Introduction
Le package skimr de Rstudio permet de resumer le contenu d’un dataframe, et si le contenu par defaut ne vous plait pas vous pouvez le personnaliser.
Avec une version de skimr < 2.0.1, ce script est a executer sous Linux pour que les histogrammes s’affichent.
Les donnees
iris2 = iris
iris2[1,3] = NA
iris2$Sepal.Length = factor(iris2$Sepal.Length)
library("skimr")Parametrage par defaut
Les sorties usuelles
skim(iris2)| Name | iris2 |
| Number of rows | 150 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Sepal.Length | 0 | 1 | FALSE | 35 | 5: 10, 5.1: 9, 6.3: 9, 5.7: 8 |
| Species | 0 | 1 | FALSE | 3 | set: 50, ver: 50, vir: 50 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Sepal.Width | 0 | 1.00 | 3.06 | 0.44 | 2.0 | 2.8 | 3.0 | 3.3 | 4.4 | <U+2581><U+2586><U+2587><U+2582><U+2581> |
| Petal.Length | 1 | 0.99 | 3.77 | 1.76 | 1.0 | 1.6 | 4.4 | 5.1 | 6.9 | <U+2587><U+2581><U+2586><U+2587><U+2582> |
| Petal.Width | 0 | 1.00 | 1.20 | 0.76 | 0.1 | 0.3 | 1.3 | 1.8 | 2.5 | <U+2587><U+2581><U+2587><U+2585><U+2583> |
Les KPI calculees selon le type de chaque variable
get_default_skimmer_names()$AsIs
[1] "n_unique" "min_length" "max_length"
$character
[1] "min" "max" "empty" "n_unique" "whitespace"
$complex
[1] "mean"
$Date
[1] "min" "max" "median" "n_unique"
$difftime
[1] "min" "max" "median" "n_unique"
$factor
[1] "ordered" "n_unique" "top_counts"
$list
[1] "n_unique" "min_length" "max_length"
$logical
[1] "mean" "count"
$numeric
[1] "mean" "sd" "p0" "p25" "p50" "p75" "p100" "hist"
$POSIXct
[1] "min" "max" "median" "n_unique"
$Timespan
[1] "min" "max" "median" "n_unique"
$ts
[1] "start" "end" "frequency" "deltat" "mean"
[6] "sd" "min" "max" "median" "line_graph"
Focus sur quelques KPI
iris2 %>% skim() %>% focus(n_missing, numeric.mean)| Name | Piped data |
| Number of rows | 150 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing |
|---|---|
| Sepal.Length | 0 |
| Species | 0 |
Variable type: numeric
| skim_variable | n_missing | mean |
|---|---|---|
| Sepal.Width | 0 | 3.06 |
| Petal.Length | 1 | 3.77 |
| Petal.Width | 0 | 1.20 |
On enleve certaines KPI pour les champs numeriques et les facteurs
Les KPI “de base” n_missing et complete_rate ont un statut special, pour les enlever il faut modifier le parametre base comme ci-dessous. Pour les autres il suffit de les mettre a NULL.
mon_skim = skim_with(numeric = sfl(hist = NULL, p25 = NULL, p50 = NULL, p75 = NULL,
mean = NULL, sd = NULL),
factor = sfl(ordered = NULL),
base = sfl(n_missing = n_missing))
mon_skim(iris2)| Name | iris2 |
| Number of rows | 150 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | n_unique | top_counts |
|---|---|---|---|
| Sepal.Length | 0 | 35 | 5: 10, 5.1: 9, 6.3: 9, 5.7: 8 |
| Species | 0 | 3 | set: 50, ver: 50, vir: 50 |
Variable type: numeric
| skim_variable | n_missing | p0 | p100 |
|---|---|---|---|
| Sepal.Width | 0 | 2.0 | 4.4 |
| Petal.Length | 1 | 1.0 | 6.9 |
| Petal.Width | 0 | 0.1 | 2.5 |
KPI personnalisee pour les champs numeriques
On ajoute une KPI personnalisee, et l’option append = FALSE enleve les KPI par defaut :
mon_skim = skim_with(numeric = sfl(ma_fonction = function(x) {
max(x, na.rm = TRUE)/min(x, na.rm = TRUE)}),
append = FALSE)
mon_skim(iris2)| Name | iris2 |
| Number of rows | 150 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Sepal.Length | 0 | 1 | FALSE | 35 | 5: 10, 5.1: 9, 6.3: 9, 5.7: 8 |
| Species | 0 | 1 | FALSE | 3 | set: 50, ver: 50, vir: 50 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | ma_fonction |
|---|---|---|---|
| Sepal.Width | 0 | 1.00 | 2.2 |
| Petal.Length | 1 | 0.99 | 6.9 |
| Petal.Width | 0 | 1.00 | 25.0 |