Introduction
Le package skimr de Rstudio permet de resumer le contenu d’un dataframe, et si le contenu par defaut ne vous plait pas vous pouvez le personnaliser.
Avec une version de skimr < 2.0.1, ce script est a executer sous Linux pour que les histogrammes s’affichent.
Les donnees
= iris
iris2 1,3] = NA
iris2[$Sepal.Length = factor(iris2$Sepal.Length)
iris2
library("skimr")
Parametrage par defaut
Les sorties usuelles
skim(iris2)
Name | iris2 |
Number of rows | 150 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
factor | 2 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Sepal.Length | 0 | 1 | FALSE | 35 | 5: 10, 5.1: 9, 6.3: 9, 5.7: 8 |
Species | 0 | 1 | FALSE | 3 | set: 50, ver: 50, vir: 50 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Sepal.Width | 0 | 1.00 | 3.06 | 0.44 | 2.0 | 2.8 | 3.0 | 3.3 | 4.4 | <U+2581><U+2586><U+2587><U+2582><U+2581> |
Petal.Length | 1 | 0.99 | 3.77 | 1.76 | 1.0 | 1.6 | 4.4 | 5.1 | 6.9 | <U+2587><U+2581><U+2586><U+2587><U+2582> |
Petal.Width | 0 | 1.00 | 1.20 | 0.76 | 0.1 | 0.3 | 1.3 | 1.8 | 2.5 | <U+2587><U+2581><U+2587><U+2585><U+2583> |
Les KPI calculees selon le type de chaque variable
get_default_skimmer_names()
$AsIs
[1] "n_unique" "min_length" "max_length"
$character
[1] "min" "max" "empty" "n_unique" "whitespace"
$complex
[1] "mean"
$Date
[1] "min" "max" "median" "n_unique"
$difftime
[1] "min" "max" "median" "n_unique"
$factor
[1] "ordered" "n_unique" "top_counts"
$list
[1] "n_unique" "min_length" "max_length"
$logical
[1] "mean" "count"
$numeric
[1] "mean" "sd" "p0" "p25" "p50" "p75" "p100" "hist"
$POSIXct
[1] "min" "max" "median" "n_unique"
$Timespan
[1] "min" "max" "median" "n_unique"
$ts
[1] "start" "end" "frequency" "deltat" "mean"
[6] "sd" "min" "max" "median" "line_graph"
Focus sur quelques KPI
%>% skim() %>% focus(n_missing, numeric.mean) iris2
Name | Piped data |
Number of rows | 150 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
factor | 2 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing |
---|---|
Sepal.Length | 0 |
Species | 0 |
Variable type: numeric
skim_variable | n_missing | mean |
---|---|---|
Sepal.Width | 0 | 3.06 |
Petal.Length | 1 | 3.77 |
Petal.Width | 0 | 1.20 |
On enleve certaines KPI pour les champs numeriques et les facteurs
Les KPI “de base” n_missing et complete_rate ont un statut special, pour les enlever il faut modifier le parametre base comme ci-dessous. Pour les autres il suffit de les mettre a NULL.
= skim_with(numeric = sfl(hist = NULL, p25 = NULL, p50 = NULL, p75 = NULL,
mon_skim mean = NULL, sd = NULL),
factor = sfl(ordered = NULL),
base = sfl(n_missing = n_missing))
mon_skim(iris2)
Name | iris2 |
Number of rows | 150 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
factor | 2 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | n_unique | top_counts |
---|---|---|---|
Sepal.Length | 0 | 35 | 5: 10, 5.1: 9, 6.3: 9, 5.7: 8 |
Species | 0 | 3 | set: 50, ver: 50, vir: 50 |
Variable type: numeric
skim_variable | n_missing | p0 | p100 |
---|---|---|---|
Sepal.Width | 0 | 2.0 | 4.4 |
Petal.Length | 1 | 1.0 | 6.9 |
Petal.Width | 0 | 0.1 | 2.5 |
KPI personnalisee pour les champs numeriques
On ajoute une KPI personnalisee, et l’option append = FALSE enleve les KPI par defaut :
= skim_with(numeric = sfl(ma_fonction = function(x) {
mon_skim max(x, na.rm = TRUE)/min(x, na.rm = TRUE)}),
append = FALSE)
mon_skim(iris2)
Name | iris2 |
Number of rows | 150 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
factor | 2 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Sepal.Length | 0 | 1 | FALSE | 35 | 5: 10, 5.1: 9, 6.3: 9, 5.7: 8 |
Species | 0 | 1 | FALSE | 3 | set: 50, ver: 50, vir: 50 |
Variable type: numeric
skim_variable | n_missing | complete_rate | ma_fonction |
---|---|---|---|
Sepal.Width | 0 | 1.00 | 2.2 |
Petal.Length | 1 | 0.99 | 6.9 |
Petal.Width | 0 | 1.00 | 25.0 |