Introduction

Le package skimr de Rstudio permet de resumer le contenu d’un dataframe, et si le contenu par defaut ne vous plait pas vous pouvez le personnaliser.

Avec une version de skimr < 2.0.1, ce script est a executer sous Linux pour que les histogrammes s’affichent.

Les donnees

iris2 = iris
iris2[1,3] = NA
iris2$Sepal.Length = factor(iris2$Sepal.Length)

library("skimr")

Parametrage par defaut

Les sorties usuelles

skim(iris2)
Data summary
Name iris2
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
factor 2
numeric 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Sepal.Length 0 1 FALSE 35 5: 10, 5.1: 9, 6.3: 9, 5.7: 8
Species 0 1 FALSE 3 set: 50, ver: 50, vir: 50

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sepal.Width 0 1.00 3.06 0.44 2.0 2.8 3.0 3.3 4.4 <U+2581><U+2586><U+2587><U+2582><U+2581>
Petal.Length 1 0.99 3.77 1.76 1.0 1.6 4.4 5.1 6.9 <U+2587><U+2581><U+2586><U+2587><U+2582>
Petal.Width 0 1.00 1.20 0.76 0.1 0.3 1.3 1.8 2.5 <U+2587><U+2581><U+2587><U+2585><U+2583>

Les KPI calculees selon le type de chaque variable

get_default_skimmer_names()
$AsIs
[1] "n_unique"   "min_length" "max_length"

$character
[1] "min"        "max"        "empty"      "n_unique"   "whitespace"

$complex
[1] "mean"

$Date
[1] "min"      "max"      "median"   "n_unique"

$difftime
[1] "min"      "max"      "median"   "n_unique"

$factor
[1] "ordered"    "n_unique"   "top_counts"

$list
[1] "n_unique"   "min_length" "max_length"

$logical
[1] "mean"  "count"

$numeric
[1] "mean" "sd"   "p0"   "p25"  "p50"  "p75"  "p100" "hist"

$POSIXct
[1] "min"      "max"      "median"   "n_unique"

$Timespan
[1] "min"      "max"      "median"   "n_unique"

$ts
 [1] "start"      "end"        "frequency"  "deltat"     "mean"      
 [6] "sd"         "min"        "max"        "median"     "line_graph"

Focus sur quelques KPI

iris2 %>% skim() %>% focus(n_missing, numeric.mean)
Data summary
Name Piped data
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
factor 2
numeric 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing
Sepal.Length 0
Species 0

Variable type: numeric

skim_variable n_missing mean
Sepal.Width 0 3.06
Petal.Length 1 3.77
Petal.Width 0 1.20

On enleve certaines KPI pour les champs numeriques et les facteurs

Les KPI “de base” n_missing et complete_rate ont un statut special, pour les enlever il faut modifier le parametre base comme ci-dessous. Pour les autres il suffit de les mettre a NULL.

mon_skim = skim_with(numeric = sfl(hist = NULL, p25 = NULL, p50 = NULL, p75 = NULL, 
                          mean = NULL, sd = NULL),
          factor = sfl(ordered = NULL),
          base = sfl(n_missing = n_missing))

mon_skim(iris2)
Data summary
Name iris2
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
factor 2
numeric 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing n_unique top_counts
Sepal.Length 0 35 5: 10, 5.1: 9, 6.3: 9, 5.7: 8
Species 0 3 set: 50, ver: 50, vir: 50

Variable type: numeric

skim_variable n_missing p0 p100
Sepal.Width 0 2.0 4.4
Petal.Length 1 1.0 6.9
Petal.Width 0 0.1 2.5

KPI personnalisee pour les champs numeriques

On ajoute une KPI personnalisee, et l’option append = FALSE enleve les KPI par defaut :

mon_skim = skim_with(numeric = sfl(ma_fonction = function(x) {
                                    max(x, na.rm = TRUE)/min(x, na.rm = TRUE)}), 
                    append = FALSE)
mon_skim(iris2)
Data summary
Name iris2
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
factor 2
numeric 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Sepal.Length 0 1 FALSE 35 5: 10, 5.1: 9, 6.3: 9, 5.7: 8
Species 0 1 FALSE 3 set: 50, ver: 50, vir: 50

Variable type: numeric

skim_variable n_missing complete_rate ma_fonction
Sepal.Width 0 1.00 2.2
Petal.Length 1 0.99 6.9
Petal.Width 0 1.00 25.0

retour au debut du document