class: center, middle, inverse, title-slide # Agregats sous R et Python dans Rstudio ### Sebastien Foulle ### 07/09/2018 --- <style> .python{ background-color: gold !important; } pre { white-space: pre-wrap; box-shadow: 10px 5px 5px lightblue; } </style> ## Contenu du document On presente dans les quelques slides suivantes les operations d'agregation usuelles en R avec *dplyr* et en Python avec *pandas*. Prerequis pour produire cette presentation **xaringan** avec Rstudio : - une distribution Python (Anaconda convient tres bien) - le package R *reticulate* qui permet d'utiliser des objets R dans des chunks Python (syntaxe *r.mon_objet_r*) et inversement (syntaxe *py$mon_objet_python*) .pull-left[ ```r library("reticulate") # chemin vers l'executable python use_python("C:/Users/Sebastien/Anaconda3/python.exe") library("dplyr") tips = reshape2::tips[75:78,-1] ``` ```python import numpy as np import pandas as pd ``` ] --- ## Les donnees .pull-left[ <p style="text-align:center">
</p> ```r tips ``` ``` tip sex smoker day time size 75 2.20 Female No Sat Dinner 2 76 1.25 Male No Sat Dinner 2 77 3.08 Male Yes Sat Dinner 2 78 4.00 Male No Thur Lunch 4 ``` ] .pull-right[ <p style="text-align:center">
</p> ```python tips = r.tips; tips ``` ``` tip sex smoker day time size 75 2.20 Female No Sat Dinner 2 76 1.25 Male No Sat Dinner 2 77 3.08 Male Yes Sat Dinner 2 78 4.00 Male No Thur Lunch 4 ``` ] --- ## Le type des donnees .pull-left[ <p style="text-align:center">
</p> On a un champ numerique, un champ entier et quatre facteurs ```r str(tips) ``` ``` 'data.frame': 4 obs. of 6 variables: $ tip : num 2.2 1.25 3.08 4 $ sex : Factor w/ 2 levels "Female","Male": 1 2 2 2 $ smoker: Factor w/ 2 levels "No","Yes": 1 1 2 1 $ day : Factor w/ 4 levels "Fri","Sat","Sun",..: 2 2 2 4 $ time : Factor w/ 2 levels "Dinner","Lunch": 1 1 1 2 $ size : int 2 2 2 4 ``` ] .pull-right[ <p style="text-align:center">
</p> On a un champ numerique, un champ entier et quatre categories *pandas* ```python tips.info() ``` ``` <class 'pandas.core.frame.DataFrame'> Index: 4 entries, 75 to 78 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tip 4 non-null float64 1 sex 4 non-null category 2 smoker 4 non-null category 3 day 4 non-null category 4 time 4 non-null category 5 size 4 non-null int32 dtypes: category(4), float64(1), int32(1) memory usage: 672.0+ bytes ``` ] --- ## Agregats simples .pull-left[ <p style="text-align:center">
</p> ```r tips %>% summarise(moy = mean(tip)) ``` ``` moy 1 2.6325 ``` ] .pull-right[ <p style="text-align:center">
</p> ```python tips[["tip"]].mean() ``` ``` tip 2.6325 dtype: float64 ``` ] --- ## Agregats simples pour toutes les colonnes numeriques .pull-left[ <p style="text-align:center">
</p> ```r tips %>% summarise_if(is.numeric, mean) ``` ``` tip size 1 2.6325 2.5 ``` ] .pull-right[ <p style="text-align:center">
</p> ```python tips.mean() ``` ``` tip 2.6325 size 2.5000 dtype: float64 ``` ] --- ## Application : % de valeurs manquantes par colonne .pull-left[ <p style="text-align:center">
</p> ```r tips2 = tips; tips2[1,1] = NA; # en R base : colMeans(is.na(tips2)) is.na(tips2) %>% as.data.frame %>% summarise_all(mean) ``` ``` tip sex smoker day time size 1 0.25 0 0 0 0 0 ``` ] .pull-right[ <p style="text-align:center">
</p> ```python tips2 = r.tips2; tips2.isna().mean().reset_index(name = "pct_na") ``` ``` index pct_na 0 tip 0.25 1 sex 0.00 2 smoker 0.00 3 day 0.00 4 time 0.00 5 size 0.00 ``` ] --- ## Agregats par groupe .pull-left[ <p style="text-align:center">
</p> On voit `2.62` au lieu de `2.625` car les *tibble* affichent 3 chiffres significatifs par defaut. Solutions : *options(pillar.sigfig = 7)* ou en rajoutant a la fin *%>% as.data.frame*. ```r tips %>% group_by(sex,smoker) %>% summarise(moy=mean(tip)) ``` ``` # A tibble: 3 x 3 # Groups: sex [2] sex smoker moy <fct> <fct> <dbl> 1 Female No 2.2 2 Male No 2.62 3 Male Yes 3.08 ``` ] .pull-right[ <p style="text-align:center">
</p> ```python tips.groupby(["sex", "smoker"], observed = True)["tip"].mean().reset_index(name = "moy") ``` ``` sex smoker moy 0 Female No 2.200 1 Male No 2.625 2 Male Yes 3.080 ``` ] --- ## Application : table de contingence .pull-left[ <p style="text-align:center">
</p> ```r # en bref : tips %>% count(sex, smoker, name = "freq") tips %>% group_by(sex,smoker) %>% summarise(freq =n()) ``` ``` # A tibble: 3 x 3 # Groups: sex [2] sex smoker freq <fct> <fct> <int> 1 Female No 1 2 Male No 2 3 Male Yes 1 ``` ] .pull-right[ <p style="text-align:center">
</p> ```python tips.groupby(["sex", "smoker"], observed = True).size().reset_index(name = "freq") ``` ``` sex smoker freq 0 Female No 1 1 Male No 2 2 Male Yes 1 ``` ] --- ## Agregats multiples par groupe .pull-left[ <p style="text-align:center">
</p> ```r tips %>% group_by(sex,smoker) %>% summarise(moy=mean(tip), somme=sum(size)) ``` ``` # A tibble: 3 x 4 # Groups: sex [2] sex smoker moy somme <fct> <fct> <dbl> <int> 1 Female No 2.2 2 2 Male No 2.62 6 3 Male Yes 3.08 2 ``` ] .pull-right[ <p style="text-align:center">
</p> ```python tips.groupby(["sex", "smoker"], observed = True).agg(moy=('tip', 'mean'), somme=('size', 'sum')).reset_index() ``` ``` sex smoker moy somme 0 Female No 2.200 2 1 Male No 2.625 6 2 Male Yes 3.080 2 ``` ] --- ## Agregats par groupe ajoutes dans la table initiale .pull-left[ <p style="text-align:center">
</p> ```r options(pillar.sigfig = 7) tips %>% group_by(sex,smoker) %>% mutate(moy=mean(tip)) ``` ``` # A tibble: 4 x 7 # Groups: sex, smoker [3] tip sex smoker day time size moy <dbl> <fct> <fct> <fct> <fct> <int> <dbl> 1 2.2 Female No Sat Dinner 2 2.2 2 1.25 Male No Sat Dinner 2 2.625 3 3.08 Male Yes Sat Dinner 2 3.08 4 4 Male No Thur Lunch 4 2.625 ``` ] .pull-right[ <p style="text-align:center">
</p> ```python tips["moy"] = tips.groupby(["sex", "smoker"], observed = True)["tip"].transform("mean");tips ``` ``` tip sex smoker day time size moy 75 2.20 Female No Sat Dinner 2 2.200 76 1.25 Male No Sat Dinner 2 2.625 77 3.08 Male Yes Sat Dinner 2 3.080 78 4.00 Male No Thur Lunch 4 2.625 ``` ] --- ## Application : on filtre sur les pourboires au-dessus de la moyenne .pull-left[ <p style="text-align:center">
</p> ```r tips %>% mutate(moy = mean(tip)) %>% filter(tip > moy) ``` ``` tip sex smoker day time size moy 77 3.08 Male Yes Sat Dinner 2 2.6325 78 4.00 Male No Thur Lunch 4 2.6325 ``` ] .pull-right[ <p style="text-align:center">
</p> ```python tips["zero"] = 0; tips["moy"]=tips.groupby("zero")["tip"].transform("mean") tips.query('tip > moy') ``` ``` tip sex smoker day time size zero moy 77 3.08 Male Yes Sat Dinner 2 0 2.6325 78 4.00 Male No Thur Lunch 4 0 2.6325 ``` ] --- ## Variante : agregats par groupe avec dictionnaire + mapping .pull-left[ <p style="text-align:center">
</p> En R on passe par un vecteur avec noms ```r vecteur = tips %>% group_by(day) %>% summarise(moy = mean(tip)) %>% tibble::deframe();vecteur tips["moy"] = vecteur[as.character(tips$day)];tips ``` ``` Sat Thur 2.176667 4.000000 tip sex smoker day time size moy 75 2.20 Female No Sat Dinner 2 2.176667 76 1.25 Male No Sat Dinner 2 2.176667 77 3.08 Male Yes Sat Dinner 2 2.176667 78 4.00 Male No Thur Lunch 4 4.000000 ``` ] .pull-right[ <p style="text-align:center">
</p> En Python on utilise un dictionnaire ```python dico = tips.groupby("day", observed = True).tip.mean().to_dict();dico ``` ``` {'Sat': 2.1766666666666667, 'Thur': 4.0} ``` ```python tips["moy"] = tips.day.map(dico); tips ``` ``` tip sex smoker day time size moy 75 2.20 Female No Sat Dinner 2 2.176667 76 1.25 Male No Sat Dinner 2 2.176667 77 3.08 Male Yes Sat Dinner 2 2.176667 78 4.00 Male No Thur Lunch 4 4.000000 ``` ]