### Load packages
library(tidyverse) # Collection of all the good stuff like dplyr, ggplot2 ect.
library(magrittr) # For extra-piping operators (eg. %<>%)
The palmer penguin dataset is excellent for EDA and UML. It contains different measures for 3 species of closely related penguins from several islands in Antarctica.
Let’s have a look:
Penguin datast: https://github.com/allisonhorst/palmerpenguins
# load the dataset from GitHub - original source
penguins <- read_csv("https://github.com/allisonhorst/palmerpenguins/raw/5b5891f01b52ae26ad8cb9755ec93672f49328a8/data/penguins_size.csv")
penguins %>% head()
penguins %>% glimpse()
Rows: 344
Columns: 7
$ species_short <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", …
$ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen", "Torgersen…
$ culmen_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37.8, 37.8, 41.1, 38.6, 34.6, 36.6, 38.7, 42.5, 34.4, 46.0, 37.8, 37.7, 35.9, 38.2…
$ culmen_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, 17.1, 17.3, 17.6, 21.2, 21.1, 17.8, 19.0, 20.7, 18.4, 21.5, 18.3, 18.7, 19.2, 18.1…
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180, 182, 191, 198, 185, 195, 197, 184, 194, 174, 180, 189, 185, 180, 187, 183, 187, 1…
$ body_mass_g <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, 3300, 3700, 3200, 3800, 4400, 3700, 3450, 4500, 3325, 4200, 3400, 3600, 3800, 3950…
$ sex <chr> "MALE", "FEMALE", "FEMALE", NA, "FEMALE", "MALE", "FEMALE", "MALE", NA, NA, NA, NA, "FEMALE", "MALE", "MALE", "FEMALE", "FEMALE", "MALE", "F…
# drop all missing observations
penguins %<>% drop_na()
penguins %>% count(species_short)
penguins %>% count(species_short, island) %>%
pivot_wider(names_from = island, values_from = n, values_fill = 0)
library(GGally)
penguins %>% ggpairs(legend = 1,
columns = c(3:6),
mapping = ggplot2::aes(colour=species_short, alpha = 0.5),
lower = list(continuous = wrap("smooth", alpha = 0.3, size=0.1))) +
theme(legend.position = "bottom")
Overall we can see some general tendencies in the data:
Before we can do any machine learning, it is a good idea to scale the data. Most algorithms are not agnostic to magnitudes and bringing all variables on the same scale is therefore crucial.
FactoMineR
package to compute PCA, and factoextra
for extracting and visualizing the results.FactoMineR
is a great and my favorite package for computing principal component methods in R. It’s very easy to use and very well documented.factoextra
is just a convenient ggplot
wrapper that easily produces nice and informative diagnistic plots for a variety of DR and clustering techniques.Lets do that. Notice the scale.unit = TRUE
argument, which you should ALWAYS use. Afterwards, we take a look at the resulting list object.
res_pca <- penguins %>%
select_if(is_numeric) %>%
PCA(scale.unit = TRUE, graph = FALSE)
Ok, lets see look at the “screeplot”, a diagnostic visualization that displays the variance explained by every component. We here use the factoextra
package, like for all following visualizations with the fviz_
prefix. Notice that the output in every case is an ggplot2
object, which could be complemented with further layers.
res_pca %>%
fviz_screeplot(addlabels = TRUE,
ncp = 10,
ggtheme = theme_gray())
As expected, we see that the first component already captures a main share of the variance. Let’s look at the corresponding eigenvalues.
res_pca$eig %>% as_tibble()
For feature selection, our rule-of-thumb is to only include components with an eigenvalue > 1, meaning that we in this case would have reduced our data to 4 dimensions. Lets project them onto 2-dimensional space and take a look at the vector of our features. In this case, you could easily condens all information in one dimension. We will include the second as well for 2-d plotting, but otherwise we have to have no hard feelings to discard the rest.
res_pca %>%
fviz_pca_var(alpha.var = "cos2",
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE,
ggtheme = theme_gray())
Lets look at the numeric values.
res_pca %>% get_pca_var()
Principal Component Analysis Results for variables
===================================================
Name Description
1 "$coord" "Coordinates for the variables"
2 "$cor" "Correlations between variables and dimensions"
3 "$cos2" "Cos2 for the variables"
4 "$contrib" "contributions of the variables"
res_pca$var$coord %>%
as_tibble() %>%
head()
The results-object also contains the observations loading on the components.
res_pca %>% get_pca_var()
Principal Component Analysis Results for variables
===================================================
Name Description
1 "$coord" "Coordinates for the variables"
2 "$cor" "Correlations between variables and dimensions"
3 "$cos2" "Cos2 for the variables"
4 "$contrib" "contributions of the variables"
res_pca$ind$coord %>%
as_tibble() %>%
head()
Let’s visualize our observations and the variable-loading together in the space of the first 2 components.
res_pca %>%
fviz_pca_biplot(alpha.ind = "cos2",
col.ind = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
geom = "point",
ggtheme = theme_gray())
We cal also briefly check if our dimensionality reductions is helpful to differentiate between species
.
res_pca %>%
fviz_pca_biplot(alpha.ind = "cos2",
geom = "point",
habillage = penguins %>% pull(species_short) %>% factor(),
addEllipses = TRUE,
ggtheme = theme_gray())
Now let’s try out UMAP, a new dimensionality reduction algorightm that comes with many interesting features: https://umap-learn.readthedocs.io/en/latest/
You want to learn more from the guy behind the algorithm? https://youtu.be/nq6iPZVUxZU check out that excellent talk by Leland McInnes or https://arxiv.org/abs/1802.03426.
# install.packages('uwot') # If necessary install
library(uwot)
res_umap <- penguins %>%
select_if(is_numeric) %>%
umap(n_neighbors = 15,
metric = "cosine",
min_dist = 0.01,
scale = TRUE)
res_umap %>% as_tibble() %>%
glimpse()
Rows: 334
Columns: 2
$ V1 <dbl> -6.745879, -7.580199, -8.403356, -6.654872, -5.534422, -7.580670, -4.743643, -8.540195, -5.400925, -4.831200, -7.705870, -6.612927, -4.340905, -7.741749, -…
$ V2 <dbl> -0.35196864, -0.11502516, -0.83884513, -1.69948804, -2.05393969, -0.09900867, -1.41245281, 0.25474346, -2.06168162, -1.49675905, -1.02527558, -1.65299618, …
res_umap %>%
as_tibble() %>%
bind_cols(penguins %>% select(island, species_short)) %>%
ggplot(aes(x = V1, y = V2, col = species_short)) +
geom_point(shape = 21, alpha = 0.5)
Umap seems to do a better job at reducing the dimensionality in a way that the resulting embedding fits well with the species destinction.
k
.scale
argument.# We use the viz_nbclust() function of the factorextra package
penguins %>%
select_if(is_numeric) %>%
scale() %>%
fviz_nbclust(kmeans, method = "wss")
#Before we start, something weird upfront. The function takes the observation names from the rownames (which nobody uses anymore, and are depreciated by `dplyr`). So, remeber to define them just straight before you cluster, otherwise the next `dplyr` function will delete them again.
penguins_clust <- penguins %>% column_to_rownames('species_short') %>%
select_if(is_numeric) %>%
scale()
Error in `.rowNamesDF<-`(x, value = value) :
duplicate 'row.names' are not allowed
res_km
K-means clustering with 3 clusters of sizes 129, 120, 85
Cluster means:
culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g
1 -1.0470735 0.4884505 -0.8833910 -0.7644518
2 0.6497893 -1.0965845 1.1566225 1.0953386
3 0.6717383 0.8068238 -0.2922031 -0.3861924
Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 3 1 1 1 3 1 3 1 1 1 3 1 3 1 1 1 1 1 1
[83] 1 1 1 3 1 1 1 3 1 1 1 3 1 3 1 1 1 1 1 1 1 3 1 3 1 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[165] 3 3 1 3 1 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[247] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[329] 2 2 2 2 2 2
Within cluster sum of squares by cluster:
[1] 120.8400 140.1537 109.6498
(between_SS / total_SS = 72.2 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter" "ifault"
Again, lets visualize it. To have a meaningful way for 2d visualization, we again project the observations on the space of the first 2 components.
res_km %>%
fviz_cluster(data = penguins %>% select_if(is_numeric) ,
ggtheme = theme_gray())
Ok, we got 3 clusters. Let’s look what’s in them.
penguins%>%
bind_cols(cluster = res_km$cluster) %>%
select_if(is_numeric) %>%
group_by(cluster) %>%
mutate(n = n()) %>%
summarise_all(funs(mean))
Lets see if they found the species correctly
table(penguins$species_short, res_km$cluster)
1 2 3
Adelie 124 0 22
Chinstrap 5 0 63
Gentoo 0 120 0
hcut
function,hcut
has a stand = TRUE
argument, meaning we do not need to scale the data beforehand.res_hc <- penguins %>%
select_if(is_numeric) %>%
hcut(hc_func = "hclust",
k = 3,
stand = TRUE)
res_hc %>%
fviz_dend(rect = TRUE, cex = 0.5)
Notice how the dendrogram is built and every data point finally merges into a single cluster with the height(distance) shown on the y-axis.
Let’s inspect what’s in the clusters.
penguins %>%
bind_cols(cluster = res_hc$cluster) %>%
select_if(is_numeric) %>%
group_by(cluster) %>%
mutate(n = n()) %>%
summarise_all(mean)
And again visualize them:
res_hc %>%
fviz_cluster(data = penguins %>% select_if(is_numeric),
ggtheme = theme_gray())
Lets see again how well we did with seperating species:
table(penguins$species_short, res_hc$cluster)
1 2 3
Adelie 146 0 0
Chinstrap 11 57 0
Gentoo 0 0 120
HCPC
function already does that for us, and offers also a nice diagnostic viz.res_hcpc <- res_pca %>%
HCPC(nb.clust = -1, # self determined: higher relative loss of inertia
graph = FALSE)
res_hcpc %>%
plot(choice = "3D.map")
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] uwot_0.1.8 Matrix_1.2-18 factoextra_1.0.7 FactoMineR_2.3 GGally_2.0.0 magrittr_1.5 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2
[10] purrr_0.3.4 readr_1.3.1 tidyr_1.1.2 tibble_3.0.3 ggplot2_3.3.2 tidyverse_1.3.0 knitr_1.29
loaded via a namespace (and not attached):
[1] nlme_3.1-149 fs_1.5.0 lubridate_1.7.9 RcppAnnoy_0.0.16 RColorBrewer_1.1-2 progress_1.2.2 httr_1.4.2
[8] tools_4.0.2 backports_1.1.9 utf8_1.1.4 R6_2.4.1 DBI_1.1.0 mgcv_1.8-33 colorspace_1.4-1
[15] withr_2.2.0 tidyselect_1.1.0 prettyunits_1.1.1 curl_4.3 compiler_4.0.2 cli_2.0.2 rvest_0.3.6
[22] flashClust_1.01-2 xml2_1.3.2 labeling_0.3 scales_1.1.1 digest_0.6.25 foreign_0.8-80 rmarkdown_2.3
[29] rio_0.5.16 base64enc_0.1-3 htmltools_0.5.0 pkgconfig_2.0.3 dbplyr_1.4.4 rlang_0.4.7 readxl_1.3.1
[36] rstudioapi_0.11 farver_2.0.3 generics_0.0.2 jsonlite_1.7.1 zip_2.1.1 car_3.0-9 leaps_3.1
[43] Rcpp_1.0.5 munsell_0.5.0 fansi_0.4.1 abind_1.4-5 lifecycle_0.2.0 scatterplot3d_0.3-41 stringi_1.5.3
[50] yaml_2.2.1 carData_3.0-4 MASS_7.3-53 plyr_1.8.6 grid_4.0.2 blob_1.2.1 ggrepel_0.8.2
[57] crayon_1.3.4 lattice_0.20-41 haven_2.3.1 splines_4.0.2 hms_0.5.3 pillar_1.4.6 ggpubr_0.4.0
[64] ggsignif_0.6.0 codetools_0.2-16 reprex_0.3.0 glue_1.4.2 evaluate_0.14 data.table_1.13.0 modelr_0.1.8
[71] vctrs_0.3.4 cellranger_1.1.0 gtable_0.3.0 reshape_0.8.8 assertthat_0.2.1 xfun_0.17 openxlsx_4.1.5
[78] broom_0.7.0 rstatix_0.6.0 cluster_2.1.0 ellipsis_0.3.1