### Load packages
library(tidyverse) # Collection of all the good stuff like dplyr, ggplot2 ect.
library(magrittr) # For extra-piping operators (eg. %<>%)

Introduction

Welcome to the applied session in data visualization for Exploratory Data Analysis (EDA) in R.

Introduction to gglot2

ggplot2 can be thought of as a mini-language (domain-specific language) within the R language. It is an R implementation of Wilkinson’s Grammar of Graphics book. A Layered Grammar of Graphics describes Hadley’s implementation of these thoughts in the ggplot2’s design. Due to its conceptional richness as well as the rich functionality provided, ggplot2 has over time become the main sub-ecosustem for rgaphic visualization. Most packages dedicated to specialized forms of visualization (networks, interactions, etc.) will use the ggplot package as underlying plattform. So, it makes sense to dive a bit deeper into it functionality

Conceptually, the main idea behind the Grammar of Graphics is that a statistical graphic is a mapping from variables to aesthetic attributes (x axis value, y axis value, color, shape, size) of geometric objects (points, line, bars).

While the Grammar of Graphic contains more elements, we will focus in this brief intro in the two main ones, aestetics and geometries.

  • Aestetics: Devine the “surface” of your plot, in terms of what has to be mapped (size, coplor) on the x and y (and potentially adittional) axes. Aesteticts are defined within the aes() function.
  • Geometries: Visual elements you can see in the plot itself, such as bars, lines, and points. They are defined within various geom_XYZ() functions.

Basically, you define a surface grid and then plot something on top. We will talk about all of that in depth in later sessions, for now that’s all you need to know to understand the following simple examples.

Application: the BIXI Bikeshare Data dataset

Lets take a step back and zoom a bit into different forms of visualization. We will now take a look at the BIXI Bikeshare Data, covering 500k bike-rides in the BIXI bike-sharing system in Montreal.

bike <- readRDS(url("https://github.com/SDS-AAU/SDS-master/raw/master/00_data/bikes_montreal.rds?dl=1"))

Lets take a look:

bike %>% glimpse()
Rows: 500,000
Columns: 12
$ start_date         <dttm> 2017-08-16 12:10:00, 2017-06-25 23:22:00, 2017-08-10 17:26:00, 2017-08-17 15:25:00, 2017-10-12 10:39:00, 2017-07-03 08:53:00, 2…
$ start_station_code <int> 6213, 6393, 6114, 6044, 6389, 6411, 6738, 6425, 7042, 6034, 6213, 6184, 6008, 6202, 6048, 6087, 6195, 6013, 6168, 6154, 6901, 62…
$ end_date           <dttm> 2017-08-16 12:30:00, 2017-06-25 23:27:00, 2017-08-10 17:29:00, 2017-08-17 15:32:00, 2017-10-12 10:49:00, 2017-07-03 08:55:00, 2…
$ end_station_code   <int> 6391, 6394, 6113, 6015, 6262, 6206, 6090, 6406, 6185, 6039, 6188, 6142, 6012, 6038, 6020, 6032, 7080, 6078, 6302, 6164, 6011, 62…
$ duration_sec       <int> 1237, 294, 156, 419, 601, 108, 438, 757, 1144, 578, 366, 326, 154, 835, 863, 565, 929, 262, 633, 559, 1071, 120, 847, 1223, 287,…
$ start_day          <date> 2017-08-16, 2017-06-25, 2017-08-10, 2017-08-17, 2017-10-12, 2017-07-03, 2017-07-04, 2017-06-27, 2017-11-09, 2017-07-03, 2017-05…
$ start_dow          <fct> Wed, Sun, Thu, Thu, Thu, Mon, Tue, Tue, Thu, Mon, Wed, Thu, Wed, Thu, Wed, Wed, Mon, Sun, Thu, Wed, Thu, Wed, Sun, Sat, Sun, Thu…
$ weekday            <fct> workweek, weekend, workweek, workweek, workweek, workweek, workweek, workweek, workweek, workweek, workweek, workweek, workweek,…
$ start_hod          <dbl> 12, 23, 17, 15, 10, 8, 16, 17, 9, 17, 16, 8, 10, 18, 7, 11, 8, 15, 23, 16, 8, 22, 9, 22, 19, 17, 14, 18, 15, 14, 21, 20, 14, 17,…
$ start_mon          <dbl> 8, 6, 8, 8, 10, 7, 7, 6, 11, 7, 5, 6, 8, 8, 10, 7, 8, 10, 6, 5, 8, 5, 4, 8, 6, 5, 9, 4, 10, 8, 9, 10, 8, 5, 5, 6, 7, 6, 6, 6, 10…
$ start_wk           <dbl> 33, 26, 32, 33, 41, 27, 27, 26, 45, 27, 22, 26, 33, 32, 40, 30, 33, 41, 26, 19, 35, 22, 17, 33, 23, 19, 39, 16, 42, 33, 37, 43, …
$ membership         <fct> member, non-member, member, member, member, member, member, member, member, member, member, member, member, member, member, memb…
bike %>% head()

We see here a number of different variable types present, namely:

  • Continuous variables
  • Categorical variables
  • Temporal variables

First of all: Lets remember, the first thing we do is defining the aestetics, first of all the dimensions (x, y) of the visualization.

bike %>% ggplot(aes(x = weekday, y = start_hod)) 

The result will be an empty plane with the dimensions we defined. Note that there are more aestetic dimensions which can be used to convey informations visualy, such as for instance:

  • Position (x, y)
  • Color
  • Shape
  • Alpha (Transparency)

We will explore them later.

Basic visualization of variable types

Summaries of One Variable: Continuous

When attempting to summarize a single variable, histograms and density distributions are often the visualization of choice. We can do that easily by using the geom_histogram() layer. Notice that we only define a x aestetic, since we only summarize one variable

bike %>% ggplot(aes(x = duration_sec)) +
  geom_histogram()

To plot a probability density function (PDF) instead, we can use the geom_density() layer.

bike %>% ggplot(aes(x = duration_sec)) +
  geom_density()

Note the distribution appears right-skewed, since we have some outliers of very long bike rides. Adding a log-scale on the x-axis might help to reduce their impact on the visualization.

bike %>% ggplot(aes(x = duration_sec)) +
  geom_histogram() +
  scale_x_log10() 

In case we would already like to start looking at conditional distributions, we could add an adittional fill aestetic.

bike %>% ggplot(aes(x = duration_sec, fill = weekday)) +
  geom_histogram() +
  scale_x_log10() 

Summaries of One Variable: Discrete

To do the same for a discrete variable, we would start with a simple barplot via geom_bar(). Notice again that we only define a x aestetic. ggplot per default will use the count on the y-axis.

bike %>% ggplot(aes(x = start_dow)) +
  geom_bar()

We could also use the membership as fill aestetic to map further information in the plot.

bike %>% ggplot(aes(x = start_dow, fill = membership)) +
  geom_bar()

Summaries of One Variable: Temporal

A temporal variable can also be visualized as a line-plot with geom_line().

bike %>%
  count(start_wk) %>%
  ggplot(aes(x = start_wk, y = n)) +
  geom_line()

To instead (or in addition) add a trendline, we can use geom_smooth()

bike %>%
  count(start_wk) %>%
  ggplot(aes(x = start_wk, y = n)) +
  geom_smooth()

Summarizing multiple variablea jointly

Ok, that was pretty easy. However, the insights gained so far are pretty little. To tease out interesting pattern in our data, it might not be enough to only look at one variable at a time. To display relationships between multiple variables, we mainly can:

  • Use aestetics such as color, fill, size, shape (alter the aestetics within one plot)
  • Use facet_wrap()(produce multiple plots)

Lets look at some examples:

First, we could take a look at the number of daily rides with workweek / weekend days colored differently.

# Compute daily counts & plot
bike %>%
  count(start_day, weekday) %>%
  ggplot(aes(start_day, n, color = weekday)) +
  geom_point()

Now let’s look at how rides are distributed according to the time of day. Let’s make a summary plot of weekly ride counts faceted by start hour of day and broken down by workweek/weekend. Here, we will use the facet_grid

# Compute week_hod & plot
bike %>%
  count(start_wk, start_hod, weekday) %>%
  ggplot(aes(start_wk, n, color = weekday)) +
  geom_point() +
  facet_grid(~ start_hod) +
  scale_y_sqrt()

Expanding on the previous plot, let’s add one more variable into our summary, adding a facet dimension for whether or not the rider is a member of BIXI.

# Compute wk_memb_hod & plot
bike %>%
  count(start_wk, start_hod, weekday, membership) %>%
  ggplot(aes(start_wk, n, color = weekday)) +
  geom_point() +
  facet_grid(membership ~ start_hod) +
  scale_y_sqrt()

Let’s now look at the number of rides vs. hour for each day. To start, we’ll create a summary dataset for the first full month in the dataset (May) and look at it.

# Compute daily_may & 
bike %>%
  filter(start_mon == 5) %>%
  count(start_day, start_hod, membership) %>%
  ggplot(aes(start_hod, n, color = membership)) +
  geom_point() +
  facet_wrap(~ start_day, ncol = 7)

Endnotes

References

Suggestions for further study

Own exploration

There is so much more to explore. However, since time is limited, I will leave it up to you to explore more.

  • Take a moment to review the different geoms offered by ggplot here.
  • For inspiration what can be done, check here.
  • Check ggplot2 addons here. Some of my favorite are:
    • ggforce: For a collection of adittional features
    • patchwork: For easy inegration of multiple plots jointly
    • GGally: Collection fo many cool plotting features, including many standard stats plot for correlation, distribution etc.
    • ggmap: For geoplotting
    • ggraph: For network plots (will be handled later)
    • ggridges: Ridge features, for example to create joy-plots
    • ggalluvial: For alluvial plots

Datacamp

Other online courses

Papers, Ebooks & chapters

Session Info

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] plotly_4.9.2.1   patchwork_1.0.1  ggpubr_0.4.0     GGally_2.0.0     kableExtra_1.1.0 knitr_1.29       magrittr_1.5     forcats_0.5.0    stringr_1.4.0   
[10] dplyr_1.0.2      purrr_0.3.4      readr_1.3.1      tidyr_1.1.1      tibble_3.0.3     ggplot2_3.3.2    tidyverse_1.3.0 

loaded via a namespace (and not attached):
 [1] nlme_3.1-149       fs_1.5.0           lubridate_1.7.9    webshot_0.5.2      RColorBrewer_1.1-2 progress_1.2.2     httr_1.4.2         tools_4.0.2       
 [9] backports_1.1.8    utf8_1.1.4         R6_2.4.1           mgcv_1.8-32        DBI_1.1.0          lazyeval_0.2.2     colorspace_1.4-1   withr_2.2.0       
[17] tidyselect_1.1.0   prettyunits_1.1.1  curl_4.3           compiler_4.0.2     cli_2.0.2          rvest_0.3.6        pacman_0.5.1       xml2_1.3.2        
[25] labeling_0.3       scales_1.1.1       digest_0.6.25      foreign_0.8-80     rmarkdown_2.3      rio_0.5.16         base64enc_0.1-3    pkgconfig_2.0.3   
[33] htmltools_0.5.0    fastmap_1.0.1      dbplyr_1.4.4       htmlwidgets_1.5.1  rlang_0.4.7        readxl_1.3.1       rstudioapi_0.11    shiny_1.5.0       
[41] farver_2.0.3       generics_0.0.2     jsonlite_1.7.0     crosstalk_1.1.0.1  zip_2.1.0          car_3.0-9          Matrix_1.2-18      Rcpp_1.0.5        
[49] munsell_0.5.0      fansi_0.4.1        abind_1.4-5        lifecycle_0.2.0    stringi_1.4.6      yaml_2.2.1         carData_3.0-4      plyr_1.8.6        
[57] grid_4.0.2         blob_1.2.1         promises_1.1.1     crayon_1.3.4       lattice_0.20-41    splines_4.0.2      haven_2.3.1        hms_0.5.3         
[65] pillar_1.4.6       ggsignif_0.6.0     reprex_0.3.0       glue_1.4.1         evaluate_0.14      data.table_1.13.0  modelr_0.1.8       vctrs_0.3.2       
[73] httpuv_1.5.4       cellranger_1.1.0   gtable_0.3.0       reshape_0.8.8      assertthat_0.2.1   xfun_0.16          openxlsx_4.1.5     mime_0.9          
[81] xtable_1.8-4       broom_0.7.0        rstatix_0.6.0      later_1.1.0.1      rsconnect_0.8.16   viridisLite_0.3.0  ellipsis_0.3.1    
