Application: the BIXI Bikeshare Data
dataset
Lets take a step back and zoom a bit into different forms of visualization. We will now take a look at the BIXI Bikeshare Data
, covering 500k bike-rides in the BIXI bike-sharing system in Montreal.
bike <- readRDS(url("https://github.com/SDS-AAU/SDS-master/raw/master/00_data/bikes_montreal.rds?dl=1"))
Lets take a look:
bike %>% glimpse()
Rows: 500,000
Columns: 12
$ start_date <dttm> 2017-08-16 12:10:00, 2017-06-25 23:22:00, 2017-08-10 17:26:00, 2017-08-17 15:25:00, 2017-10-12 10:39:00, 2017-07-03 08:53:00, 2…
$ start_station_code <int> 6213, 6393, 6114, 6044, 6389, 6411, 6738, 6425, 7042, 6034, 6213, 6184, 6008, 6202, 6048, 6087, 6195, 6013, 6168, 6154, 6901, 62…
$ end_date <dttm> 2017-08-16 12:30:00, 2017-06-25 23:27:00, 2017-08-10 17:29:00, 2017-08-17 15:32:00, 2017-10-12 10:49:00, 2017-07-03 08:55:00, 2…
$ end_station_code <int> 6391, 6394, 6113, 6015, 6262, 6206, 6090, 6406, 6185, 6039, 6188, 6142, 6012, 6038, 6020, 6032, 7080, 6078, 6302, 6164, 6011, 62…
$ duration_sec <int> 1237, 294, 156, 419, 601, 108, 438, 757, 1144, 578, 366, 326, 154, 835, 863, 565, 929, 262, 633, 559, 1071, 120, 847, 1223, 287,…
$ start_day <date> 2017-08-16, 2017-06-25, 2017-08-10, 2017-08-17, 2017-10-12, 2017-07-03, 2017-07-04, 2017-06-27, 2017-11-09, 2017-07-03, 2017-05…
$ start_dow <fct> Wed, Sun, Thu, Thu, Thu, Mon, Tue, Tue, Thu, Mon, Wed, Thu, Wed, Thu, Wed, Wed, Mon, Sun, Thu, Wed, Thu, Wed, Sun, Sat, Sun, Thu…
$ weekday <fct> workweek, weekend, workweek, workweek, workweek, workweek, workweek, workweek, workweek, workweek, workweek, workweek, workweek,…
$ start_hod <dbl> 12, 23, 17, 15, 10, 8, 16, 17, 9, 17, 16, 8, 10, 18, 7, 11, 8, 15, 23, 16, 8, 22, 9, 22, 19, 17, 14, 18, 15, 14, 21, 20, 14, 17,…
$ start_mon <dbl> 8, 6, 8, 8, 10, 7, 7, 6, 11, 7, 5, 6, 8, 8, 10, 7, 8, 10, 6, 5, 8, 5, 4, 8, 6, 5, 9, 4, 10, 8, 9, 10, 8, 5, 5, 6, 7, 6, 6, 6, 10…
$ start_wk <dbl> 33, 26, 32, 33, 41, 27, 27, 26, 45, 27, 22, 26, 33, 32, 40, 30, 33, 41, 26, 19, 35, 22, 17, 33, 23, 19, 39, 16, 42, 33, 37, 43, …
$ membership <fct> member, non-member, member, member, member, member, member, member, member, member, member, member, member, member, member, memb…
bike %>% head()
We see here a number of different variable types present, namely:
- Continuous variables
- Categorical variables
- Temporal variables
First of all: Lets remember, the first thing we do is defining the aestetics, first of all the dimensions (x, y) of the visualization.
bike %>% ggplot(aes(x = weekday, y = start_hod))
The result will be an empty plane with the dimensions we defined. Note that there are more aestetic dimensions which can be used to convey informations visualy, such as for instance:
- Position (x, y)
- Color
- Shape
- Alpha (Transparency)
We will explore them later.
Basic visualization of variable types
Summaries of One Variable: Continuous
When attempting to summarize a single variable, histograms and density distributions are often the visualization of choice. We can do that easily by using the geom_histogram()
layer. Notice that we only define a x
aestetic, since we only summarize one variable
bike %>% ggplot(aes(x = duration_sec)) +
geom_histogram()
To plot a probability density function (PDF) instead, we can use the geom_density()
layer.
bike %>% ggplot(aes(x = duration_sec)) +
geom_density()
Note the distribution appears right-skewed, since we have some outliers of very long bike rides. Adding a log-scale on the x-axis might help to reduce their impact on the visualization.
bike %>% ggplot(aes(x = duration_sec)) +
geom_histogram() +
scale_x_log10()
In case we would already like to start looking at conditional distributions, we could add an adittional fill
aestetic.
bike %>% ggplot(aes(x = duration_sec, fill = weekday)) +
geom_histogram() +
scale_x_log10()
Summaries of One Variable: Discrete
To do the same for a discrete variable, we would start with a simple barplot via geom_bar()
. Notice again that we only define a x aestetic. ggplot
per default will use the count on the y-axis.
bike %>% ggplot(aes(x = start_dow)) +
geom_bar()
We could also use the membership as fill aestetic to map further information in the plot.
bike %>% ggplot(aes(x = start_dow, fill = membership)) +
geom_bar()
Summaries of One Variable: Temporal
A temporal variable can also be visualized as a line-plot with geom_line()
.
bike %>%
count(start_wk) %>%
ggplot(aes(x = start_wk, y = n)) +
geom_line()
To instead (or in addition) add a trendline, we can use geom_smooth()
bike %>%
count(start_wk) %>%
ggplot(aes(x = start_wk, y = n)) +
geom_smooth()
Summarizing multiple variablea jointly
Ok, that was pretty easy. However, the insights gained so far are pretty little. To tease out interesting pattern in our data, it might not be enough to only look at one variable at a time. To display relationships between multiple variables, we mainly can:
- Use aestetics such as
color
, fill
, size
, shape
(alter the aestetics within one plot)
- Use
facet_wrap()
(produce multiple plots)
Lets look at some examples:
First, we could take a look at the number of daily rides with workweek / weekend days colored differently.
# Compute daily counts & plot
bike %>%
count(start_day, weekday) %>%
ggplot(aes(start_day, n, color = weekday)) +
geom_point()
Now let’s look at how rides are distributed according to the time of day. Let’s make a summary plot of weekly ride counts faceted by start hour of day and broken down by workweek/weekend. Here, we will use the facet_grid
# Compute week_hod & plot
bike %>%
count(start_wk, start_hod, weekday) %>%
ggplot(aes(start_wk, n, color = weekday)) +
geom_point() +
facet_grid(~ start_hod) +
scale_y_sqrt()
Expanding on the previous plot, let’s add one more variable into our summary, adding a facet dimension for whether or not the rider is a member of BIXI.
# Compute wk_memb_hod & plot
bike %>%
count(start_wk, start_hod, weekday, membership) %>%
ggplot(aes(start_wk, n, color = weekday)) +
geom_point() +
facet_grid(membership ~ start_hod) +
scale_y_sqrt()
Let’s now look at the number of rides vs. hour for each day. To start, we’ll create a summary dataset for the first full month in the dataset (May) and look at it.
# Compute daily_may &
bike %>%
filter(start_mon == 5) %>%
count(start_day, start_hod, membership) %>%
ggplot(aes(start_hod, n, color = membership)) +
geom_point() +
facet_wrap(~ start_day, ncol = 7)
