In this session, you will learn the basic grammar of data manipulation, some best-practice advices. Since data manipulation always follows a purpose and requires some understanding of the data at hand, we will also have a first glance data exploration and visualization. However, we will her only cover the very basics and skip most of the details. Again, you will have a dedicated sessions lateron.
In this session, you will learn:
From my experience, this covers ca. 80% of common data manipulation tasks. Sound like fun? Lets get started!
So, let’s zoom in the dplyr
functionality and syntax for
a moment, before we illustrate it at the following data-manipulation
tasks.
At it’s core are 5 central verbs of data/manipulation, which are
filter()
picks cases based on their values.select()
picks variables based on their names.arrange()
changes the ordering of the rows.mutate()
adds new variables that are functions of
existing variablessummarise()
reduces multiple values down to a single
summary.These all combine naturally with group_by()
which allows
you to perform any operation “by group”. You can learn more about them
in vignette("dplyr")
. As well as these single-table verbs,
dplyr also provides a variety of two-table verbs (joins and mutations),
which you can learn about in vignette("two-table")
.
We will explore now a bit with the starwars dataset, which comes
together with dplyr
. First, let’s get an intuition what’s
in it. head()
shows us the first 10 rows.
glimpse()
gives us an overview over all variable
types.
Rows: 87
Columns: 14
$ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", "Owen Lars", "Beru Whitesun lars", "R5-D4", "Biggs Darklighter", "Obi-Wan Kenobi"…
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 228, 180, 173, 175, 170, 180, 66, 170, 183, 200, 190, 177, 175, 180, 150, NA, 88, 160, 19…
$ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.0, 84.0, NA, 112.0, 80.0, 74.0, 1358.0, 77.0, 110.0, 17.0, 75.0, 78.2, 140.0, 113.0, 79.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", NA, "black", "auburn, white", "blond", "auburn, grey", "brown", "brown", NA, NA, "brown",…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "light", "white, red", "light", "fair", "fair", "fair", "unknown", "fair", "green", "gree…
$ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue", "red", "brown", "blue-gray", "blue", "blue", "blue", "brown", "black", "orange", "hazel…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, 41.9, 64.0, 200.0, 29.0, 44.0, 600.0, 21.0, NA, 896.0, 82.0, 31.5, 15.0, 53.0, 31.0, 37.…
$ sex <chr> "male", "none", "none", "male", "female", "male", "female", "none", "male", "male", "male", "male", "male", "male", "male", "hermaphroditic", "male…
$ gender <chr> "masculine", "masculine", "masculine", "masculine", "feminine", "masculine", "feminine", "masculine", "masculine", "masculine", "masculine", "mascu…
$ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "Tatooine", "Tatooine", "Tatooine", "Tatooine", "Stewjon", "Tatooine", "Eriadu", "Kashyyyk…
$ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Human", "Droid", "Human", "Human", "Human", "Human", "Wookiee", "Human", "Rodian", "Hutt", "…
$ films <list> [<"The Empire Strikes Back", "Revenge of the Sith", "Return of the Jedi", "A New Hope", "The Force Awakens">, <"The Empire Strikes Back", "Attack …
$ vehicles <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imperial Speeder Bike", <>, <>, <>, <>, "Tribubble bongo", <"Zephyr-G swoop bike", "XJ-6 a…
$ starships <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1", <>, <>, <>, <>, "X-wing", <"Jedi starfighter", "Trade Federation cruiser", "Naboo star…
filter()
The verb filter()
lets you subset a dataframe by rows
(observations), meaning the output will filter for only rows which
fulfill a certain condition.
Conditions can be combined with boolean operators
(&, |
)
select()
The verb select()
lets you subset a dataframe by column
(variable), meaning the output will only contain certain columns in the
stated order
It can also be used to deselect columns with a leading
-
Lastly, many of dplyr
’s other conditions and helper
functions can be applied to select or deselect columns. More on that
later.
arrange()
The verb arrange()
defines the way the rows of your
dataframe are ordered
mutate()
The verb mutate()
lets you manipulate existing variables
or create new ones.
summarize()
The verb summarize()
reduces your dataset to one
observation, which is summarized according to a defined function.
By now, we worked with the traditional data structure, where every column corresponds to one variable. We also call that a **wide*“** data format. This is in most cases the most efficient and prefered way to keep your data. However, there are some cases, where another dataformat would be more convenient.
For example, for range of graphical as well as numerical summaries, it would be nice to have every observation and every variable to be mapped in a single row. This format is also calles long (or in R slang, tidy) data. Adittionally, some real life datasets (including a lot of WorldBank, Eurostats, OECD, ect.) data is usually available in a long format. Therefore, weare in need of easily switching between long and wide formats, to reshape your data (in Excel slang, people call that pivoting).
Lets take a look at a pivot_wider
example
And now pivot_longer
group_by
Finally, by_group()
is offers the perfect complement to
the 5 basic verbs, since it allows to perform aqll the above mentioned
tasks sepperate according to used-defined groupings in one or multiple
categories. Lets look at some examples:
First of all, it works amazingly well with the
summarize()
verb, producing within group summaries
Note that it is adviced to finish always such operations with
ungroup()
. Otherwise, the still active grouping might at a
later point lead to all kind of unforseen results.
Some little exercises in between: Please solve Exercise
1 in the corresponding section on github
It’s rare that a data analysis involves only a single table of data.
In practice, you’ll normally have many tables that contribute to an
analysis, and you need flexible tools to combine them. In
dplyr
, there are three families of verbs that work with two
tables at a time:
Among those, mutating joins are the by far most used ones, so we will
briefly discuss them here, and cover other types later in the applied
part. For questions right away, consult
vignette(two-"two-table")
. Mutating joins allow you to
combine variables from multiple tables (merging data). We here assume
our data is “tidy” (meaning: rows = observations, columns =
variables).
Lets briefly create some example data:
There are 4 types of mutating join to combine variables from
dataframe x
and `y, which differ in their behaviour when a
match is not found. We’ll illustrate each with a simple example:
left_join(x, y)
includes all observations in
x
, regardless of whether they match or not. This is the
most commonly used join because it ensures that you don’t lose
observations from your primary table.right_join(x, y)
includes all observations in
y
. It’s equivalent to left_join(y, x)
, but the
columns will be ordered differently.inner_join(x, y)
only includes observations that match
in both x
and y
.full_join()
includes all observations from
x
and y
.The left, right and full joins are collectively know as outer joins. When a row doesn’t match in an outer join, the new variables are filled in with missing values.
While mutating joins are primarily used to add new variables, they can also generate new observations. If a match is not unique, a join will add all possible combinations (the Cartesian product) of the matching observations:
Note: In case of multiple matches in y
, adittional rows
will be added. Therefore, if that is not the behavior you want, use the
distinct()
function afterwards.
By default, joins are performed on the bases of all
variables appearing in x
and y
. You can also
specify one or multiple variables that should be matched by using the
by
argument. This can also be used to match on variables
with different names in x
and y
.
Some little exercises in between: Please solve Exercise
2 in the corresponding section on github
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] magrittr_1.5 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4 readr_1.3.1 tidyr_1.1.2 tibble_3.0.3 ggplot2_3.3.2 tidyverse_1.3.0
[11] knitr_1.29
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 cellranger_1.1.0 pillar_1.4.6 compiler_4.0.2 dbplyr_1.4.4 tools_4.0.2 jsonlite_1.7.1 lubridate_1.7.9 lifecycle_0.2.0
[10] gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.7 reprex_0.3.0 cli_2.0.2 DBI_1.1.0 rstudioapi_0.11 yaml_2.2.1 haven_2.3.1
[19] xfun_0.17 withr_2.2.0 xml2_1.3.2 httr_1.4.2 fs_1.5.0 generics_0.0.2 vctrs_0.3.4 hms_0.5.3 grid_4.0.2
[28] tidyselect_1.1.0 glue_1.4.2 R6_2.4.1 fansi_0.4.1 readxl_1.3.1 modelr_0.1.8 blob_1.2.1 backports_1.1.9 scales_1.1.1
[37] ellipsis_0.3.1 rvest_0.3.6 assertthat_0.2.1 colorspace_1.4-1 utf8_1.1.4 stringi_1.5.3 munsell_0.5.0 broom_0.7.0 crayon_1.3.4