Updated August 31, 2022

This session

In this session, you will learn the basic grammar of data manipulation. This includes:

  • How to do basic variable filtering, selection, and manipulation
  • How to create various types of data summarization
  • How to reshape (pivot) your data
  • How to also apply these actions on grouped data
  • How to join data from different sources

From my experience, this covers ca. 80% of common data manipulation tasks. Sound like fun? Lets get started!

Introduction

What to do with data?

Basic Operations of Data Manipulation

The 5(+2) core operations (SQL syntax)

At it’s core are 5 central operations of data manipulation:

  1. FILTER picks cases based on their values.
  2. SELECT picks variables based on their names.
  3. ARRANGE changes the ordering of the rows.
  4. MUTATE adds new variables that are functions of existing variables
  5. SUMMARIZE reduces multiple values down to a single summary.

They are often completed by:

  1. PIVOT and other changes of the dataframe’s shape
  2. GROUP_BY operations on group level

1: FILTER: Subsetting rows

FILTER oerations subset a dataframe by rows (observations), so that output only contains rows which fulfill a certain condition.

DF_CARS
speed dist
4 2
4 10
7 4
7 22
8 16
9 10
10 18
10 26
10 34
11 17
11 28
12 14
12 20
12 24
12 28
DF_CARS FILTER SPEED >= 5
speed dist
7 4
7 22
8 16
9 10
10 18
10 26
10 34
11 17
11 28
12 14
12 20
12 24
12 28
13 26
13 34

2: SELECT: Subsetting columns

SELECT oerations subset a dataframe by column (variable) so that output only contains certain columns in the stated order

DF_IRIS
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5.0 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3.0 1.4 0.1 setosa
4.3 3.0 1.1 0.1 setosa
5.8 4.0 1.2 0.2 setosa
DF_IRIS SELECT SEPAL_LENGHT SPECIES
Sepal.Length Species
5.1 setosa
4.9 setosa
4.7 setosa
4.6 setosa
5.0 setosa
5.4 setosa
4.6 setosa
5.0 setosa
4.4 setosa
4.9 setosa
5.4 setosa
4.8 setosa
4.8 setosa
4.3 setosa
5.8 setosa

3: ARRANGE: Reoder by column value

ARRANGE oerations reorder dataframe’s rows by selected column values

DF_CARS ARRANGE DESC SPEED
speed dist
25 85
24 70
24 92
24 93
24 120
23 54
22 66
20 32
20 48
20 52
20 56
20 64
19 36
19 46
19 68

4: MUTATE: Assign variable values

MUTATE oerations manipulate existing variables or create new ones.

DF_IRIS MUTATE SEPAL_RADIUS = SEPAL_LENGHT * 3.14
Sepal.Length Sepal.Width Sepal.Radius
5.1 3.5 16.014
4.9 3.0 15.386
4.7 3.2 14.758
4.6 3.1 14.444
5.0 3.6 15.700
5.4 3.9 16.956
4.6 3.4 14.444
5.0 3.4 15.700
4.4 2.9 13.816
4.9 3.1 15.386
5.4 3.7 16.956
4.8 3.4 15.072
4.8 3.0 15.072
4.3 3.0 13.502
5.8 4.0 18.212

5: SUMMARIZE: Summarizes column values

SUMMARIZE operations reduce the dataframe to one row, which is summarized according to a defined function.

DF_IRIS SUMMARIZE MEAN_LENGHT = MEAN SEPAL_LENGHT
##   Mean.Lenght
## 1    5.843333

(+) 1: PIVOT: Changing the data shape

  • Different forms of PIVOT are the most popular shape changing operations
  • Here, variable values…
    • distributed over several columns and one row (wide formats) are gathered into one column and multiple rows (pivot longer).
    • distributed over several rows and one column (long format) are spread into several columns and one row (pivot wider).
  • Changes between long & wide formats, which is often helpful to create tidy data, or prepare tidy data for different types of summaries & viualizations.
  • Often important when working with time series.

PIVOT LONGER

PIVOT_LONGER corresponds to a data frame’s shape conversion from wide to long [n,1] -> [1,n]

DF_IRIS 
Species Sepal.Length Sepal.Width
setosa 5.1 3.5
setosa 4.9 3.0
setosa 4.7 3.2
setosa 4.6 3.1
setosa 5.0 3.6
setosa 5.4 3.9
setosa 4.6 3.4
setosa 5.0 3.4
setosa 4.4 2.9
setosa 4.9 3.1
setosa 5.4 3.7
setosa 4.8 3.4
setosa 4.8 3.0
setosa 4.3 3.0
setosa 5.8 4.0
DF_IRIS PIVOT_LONGER -SPECIES
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5.0 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3.0 1.4 0.1 setosa
4.3 3.0 1.1 0.1 setosa
5.8 4.0 1.2 0.2 setosa

PIVOT WIDER

PIVOT_WIDER corresponds to a data frame’s shape conversion from long to wide [1,m] -> [m,1]

DF_IRIS_LONG
Species name value
setosa Sepal.Length 5.1
setosa Sepal.Width 3.5
setosa Sepal.Length 4.9
setosa Sepal.Width 3.0
setosa Sepal.Length 4.7
setosa Sepal.Width 3.2
setosa Sepal.Length 4.6
setosa Sepal.Width 3.1
setosa Sepal.Length 5.0
setosa Sepal.Width 3.6
setosa Sepal.Length 5.4
setosa Sepal.Width 3.9
setosa Sepal.Length 4.6
setosa Sepal.Width 3.4
setosa Sepal.Length 5.0
setosa Sepal.Width 3.4
setosa Sepal.Length 4.4
setosa Sepal.Width 2.9
setosa Sepal.Length 4.9
setosa Sepal.Width 3.1
setosa Sepal.Length 5.4
setosa Sepal.Width 3.7
setosa Sepal.Length 4.8
setosa Sepal.Width 3.4
setosa Sepal.Length 4.8
setosa Sepal.Width 3.0
setosa Sepal.Length 4.3
setosa Sepal.Width 3.0
setosa Sepal.Length 5.8
setosa Sepal.Width 4.0
setosa Sepal.Length 5.7
setosa Sepal.Width 4.4
setosa Sepal.Length 5.4
setosa Sepal.Width 3.9
setosa Sepal.Length 5.1
setosa Sepal.Width 3.5
setosa Sepal.Length 5.7
setosa Sepal.Width 3.8
setosa Sepal.Length 5.1
setosa Sepal.Width 3.8
setosa Sepal.Length 5.4
setosa Sepal.Width 3.4
setosa Sepal.Length 5.1
setosa Sepal.Width 3.7
setosa Sepal.Length 4.6
setosa Sepal.Width 3.6
setosa Sepal.Length 5.1
setosa Sepal.Width 3.3
setosa Sepal.Length 4.8
setosa Sepal.Width 3.4
setosa Sepal.Length 5.0
setosa Sepal.Width 3.0
setosa Sepal.Length 5.0
setosa Sepal.Width 3.4
setosa Sepal.Length 5.2
setosa Sepal.Width 3.5
setosa Sepal.Length 5.2
setosa Sepal.Width 3.4
setosa Sepal.Length 4.7
setosa Sepal.Width 3.2
setosa Sepal.Length 4.8
setosa Sepal.Width 3.1
setosa Sepal.Length 5.4
setosa Sepal.Width 3.4
setosa Sepal.Length 5.2
setosa Sepal.Width 4.1
setosa Sepal.Length 5.5
setosa Sepal.Width 4.2
setosa Sepal.Length 4.9
setosa Sepal.Width 3.1
setosa Sepal.Length 5.0
setosa Sepal.Width 3.2
setosa Sepal.Length 5.5
setosa Sepal.Width 3.5
setosa Sepal.Length 4.9
setosa Sepal.Width 3.6
setosa Sepal.Length 4.4
setosa Sepal.Width 3.0
setosa Sepal.Length 5.1
setosa Sepal.Width 3.4
setosa Sepal.Length 5.0
setosa Sepal.Width 3.5
setosa Sepal.Length 4.5
setosa Sepal.Width 2.3
setosa Sepal.Length 4.4
setosa Sepal.Width 3.2
setosa Sepal.Length 5.0
setosa Sepal.Width 3.5
setosa Sepal.Length 5.1
setosa Sepal.Width 3.8
setosa Sepal.Length 4.8
setosa Sepal.Width 3.0
setosa Sepal.Length 5.1
setosa Sepal.Width 3.8
setosa Sepal.Length 4.6
setosa Sepal.Width 3.2
setosa Sepal.Length 5.3
setosa Sepal.Width 3.7
setosa Sepal.Length 5.0
setosa Sepal.Width 3.3
versicolor Sepal.Length 7.0
versicolor Sepal.Width 3.2
versicolor Sepal.Length 6.4
versicolor Sepal.Width 3.2
versicolor Sepal.Length 6.9
versicolor Sepal.Width 3.1
versicolor Sepal.Length 5.5
versicolor Sepal.Width 2.3
versicolor Sepal.Length 6.5
versicolor Sepal.Width 2.8
versicolor Sepal.Length 5.7
versicolor Sepal.Width 2.8
versicolor Sepal.Length 6.3
versicolor Sepal.Width 3.3
versicolor Sepal.Length 4.9
versicolor Sepal.Width 2.4
versicolor Sepal.Length 6.6
versicolor Sepal.Width 2.9
versicolor Sepal.Length 5.2
versicolor Sepal.Width 2.7
versicolor Sepal.Length 5.0
versicolor Sepal.Width 2.0
versicolor Sepal.Length 5.9
versicolor Sepal.Width 3.0
versicolor Sepal.Length 6.0
versicolor Sepal.Width 2.2
versicolor Sepal.Length 6.1
versicolor Sepal.Width 2.9
versicolor Sepal.Length 5.6
versicolor Sepal.Width 2.9
versicolor Sepal.Length 6.7
versicolor Sepal.Width 3.1
versicolor Sepal.Length 5.6
versicolor Sepal.Width 3.0
versicolor Sepal.Length 5.8
versicolor Sepal.Width 2.7
versicolor Sepal.Length 6.2
versicolor Sepal.Width 2.2
versicolor Sepal.Length 5.6
versicolor Sepal.Width 2.5
versicolor Sepal.Length 5.9
versicolor Sepal.Width 3.2
versicolor Sepal.Length 6.1
versicolor Sepal.Width 2.8
versicolor Sepal.Length 6.3
versicolor Sepal.Width 2.5
versicolor Sepal.Length 6.1
versicolor Sepal.Width 2.8
versicolor Sepal.Length 6.4
versicolor Sepal.Width 2.9
versicolor Sepal.Length 6.6
versicolor Sepal.Width 3.0
versicolor Sepal.Length 6.8
versicolor Sepal.Width 2.8
versicolor Sepal.Length 6.7
versicolor Sepal.Width 3.0
versicolor Sepal.Length 6.0
versicolor Sepal.Width 2.9
versicolor Sepal.Length 5.7
versicolor Sepal.Width 2.6
versicolor Sepal.Length 5.5
versicolor Sepal.Width 2.4
versicolor Sepal.Length 5.5
versicolor Sepal.Width 2.4
versicolor Sepal.Length 5.8
versicolor Sepal.Width 2.7
versicolor Sepal.Length 6.0
versicolor Sepal.Width 2.7
versicolor Sepal.Length 5.4
versicolor Sepal.Width 3.0
versicolor Sepal.Length 6.0
versicolor Sepal.Width 3.4
versicolor Sepal.Length 6.7
versicolor Sepal.Width 3.1
versicolor Sepal.Length 6.3
versicolor Sepal.Width 2.3
versicolor Sepal.Length 5.6
versicolor Sepal.Width 3.0
versicolor Sepal.Length 5.5
versicolor Sepal.Width 2.5
versicolor Sepal.Length 5.5
versicolor Sepal.Width 2.6
versicolor Sepal.Length 6.1
versicolor Sepal.Width 3.0
versicolor Sepal.Length 5.8
versicolor Sepal.Width 2.6
versicolor Sepal.Length 5.0
versicolor Sepal.Width 2.3
versicolor Sepal.Length 5.6
versicolor Sepal.Width 2.7
versicolor Sepal.Length 5.7
versicolor Sepal.Width 3.0
versicolor Sepal.Length 5.7
versicolor Sepal.Width 2.9
versicolor Sepal.Length 6.2
versicolor Sepal.Width 2.9
versicolor Sepal.Length 5.1
versicolor Sepal.Width 2.5
versicolor Sepal.Length 5.7
versicolor Sepal.Width 2.8
virginica Sepal.Length 6.3
virginica Sepal.Width 3.3
virginica Sepal.Length 5.8
virginica Sepal.Width 2.7
virginica Sepal.Length 7.1
virginica Sepal.Width 3.0
virginica Sepal.Length 6.3
virginica Sepal.Width 2.9
virginica Sepal.Length 6.5
virginica Sepal.Width 3.0
virginica Sepal.Length 7.6
virginica Sepal.Width 3.0
virginica Sepal.Length 4.9
virginica Sepal.Width 2.5
virginica Sepal.Length 7.3
virginica Sepal.Width 2.9
virginica Sepal.Length 6.7
virginica Sepal.Width 2.5
virginica Sepal.Length 7.2
virginica Sepal.Width 3.6
virginica Sepal.Length 6.5
virginica Sepal.Width 3.2
virginica Sepal.Length 6.4
virginica Sepal.Width 2.7
virginica Sepal.Length 6.8
virginica Sepal.Width 3.0
virginica Sepal.Length 5.7
virginica Sepal.Width 2.5
virginica Sepal.Length 5.8
virginica Sepal.Width 2.8
virginica Sepal.Length 6.4
virginica Sepal.Width 3.2
virginica Sepal.Length 6.5
virginica Sepal.Width 3.0
virginica Sepal.Length 7.7
virginica Sepal.Width 3.8
virginica Sepal.Length 7.7
virginica Sepal.Width 2.6
virginica Sepal.Length 6.0
virginica Sepal.Width 2.2
virginica Sepal.Length 6.9
virginica Sepal.Width 3.2
virginica Sepal.Length 5.6
virginica Sepal.Width 2.8
virginica Sepal.Length 7.7
virginica Sepal.Width 2.8
virginica Sepal.Length 6.3
virginica Sepal.Width 2.7
virginica Sepal.Length 6.7
virginica Sepal.Width 3.3
virginica Sepal.Length 7.2
virginica Sepal.Width 3.2
virginica Sepal.Length 6.2
virginica Sepal.Width 2.8
virginica Sepal.Length 6.1
virginica Sepal.Width 3.0
virginica Sepal.Length 6.4
virginica Sepal.Width 2.8
virginica Sepal.Length 7.2
virginica Sepal.Width 3.0
virginica Sepal.Length 7.4
virginica Sepal.Width 2.8
virginica Sepal.Length 7.9
virginica Sepal.Width 3.8
virginica Sepal.Length 6.4
virginica Sepal.Width 2.8
virginica Sepal.Length 6.3
virginica Sepal.Width 2.8
virginica Sepal.Length 6.1
virginica Sepal.Width 2.6
virginica Sepal.Length 7.7
virginica Sepal.Width 3.0
virginica Sepal.Length 6.3
virginica Sepal.Width 3.4
virginica Sepal.Length 6.4
virginica Sepal.Width 3.1
virginica Sepal.Length 6.0
virginica Sepal.Width 3.0
virginica Sepal.Length 6.9
virginica Sepal.Width 3.1
virginica Sepal.Length 6.7
virginica Sepal.Width 3.1
virginica Sepal.Length 6.9
virginica Sepal.Width 3.1
virginica Sepal.Length 5.8
virginica Sepal.Width 2.7
virginica Sepal.Length 6.8
virginica Sepal.Width 3.2
virginica Sepal.Length 6.7
virginica Sepal.Width 3.3
virginica Sepal.Length 6.7
virginica Sepal.Width 3.0
virginica Sepal.Length 6.3
virginica Sepal.Width 2.5
virginica Sepal.Length 6.5
virginica Sepal.Width 3.0
virginica Sepal.Length 6.2
virginica Sepal.Width 3.4
virginica Sepal.Length 5.9
virginica Sepal.Width 3.0
DF_IRIS_LONG PIVOT_WIDER 
                NAMES_FROM SPECIES
                VALUS_FROM SEPAL_LENGHT
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5.0 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3.0 1.4 0.1 setosa
4.3 3.0 1.1 0.1 setosa
5.8 4.0 1.2 0.2 setosa

7: GROUP_BY: Grouped operations

GROUP_BY executes further operations separately for every unique group. Often (but not exclusively) followed by SUMMARIZE

DF_IRIS GROUP_BY SPECIES SUMMARIZE MEAN_LENGHT = MEAN SEPAL_LENGHT
Species Mean.Lenght
setosa 5.006
versicolor 5.936
virginica 6.588

Multi-Table Operations

Introduction

  • Multi-table operations draw their input data from more than one dataframe.
  • This is important when data is assembled from different sources.
  • Also necessary when working with relational data and/or stored in an relational database (eg. SQL)
  • Most common operation carried out across dataframes are the family of joins

Introduction to joins

  • Join operations combine data from multiple dataframes
    • Mutating joins, which add new variables to one data frame from matching observations in another.
    • Filtering joins, which filter observations from one data frame based on whether or not they match an observation in the other table.
  • Keys: Variable(-a) used to connect each pair of tables and uniquely identifies an observation Primary Key: Identifies observation in its own table. Foreign Key: Identifies observations in another table

Mutating Joins

  • Combine information from different dataframes to combine or alter existing data
    • INNER_JOIN(A,B) Combines variables from A and B for observations in A and B jointly
    • LEFT_JOIN(A,B): Combines A with variables from B for observations present in A
    • RIGHT_JOIN(A,B): Combines B with variables from A for observations present in B
    • FULL_JOIN(A,B) Combines variables from A and B for observations in A or B

Inner Join

  • Inner joins find observations present in both datarames
  • Thereby, they combine columns, while subletting rows for (A AND B)
  • Appropriate in cases only observations containing the joint information are valuable for the analysis.

Outer

Summary

What did we cover today?

Today, I guess the main take-aways are:

  • Ca. 80% of data operations in structured data can be broken down into a handful of operations.
  • Singe-table operations alter, filter, select, or change the shape of a dataframe.
  • Multi-Table operations combine information from different datarames.
  • Mutting joings can be distinguished in inner and outer joins