VDJ 2022: 2. Data Wrangling with dplyr

By Yuka Takemon in R workshop

September 18, 2022

Date

September 18, 2022

Time

12:00 AM

Workshop info

When: October 1st, 9am (PST, Vancouver, BC)
Where: Northeastern University
Requirements: Participants must have a laptop or desktop with a Mac, Linux, or Windows operating system. (Tablets and Chromebooks are not advised.) Please have the latest version of R and RStudio downloaded and running (free!).
Code of conduct: Everyone participating in the Vancouver DataJam activities are required to conform to the Code of Conduct

These materials have been adapted from the Software Carpentry: R Novice Lesson. You can find the original materials here

Objectives

This lesson will cover some basic functions that can be used to manipulate and analyze data in R.

There are eight main functions we’ll be talking about today, each allowing us to manipulate data frames. These eight functions are:

glimpse() – Take a peak at the structure and the first few row of the data
select() – Choose columns (variables or attributes) from our data frame
%>% – take the results from the left and push it to the right
filter() – Choose rows (samples or observations) from our data frame
mutate() – Create new columns
group_by() – Group rows based on a particular value within that column
summarize() – Perform some function on the grouped data
left_join() – Combine two tables based on a shared column

Here we go!

Illustration by Allison Horst

Setup

If you haven’t already, make sure you have tidyverse and gapminder installed and loaded with the following commands:

# Download the packages in to our personal library
install.packages(c("tidyverse", "gapminder"))

# Load the packages for use
library(tidyverse)
library(gapminder)

We’re going to use the gapminder data set to apply our knowledge of functions. Before we use their data, let’s quickly learn about the data set:

Gapminder Foundation is a non-profit venture registered in Stockholm, Sweden, that promotes sustainable global development and achievement of the United Nations Millennium Development Goals by increased use and understanding of statistics and other information about social, economic and environmental development at local, national and global levels. [1]

glimpse()

Let’s take a quick look at our data frame to remind ourselves of its structure. We do this using the glimpse() command, which will display the structure and the first few rows of our data frame.

glimpse(gapminder)

Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

What do you observe?

Number of rows and columns
Name of columns
Content of columns
Type of data in the columns

Quick side note on tibbles and data frames

In R, one of the main types of objects/variables we’re going to be working with is a data frame. This is much like a table you would view in Excel, where column represent variables or measures and rows represent measurements, samples, or observations.

Quick side note on loading and saving data

We cheated by loading our data using the gapminder packages. Normally we would need to read in and write out data.

# Write data out as comma separated file
write_csv(gapminder, "~/my_project/data/gapminder.csv")

# Read .csv file in
write_csv(gapminder, "~/my_project/data/gapminder.csv")

# other formats include .tsv or any defined separator

Choose Columns: select

The first function we’ll be using is select(). This function let’s us pick columns from our data frame, based on name (e.g. year) or by index (e.g. 3).

Let’s try using select() to pick out a few columns: “country”, “year”, “lifeExp”, and “pop”. We’ll be assigning these columns to a new data frame,gapminder_select. Then we’ll use glimpse() to see if it worked.

# select() code here
gapminder_select <- select(gapminder, country, year, lifeExp, pop)

# Check the data frame
glimpse(gapminder_select)

Rows: 1,704
Columns: 4
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "A…
$ year    <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 20…
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.822…
$ pop     <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 1288…

As you can see, our new data frame contains only a subset of the columns from the original data frame, based on the names we provided in the select() command.

Pipe: %>%

The pipe symbol %>% sends or “pipes” an object (e.g. a data frame like gapminder) INTO a function (e.g. select()).

So, the above select() command can be rewritten as follows (NOTE: the “.” is a placeholder, which represents the object being piped). Again, we can check our result using head().

# select() using pipe syntax
gapminder_pipe <- gapminder %>% 
  select(., country, year, lifeExp, pop)

glimpse(gapminder_pipe)

Rows: 1,704
Columns: 4
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "A…
$ year    <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 20…
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.822…
$ pop     <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 1288…

We can actually simplify the above command further - dplyr’s functions such as select() are smart enough that you don’t actually need to include the “.” placeholder, as shown below.

# select() using pipe syntax w/out a placeholder
gapminder_pipe2 <- gapminder %>% 
  select(country, year, lifeExp, pop)

head(gapminder_pipe2)

# A tibble: 6 × 4
  country      year lifeExp      pop
  <fct>       <int>   <dbl>    <int>
1 Afghanistan  1952    28.8  8425333
2 Afghanistan  1957    30.3  9240934
3 Afghanistan  1962    32.0 10267083
4 Afghanistan  1967    34.0 11537966
5 Afghanistan  1972    36.1 13079460
6 Afghanistan  1977    38.4 14880372

Challenge 1: (5 mins)

Reorder the code below to create a code that uses select() command and pipe (%>%) notation to pick the columns continent, GDP per capita, life expectancy, and year from the gapminder data frame. Then assign them to a new variable result, and display the results using glimpse().

glimpse(result)
result <-
select(continent, gdpPercap, lifeExp, year)
%>%
gapminder

Choose Rows: filter

So we’ve covered selecting columns, but what about rows? This is where filter() comes in. This function allows us to choose rows from our data frame using some logical criteria. An example is filtering for rows in which the country is Canada. This can also be applied to numerical values, such as the year being equal to 1967, or life expectancy greater than 30.

NOTE: In R, equality (e.g. country is Canada, year is 1967) is done using a double equals sign (==).

Let’s go through a couple examples.

# Filter rows where country is Canada
gapminder_canada <- gapminder %>% 
  filter(country == "Canada")

head(gapminder_canada)

# A tibble: 6 × 6
  country continent  year lifeExp      pop gdpPercap
  <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
1 Canada  Americas   1952    68.8 14785584    11367.
2 Canada  Americas   1957    70.0 17010154    12490.
3 Canada  Americas   1962    71.3 18985849    13462.
4 Canada  Americas   1967    72.1 20819767    16077.
5 Canada  Americas   1972    72.9 22284500    18971.
6 Canada  Americas   1977    74.2 23796400    22091.

Let’s try another one, this time filtering on life expectancy above a certain threshold:

# Filter for rows where life expectancy is greater than 50
gapminder_LE <- gapminder %>% 
  filter(lifeExp > 50)

head(gapminder_LE)

# A tibble: 6 × 6
  country continent  year lifeExp     pop gdpPercap
  <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
1 Albania Europe     1952    55.2 1282697     1601.
2 Albania Europe     1957    59.3 1476505     1942.
3 Albania Europe     1962    64.8 1728137     2313.
4 Albania Europe     1967    66.2 1984060     2760.
5 Albania Europe     1972    67.7 2263554     3313.
6 Albania Europe     1977    68.9 2509048     3533.

We can also filter with multiple arguments, each separated by a comma:

# filter() for Canada and life expectancy greater than 80
gapminder_C_LE <- gapminder %>% 
  filter(country == "Canada", lifeExp > 80)

head(gapminder_C_LE)

# A tibble: 1 × 6
  country continent  year lifeExp      pop gdpPercap
  <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
1 Canada  Americas   2007    80.7 33390141    36319.

Challenge 2: (5 mins)

Choose the code(s) the will produce the correct answer for the task below:

Use filter() to choose data for all African countries, from the year 1982 and onwards and assign them to a variable called gapminder_africa. Hint: >= is used for the greater than or equal to symbol.

gapminder_africa <- filter(., continent == "Africa", year >= 1982)

gapminder_africa <- gapminder %>% 
  filter(continent == "Africa", year >= 1982)

gapminder_africa <- filter(gapminder, continent == "Africa") %>%
  filter(year > 1982)

gapminder_africa <- gapminder %>% 
  filter(continent == "Africa" & year >= 1982)

Can you explain why some of them didn’t work?

Create New Columns: mutate()

Let’s say we now want to calculate the total GDP, which is done by mutiplying the GDP per capita by the population. mutate() will perform this calculation on each row in the data frame, one row at a time (i.e. row-wise).

# Use mutate() to calculate GDP in billions
gapminder_totalgdp <- gapminder %>% 
  mutate(gdp_total = gdpPercap * pop)

head(gapminder_totalgdp)

# A tibble: 6 × 7
  country     continent  year lifeExp      pop gdpPercap    gdp_total
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>        <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.

Combine Functions with Pipes

We’ve seen that pipes ( %>% ) can be used to send an object such as a data frame into a function, such as select(), or filter(). But they can also be used to send the output of one function into another function. This allows us to chain together multiple commmands, without the need for intermediate variables.

Let’s take a look at this in an example.

# select() the five columns, and filter() for Canada
gapminder_multi <- gapminder %>%
    select(country, year, lifeExp, pop, gdpPercap) %>%
    filter(country == "Canada")

head(gapminder_multi)

# A tibble: 6 × 5
  country  year lifeExp      pop gdpPercap
  <fct>   <int>   <dbl>    <int>     <dbl>
1 Canada   1952    68.8 14785584    11367.
2 Canada   1957    70.0 17010154    12490.
3 Canada   1962    71.3 18985849    13462.
4 Canada   1967    72.1 20819767    16077.
5 Canada   1972    72.9 22284500    18971.
6 Canada   1977    74.2 23796400    22091.

We can further expand on this by incorporating our mutate() command from earlier, linking multiple functions into a single command.

Challenge 3: (5 mins)

select() all columns except for continent and filter() to show only data from Canada. Then calculate the total gdp for each row and place it into a new column named gdp_total.

Hint: Instead of selecting the columns you want, you can also subtract the columns you don’t want with -. eg -year

Calculations with group_by and summarise

These functions allow us to work on our data in specific groups. For example, we can use group_by() to group observations by country, then calculate the maximum, minimum, and average life expectancy for each country.

# group_by() country, calculate average life expectancy
gapminder %>%
group_by(country) %>%
summarise(LifeExp_ang = max(lifeExp),
          LifeExp_ang = min(lifeExp),
          LifeExp_ang = mean(lifeExp))

# A tibble: 142 × 2
   country     LifeExp_ang
   <fct>             <dbl>
 1 Afghanistan        37.5
 2 Albania            68.4
 3 Algeria            59.0
 4 Angola             37.9
 5 Argentina          69.1
 6 Australia          74.7
 7 Austria            73.1
 8 Bahrain            65.6
 9 Bangladesh         49.8
10 Belgium            73.6
# … with 132 more rows

Extra: group_by() can be combined with tally()to count all the rows corresponding to that group.
gapminder %>% group_by(continent, year) %>% tally()

Combining tables with _join()

Let’s say you have two data frames that you want to combine. Both data frames contain a column with unique identifiers, but each data frame may contain different columns of information. That’s where join functions come in!

There are a variety of ways data can be joined. Here are a few common ones you may come across:

left_join()

The left_join() function takes one data frame on the “left” and using a specified column, looks for matching entries on the “right”. Note that the output data frame will contain all rows and columns from the left dataframe, as well as all columns from right, but only matching rows from.

First let’s load some simple example data to play with:

band_members

# A tibble: 3 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles

band_instruments

# A tibble: 3 × 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar

Now we can use left_join() to combine the two tables, based on matching values in a specified column. The syntax is as follows:

left_join(band_members, band_instruments)

# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass

right_join()

right_join() works exactly the same with and you can also specify which column you wish to join by to get the same results

right_join(band_members, band_instruments, by = "name")

# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  
3 Keith <NA>    guitar

Extra: Note that you can have different column names in each of your data frames, and still join the tables together. The syntax for this is:
left_join(x, y, by = c("columnX" = "columnY"))

Challenge 4: Tying it all together (10 mins)

Now let’s use all the commands we’ve covered and combine them with pipes into a single statement.

Calculate the mean and standard deviation (sd()) of the total GDP for all counties in Asia from 1990 and onwards.

Challenge answers

Challenge 1 : B, E, D, C, A

# Here is the answer
result <- gapminder %>% 
  select(continent, gdpPercap, lifeExp, year)

glimpse(result)

Challenge 2 : B & D

Challenge 3 :

gapminder %>%
    select(-continent) %>%
    filter(country == "Canada") %>%
    mutate(gdp_total = gdpPercap * pop)

Challenge 4 :

gapminder %>%
    filter(continent == "Asia", year >= 1990) %>%
    mutate(totalGDP = gdpPercap * pop) %>%
    group_by(country) %>%
    summarise(totalGDP_mean = mean(totalGDP), 
              totalGDP_sd = sd(totalGDP))

# A tibble: 33 × 3
   country          totalGDP_mean totalGDP_sd
   <fct>                    <dbl>       <dbl>
 1 Afghanistan            1.85e10     8.94e 9
 2 Bahrain                1.47e10     4.81e 9
 3 Bangladesh             1.45e11     4.94e10
 4 Cambodia               1.28e10     7.82e 9
 5 China                  3.82e12     2.00e12
 6 Hong Kong, China       2.03e11     5.57e10
 7 India                  1.74e12     7.33e11
 8 Indonesia              6.15e11     1.43e11
 9 Iran                   5.96e11     1.58e11
10 Iraq                   8.98e10     2.91e10
# … with 23 more rows

Posted on:: September 18, 2022

Length:: 12 minute read, 2370 words

Categories:: R workshop

See Also: