Content

  • Pipe syntax
  • Data manipulation:
    • Choosing variables and rows in the data (select and filter)
    • Renaming and changing variables, as well as calculating new variables (rename and mutate)
    • Combining, joining and merging data

Packages in this session

Remember: A package is a collection of functions and data. We load a package with the function library, which gives us access to the contents of the package.

library(nycflights13)
## Warning: package 'nycflights13' was built under R version
## 4.4.2
library(dplyr)
# or:
library(tidyverse)

dplyr is the large data-manipulation package, which contains functions for manipulating data by choosing rows and columns, create new variables, sort data, combine and merge datasets and summarising data.

Overview of data

When working with new data, it is a good idea to get an overview of the dataset

glimpse(flights)
## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013,…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, …
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA",…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 5…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", …
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", …
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, …
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1…
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6,…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, …
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05…

About pipes

What a pipe does

select is a function in the dplyr package, which selects variables in a dataset. It can be used like this:

select(flights, carrier, dest)

(first argument is your input data, the following arguments are the ones you want to select)

Alternatively, pipe syntax can be used:

flights |> select(carrier, dest)

The pipe |> pulls everything to the left of it through the pipe and inserts it into the first argument of the function on the right-hand side. You can think of it as …and then …

What a pipe does

flights
## # A tibble: 336,776 × 19
##    year month   day dep_time sched_dep_time dep_delay
##   <int> <int> <int>    <int>          <int>     <dbl>
## 1  2013     1     1      517            515         2
## 2  2013     1     1      533            529         4
## 3  2013     1     1      542            540         2
## 4  2013     1     1      544            545        -1
## # ℹ 336,772 more rows
## # ℹ 13 more variables: arr_time <int>,
## #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

What a pipe does

flights |> select(carrier, dest)
## # A tibble: 336,776 × 2
##   carrier dest 
##   <chr>   <chr>
## 1 UA      IAH  
## 2 UA      IAH  
## 3 AA      MIA  
## 4 B6      BQN  
## # ℹ 336,772 more rows

Why use a pipe?

Pipes lets us chain together commands in a way that is more readable.

Without pipes

new_data <- select(flights, carrier, arr_delay, distance)
new_data <- filter(new_data, distance > 1000)
new_data <- mutate(new_data, arr_delay = if_else(arr_delay > 0, "YES", "NO"))

With pipes

new_data <- flights |> 
  select(carrier, arr_delay, distance) |> 
  filter(distance > 1000) |> 
  mutate(arr_delay = if_else(arr_delay > 0, "YES", "NO"))

Other pipes

  • The pipe |> is the native pipe in R from version 4.1 onwards
  • Can be produced by using Ctrl + Shift + m
  • Before the native pipe, the %>% pipe from the magrittr package was used. It does practically the same thing.
    • If you come across code containing %>%, you can usually either run it as is or replace it with |>
  • magrittr contains a number of other pipes for specific uses that we won’t touch upon here

Pipe settings

To be able to use Ctrl+Shift+m for producing |>, we need to tick the box ‘Use native pipe operator’ in Tools \(\rightarrow\) Global options \(\rightarrow\) Code

Datamanipulation with dplyr

Selecting columns in your dataset

select

flights |> 
  select(carrier, dest)
## # A tibble: 336,776 × 2
##   carrier dest 
##   <chr>   <chr>
## 1 UA      IAH  
## 2 UA      IAH  
## 3 AA      MIA  
## 4 B6      BQN  
## # ℹ 336,772 more rows

Selecting columns in your dataset

deselect / drop

flights |> 
  select(-month, -day)
## # A tibble: 336,776 × 17
##    year dep_time sched_dep_time dep_delay arr_time
##   <int>    <int>          <int>     <dbl>    <int>
## 1  2013      517            515         2      830
## 2  2013      533            529         4      850
## 3  2013      542            540         2      923
## 4  2013      544            545        -1     1004
## # ℹ 336,772 more rows
## # ℹ 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Selecting columns in your dataset

  • Helper functions in select
    • starts_with(), ends_with(), contains(), matches()

Renaming columns

flights |> 
  rename(departure = dep_time)
## # A tibble: 336,776 × 19
##    year month   day departure sched_dep_time dep_delay
##   <int> <int> <int>     <int>          <int>     <dbl>
## 1  2013     1     1       517            515         2
## 2  2013     1     1       533            529         4
## 3  2013     1     1       542            540         2
## 4  2013     1     1       544            545        -1
## # ℹ 336,772 more rows
## # ℹ 13 more variables: arr_time <int>,
## #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Renaming columns

‘rename_with’ can be used for renaming all or a selection of columns at the same time with a function

flights |> rename_with(toupper, everything()) 
## # A tibble: 336,776 × 19
##    YEAR MONTH   DAY DEP_TIME SCHED_DEP_TIME DEP_DELAY
##   <int> <int> <int>    <int>          <int>     <dbl>
## 1  2013     1     1      517            515         2
## 2  2013     1     1      533            529         4
## 3  2013     1     1      542            540         2
## 4  2013     1     1      544            545        -1
## # ℹ 336,772 more rows
## # ℹ 13 more variables: ARR_TIME <int>,
## #   SCHED_ARR_TIME <int>, ARR_DELAY <dbl>, CARRIER <chr>,
## #   FLIGHT <int>, TAILNUM <chr>, ORIGIN <chr>, DEST <chr>,
## #   AIR_TIME <dbl>, DISTANCE <dbl>, HOUR <dbl>,
## #   MINUTE <dbl>, TIME_HOUR <dttm>

Renaming columns

‘rename_with’ can be used for renaming all or a selection of columns at the same time with a function

flights |> rename_with(toupper, contains("r")) 
## # A tibble: 336,776 × 19
##    YEAR month   day dep_time sched_dep_time dep_delay
##   <int> <int> <int>    <int>          <int>     <dbl>
## 1  2013     1     1      517            515         2
## 2  2013     1     1      533            529         4
## 3  2013     1     1      542            540         2
## 4  2013     1     1      544            545        -1
## # ℹ 336,772 more rows
## # ℹ 13 more variables: ARR_TIME <int>,
## #   SCHED_ARR_TIME <int>, ARR_DELAY <dbl>, CARRIER <chr>,
## #   flight <int>, tailnum <chr>, ORIGIN <chr>, dest <chr>,
## #   AIR_TIME <dbl>, distance <dbl>, HOUR <dbl>,
## #   minute <dbl>, TIME_HOUR <dttm>

Selecting rows

filter

flights |> 
  filter(distance > 1000) 
## # A tibble: 147,105 × 19
##    year month   day dep_time sched_dep_time dep_delay
##   <int> <int> <int>    <int>          <int>     <dbl>
## 1  2013     1     1      517            515         2
## 2  2013     1     1      533            529         4
## 3  2013     1     1      542            540         2
## 4  2013     1     1      544            545        -1
## # ℹ 147,101 more rows
## # ℹ 13 more variables: arr_time <int>,
## #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Selecting rows

Use & (and) or | (or) to filter the data on multiple conditions

flights |> 
  filter(distance > 1000 & month == 12)
## # A tibble: 12,946 × 19
##    year month   day dep_time sched_dep_time dep_delay
##   <int> <int> <int>    <int>          <int>     <dbl>
## 1  2013    12     1       13           2359        14
## 2  2013    12     1       17           2359        18
## 3  2013    12     1      520            515         5
## 4  2013    12     1      536            540        -4
## # ℹ 12,942 more rows
## # ℹ 13 more variables: arr_time <int>,
## #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Selecting rows

Helper functions for filter

  • if_all() and if_any()
flights |> 
  filter(if_all(contains("delay"), ~.x > 0))
## # A tibble: 92,303 × 19
##    year month   day dep_time sched_dep_time dep_delay
##   <int> <int> <int>    <int>          <int>     <dbl>
## 1  2013     1     1      517            515         2
## 2  2013     1     1      533            529         4
## 3  2013     1     1      542            540         2
## 4  2013     1     1      608            600         8
## # ℹ 92,299 more rows
## # ℹ 13 more variables: arr_time <int>,
## #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Creating new variables

mutate

flights |> mutate(arr_delay = if_else(arr_delay > 0, "YES", "NO"), 
                  distance_km = 1.61 * distance,
                  .keep = "used")
## # A tibble: 336,776 × 3
##   arr_delay distance distance_km
##   <chr>        <dbl>       <dbl>
## 1 YES           1400       2254 
## 2 YES           1416       2280.
## 3 YES           1089       1753.
## 4 NO            1576       2537.
## # ℹ 336,772 more rows

Mutate: if_else and case_when

flights |> 
  mutate(arr_delay = if_else(arr_delay > 0, "YES", "NO"),
         flight_type = case_when(distance < 500 ~ "Short flight",
                                 air_time > 180 ~ "Long flight",
                                 TRUE ~ "Medium flight"),
         .keep = "used")
## # A tibble: 336,776 × 4
##   arr_delay air_time distance flight_type  
##   <chr>        <dbl>    <dbl> <chr>        
## 1 YES            227     1400 Long flight  
## 2 YES            227     1416 Long flight  
## 3 YES            160     1089 Medium flight
## 4 NO             183     1576 Long flight  
## # ℹ 336,772 more rows

Exercises D1

Combining data

bind_rows()

bind_rows() attaches tables after each other - for example datasets from different years:

bind_rows(data1, data2)

There is also a bind_cols() function, but it is generally bad practice to use that - rows might not be sorted in the same way.

Joins

When tables are combined by one or more ID variables, it is called a join

The following types of joins are available:

  • Mutating joins
    • left_join()
    • inner_join()
    • full_join()
    • right_join() (Less used)
  • Filtering joins
    • semi_join()
    • anti_join()

“Mutating” joins

“Filtering” joins

Example join

flights2
## # A tibble: 336,776 × 6
##    year month   day  hour tailnum carrier
##   <int> <int> <int> <dbl> <chr>   <chr>  
## 1  2013     1     1     5 N14228  UA     
## 2  2013     1     1     5 N24211  UA     
## 3  2013     1     1     5 N619AA  AA     
## # ℹ 336,773 more rows
airlines
## # A tibble: 16 × 2
##   carrier name                  
##   <chr>   <chr>                 
## 1 9E      Endeavor Air Inc.     
## 2 AA      American Airlines Inc.
## 3 AS      Alaska Airlines Inc.  
## # ℹ 13 more rows

left_join

Joining the airlines with the flights2 dataset based on the ‘carrier’ column.

flights2 |> 
  left_join(airlines, by = "carrier")
## # A tibble: 336,776 × 7
##    year month   day  hour tailnum carrier name              
##   <int> <int> <int> <dbl> <chr>   <chr>   <chr>             
## 1  2013     1     1     5 N14228  UA      United Air Lines …
## 2  2013     1     1     5 N24211  UA      United Air Lines …
## 3  2013     1     1     5 N619AA  AA      American Airlines…
## # ℹ 336,773 more rows

Another example

planes
## # A tibble: 3,322 × 9
##   tailnum  year type  manufacturer model engines seats speed
##   <chr>   <int> <chr> <chr>        <chr>   <int> <int> <int>
## 1 N10156   2004 Fixe… EMBRAER      EMB-…       2    55    NA
## 2 N102UW   1998 Fixe… AIRBUS INDU… A320…       2   182    NA
## 3 N103US   1999 Fixe… AIRBUS INDU… A320…       2   182    NA
## # ℹ 3,319 more rows
## # ℹ 1 more variable: engine <chr>

inner_join

Finding all planes with the engine type ‘Reciprocating’ in the planes dataset, then adding the production year and number of seats to the dataset.

flights2 |> 
  inner_join(planes |> 
               filter(engine=="Reciprocating") |> 
               select(tailnum, year_prod = year, seats),
             by = "tailnum") |> 
  head(4)
## # A tibble: 4 × 8
##    year month   day  hour tailnum carrier year_prod seats
##   <int> <int> <int> <dbl> <chr>   <chr>       <int> <int>
## 1  2013     1     1     9 N737MQ  MQ           1977     4
## 2  2013     1     1     9 N517AA  AA             NA     2
## 3  2013     1     1     9 N508JB  B6           2007     4
## # ℹ 1 more row

semi_join

First creating a dataset with selected airlines to use with the filter joins.

selected_airlines <- airlines |> 
              filter(str_detect(str_to_upper(name), "AIRWAYS"))

selected_airlines
## # A tibble: 3 × 2
##   carrier name                       
##   <chr>   <chr>                      
## 1 B6      JetBlue Airways            
## 2 FL      AirTran Airways Corporation
## 3 US      US Airways Inc.

semi_join

Semi-joining our dataset with flights2.

flights3 <- flights2 |> 
  semi_join(selected_airlines, by = "carrier")

flights3 |> head()
## # A tibble: 6 × 6
##    year month   day  hour tailnum carrier
##   <int> <int> <int> <dbl> <chr>   <chr>  
## 1  2013     1     1     5 N804JB  B6     
## 2  2013     1     1     6 N516JB  B6     
## 3  2013     1     1     6 N593JB  B6     
## # ℹ 3 more rows

semi_join

Counting to see if there is only the type of ‘carrier’ that we expect.

flights3 |> 
  count(carrier)
## # A tibble: 3 × 2
##   carrier     n
##   <chr>   <int>
## 1 B6      54635
## 2 FL       3260
## 3 US      20536

anti_join

The opposite procedure can be done with anti-join

flights4 <- flights2 |> 
  anti_join(selected_airlines, by = "carrier")

flights4 |> head()
## # A tibble: 6 × 6
##    year month   day  hour tailnum carrier
##   <int> <int> <int> <dbl> <chr>   <chr>  
## 1  2013     1     1     5 N14228  UA     
## 2  2013     1     1     5 N24211  UA     
## 3  2013     1     1     5 N619AA  AA     
## # ℹ 3 more rows

anti_join

Checking if any of the carriers in selected_airlines exist in flights4.

flights4 |> 
  semi_join(selected_airlines, by = "carrier")
## # A tibble: 0 × 6
## # ℹ 6 variables: year <int>, month <int>, day <int>,
## #   hour <dbl>, tailnum <chr>, carrier <chr>

Further reading:

Exercises D2