1. Joins

We recommend opening Rstudio's cheat sheet for dplyr.

library("tidyverse")

Exercises on joins

We'll be joining tables from the nycflights13 data set

library("nycflights13")
## Warning: package 'nycflights13' was built under R version 4.4.2

Exercise 1.1

Select time_hour, flight, carrier and tailnum from flights and join with the planes table. Check that the new table has the same amount of rows as the flights table.

You'll need to use the functions select(), left_join() and nrow()

data_flights_planes <- flights |> 
  select(time_hour, flight, carrier, tailnum) |> 
  left_join(planes, by = "tailnum")

nrow(data_flights_planes) == nrow(flights)

Exercise 1.2: Filtering joins

Choose planes with 1, 3 or 4 engines from planes and use semi_join() to filter so that only flights with these planes are left.

Choose the desired rows first by using filter() and %in%

planes_for_join <- planes |>
  filter(engines %in% c(1, 3:4)) 

data_1_3_4_eng <- flights |>
  semi_join(planes_for_join, by = "tailnum")

Exercise 1.3: Filtering joins

Some of the planes have a missing value in the year variable. Remove flights with these planes by using anti_join(). Check that some flights have been removed.

Choose the planes to exclude first by using filter() and is.na().

planes_to_remove <- planes |>
  filter(is.na(year)) 

data_planes_year <- flights |>
  anti_join(planes_to_remove, by = "tailnum")

nrow(flights) > nrow(data_planes_year)

Exercise 1.4: Joining tables and creating new variables

Find out if any flights in the data have been flying at a higher average speed than that plane's listed speed.

You'll need to create multiple new variables using 'air_time' (which is in minutes) and 'distance' to be able to compare the miles per hour of the flight to that plane's speed.

data_flights_speed <- flights |> 
  left_join(planes, by = "tailnum") |> 
  mutate(air_time_hours = air_time / 60,
         mph = distance / air_time_hours,
         super_speed = mph > speed)

data_flights_speed |> count(super_speed)

Exercise 1.5

Using the weather data and the table you created in the last exercise, find out the overall average wind speed for flights that were faster than their listed plane speeds. Compare it to the average wind speed for the slower flights.

Look under time_hour in the documentation for the flights data set for an idea of how to join the weather data. Then use filter() and mean(). The latter function needs a vector as its first argument - you cannot pipe a table into it.

data_with_weather <- data_flights_speed |> 
  left_join(weather, by = c("origin", "time_hour"))

data_super_speed <- data_with_weather |> filter(super_speed == TRUE)
data_less_speed <- data_with_weather |> filter(super_speed == FALSE)

mean(data_super_speed$wind_speed, na.rm = TRUE)
mean(data_less_speed$wind_speed, na.rm = TRUE)