R for Data Science 5.6 – 5.7 Exercise

R4DS讀到現在,覺得這真是本很棒的入門書,tidyverse裡面的package一脈相承的設計哲學讓data manipulation可以系統化的學習,而非只是死背一卡車的函數。尤其dplyr,函數操作濃濃的SQL味道,很容易上手,讀到這裡終於有信心用R做前處理,而不是看完整本書覺得腦子裡只有碎片化的函數飛來飛去……

第五章最末兩小節的練習題特別多且應用性強,故獨立一篇記錄之。

Exercise 5.6.2
Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count())

#not_cancelled %>% count(dest)

#not_cancelled %>%
#  group_by(dest) %>%
#  summarise(n = n())

#not_cancelled %>%
#  group_by(dest) %>%
#  summarise(n = length(dest))
# using summarise() to create one or more scalar variables 
# summarizing the variables of an existing tbl
# summarise(), to reduce multiple values down to a single value

# count() is effectively a short-cut for group_by() followed by # tally()
#not_cancelled %>%
#  group_by(dest) %>%
#  tally()

#> # A tibble: 104 x 2
#>   dest      n
#>   <chr> <int>
#> 1 ABQ     254
#> 2 ACK     264
#> 3 ALB     418
#> 4 ANC       8
#> 5 ATL   16837
#> 6 AUS    2411
#> # … with 98 more rows
# not_cancelled %>% count(tailnum, wt = distance)

#not_cancelled %>%
#  group_by(tailnum) %>%
#  summarise(n = sum(distance))

# we can also use the combination group_by() and tally().
# any arguments to tally() are summed.
not_cancelled %>%
  group_by(tailnum) %>%
  tally(distance)

#> # A tibble: 4,037 x 2
#>   tailnum      n
#>   <chr>    <dbl>
#> 1 D942DN    3418
#> 2 N0EGMQ  239143
#> 3 N10156  109664
#> 4 N102UW   25722
#> 5 N103US   24619
#> 6 N104UW   24616
#> # … with 4,031 more rows

Exercise 5.6.3
Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay)) is slightly suboptimal. Why? Which is the most important column?

飛機起飛並不代表抵達,飛行過程可能發生意外或是折返回起飛機場,所以 arr_delay應該才是我們需要研究的目標column。

Exercise 5.6.4 *****
Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?

cancelled_per_day <-
  flights %>%
  mutate(cancelled = (is.na(arr_delay) | is.na(dep_delay))) %>%
# add a column "cancelled" , which show TRUE or FALSE
  group_by(year, month, day) %>%
  summarise(
    cancelled_num = sum(cancelled),
    flights_num = n(),
  )
# Plotting flights_num against cancelled_num
ggplot(cancelled_per_day) +
  geom_point(aes(x = flights_num, y = cancelled_num))

# with regression line
#  ggplot(cancelled_per_day, aes(x = flights_num, y = #cancelled_num)) +
#    geom_point() +
#    geom_smooth(method = "lm")
# 自己覺得畫出迴歸線後,發現 變數之間的關係不是很明顯
# whether there is a relationship between the proportion of 
# flights cancelled and the average departure delay

cancelled_and_delays <-
  flights %>%
  mutate(cancelled = (is.na(arr_delay) | is.na(dep_delay))) %>%
  group_by(year, month, day) %>%
  summarise(
    cancelled_prop = mean(cancelled),
    avg_dep_delay = mean(dep_delay, na.rm = TRUE),
                                 # remove NA             
    avg_arr_delay = mean(arr_delay, na.rm = TRUE)
  ) %>%
  ungroup()
  ggplot(cancelled_and_delays, aes(x = avg_dep_delay, y = cancelled_prop)) +
    geom_point() + 
    geom_smooth(method = lm)
  ggplot(cancelled_and_delays, aes(x = avg_arr_delay, y = cancelled_prop)) +
    geom_point() + 
    geom_smooth(method = lm)

Exercise 5.6.5
Which carrier has the worst delays?
Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()))

flights %>%
  group_by(carrier) %>%
  summarise(arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(desc(arr_delay))
#> # A tibble: 16 x 2
#>   carrier arr_delay
#>   <chr>       <dbl>
#> 1 F9           21.9
#> 2 FL           20.1
#> 3 EV           15.8
#> 4 YV           15.6
#> 5 OO           11.9
#> 6 MQ           10.8
#> # … with 10 more rows


filter(airlines, carrier == "F9")
#> # A tibble: 1 x 2
#>   carrier name                  
#>   <chr>   <chr>                 
#> 1 F9      Frontier Airlines Inc.

Exercise 5.7.2
Which plane (tailnum) has the worst on-time record?The question does not define a way to measure on-time record, so I will consider two metrics:
1. proportion of flights not delayed or cancelled, and
2. mean arrival delay.

# 1.
on_time_prop_final <- flights %>%
  filter(!is.na(tailnum)) %>%
  mutate(on_time = (arr_delay <= 0)) %>%
  group_by(tailnum) %>%
  summarise(n = n(), on_time_prop = mean(on_time, na.rm = TRUE)) %>%
  arrange(on_time_prop)
# 2. mean arrival delay.
mean_arr_delay <- flights %>%
  filter(!is.na(tailnum)) %>%
  group_by(tailnum) %>%
  summarise(n = n(), avg_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(desc(avg_arr_delay))

Exercise 5.7.3
What time of day should you fly if you want to avoid delays as much as possible?

best_hour <- flights %>%
  group_by(hour) %>%
  summarise(avg_arr_delay = mean(arr_delay, na.rm = TRUE),
            avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  arrange(avg_arr_delay, avg_arr_delay)

Exercise 5.7.4
For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.

dest_delay_total <- flights %>%
  filter(arr_delay > 0) %>%
  group_by(dest) %>%
  mutate(
    arr_delay_total = sum(arr_delay),
    # using sum()
    arr_delay_prop = arr_delay / arr_delay_total
  ) %>%
  select(dest, arr_delay_total, arr_delay_prop)

Exercise 5.7.5 *****
Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag() explore how the delay of a flight is related to the delay of the immediately preceding flight.

lead-lag {dplyr} R Documentation
Description
Find the “next" or “previous" values in a vector. Useful for comparing values ahead of or behind the current values.
lag():這一筆與下一筆觀察值的差,等於下筆觀察值減這一筆觀察值。
lead():這一筆與前一筆觀察值的差,等於這一筆觀察值減前一筆觀察值。

Step 1.
Calculates the departure delay of the preceding flight from the same airport.

lagged_delays <- flights %>%
  arrange(origin, month, day, dep_time) %>%
  group_by(origin) %>%
  mutate(dep_delay_lag = lag(dep_delay)) %>%
  filter(!is.na(dep_delay), !is.na(dep_delay_lag))

Step 2.
Plots the relationship between the mean delay of a flight for all values of the previous flight. 
“Previous Departure Delay" against “Departure Delay"

lagged_delays %>%
  group_by(dep_delay_lag) %>%
  summarise(dep_delay_mean = mean(dep_delay)) %>%
  ggplot(aes(y = dep_delay_mean, x = dep_delay_lag)) +
  geom_point() +

# change distance between breaks for continuous x-axis ---
  scale_x_continuous(breaks = seq(0, 1500, by = 120)) +

  labs(y = "Departure Delay", x = "Previous Departure Delay")

“Previous Departure Delay" 與 “Departure Delay"之間的關係,在個別機場上或是 overall(所有機場的數據)的效果相似。

lagged_delays %>%
  group_by(origin, dep_delay_lag) %>%
  summarise(dep_delay_mean = mean(dep_delay)) %>%
  ggplot(aes(y = dep_delay_mean, x = dep_delay_lag)) +
  geom_point() +
  # facet_wrap() 分割子圖之參數設定
  facet_wrap(~origin, ncol = 1) +
  labs(y = "Departure Delay", x = "Previous Departure Delay")

Exercise 5.7.6 *****
Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?

使用Standardization偵測不尋常的觀察值。

standardized_flights <- flights %>%
  filter(!is.na(air_time)) %>%
  group_by(dest, origin) %>%
  #from the same origin to the same destination
  mutate(
    air_time_mean = mean(air_time),
    air_time_sd = sd(air_time),
    n = n()
  ) %>%
  ungroup() %>%
  mutate(air_time_standard = (air_time - air_time_mean) / (air_time_sd + 1))
# add 1 to the denominator and numerator to avoid dividing 
# by zero
# 等價於以上程式區塊
standardized_flights <- flights %>%
  filter(!is.na(air_time)) %>%
  group_by(origin, dest) %>%
  mutate( n = n(),
         air_time_mean = mean(air_time),
         air_time_sd = sd(air_time),
         ) %>%
  mutate(standard_air_time = (air_time - air_time_mean)
         / (air_time_sd + 1 )
         ) %>%
  select(standard_air_time, everything())
ggplot(standardized_flights, aes(x = air_time_standard)) +
  geom_density()
# show the top ten outliers
standardized_flights %>%
  arrange(air_time_standard) %>%
  select(
    carrier, flight, origin, dest, month, day,
    air_time, air_time_mean, air_time_standard
  ) %>%
  head(10) %>%
  print(width = Inf) # optional to ensure that all columns will 
                     # be printed ---
# use the median as a measure of central tendency and the 
# interquartile range (IQR) as a measure of spread
# more robust to outliers

standardized_flights2 <- flights %>%
  filter(!is.na(air_time)) %>%
  group_by(dest, origin) %>%
  mutate(
    air_time_median = median(air_time),
    air_time_iqr = IQR(air_time),
    n = n(),
    air_time_standard = (air_time - air_time_median) / air_time_iqr
  )

ggplot(standardized_flights2, aes(x = air_time_standard)) +
  geom_density()

Knowing the substance of the data analysis at hand is one of the most important tools of a data scientist. The tools of statistics are a complement, not a substitute, for that knowledge.

Exercise 5.7.7 *****
Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.

flights %>%
  # find all airports with > 1 carrier
  group_by(dest) %>%
  mutate(n_carriers = n_distinct(carrier)) %>%
  # use n_distinct() to count diff carrier
  filter(n_carriers > 1) %>%
  # rank carriers by numer of destinations
  group_by(carrier) %>%
  summarize(n_dest = n_distinct(dest)) %>%
  arrange(desc(n_dest))

Exercise 5.7.8 *******
For each plane, count the number of flights before the first delay of greater than 1 hour.

拆解問題,分為三個程式區塊實作:

#1. reconstruct the tbl
a <- flights %>%
  # sort in increasing order
  select(tailnum, year, month, day, dep_delay) %>%
  filter(!is.na(dep_delay)) %>%
  arrange(tailnum, year, month, day) %>%
  group_by(tailnum)
# 2. add a column to tbl, 
# "num of flights greater than 1 hour"
b <- flights %>%
  # sort in increasing order
  select(tailnum, year, month, day, dep_delay) %>%
  filter(!is.na(dep_delay)) %>%
  arrange(tailnum, year, month, day) %>%
  group_by(tailnum) %>%
  # cumulative number of flights delayed over one hour
  # hint: use cumsum() 
  mutate(cumulative_hr_delays = cumsum(dep_delay > 60))
ans <- flights %>%
  # sort in increasing order
  select(tailnum, year, month, day, dep_delay) %>%
  filter(!is.na(dep_delay)) %>%
  arrange(tailnum, year, month, day) %>%
  group_by(tailnum) %>%
  # cumulative number of flights delayed over one hour
  mutate(cumulative_hr_delays = cumsum(dep_delay > 60)) %>%
  # count the number of flights == 0
  # "count the number of flights before the first delay" ---
  summarise(total_flights = sum(cumulative_hr_delays < 1)) %>%
  arrange(total_flights)

發表者:Q

塵世中一個迷途小書僮

發表留言

使用 WordPress.com 設計專業網站
立即開始使用