R for Data Science Ch 7.4-7.5.1

7.3節談到outlier,7.4節接續介紹處理outlier的方法。

7.4 Missing values
當在data set中出現異常值、離群值時,常用以下兩個方法解決:

  1. Drop the entire row with the strange values:
diamonds2 <- diamonds %>% 
  filter(between(y, 3, 20))

作者並不建議直接將觀察值丟棄,因為一個觀察值的單一column是離群值,並不代表該觀察值的所有column都出現問題,而且會丟棄觀察值的作法會損失樣本。

2. Instead, I recommend replacing the unusual values with missing values.
作者建議以遺漏值 (missing value) 置換unusual values,在 tidyverse 包中可以使用mutate() 函數加上 ifelse() 函數 或是 case_when() 函數實作。

diamonds2 <- diamonds %>% 
  mutate(y = ifelse(y < 3 | y > 20, NA, y))

使用 ggplot2 繪圖時,會預設移除 missing value 再繪圖,但會跳出 warning 訊息。

ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + 
  geom_point()
#> Warning: Removed 9 rows containing missing values (geom_point).

如果要隱藏 warning message ,可以設定 na.rm = TRUE ,如以下程式碼:

ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + 
  geom_point(na.rm = TRUE)

若需要觀察遺漏值,可以使用 is.na() 函數取出遺漏值,並繪圖觀察:

nycflights13::flights %>% 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>% 
  ggplot(mapping = aes(sched_dep_time)) + 
    geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)

7.4.1 Exercises
What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?

繪製直方圖時,預設就會移除遺漏值再進行繪圖:

# Missing values are removed when the number of observations 
# in each bin are calculated.   ---
diamonds2 <- diamonds %>%
  mutate(y = ifelse(y < 3 | y > 20, NA, y))

ggplot(diamonds2, aes(x = y)) +
  geom_histogram()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Removed 9 rows containing non-finite values (stat_bin).

繪製長條圖時,遺漏值則會獨立成一個 group 顯示在長條圖上。

# In the geom_bar() function, NA is treated as another category. 
diamonds %>%
  mutate(cut = if_else(runif(n()) < 0.1, NA_character_, as.character(cut))) %>%
  ggplot() +
  geom_bar(mapping = aes(x = cut))

Exercise 7.4.2
sum() mean() 中使用 na.rm = TRUE 會在運算前移除遺漏值。

mean(c(0, 1, 2, NA), na.rm = TRUE)
#> [1] 1
sum(c(0, 1, 2, NA), na.rm = TRUE)
#> [1] 3

7.5 Covariation
變異數測量的是變數內的關係,而共變數測量的是變數之間的關係, Covariation is the tendency for the values of two or more variables to vary together in a related way.  觀察變異數最好的方法就是將變數們視覺化,變數的類型將決定使用何種視覺化方法。

7.5.1 A categorical and continuous variable
通常以 frequency polygon 作圖,但是預設的 geom_freqpoly() y 軸為count(計數),若在count數皆很小的組別,不容易看出組別之間的差距,不利於比較,以下為例子:

ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

使用直方圖時,數據全距太大的情況下也難以辨認組別之間的差距,如以下例子:

ggplot(diamonds) + 
  geom_bar(mapping = aes(x = cut))

為了容易比較組別之間的差距,我們將縱軸 (y-axis) 設為機率密度 (density),density 為標準化後的總計數 (count standardised)。縱軸為機率密度的折線圖,每條折線之下的區域面積總和為 1。以下為縱軸為機率密度的折線圖:

# y-axis: replace count with density
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

由以上的機率密度圖可看出,fair等級的鑽石「似乎」有最高的平均數,但折線圖其實不易判斷組別之間的中央趨勢,著我們再以箱型圖(boxplot)深入探索「a continuous variable broken down by a categorical variable」,以下圖片顯示箱型圖所能提供的資訊:

# using geom_boxplot() to plot boxplot
# "cut" is an ordered factor
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
  geom_boxplot()

在上圖範例中, “cut" 變數本身有次序性。
若用來繪製箱型圖的類別變數本身沒有次序性,我們可以使用 reorder() 函數將類別變數之下不同的 level 重新排序。
以下範例我們使用不同 level 之下資料點的中位數為重新排序的標準:

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot()
# reorder() "class" with median of "hwy" at each level
ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))
# using coord_flip() to flip plot 90° to deal with long variable names
ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
  coord_flip()

7.5.1.1 Exercises
Use the boxplot to improve the visualization of the departure times of cancelled vs. non-cancelled flights.

nycflights13::flights %>%
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>%
  ggplot() +
  geom_boxplot(mapping = aes(y = sched_dep_time, x = cancelled))

Exercise 7.5.1.2
先針對第一個提問作圖:What variable in the diamonds dataset is most important for predicting the price of a diamond? 

# Use a scatter plot to visualize relationship between "price" and "carat"
# scatter plot display relationship between two continuous variables
ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point()

由上圖可以看出觀察值(資料點)個數非常多,以下我們使用「carat」這個變數裝箱(binning)後繪出散佈圖。使用裝箱法的散佈圖,其 binning width 數值的決定非常重要:數值設定太大,會遮蔽住變數間的關係;當數值設定的太小,每個bin中的樣本個數會成為多餘的變數以至於無法顯示整體趨勢。

ggplot(data = diamonds, mapping = aes(x = carat, y = price)) +
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

「color」和「clarity」屬於已排序的類別變數,作者使用 boxplot 描繪關係,作者認為箱型圖更容易觀察出兩個類別變數之間的關係。

# reverse the order of the color levels 
# so they will be in increasing order of quality on the x-axis
# before plotting
diamonds %>%
  mutate(color = fct_rev(color)) %>%
  ggplot(aes(x = color, y = price)) +
  geom_boxplot()

# a weak negative relationship between "color" and "price"
# a weak negative relationship between "clarity" and "price" 
# the scale of clarity goes from I1 (worst) to IF (best)
ggplot(data = diamonds) +
  geom_boxplot(mapping = aes(x = clarity, y = price))

綜合以上三張圖可以發現,「clarity」和「color」這兩個自變數,其組內變異皆大於組間變異(變數之間的關係是否被組內變異抵銷),而「carat」是預測「price」較佳的自變數(the best predictor of price)。

再來,以箱型圖描繪「cut」與「carat」之間的關係:

ggplot(diamonds, aes(x = cut, y = carat)) +
  geom_boxplot()

由上圖可以看出,「cut」變數之下各 level 的組內變異數非常大,我們注意到:最大克拉數的鑽石有著最差等級的切割,而「cut」與「carat」之間有些微的負向關係。可能是因為大克拉數的鑽石只要普通甚至略差的切割工藝就能賣出,但是小克拉數的鑽石則需要較高等級的切割工藝才能賣出。

Exercise 7.5.1.3
Install the ggstance package, and create a horizontal box plot (水平箱型圖). How does this compare to using coord_flip()

geom_boxplot() 函數與 coord_flip() 函數製作的水平箱型圖如下:

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
  coord_flip()
# 畫出一般箱型圖再翻轉

以下程式區塊使用 ggstance 這個包繪製水平箱型圖:

library("ggstance")

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(y = reorder(class, hwy, FUN = median), x = hwy))
# 直接指定橫軸與縱軸

Exercise 7.5.1.4
箱型圖適用於小樣本的資料集,以下介紹應用於大數據的箱型圖──Letter-Value Plots: Boxplots for Large Data ,letter-value plot 引入研究者自定義的順序統計量 “letter values",能夠更好地描述落在 Q1 與 Q4 之外的 tail behavior,一般來說,大數據相較於小樣本有以下特點:
1. Give precise estimates of quantiles(分位數) beyond the quartiles(四分位數).
(關於分位數、四分位數、百分位數)
2. Larger datasets should have more outliers ( in absolute numbers ).

在R語言中使用 lvplot 包中的geom_lv() 繪製Letter-Value Plots。

library("tidyverse")
library("lvplot")
ggplot(diamonds, aes(x = cut, y = price)) +
  geom_lv()

Exercise 7.5.1.5
Compare and contrast geom_violin() with a faceted geom_histogram(), or a colored geom_freqpoly(). What are the pros and cons of each method ?

著色的 geom_freqpoly() 利於快速查看整體趨勢,如以下範例,我們可以一眼看出「cut」之下的哪個 level 有最高的「price」,但折線間的重疊使我們難以區分整體分配與各組別間的關係(the overlapping lines makes it difficult to distinguish how the overall distributions relate to each other)。

geom_violin() 與 faceted geom_histogram() 有相似的優點和缺點:能用肉眼就了解各個 level 之下的資料長相(偏度、中央趨勢、離散趨勢等等);但是難以比較分配的縱軸數值,無法一眼看出「cut」之下的哪個 level 有最高的「price」。

以上三種繪圖函數呈現的分配形狀,皆涉及 parameters tuning。( All of these methods depend on tuning parameters to determine the level of smoothness of the distribution. )
以下是三種繪圖函數的程式碼與繪圖範例:

# geom_freqpoly()
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
  geom_freqpoly(mapping = aes(color = cut), binwidth = 500)
# faceted geom_histogram()
ggplot(data = diamonds, mapping = aes(x = price)) +
  geom_histogram() +
  facet_wrap(~cut, ncol = 1, scales = "free_y")
# geom_violin()
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
  geom_violin() +
  coord_flip()

Exercise 7.5.1.6
在數據集屬於小樣本的情況下,geom_jitter() 可以讓連續型變數與類別變數間的關係更清楚,ggbeeswarm 包提供許多與geom_jitter() 函數功能相似的方法,如以下說明:

  • geom_quasirandom() produces plots that are a mix of jitter and violin plots. There are several different methods that determine exactly how the random location of the points is generated.
  • geom_beeswarm() produces a plot similar to a violin plot, but by offsetting the points.

R for Data Science Ch 7.1 – 7.3

7 Exploratory Data Analysis
This chapter will show you how to use visualization and transformation to explore your data in a systematic way, a task that statisticians call exploratory data analysis, or EDA for short. 
EDA is an iterative cycle :

  1. Generate questions about your data.
  2. Search for answers by visualizing, transforming, and modelling your data.
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    the tools of EDA
  3. Use what you learn to refine your questions and/or generate new questions.

探索性資料分析 (Exploratory Data Analysis, EDA) 就是重複:提問、以資料回答問題、根據資料修正或提出新問題的循環過程。涉及視覺化、資料轉換、建模等步驟。EDA的目的就是深入了解資料。

究竟應提出甚麼問題以開啟EDA的流程,並無特定方法或規則可依循,本書建議可以由以下兩類問題開始進行EDA:

  1. What type of variation occurs within my variables?
  2. What type of covariation occurs between my variables?

7.2節開始定義一些常用的統計學名詞:observation, variable, value, tabular data, 和大多統計書籍裡常用名詞一樣,沒有特別艱澀冷僻的概念,有一般統計概念的讀者應可迅速上手。

7.3 Variation
The best way to understand that pattern is to visualize the distribution of the variable’s values.
針對感興趣的變數的視覺化,首先需要判斷變數是離散或連續。

在R語言中,類別變數會以 factor vector 或 character vector 儲存 ,通常用 bar chart 描繪類別變數的機率分配。

library(tidyverse)
ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut))
# use dplyr::count() to count sample size in each group
diamonds %>% 
  count(cut)
#> # A tibble: 5 x 2
#>   cut           n
#>   <ord>     <int>
#> 1 Fair       1610
#> 2 Good       4906
#> 3 Very Good 12082
#> 4 Premium   13791
#> 5 Ideal     21551

在R語言中,連續型變數會是 number 或 date-time 格式 ,通常用 histogram 描繪連續型變數的機率分配。

ggplot(data = diamonds) +
  geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
# combine dplyr::count() and ggplot2::cut_width() 
# to show the number of observations that fall in each bin

diamonds %>% 
  count(cut_width(carat, 0.5))
#> # A tibble: 11 x 2
#>   `cut_width(carat, 0.5)`     n
#>   <fct>                   <int>
#> 1 [-0.25,0.25]              785
#> 2 (0.25,0.75]             29498
#> 3 (0.75,1.25]             15977
#> 4 (1.25,1.75]              5313
#> 5 (1.75,2.25]              2002
#> 6 (2.25,2.75]               322
#> # … with 5 more rows

繪製 histogram 時,可以使用 binwidth 調整每根 bar 的寬度(等同於用來測量 x 軸變數的單位),寬度的調整可能會改變分配形狀,如以下程式區塊及其圖形:

smaller <- diamonds %>% 
  filter(carat < 3)
  
ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.1)

如果想在同一張圖上顯示(重疊,覆蓋)許多張 histogram ,建議使用折線圖 ,即geom_freqpoly() 函數, geom_freqpoly() 進行與 geom_histogram() 相同的計算,但是以折線描繪分配。

ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
  geom_freqpoly(binwidth = 0.1)

以長條圖、直方圖或是直線圖描繪出分配後,可針對變數的分散趨勢,即變異性(variation)深入探討以下問題:

  • Which values are the most common? Why?
  • Which values are rare? Why? Does that match your expectations?
  • Can you see any unusual patterns? What might explain them?

以 diamond 數據集為例,我們可以在繪出直方圖( histogram )後探討以下問題:

  • Why are there more diamonds at whole carats and common fractions of carats?
  • Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?
  • Why are there no diamonds bigger than 3 carats?
smaller <- diamonds %>% 
  filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.01)

由上圖可以發現, 數個有相似值區間聚集而成的subgroup (Clusters of similar values suggest that subgroups exist in your data. ),引申出以下問題:

  • How are the observations within each cluster similar to each other?
  • How are the observations in separate clusters different from each other?
  • How can you explain or describe the clusters?
  • Why might the appearance of clusters be misleading?

7.3.3 Unusual values
Outliers are observations that are unusual; data points that don’t seem to fit the pattern.
當資料數量非常大時,直方圖難以辨認出離群值(outlier)

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5)

為了能夠辨認出以上直方圖中的離群值,我們可以使用 coord_cartesian() 函數,將 y 軸的單位 zoom-in 至較小的全距。
The Cartesian coordinate system (笛卡爾座標) is the most familiar, and common, type of coordinate system.
Setting limits on the coordinate system will zoom the plot (like you’re looking at it with a magnifying glass), and will not change the underlying data like setting limits on a scale will.

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  coord_cartesian(ylim = c(0, 50))
  # y 軸全距只顯示 (0, 50)
# 在 ggplot中也有 xlim() ylim() 函數,但是採取直接把設定範圍之外的
# 資料點全部丟棄的做法,和 coord_cartesian() 的作法不同 ---

由上圖可看出,在 x軸 ( y值 ) 為0, 約30, 約60的地方有離群值。進一步抓出這些離群值:

unusual <- diamonds %>% 
  filter(y < 3 | y > 20) %>% 
  select(price, x, y, z) %>%
  arrange(y)
unusual
#> # A tibble: 9 x 4
#>   price     x     y     z
#>   <int> <dbl> <dbl> <dbl>
#> 1  5139  0      0    0   
#> 2  6381  0      0    0   
#> 3 12800  0      0    0   
#> 4 15686  0      0    0   
#> 5 18034  0      0    0   
#> 6  2130  0      0    0   
#> 7  2130  0      0    0   
#> 8  2075  5.15  31.8  5.12
#> 9 12210  8.09  58.9  8.06

y 變數測量的是鑽石在三個不同維度的數值,以 mm為單位,顯然不可能出現數值為零的觀察值,故可以剔除這些數值為零的觀察值。 32mm 與 59mm 的觀察值也明顯不合哩,因為這樣的寬度對於鑽石來說太過離譜。

若去除離群值後的分析結果幾乎等於保留離群值的分析結果,可以使用遺漏值( missing value )取代離群值;若去除離群值的分析結果與保留離群值的分析結果差距甚大,就必須進一步研究造成離群值的原因。

7.3 Exercises

7.3.1
Explore the distribution of each of the xy, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

# 首先使用 summary() 列出各變數的敘述統計量
summary(select(diamonds, x, y, z))
#>        x               y              z       
#>  Min.   : 0.00   Min.   : 0.0   Min.   : 0.0  
#>  1st Qu.: 4.71   1st Qu.: 4.7   1st Qu.: 2.9  
#>  Median : 5.70   Median : 5.7   Median : 3.5  
#>  Mean   : 5.73   Mean   : 5.7   Mean   : 3.5  
#>  3rd Qu.: 6.54   3rd Qu.: 6.5   3rd Qu.: 4.0  
#>  Max.   :10.74   Max.   :58.9   Max.   :31.8

#再繪製各變數的密度分配,連續型變數使用直方圖
ggplot(diamonds) +
  geom_histogram(mapping = aes(x = x), binwidth = 0.01)

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = y), binwidth = 0.01)

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = z), binwidth = 0.01)

綜合以上統計量與機率密度分配,我們有以下幾點發現:

  1. 針對各變數之四分位距 ( interquartile range, IQR, Q3 – Q1) 知, x 與 y 普遍的觀察值大於 z ( with x and y having inter-quartile ranges of 4.7-6.5, while z has an inter-quartile range of 2.9-4.0 )
  2. 各變數皆有離群值
  3. 各變數之機率密度分配皆為右偏(正偏)分配
  4. 各變數之機率密度分配皆有多個peak,可能屬於multimodal distribution
# 比較三個變數之間的大小關係
summarise(diamonds, mean(x > y), mean(x > z), mean(y > z))
#> # A tibble: 1 x 3
#>   `mean(x > y)` `mean(x > z)` `mean(y > z)`
#>           <dbl>         <dbl>         <dbl>
#> 1         0.434         1.000         1.000
# 取出各變數中數值為零的觀察值,進一步研究這些離群值
filter(diamonds, x == 0 | y == 0 | z == 0)

#> # A tibble: 20 x 10
#>   carat cut     color clarity depth table price     x     y     z
#>   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1  1    Premium G     SI2      59.1    59  3142  6.55  6.48     0
#> 2  1.01 Premium H     I1       58.1    59  3167  6.66  6.6      0
#> 3  1.1  Premium G     SI2      63      59  3696  6.5   6.47     0
#> 4  1.01 Premium F     SI2      59.2    58  3837  6.5   6.47     0
#> 5  1.5  Good    G     I1       64      61  4731  7.15  7.04     0
#> 6  1.07 Ideal   F     SI2      61.6    56  4954  0     6.62     0
#> # … with 14 more rows

# 取出變數 y、z 中,數值異常大的離群值
# 分別是 y == 58.9, y == 31.8, z == 31.8
# 由大到小列出前幾筆數值
diamonds %>%
  arrange(desc(y)) %>%
  head()
#> # A tibble: 6 x 10
#>   carat cut     color clarity depth table price     x     y     z
#>   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1  2    Premium H     SI2      58.9    57 12210  8.09 58.9   8.06
#> 2  0.51 Ideal   E     VS1      61.8    55  2075  5.15 31.8   5.12
#> 3  5.01 Fair    J     I1       65.5    59 18018 10.7  10.5   6.98
#> 4  4.5  Fair    J     I1       65.8    58 18531 10.2  10.2   6.72
#> 5  4.01 Premium I     I1       61      61 15223 10.1  10.1   6.17
#> 6  4.01 Premium J     I1       62.5    62 15223 10.0   9.94  6.24
# 由大到小列出前幾筆數值
diamonds %>%
  arrange(desc(z)) %>%
  head()
#> # A tibble: 6 x 10
#>   carat cut       color clarity depth table price     x     y     z
#>   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1  0.51 Very Good E     VS1      61.8  54.7  1970  5.12  5.15 31.8 
#> 2  2    Premium   H     SI2      58.9  57   12210  8.09 58.9   8.06
#> 3  5.01 Fair      J     I1       65.5  59   18018 10.7  10.5   6.98
#> 4  4.5  Fair      J     I1       65.8  58   18531 10.2  10.2   6.72
#> 5  4.13 Fair      H     I1       64.8  61   17329 10     9.85  6.43
#> 6  3.65 Fair      H     I1       67.1  53   11668  9.53  9.48  6.38

觀察到:不同變數的離群值之間似乎沒有相關聯。
進一步針對兩兩變數做散佈圖,觀察變數與變數之間的關係:

ggplot(diamonds, aes(x = x, y = y)) +
  geom_point()
ggplot(diamonds, aes(x = x, y = z)) +
  geom_point()
ggplot(diamonds, aes(x = y, y = z)) +
  geom_point()

以下我們將原始資料剔除離群值後,探索 x, y, z 三個變數的機率密度分布:

filter(diamonds, x > 0, x < 10) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = x), binwidth = 0.01) +
  scale_x_continuous(breaks = 1:10)
#^^^^^^^^^^^^^^^^^^^調整 x 軸單位


filter(diamonds, y > 0, y < 10) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = y), binwidth = 0.01) +
  scale_x_continuous(breaks = 1:10)

filter(diamonds, z > 0, z < 10) %>%
  ggplot() +
  geom_histogram(mapping = aes(x = z), binwidth = 0.01) +
  scale_x_continuous(breaks = 1:10)

由以上三張圖,推測出 spike來自於鑽石切割中常用的數值。

Exercise 7.3.2
Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)
藉由改變直方圖的binwidth,觀察變數的 global feature 和 local feature

summary(select(diamonds, price))
#     price      
# Min.   :  326  
# 1st Qu.:  950  
# Median : 2401  
# Mean   : 3933  
# 3rd Qu.: 5324  
# Max.   :18823 
# 先針對小於中位數的數值繪圖
# binwidth = 10
ggplot(filter(diamonds, price < 2500), aes(x = price)) +
#data = ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
  geom_histogram(binwidth = 10, center = 0)
# 針對全部資料作圖
# binwidth = 100
ggplot(filter(diamonds), aes(x = price)) +
  geom_histogram(binwidth = 100, center = 0)

Exercise 7.3.3
How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?

diamonds %>%
  filter(carat >= 0.99, carat <= 1) %>%
# 列出 carat 介於 0.99 到 1 間的觀察值
  count(carat)
#> # A tibble: 2 x 2
#>   carat     n
#>   <dbl> <int>
#> 1  0.99    23
#> 2  1     1558

1 克拉的觀察值個數是0.99克拉的近七十倍。

Exercise 7.3.4
Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?

coord_cartesian() 函數是在運算及繪圖之後,針對指定(限定)的範圍 zoom-in

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = price)) +
  coord_cartesian(xlim = c(100, 5000), ylim = c(0, 3000))

xlim() 與 ylim() 函數則是在函數運算前針對統計量做限制,落在 x- 與 y-limits 之外的數據皆會被丟棄, 會影響整體直方圖的長相。

ggplot(diamonds) +
  geom_histogram(mapping = aes(x = price)) +
  xlim(100, 5000) +
  ylim(0, 3000)

R for Data Science Ch.6 & 8

第六章講解一些RStudio的快捷鍵、使用R語言和他人協作時的慣例(共享的code開頭要寫明使用哪些package,但盡量避免 install.packages() or setwd() 等會變更個人化設定的指令)、 RStudio 錯誤回報時如何診斷問題來源。

看到以下畫面表示有 Syntax error ,游標移動過去就可以看到error message

這裡是RStudio官方針對Code Diagnostics的頁面。

第八章講解使用RStudio實作專案時,工作區的設定、資料夾和路徑的管理。

R has a powerful notion of the working directory. This is where R looks for files that you ask it to load, and where it will put any files that you ask it to save.

# running getwd() to print current working directory
getwd()

8.3 Paths and directories
Paths and directories are a little complicated because there are two basic styles of paths:

  1. Windows uses backslashes to separate the components of the path. (e.g.plots\diamonds.pdf)
  2. Absolute paths (i.e. paths that point to the same place regardless of your working directory) look different. In Windows they start with a drive letter (e.g. C:) or two backslashes (e.g. \\servername
    You should never use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you.
  3. The last minor difference is the place that ~ points to. ~ is a convenient shortcut to your home directory. Windows doesn’t really have the notion of a home directory, so it instead points to your documents directory.

8.4 RStudio projects

R experts keep all the files associated with a project together — input data, R scripts, analytical results, figures. 在 RStudio 中內建 projects 來實現以上功能。

Click File > New Project, then:

務必記得subdirectory!!

完成以上設定後,輸入 getwd( ) 確認project設定的路徑是否在目前 working directory 之下。

8.5 Summary
R project內建了嚴謹的工作流,確保project之間的隔離與單一project內的工作便利性,其workflow如下:

  • Create an RStudio project for each data analysis project.
  • Keep data files there; we’ll talk about loading them into R in data import.
  • Keep scripts there; edit them, run them in bits or as a whole.
  • Save your outputs (plots and cleaned data) there.
  • Only ever use relative paths, not absolute paths.

R for Data Science 5.6 – 5.7 Exercise

R4DS讀到現在,覺得這真是本很棒的入門書,tidyverse裡面的package一脈相承的設計哲學讓data manipulation可以系統化的學習,而非只是死背一卡車的函數。尤其dplyr,函數操作濃濃的SQL味道,很容易上手,讀到這裡終於有信心用R做前處理,而不是看完整本書覺得腦子裡只有碎片化的函數飛來飛去……

第五章最末兩小節的練習題特別多且應用性強,故獨立一篇記錄之。

Exercise 5.6.2
Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count())

#not_cancelled %>% count(dest)

#not_cancelled %>%
#  group_by(dest) %>%
#  summarise(n = n())

#not_cancelled %>%
#  group_by(dest) %>%
#  summarise(n = length(dest))
# using summarise() to create one or more scalar variables 
# summarizing the variables of an existing tbl
# summarise(), to reduce multiple values down to a single value

# count() is effectively a short-cut for group_by() followed by # tally()
#not_cancelled %>%
#  group_by(dest) %>%
#  tally()

#> # A tibble: 104 x 2
#>   dest      n
#>   <chr> <int>
#> 1 ABQ     254
#> 2 ACK     264
#> 3 ALB     418
#> 4 ANC       8
#> 5 ATL   16837
#> 6 AUS    2411
#> # … with 98 more rows
# not_cancelled %>% count(tailnum, wt = distance)

#not_cancelled %>%
#  group_by(tailnum) %>%
#  summarise(n = sum(distance))

# we can also use the combination group_by() and tally().
# any arguments to tally() are summed.
not_cancelled %>%
  group_by(tailnum) %>%
  tally(distance)

#> # A tibble: 4,037 x 2
#>   tailnum      n
#>   <chr>    <dbl>
#> 1 D942DN    3418
#> 2 N0EGMQ  239143
#> 3 N10156  109664
#> 4 N102UW   25722
#> 5 N103US   24619
#> 6 N104UW   24616
#> # … with 4,031 more rows

Exercise 5.6.3
Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay)) is slightly suboptimal. Why? Which is the most important column?

飛機起飛並不代表抵達,飛行過程可能發生意外或是折返回起飛機場,所以 arr_delay應該才是我們需要研究的目標column。

Exercise 5.6.4 *****
Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?

cancelled_per_day <-
  flights %>%
  mutate(cancelled = (is.na(arr_delay) | is.na(dep_delay))) %>%
# add a column "cancelled" , which show TRUE or FALSE
  group_by(year, month, day) %>%
  summarise(
    cancelled_num = sum(cancelled),
    flights_num = n(),
  )
# Plotting flights_num against cancelled_num
ggplot(cancelled_per_day) +
  geom_point(aes(x = flights_num, y = cancelled_num))

# with regression line
#  ggplot(cancelled_per_day, aes(x = flights_num, y = #cancelled_num)) +
#    geom_point() +
#    geom_smooth(method = "lm")
# 自己覺得畫出迴歸線後,發現 變數之間的關係不是很明顯
# whether there is a relationship between the proportion of 
# flights cancelled and the average departure delay

cancelled_and_delays <-
  flights %>%
  mutate(cancelled = (is.na(arr_delay) | is.na(dep_delay))) %>%
  group_by(year, month, day) %>%
  summarise(
    cancelled_prop = mean(cancelled),
    avg_dep_delay = mean(dep_delay, na.rm = TRUE),
                                 # remove NA             
    avg_arr_delay = mean(arr_delay, na.rm = TRUE)
  ) %>%
  ungroup()
  ggplot(cancelled_and_delays, aes(x = avg_dep_delay, y = cancelled_prop)) +
    geom_point() + 
    geom_smooth(method = lm)
  ggplot(cancelled_and_delays, aes(x = avg_arr_delay, y = cancelled_prop)) +
    geom_point() + 
    geom_smooth(method = lm)

Exercise 5.6.5
Which carrier has the worst delays?
Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()))

flights %>%
  group_by(carrier) %>%
  summarise(arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(desc(arr_delay))
#> # A tibble: 16 x 2
#>   carrier arr_delay
#>   <chr>       <dbl>
#> 1 F9           21.9
#> 2 FL           20.1
#> 3 EV           15.8
#> 4 YV           15.6
#> 5 OO           11.9
#> 6 MQ           10.8
#> # … with 10 more rows


filter(airlines, carrier == "F9")
#> # A tibble: 1 x 2
#>   carrier name                  
#>   <chr>   <chr>                 
#> 1 F9      Frontier Airlines Inc.

Exercise 5.7.2
Which plane (tailnum) has the worst on-time record?The question does not define a way to measure on-time record, so I will consider two metrics:
1. proportion of flights not delayed or cancelled, and
2. mean arrival delay.

# 1.
on_time_prop_final <- flights %>%
  filter(!is.na(tailnum)) %>%
  mutate(on_time = (arr_delay <= 0)) %>%
  group_by(tailnum) %>%
  summarise(n = n(), on_time_prop = mean(on_time, na.rm = TRUE)) %>%
  arrange(on_time_prop)
# 2. mean arrival delay.
mean_arr_delay <- flights %>%
  filter(!is.na(tailnum)) %>%
  group_by(tailnum) %>%
  summarise(n = n(), avg_arr_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(desc(avg_arr_delay))

Exercise 5.7.3
What time of day should you fly if you want to avoid delays as much as possible?

best_hour <- flights %>%
  group_by(hour) %>%
  summarise(avg_arr_delay = mean(arr_delay, na.rm = TRUE),
            avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  arrange(avg_arr_delay, avg_arr_delay)

Exercise 5.7.4
For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.

dest_delay_total <- flights %>%
  filter(arr_delay > 0) %>%
  group_by(dest) %>%
  mutate(
    arr_delay_total = sum(arr_delay),
    # using sum()
    arr_delay_prop = arr_delay / arr_delay_total
  ) %>%
  select(dest, arr_delay_total, arr_delay_prop)

Exercise 5.7.5 *****
Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag() explore how the delay of a flight is related to the delay of the immediately preceding flight.

lead-lag {dplyr} R Documentation
Description
Find the “next" or “previous" values in a vector. Useful for comparing values ahead of or behind the current values.
lag():這一筆與下一筆觀察值的差,等於下筆觀察值減這一筆觀察值。
lead():這一筆與前一筆觀察值的差,等於這一筆觀察值減前一筆觀察值。

Step 1.
Calculates the departure delay of the preceding flight from the same airport.

lagged_delays <- flights %>%
  arrange(origin, month, day, dep_time) %>%
  group_by(origin) %>%
  mutate(dep_delay_lag = lag(dep_delay)) %>%
  filter(!is.na(dep_delay), !is.na(dep_delay_lag))

Step 2.
Plots the relationship between the mean delay of a flight for all values of the previous flight. 
“Previous Departure Delay" against “Departure Delay"

lagged_delays %>%
  group_by(dep_delay_lag) %>%
  summarise(dep_delay_mean = mean(dep_delay)) %>%
  ggplot(aes(y = dep_delay_mean, x = dep_delay_lag)) +
  geom_point() +

# change distance between breaks for continuous x-axis ---
  scale_x_continuous(breaks = seq(0, 1500, by = 120)) +

  labs(y = "Departure Delay", x = "Previous Departure Delay")

“Previous Departure Delay" 與 “Departure Delay"之間的關係,在個別機場上或是 overall(所有機場的數據)的效果相似。

lagged_delays %>%
  group_by(origin, dep_delay_lag) %>%
  summarise(dep_delay_mean = mean(dep_delay)) %>%
  ggplot(aes(y = dep_delay_mean, x = dep_delay_lag)) +
  geom_point() +
  # facet_wrap() 分割子圖之參數設定
  facet_wrap(~origin, ncol = 1) +
  labs(y = "Departure Delay", x = "Previous Departure Delay")

Exercise 5.7.6 *****
Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?

使用Standardization偵測不尋常的觀察值。

standardized_flights <- flights %>%
  filter(!is.na(air_time)) %>%
  group_by(dest, origin) %>%
  #from the same origin to the same destination
  mutate(
    air_time_mean = mean(air_time),
    air_time_sd = sd(air_time),
    n = n()
  ) %>%
  ungroup() %>%
  mutate(air_time_standard = (air_time - air_time_mean) / (air_time_sd + 1))
# add 1 to the denominator and numerator to avoid dividing 
# by zero
# 等價於以上程式區塊
standardized_flights <- flights %>%
  filter(!is.na(air_time)) %>%
  group_by(origin, dest) %>%
  mutate( n = n(),
         air_time_mean = mean(air_time),
         air_time_sd = sd(air_time),
         ) %>%
  mutate(standard_air_time = (air_time - air_time_mean)
         / (air_time_sd + 1 )
         ) %>%
  select(standard_air_time, everything())
ggplot(standardized_flights, aes(x = air_time_standard)) +
  geom_density()
# show the top ten outliers
standardized_flights %>%
  arrange(air_time_standard) %>%
  select(
    carrier, flight, origin, dest, month, day,
    air_time, air_time_mean, air_time_standard
  ) %>%
  head(10) %>%
  print(width = Inf) # optional to ensure that all columns will 
                     # be printed ---
# use the median as a measure of central tendency and the 
# interquartile range (IQR) as a measure of spread
# more robust to outliers

standardized_flights2 <- flights %>%
  filter(!is.na(air_time)) %>%
  group_by(dest, origin) %>%
  mutate(
    air_time_median = median(air_time),
    air_time_iqr = IQR(air_time),
    n = n(),
    air_time_standard = (air_time - air_time_median) / air_time_iqr
  )

ggplot(standardized_flights2, aes(x = air_time_standard)) +
  geom_density()

Knowing the substance of the data analysis at hand is one of the most important tools of a data scientist. The tools of statistics are a complement, not a substitute, for that knowledge.

Exercise 5.7.7 *****
Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.

flights %>%
  # find all airports with > 1 carrier
  group_by(dest) %>%
  mutate(n_carriers = n_distinct(carrier)) %>%
  # use n_distinct() to count diff carrier
  filter(n_carriers > 1) %>%
  # rank carriers by numer of destinations
  group_by(carrier) %>%
  summarize(n_dest = n_distinct(dest)) %>%
  arrange(desc(n_dest))

Exercise 5.7.8 *******
For each plane, count the number of flights before the first delay of greater than 1 hour.

拆解問題,分為三個程式區塊實作:

#1. reconstruct the tbl
a <- flights %>%
  # sort in increasing order
  select(tailnum, year, month, day, dep_delay) %>%
  filter(!is.na(dep_delay)) %>%
  arrange(tailnum, year, month, day) %>%
  group_by(tailnum)
# 2. add a column to tbl, 
# "num of flights greater than 1 hour"
b <- flights %>%
  # sort in increasing order
  select(tailnum, year, month, day, dep_delay) %>%
  filter(!is.na(dep_delay)) %>%
  arrange(tailnum, year, month, day) %>%
  group_by(tailnum) %>%
  # cumulative number of flights delayed over one hour
  # hint: use cumsum() 
  mutate(cumulative_hr_delays = cumsum(dep_delay > 60))
ans <- flights %>%
  # sort in increasing order
  select(tailnum, year, month, day, dep_delay) %>%
  filter(!is.na(dep_delay)) %>%
  arrange(tailnum, year, month, day) %>%
  group_by(tailnum) %>%
  # cumulative number of flights delayed over one hour
  mutate(cumulative_hr_delays = cumsum(dep_delay > 60)) %>%
  # count the number of flights == 0
  # "count the number of flights before the first delay" ---
  summarise(total_flights = sum(cumulative_hr_delays < 1)) %>%
  arrange(total_flights)

R for Data Science Ch 5.6-5.7

5.6 Grouped summaries with summarise()

summrise函數可將整個data frame濃縮成一列我們所需的資訊。

#using summarise to show "mean" of dep_delay
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))

#> # A tibble: 1 x 1
#>   delay
#>   <dbl>
#> 1  12.6

summarise()搭配group_by()可以針對我們感興趣的群組列出常用統計量,相當實用。若將經過group_by()運算的 data frame 再做 summarise() , summarise() 會自動只對群組化過的資料範圍運算。( automatically applied “by group” )

by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
#將會得到每日(因為已經以year,month, day 群組化資料)的平均dep_delay---

5.6.1 Combining multiple operations with the pipe

假設我們想要了解航班目的地與飛行距離、平均delay分鐘數間的關係。會使用一連串的函數運算。

by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
                   count = n(),
                   dist = mean(distance, na.rm = TRUE),
                   delay = mean(arr_delay, na.rm = TRUE)
)
# use n() to count the number of observations in the current  group---

delay <- filter(delay, count > 20, dest != "HNL")
# 只留下sample size > 20 的地點且排除dest = "HNL"的航班

# It looks like delays increase with distance up to ~750 miles 
# and then decrease. Maybe as flights get longer there's more 
# ability to make up delays in the air?

ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) +
  geom_smooth(se = FALSE)

delays increase with distance up to ~750 miles
and then decrease

以上程式區塊執行了三個步驟的功能:

  1. Group flights by destination.
  2. Summarise to compute distance, average delay, and number of flights.
  3. Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.

我們可以使用 pipe運算子 %>% 來簡化原本必須分段撰寫的分析。使用pipe讓我們聚焦在data frame轉換的動作上,而不用分心於中介的data frame。使用pipe同時可以增加程式碼的可閱讀性。

例如 f(x, y) 可寫作 x %>% f(y)
g(f(x, y), z) 可寫作 x %>% f(y) %>% g(z)

# group, then summarise, then filter
# a good way to pronounce %>% when reading code is “then” ---

delays <- flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL")

除了ggplot2之外,其他tidyverse中的包都可以使用pipe。

5.6.2 Missing values
若不設定 na.rm ,我們將會在data frame中看見非常多遺漏值!

flights %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay))
#> # A tibble: 365 x 4
#> # Groups:   year, month [?]
#>    year month   day  mean
#>   <int> <int> <int> <dbl>
#> 1  2013     1     1    NA
#> 2  2013     1     2    NA
#> 3  2013     1     3    NA
#> 4  2013     1     4    NA
#> 5  2013     1     5    NA
#> 6  2013     1     6    NA
#> # … with 359 more rows

因此tidyverse中所有的 aggregation function 都可以設定 na.rm 以便在計算前移除遺漏值。

flights %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay, na.rm = TRUE))
#> # A tibble: 365 x 4
#> # Groups:   year, month [?]
#>    year month   day  mean
#>   <int> <int> <int> <dbl>
#> 1  2013     1     1 11.5 
#> 2  2013     1     2 13.9 
#> 3  2013     1     3 11.0 
#> 4  2013     1     4  8.95
#> 5  2013     1     5  5.73
#> 6  2013     1     6  7.15
#> # … with 359 more rows
# use is.na() to remove missing value from data frame

not_cancelled <- flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay))
#> # A tibble: 365 x 4
#> # Groups:   year, month [?]
#>    year month   day  mean
#>   <int> <int> <int> <dbl>
#> 1  2013     1     1 11.4 
#> 2  2013     1     2 13.7 
#> 3  2013     1     3 10.9 
#> 4  2013     1     4  8.97
#> 5  2013     1     5  5.73
#> 6  2013     1     6  7.15
#> # … with 359 more rows 

5.6.3 Counts

使用 n( ) 計算觀察值(樣本個數)
sum(!is.na(x)) 計算非遺漏值的觀察值個數

# flights去除遺漏值後的data frame,將其命名為not_cancelled
not_cancelled <- flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

#以機型群組化後,計算平均delay分鐘數,整理為delays data frame ---
delays <- not_cancelled %>% 
  group_by(tailnum) %>% 
  summarise(
    delay = mean(arr_delay)
  )

# 以平均 delay 分鐘數 為 x 軸,畫出 delay 分鐘數的計次圖

ggplot(data = delays, mapping = aes(x = delay)) + 
  geom_freqpoly(binwidth = 10)

假如我們想知道平均delay分鐘數的個別次數,以下圖觀察會更清楚:

或者我們可以直接畫出各機型(tailnum)的平均delay時間:

ggplot(data = delays, mapping = aes(x = tailnum, y = delay)) + 
  geom_point()

當我們忽略觀察值中樣本數太小的群組時(例如以下程式區塊:忽略樣本數小於25的機型),可以更清楚觀察到變數之間的pattern

not_cancelled <- flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))


delays <- not_cancelled %>% 
  group_by(tailnum) %>% 
  summarise(
    delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  )

# 注意,由於ggplot2不支援pipe,layer之間還是以 + 連接
delays %>% 
  filter(n > 25) %>% 
    ggplot(mapping = aes(x = n, y = delay)) + 
  geom_point(alpha = 1/10)

在探索觀察值個數使用 n() 函數時,常用RStudio的快速鍵為 Cmd/Ctrl + Shift + P ,會再次傳送先前運算的程式區塊( resends the previously sent chunk from the editor to the console ),只要修改程式區塊n的個數,使用 Ctrl + Shift + P 可快速重複運行同樣的指令。

以下使用 tidyverse 中內建的 Lahman 數據集作為演練。

# Convert to a tibble so it prints nicely
batting <- as_tibble(Lahman::Batting)

batters <- batting %>% 
  group_by(playerID) %>% 
  summarise(
    ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
    ab = sum(AB, na.rm = TRUE)
  )

batters %>% 
  filter(ab > 100) %>% 
  ggplot(mapping = aes(x = ab, y = ba)) +
    geom_point() + 
    geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
positive correlation between skill (ba) and opportunities to hit the ball (ab)

5.6.4 Useful summary functions
常見統計量的函數運算、常用指令

  • 衡量中央趨勢: mean(x) , median(x)
  • 衡量分散趨勢: sd(x)IQR(x)mad(x)
    median absolute deviation mad(x) 較不易受離群值影響。
  • 衡量次序: min(x)quantile(x, 0.25)max(x)
  • 衡量位置: first(x)nth(x, 2)last(x). These work similarly to x[1]x[2], and x[length(x)] but let you set a default value if that position does not exist
  • 計數(Count): n() , sum(!is.na(x)) , n_distinct(x)
  • TRUE, FALSE 與sum( ), mean( ) 的應用:可將條件式寫進sum( ) 或 mean( ),運用函數回傳值的特性,可以輕鬆計算符合條件的觀察值個數。
    When used with numeric functions, TRUE is converted to 1 and FALSE to 0. This makes sum() and mean() very useful: sum(x) gives the number of TRUEs in x, and mean(x) gives the proportion.
library(tidyverse)
library(nycflights13)

not_cancelled <- flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

# Measures of location
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(
    avg_delay1 = mean(arr_delay),
    avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
  )

# Measures of spread

not_cancelled %>% 
  group_by(dest) %>% 
  summarise(distance_sd = sd(distance)) %>% 
  arrange(desc(distance_sd))

# Measures of rank
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(
    first = min(dep_time),
    last = max(dep_time)
  )

# Measures of position

not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(
    first_dep = first(dep_time), 
    last_dep = last(dep_time)
  )

# Measures of position within filter()

not_cancelled %>% 
  group_by(year, month, day) %>% 
  mutate(r = min_rank(desc(dep_time))) %>% 
  filter(r %in% range(r))

# Which destinations have the most carriers?
not_cancelled %>% 
  group_by(dest) %>% 
  summarise(carriers = n_distinct(carrier)) %>% 
  arrange(desc(carriers))

# SQLike command, count(), if all you want is a count
not_cancelled %>% 
  count(dest)
#> # A tibble: 104 x 2
#>   dest      n
#>   <chr> <int>
#> 1 ABQ     254
#> 2 ACK     264
#> 3 ALB     418
#> 4 ANC       8
#> 5 ATL   16837
#> 6 AUS    2411
#> # … with 98 more rows

# count() can optionally provide a weight variable
not_cancelled %>% 
  count(tailnum, wt = distance)

# 等價於 各機型 * distance
#not_cancelled %>% 
#  count(tailnum, distance) ---

#> # A tibble: 4,037 x 2
#>   tailnum      n
#>   <chr>    <dbl>
#> 1 D942DN    3418
#> 2 N0EGMQ  239143
#> 3 N10156  109664
#> 4 N102UW   25722
#> 5 N103US   24619
#> 6 N104UW   24616
#> # … with 4,031 more rows

# How many flights left before 5am? (these usually indicate 
# delayed flights from the previous day) ---
# Using sum( )
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(n_early = sum(dep_time < 500))

# What proportion of flights are delayed by more than an hour?
# Using mean( ) to show proportion
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(hour_perc = mean(arr_delay > 60))

5.7 Grouped mutates (and filters)
group_by( ), summarise( ) , mutate( ) , filter( ) ,以下為應用實例:

# Find the worst members of each group:

flights_sml %>% 
  group_by(year, month, day) %>%
  filter(rank(desc(arr_delay)) < 10)
#> # A tibble: 3,306 x 7
#> # Groups:   year, month, day [365]
#>    year month   day dep_delay arr_delay distance air_time
#>   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
#> 1  2013     1     1       853       851      184       41
#> 2  2013     1     1       290       338     1134      213
#> 3  2013     1     1       260       263      266       46
#> 4  2013     1     1       157       174      213       60
#> 5  2013     1     1       216       222      708      121
#> 6  2013     1     1       255       250      589      115
#> # … with 3,300 more rows
# Find all groups bigger than a threshold:

popular_dests <- flights %>% 
  group_by(dest) %>% 
  filter(n() > 365)
popular_dests

#> # A tibble: 332,577 x 19
#> # Groups:   dest [77]
#>    year month   day dep_time sched_dep_time dep_delay arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>
#> 1  2013     1     1      517            515         2      830
#> 2  2013     1     1      533            529         4      850
#> 3  2013     1     1      542            540         2      923
#> 4  2013     1     1      544            545        -1     1004
#> 5  2013     1     1      554            600        -6      812
#> 6  2013     1     1      554            558        -4      740
#> # … with 3.326e+05 more rows, and 12 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>
# Standardise to compute per group metrics:

popular_dests %>% 
  filter(arr_delay > 0) %>% 
  mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
  select(year:day, dest, arr_delay, prop_delay)
#> # A tibble: 131,106 x 6
#> # Groups:   dest [77]
#>    year month   day dest  arr_delay prop_delay
#>   <int> <int> <int> <chr>     <dbl>      <dbl>
#> 1  2013     1     1 IAH          11  0.000111 
#> 2  2013     1     1 IAH          20  0.000201 
#> 3  2013     1     1 MIA          33  0.000235 
#> 4  2013     1     1 ORD          12  0.0000424
#> 5  2013     1     1 FLL          19  0.0000938
#> 6  2013     1     1 ORD           8  0.0000283
#> # … with 1.311e+05 more rows

R for Data Science Ch 5.5

5.5 Add new variables with mutate()
mutate( ) 函數是以現有的column為基礎,在data frame中添加新column.
複習:在RStudio中,可以使用 View( ) 查看data frame的所有columns.

flights_sml <- select(flights, 
                      year:day, 
                      ends_with("delay"), 
                      distance, 
                      air_time
)
flights_sml
mutate(flights_sml,
       gain = dep_delay - arr_delay,
       speed = distance / air_time * 60
)

# You can refer to columns that you’ve just created:
mutate(flights_sml,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)
  • 使用 transmute( ) 只列出新創的column:
transmute(flights,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)
#> # A tibble: 336,776 x 3
#>    gain hours gain_per_hour
#>   <dbl> <dbl>         <dbl>
#> 1    -9  3.78         -2.38
#> 2   -16  3.78         -4.23
#> 3   -31  2.67        -11.6 
#> 4    17  3.05          5.57
#> 5    19  1.93          9.83
#> 6   -16  2.5          -6.4 
#> # … with 3.368e+05 more rows
  • Useful creation functions using with mutate( )
    The key property is that the function must be vectorised: it must take a vector of values as input, return a vector with the same number of values as output.
    以下列出常見的用法:
  1. Arithmetic operators: +-*/^.
    These are all vectorised, using the so called “recycling rules”. If one parameter is shorter than the other, it will be automatically extended to be the same length. This is most useful when one of the arguments is a single number: air_time / 60hours * 60 + minute, etc.
    Arithmetic operators are also useful in conjunction with the aggregate functions(聚合函數) you’ll learn about later. For example, x / sum(x) calculates the proportion of a total, and y - mean(y) computes the difference from the mean.
  2. Modular arithmetic: %/% (integer division) and %% (remainder), where x == y * (x %/% y) + (x %% y).
    Modular arithmetic is a handy tool because it allows you to break integers up into pieces. For example, in the flights dataset, you can compute hour and minute from dep_time with:
  3. Logs: log()log2()log10(). Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. They also convert multiplicative relationships to additive, a feature we’ll come back to in modelling.
  4. Offsets: lead() and lag() allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. x - lag(x)) or find when values change (x != lag(x)). They are most useful in conjunction with group_by(), which you’ll learn about shortly.
  5. Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum()cumprod()cummin()cummax(); and dplyr provides cummean() for cumulative means.
  6. Logical comparisons, <<=>>=!=, and ==, 邏輯運算子組合成的條件式較複雜時,建議將其條件式運算結果先存入臨時變數,確保每個步驟皆正確無誤。
  7. Ranking: there are a number of ranking functions, but you should start with min_rank()對資料進行排序,遇到重複值,排序相同,但每個值都佔一個位置,遺漏值不列入。
# An example of modular arithmetic
transmute(flights,
  dep_time,
  hour = dep_time %/% 100,
  minute = dep_time %% 100
)
#> # A tibble: 336,776 x 3
#>   dep_time  hour minute
#>      <int> <dbl>  <dbl>
#> 1      517     5     17
#> 2      533     5     33
#> 3      542     5     42
#> 4      544     5     44
#> 5      554     5     54
#> 6      554     5     54
#> # … with 3.368e+05 more rows
# lead() and lag()
# Find the "next" or "previous" values in a vector. 
# Useful for comparing values ahead of or behind the # current values.
(x <- 1:10)
#>  [1]  1  2  3  4  5  6  7  8  9 10

lag(x)
#>  [1] NA  1  2  3  4  5  6  7  8  9

lead(x)
#>  [1]  2  3  4  5  6  7  8  9 10 NA
# Cumulative and rolling aggregates
x
#>  [1]  1  2  3  4  5  6  7  8  9 10

cumsum(x)
#>  [1]  1  3  6 10 15 21 28 36 45 55

cummean(x)
#>  [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
# 常用的排序(ranking)函數
y <- c(1, 2, 2, NA, 3, 4)
min_rank(y)
#> [1]  1  2  2 NA  4  5
min_rank(desc(y))
#> [1]  5  3  3 NA  2  1

row_number(y)
#> [1]  1  2  3 NA  4  5

dense_rank(y)
#> [1]  1  2  2 NA  3  4

percent_rank(y)
#> [1] 0.00 0.25 0.25   NA 0.75 1.00

cume_dist(y)
#> [1] 0.2 0.6 0.6  NA 0.8 1.0

Exercise 5.5.1
Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

Hint: converting format HHMM to minutes since midnight(int.)
1. the integer division operator, %/%
2. the modulo operator, %%
3. now, we can convert all the times to minutes after midnight
4. x %% 1440 will convert 1440 to zero while keeping all the other times the same, this step will convert midnight(24:00 equals 1440 mins) to zero.

# convert HHMM to minutes after midnight 
# convert 24:00 to zero
flights_times <- mutate(flights,
  dep_time_mins = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440,
  sched_dep_time_mins = (sched_dep_time %/% 100 * 60 +
    sched_dep_time %% 100) %% 1440
)

# view only relevant columns
select(
  flights_times, dep_time, dep_time_mins, sched_dep_time,
  sched_dep_time_mins
)
#> # A tibble: 336,776 x 4
#>   dep_time dep_time_mins sched_dep_time sched_dep_time_mins
#>      <int>         <dbl>          <int>               <dbl>
#> 1      517           317            515                 315
#> 2      533           333            529                 329
#> 3      542           342            540                 340
#> 4      544           344            545                 345
#> 5      554           354            600                 360
#> 6      554           354            558                 358
#> # … with 3.368e+05 more rows
# define a function to avoid copying and pasting code
time2mins <- function(x) {
  (x %/% 100 * 60 + x %% 100) %% 1440
}


#Using time2mins, the previous code simplifies to the following.


flights_times <- mutate(flights,
  dep_time_mins = time2mins(dep_time),
  sched_dep_time_mins = time2mins(sched_dep_time)
)
# show only the relevant columns
select(
  flights_times, dep_time, dep_time_mins, sched_dep_time,
  sched_dep_time_mins
)
#> # A tibble: 336,776 x 4
#>   dep_time dep_time_mins sched_dep_time sched_dep_time_mins
#>      <int>         <dbl>          <int>               <dbl>
#> 1      517           317            515                 315
#> 2      533           333            529                 329
#> 3      542           342            540                 340
#> 4      544           344            545                 345
#> 5      554           354            600                 360
#> 6      554           354            558                 358
#> # … with 3.368e+05 more rows

Exercise 5.5.2
Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?

# define air-time = arr_time - dep_time
flights_airtime <-
  mutate(flights,
    dep_time = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440,
    arr_time = (arr_time %/% 100 * 60 + arr_time %% 100) %% 1440,
    air_time_diff = air_time - arr_time + dep_time
  )

# If air_time = arr_time - dep_time , there should be no flights with non-zero values of air_time_diff.
# use nrow to calculate numbers of rows
nrow(filter(flights_airtime, air_time_diff != 0))
#> [1] 327150

除了數據錯誤之外,可能有下列的原因造成airtime != arr_time – dep_time

  1. 抵達時間為清晨,所以arr_time 數值小於 dep_time。若屬於此種狀況,
    airtime 將小於24:00(即1440分鐘)
  2. 飛行跨越時區,airtime被時差抵銷,data set為啟程自紐約市的國內航班,因此時區造成的時間差(即airtime 與 arr_time – dep_time的差距)會是60的倍數(60 minutes (Central) 120 minutes (Mountain), 180 minutes (Pacific), 240 minutes (Alaska), or 300 minutes (Hawaii))

不論是以上哪種情況,時間差應該皆是60的倍數。對此針對時間差作圖,求證假設。

ggplot(flights_airtime, aes(x = air_time_diff)) +
  geom_histogram(binwidth = 1)
#> Warning: Removed 9430 rows containing non-finite values (stat_bin).

由上圖可知大多數的時間差並非是60的倍數,即上述假設不成立。進一步我們過濾出目的地為洛杉磯的班機,其時間差應該是180分鐘。並作圖驗證之。

ggplot(filter(flights_airtime, dest == "LAX"), aes(x = air_time_diff)) +
  geom_histogram(binwidth = 1)
#> Warning: Removed 148 rows containing non-finite values (stat_bin).

重新觀察documentation,發現時間差可能由於未考慮牽引機拖班機進Bay的時間,所以欄位間關係不如預期。顯示核對column定義的重要性。

Exercise 5.5.3
比較 dep_timesched_dep_timedep_delay 這些column之間,應該會有 dep_time – sched_dep_time = dep_delay的關係。

flights_deptime <-
  mutate(flights,
    dep_time_min = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440,
    sched_dep_time_min = (sched_dep_time %/% 100 * 60 +
      sched_dep_time %% 100) %% 1440,    #先將時間格式轉換為分鐘 ---
    dep_delay_diff = dep_delay - dep_time_min + sched_dep_time_min
  # 添加一個新欄位 dep_delay_diff 定義如上行code ---
  )
# 由我們的定義可知 dep_delay_diff 中每一 row 的數值應該為 0 
# 驗證之---

filter(flights_deptime, dep_delay_diff != 0)
#> # A tibble: 1,236 x 22
#>    year month   day dep_time sched_dep_time dep_delay arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>
#> 1  2013     1     1      848           1835       853     1001
#> 2  2013     1     2       42           2359        43      518
#> 3  2013     1     2      126           2250       156      233
#> 4  2013     1     3       32           2359        33      504
#> 5  2013     1     3       50           2145       185      203
#> 6  2013     1     3      235           2359       156      700
#> # … with 1,230 more rows, and 15 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, dep_time_min <dbl>,
#> #   sched_dep_time_min <dbl>, dep_delay_diff <dbl>

結果有1236 rows不為零,可能的原因是班機於24:00前起飛,但在24:00後(midnight)逾時落地,若屬於此種狀況,dep_delay_diff 應該為1440,繪圖驗證之:

ggplot(
  filter(flights_deptime, dep_delay_diff > 0),
  aes(y = sched_dep_time_min, x = dep_delay_diff)
) +
  geom_point()

Exercise 5.5.4

Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank()

dplyr包有許多用來排序的函數: row_number()min_rank()dense_rank() ,對於重複值(ties)各有不同的處理方式。以下內容摘錄自 https://www.cnblogs.com/wkslearner/p/5768845.html

min_rank()是对数据大小进行编号排序,遇到重复值,排序相同,但每个值都占一个位置,缺失值不计入 。

row_number()是对数据大小进行编号排序,遇到重复值,排序继续加1,缺失值不计入 。
其運算結果與index(or row number) of each element相同。

dense_rank() 是对数据大小进行编号排序,遇到重复值,排序相同,所有重复值只占一个位置,缺失值不计入 。
對於ties的處理與min_rank相同,但是對於tie的next rank有不同處理。如以下程式碼:

# create a seq, naming temp
(temp <- c(8, 1, 8, 2, 9, 5, 2, 8) )
#> [1] 8 1 8 2 9 5 2 8

(min_rank(temp))
#> [1] 5 1 5 2 8 4 2 5
# 重複值的每個元素都占用一個位置

(dense_rank(temp))
#> [1] 4 1 4 2 5 3 2 4
# 遇到重複值,所有重複值只佔一個位置

以下範例可看出這些ranking function的不同之處。

# create a tibble, which naming rankme
rankme <- tibble(
  x = c(10, 5, 1, 5, 5)
)

# using mutate() to add column
rankme <- mutate(rankme,
                 x_row_number = row_number(x),
                 x_min_rank = min_rank(x),
                 x_dense_rank = dense_rank(x)
)

# show the difference between ranking functions
# using arrange(dataframe, var.to reorder column) 
arrange(rankme, x)

若沒有特別指定,通常使用min_rank()函數排序。
以下程式碼可以列出前十delayed_flights。

flights_delayed <- mutate(flights,
  dep_delay_min_rank = min_rank(desc(dep_delay)),
  dep_delay_row_number = row_number(desc(dep_delay)),
  dep_delay_dense_rank = dense_rank(desc(dep_delay))
)

#只留下前十遲到班機,top_n()要配合 >%>運算子,容後再述---
flights_delayed <- filter(
  flights_delayed,
  !(dep_delay_min_rank > 10 | dep_delay_row_number > 10 |
    dep_delay_dense_rank > 10)
)

#reorder by dep_delay_min_rank
flights_delayed <- arrange(flights_delayed, dep_delay_min_rank)

print(select(
  flights_delayed, month, day, carrier, flight, dep_delay,
  dep_delay_min_rank, dep_delay_row_number, dep_delay_dense_rank
),
n = Inf
)
#> # A tibble: 10 x 8
#>    month   day carrier flight dep_delay dep_delay_min_r… dep_delay_row_n…
#>    <int> <int> <chr>    <int>     <dbl>            <int>            <int>
#>  1     1     9 HA          51      1301                1                1
#>  2     6    15 MQ        3535      1137                2                2
#>  3     1    10 MQ        3695      1126                3                3
#>  4     9    20 AA         177      1014                4                4
#>  5     7    22 MQ        3075      1005                5                5
#>  6     4    10 DL        2391       960                6                6
#>  7     3    17 DL        2119       911                7                7
#>  8     6    27 DL        2007       899                8                8
#>  9     7    22 DL        2047       898                9                9
#> 10    12     5 AA         172       896               10               10
#> # … with 1 more variable: dep_delay_dense_rank <int>

Exercise 5.5.5
What does 1:3 + 1:10 return? Why?

在R語言中基本的資料運算單位是向量(vector), 任何運算都是元素級別(element-wise)和函數以映射方式應用到每個元素之上 。

1:3 + 1:10
#> Warning in 1:3 + 1:10: longer object length is not a multiple of shorter
#> object length
#>  [1]  2  4  6  5  7  9  8 10 12 11
# 1:3 + 1:10 等於
c(1 + 1, 2 + 2, 3 + 3, 1 + 4, 2 + 5, 3 + 6, 1 + 7, 2 + 8, 3 + 9, 
1 + 10)

#>  [1]  2  4  6  5  7  9  8 10 12 11

a <- 1:12
b <- 1:3
a + b
[1]  2  4  6  5  7  9  8 10 12 11 13 15

# 1:3 + 1:12 等於
c(1 + 1, 2 + 2, 3 + 3, 1 + 4, 2 + 5, 3 + 6, 1 + 7, 2 + 8, 3 + 9, 
1 + 10, 2 + 11, 3 + 12)
# mapping element-wise vector 1:3 four cycles to 1:12

Exercise 5.5.6
R 語言中提供的三角函數類運算函數,可使用?Trig 開啟help page.

R for Data Science Ch 5.3-5.4

5.3 Arrange rows with arrange()
arrange() 以指定的column將資料重新排序(re-order),用來重新排序的變數可以不只一個, 遺漏值會被重新排序在最尾。

arrange(flights, year, month, day)

# Use desc() to re-order by a column in descending order:
arrange(flights, desc(dep_delay))

# Missing values are always sorted at the end:
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
arrange(df, desc(x))

5.3 Exercises

Exercise 5.3.1
How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).

The flights will first be sorted by desc(is.na(dep_time)). Since desc(is.na(dep_time)) is either TRUE when dep_time is missing, or FALSE, when it is not, the rows with missing values of dep_time will come first, since TRUE > FALSE.

arrange(flights, desc(is.na(dep_time)), dep_time)
               # ^^^^^^^^^^^^^^^^^^^^^
               # return T or F, T > F
               # sort all missing values to the start 

5.4 Select columns with select()
原始資料集中通常有非常多的變數,使用select( )函數可以列出需要的column,聚焦在我們感興趣的變數。若在 select( )函數中重複引入同一個變數,這個變數只會出現一次。 (The select() call ignores the duplication.)

# Select columns by name
select(flights, year, month, day)

# Select all columns between year and day (inclusive)
select(flights, year:day)

# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))

在select( )函數內可以使用的helper function:

  • starts_with("abc"): matches names that begin with “abc”.
  • ends_with("xyz"): matches names that end with “xyz”.
  • contains("ijk"): matches names that contain “ijk”.
  • matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.
  • num_range("x", 1:3): matches x1x2 and x3.

rename(), 是 select() 的變形,會保留所有變數,且可以對指定變數更名。

rename(flights, tail_num = tailnum)

select( )中使用everything( )函數,可以將data frame中所有變數引入select函數,免除手工key 大量變數的麻煩。

select(flights, time_hour, air_time, everything())

由於select( )函數中若有變數重複出現,還是只會列出一次。應用此性質加上everything() 函數,即可以重組column的排列順序。而不用將全部的column都列出。

# Using select()  with everything() to rearrange columns  
select(flights, arr_delay, everything())

Exercise 5.4.1
Brainstorm as many ways as possible to select dep_timedep_delayarr_time, and arr_delay from flights.

# Specify columns names as unquoted variable names.
select(flights, dep_time, dep_delay, arr_time, arr_delay)

# Specify column names as strings.
select(flights, "dep_time", "dep_delay", "arr_time", "arr_delay")

# Specify the column numbers of the variables.
select(flights, 4, 6, 7, 9)

# Specify the names of the variables with character vector and one_of().
select(flights, one_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))

# Selecting the variables by matching the start of their names using starts_with().
select(flights, starts_with("dep_"), starts_with("arr_"))

This is useful because the names of the variables can be stored in a variable and passed to one_of().

variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, one_of(variables))

Warning! ! Matching the names using contains() since there is not a pattern that can include all these variables without incorrectly including others .

Exercise 5.4.3
What does the one_of() function do? Why might it be helpful in conjunction with this vector?

The one_of() function selects variables with a character vector rather than unquoted variable name arguments.  This function is useful because it is easier to programmatically generate character vectors with variable names than to generate unquoted variable names, which are easier to type.

vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, one_of(vars))
select(flights, vars)

However there is a problem with this. It is not clear whether vars refers to a column name or a variable.
If it has the same name or to ensure that it will not conflict with the names of the columns in the data frame, use the !!! (bang-bang-bang) operator.

select(flights, !!!vars)
#> # A tibble: 336,776 x 5
#>    year month   day dep_delay arr_delay
#>   <int> <int> <int>     <dbl>     <dbl>
#> 1  2013     1     1         2        11
#> 2  2013     1     1         4        20
#> 3  2013     1     1         2        33
#> 4  2013     1     1        -1       -18
#> 5  2013     1     1        -6       -25
#> 6  2013     1     1        -4        12
#> # … with 3.368e+05 more rows

# This behavior, which is used by many tidyverse functions, 
# is an example of what is called non-standard evaluation (NSE) in R. 

Exercise 5.4.4
Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

select(flights, contains("TIME"))

#> # A tibble: 336,776 x 6
#>   dep_time sched_dep_time arr_time sched_arr_time air_time
#>      <int>          <int>    <int>          <int>    <dbl>
#> 1      517            515      830            819      227
#> 2      533            529      850            830      227
#> 3      542            540      923            850      160
#> 4      544            545     1004           1022      183
#> 5      554            600      812            837      116
#> 6      554            558      740            728      150
#> # … with 3.368e+05 more rows, and 1 more variable: time_hour <dttm>

contains( )中條件式可用字串,預設不區分大小寫。可以使用ignore.case = FALSE改變設定。

select(flights, contains("TIME", ignore.case = FALSE))

R for Data Science Ch 5.1-5.2

Data transformation

5.1 Intro.
本章將學習以dplyr 包將資料調整成我們所需的格式。

library(nycflights13)
library(tidyverse)

載入tidyverse包後,可能某些函數跟baseR中的函數有衝突,此時會以tidyverse的函數覆寫baseR的版本,如果載入tidyverse後還是想使用原本的函數,必須使用全名,即package: :function的形式呼叫函數。

flights

View(flights)
# will open the whole data set in the RStudio viewer

flights指令,可以看到 flights屬於 tribble 資料型態,tribble 是更適合 tidyverse 的一種data frame資料結構。tribble在螢幕上只會顯示前幾row與符合螢幕寬度的column數目,以View(flights) 才能看到整個 flights資料集。

本章節將學習下列重要的 dplyr 函數,在data manipulation 的過程中,扮演「動詞」的角色:

  • Pick observations by their values (filter()).
  • Reorder the rows (arrange()).
  • Pick variables by their names (select()).
  • Create new variables with functions of existing variables (mutate()).
  • Collapse many values down to a single summary (summarise()).
  • These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group.

上述函數的引數結構皆相似:

  1. The first argument is a data frame.
  2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).
  3. The result is a new data frame.

5.2 Filter rows with filter()
以「值」為條件,取出符合條件的子集。output為符合filter( )函數內指定條件的資料子集。(Use filter() to choose rows/cases where conditions are TRUE
Unlike base subsetting with [, rows where the condition evaluates to NA are dropped.
通常會為filter( )運算後的結果建立新變數,並將運算結果儲存進新變數。

# filter(欲取出的"值")
filter(flights, month == 1, day == 1)
# The first argument is the name of the data frame. 
# The second and subsequent arguments are the expressions that filter the data frame.

# 將filter後的output儲存到變數,jan1裡面
jan1 <- filter(flights, month == 1, day == 1)

#以()包住指令,可同時將filter()運算結果印出且將結果存入指定變數內
(dec25 <- filter(flights, month == 12, day == 25))

  • 以比較運算子( comparison operators )活用filter( )
    >>=<<=!= (not equal), and == (equal),注意比較兩者是否相等時,應使用== 而非使用 = ,變數型態為浮點數時,也應避免使用 == ,在R語言中可以使用 near( ) 運算浮點數。
  • 以邏輯運算子( logical operators )活用filter( )
    As well as & and |, R also has && and ||. Don’t use them here!
The complete set of Boolean operations
# all flights that departed in November or December
filter(flights, month == 11 | month == 12)

注意!上述的程式區塊不等價於 filter(flights, month == 11 | 12)
上一行的寫法會回傳的是,使 == 右邊表示式成立的數值,故運算結果是所有一月份(使 11 | 12 為真的數值,為 1)的班機。
為了解決以上令人頭痛的邏輯運算問題,可以使用以下方法:
x %in% y 選出所有變數 x 中滿足 y 值的觀察值,以此方法重寫程式如下:

# x %in% y  
nov_dec <- filter(flights, month %in% c(11, 12))
  • 應用 De Morgan’s law: !(x & y) is the same as !x | !y, and !(x | y) is the same as !x & !y 來活用filter( ) ,範例如下:
# filter( ) with De Morgan’s law
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
  • Missing values 遺漏值: NA 代表數值未知,遺漏值具有「傳染性」,數值遺漏無法進行接續的data manipulation.
  • The generic function is.na( ) indicates which elements are missing.
  • The generic function is.na( )<- sets elements to NA.
  • The generic function anyNA implements any(is.na(x)) in a possibly faster way (especially for atomic vectors).
  • filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values.
# filter( ) excludes both FALSE and NA values
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)

#> # A tibble: 1 x 1
#>       x
#>   <dbl>
#> 1     3
  • 若想保留missing value,可以使用下列指令:
# preserve missing value
filter(df, is.na(x) | x > 1)

#> # A tibble: 2 x 1
#>       x
#>   <dbl>
#> 1    NA
#> 2     3

Exercise 5.2

  • Exercise 5.2.1
    Find all flights that
  1. Had an arrival delay of two or more hours
  2. Flew to Houston (IAH or HOU)
  3. Were operated by United, American, or Delta
  4. Departed in summer (July, August, and September)
  5. Arrived more than two hours late, but didn’t leave late
  6. Were delayed by at least an hour, but made up over 30 minutes in flight
  7. Departed between midnight and 6 am (inclusive)
# Had an arrival delay of two or more hours
filter(flights, arr_delay >= 120)
# Flew to Houston (IAH or HOU)
filter(flights, dest == "IAH" | dest == "HOU")

# Using %in% operator:
# filter(flights, dest %in% c("IAH", "HOU"))
# Check code for Delta, America, or United.
?flights
?airline
filter(flights, carrier %in% c("AA", "DL", "UA"))
# Departed in summer (July, August, and September)
filter(flights, month >= 7, month <= 9)

# Using %in% operator and 
# the  : operator, which is used to specify the integer range.
filter(flights, month %in% 7:9)

# Using the | operator.
filter(flights, month == 7 | month == 8 | month == 9)
# However, it is quite verbose.

# Using the between() function 
filter(flights, between(month, 7, 9))
# Arrived more than two hours late, but didn’t leave late
filter(flights, arr_delay > 120, dep_delay <= 0)
# Were delayed by at least an hour, 
# but made up over 30 minutes in flight
filter(flights, dep_delay >= 60, dep_delay - arr_delay > 30)
# Departed between midnight and 6 am (inclusive) 

# Checking the minimum and maximum of dep_time.
summary(flights$dep_time)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>       1     907    1401    1349    1744    2400    8255

#This is an example of why it is always good to check the summary statistics of your data. 

filter(flights, dep_time <= 600 | dep_time == 2400)

# Using modular arithmetic
#c(600, 1200, 2400) %% 2400
#concatenate(600, 1200, 2400)

#> [1]  600 1200    0
#  %% 運算子:回傳除後的餘數

filter(flights, dep_time %% 2400 <= 600)

Exercise 5.2.3
How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

# Using is.na( )

filter(flights, is.na(dep_time))
#> # A tibble: 8,255 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>
#> 1  2013     1     1       NA           1630        NA       NA
#> 2  2013     1     1       NA           1935        NA       NA
#> 3  2013     1     1       NA           1500        NA       NA
#> 4  2013     1     1       NA            600        NA       NA
#> 5  2013     1     2       NA           1540        NA       NA
#> 6  2013     1     2       NA           1620        NA       NA
#> # … with 8,249 more rows, and 12 more variables: sched_arr_time <int>,
#> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>

R for Data Science Ch.4

4.1 Coding basics

  • assignment statements (賦值): create new objects with <- .Don’t be lazy and use =: it will work, but it will cause confusion later. 在RStuido中快速鍵: Alt + - (the minus sign)
#  create new objects with <-
object_name <- value

4.2 What’s in a name? 命名習慣與規則

  • Object names must start with a letter, and can only contain letters, numbers, _ and .
  • 命名最好帶有意義,以便日後重讀code時容易讀懂
  • 作者推薦以 snake_case命名:以底線_分隔小寫字母
i_use_snake_case
otherPeopleUseCamelCase
some.people.use.periods
And_aFew.People_RENOUNCEconvention
  • 輸入文字按 Tab 鍵,會列出符合輸入文字的物件、函數、功能……,等所有指令
  • 按F1鍵,會跳出相關的help檔
  • The + tells you that R is waiting for more input. Usually that means you’ve forgotten either a " or a ) .

4.3 Calling functions

# y <- seq(1, 10, length.out = 5)
# y

#This common action can be shortened by surrounding the assignment with parentheses, 
#which causes assignment and “print to screen” to happen.
(y <- seq(1, 10, length.out = 5))
  • 在RStudio右上方的Environment可以查看我們所建立的所有的object

Exercise 4.1
容易造成bug,不易分辨的兩組英文字母及數字:

  • the numeral zero (0), the Latin small letter O (o), and the Latin capital letter O (O),
  • the numeral one (1), the Latin small letter I (i), the Latin capital letter I (I), and Latin small letter L (l).

Error messages of the form "object '...' not found" mean exactly what they say.  The most common scenarios in which I encounter this error message are

  1. I forgot to create the object, or an error prevented the object from being created.
  2. I made a typo in the object’s name, either when using it or when I created it (as in the example above), or I forgot what I had originally named it. If you find yourself often writing the wrong name for an object, it is a good indication that the original name was not a good one.
  3. I forgot to load the package that contains the object using library().

Exercise 4.3
Press Alt + Shift + K, this gives a menu with keyboard shortcuts. This can be found in the menu under Tools -> Keyboard Shortcuts Help.

R for Data Science Ch 3.8-3.10

3.8 Position adjustments
You can color a bar chart using either the color aesthetic, or, more usefully, fill.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, color = cut))

# use the fill aesthetic
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut))
  • map the fill aesthetic to another variable
# map the fill aesthetic to another variable, like clarity: 
# the bars are automatically stacked. 
# Each colored rectangle represents a combination of cut and clarity---
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity)) #default position = "stack"

The stacking is performed automatically by the position adjustment specified by the position argument.
The position argument included three options: "identity""dodge" or "fill".

# default position argument, position = "stack"
# 累加的概念,逐類別堆疊累加
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + 
  geom_bar(position = "stack")
position = “stack"
#identity顯示每個類別中的數值
#more useful for 2d geoms, like points, where it is the default.
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + 
  geom_bar(position = "identity")

#設定透明度可以解決overlapping的問題,更容易觀察每個類別中的counts
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + 
  geom_bar(alpha = 1/5, position = "identity")
position = “identity"
alpha = 1/5, position = “identity"
#position = "fill"
#每根stacking bar標準化成同高度,且y軸為比率而非counts
#呈現每個cut中,每種不同clarity所占的比率
#This makes it easier to compare proportions across groups.

ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + 
  geom_bar(position = "fill")
position = “fill"
#position = "dodge"
#一種clarity * 一種cut 畫一根bar(共有8*5 = 40種組合,40根bar),以cut level群組化,並列這些bar
#This makes it easier to compare individual values.
#it can be very useful for scatter plots

ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + 
  geom_bar(position = "dodge")
position = “dodge"

當每個點有一個以上的量時,jitter 會把它隨機分散。 可以設定其散落的範圍,可以把重疊的資料點分開,用以解決overplotting的問題

#overplotting
#that the plot displays only 126 points, even though there are 234 observations in the data set
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

overplotting
#position = "jitter"
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
position = “jitter"

Exercise 3.8.3

  • Compare and contrast geom_jitter() with geom_count().
# geom_jitter() adds random variation to the locations points of the graph
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_jitter()
# geom_count() sizes the points relative to the number of observations. 
# Combinations of (x, y) values with more observations will be larger than 
# those with fewer observations.
# geom_count() does not change x and y coordinates of the points

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_count()

As that example shows, unfortunately, there is no universal solution to overplotting. The costs and benefits of different approaches will depend on the structure of the data and the goal of the data scientist.

3.9 Coordinate systems
The default coordinate system is the Cartesian coordinate system where the x and y positions act independently to determine the location of each point.  There are a number of other coordinate systems that are occasionally helpful.

  • coord_flip() switches the x and y axes. This is useful (for example), if you want horizontal boxplots. It’s also useful for long labels.
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() +
  coord_flip()
long labels overlapping
coord_flip()
  • coord_quickmap() sets the aspect ratio correctly for maps. This is very important if you’re plotting spatial data with ggplot2.
nz <- map_data("nz")

ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black")

ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +
  coord_quickmap()
  • coord_polar() uses polar coordinates. Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.
bar <- ggplot(data = diamonds) + 
  geom_bar(
    mapping = aes(x = cut, fill = cut), 
    show.legend = FALSE,
    width = 1
  ) + 
  theme(aspect.ratio = 1) +
  labs(x = NULL, y = NULL)

bar + coord_flip()
bar + coord_polar()

Exercise 3.9.1
A pie chart is a stacked bar chart with the addition of polar coordinates.

Exercise 3.9.2
The labs function adds axis titles, plot titles, and a caption to the plot.

3.10 The layered grammar of graphics, a formal system for building plots

#code template
#<PARAMETER>
ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>
使用 WordPress.com 設計專業網站
立即開始使用