R for Data Science Ch 5.3-5.4

5.3 Arrange rows with arrange()
arrange() 以指定的column將資料重新排序(re-order),用來重新排序的變數可以不只一個, 遺漏值會被重新排序在最尾。

arrange(flights, year, month, day)

# Use desc() to re-order by a column in descending order:
arrange(flights, desc(dep_delay))

# Missing values are always sorted at the end:
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
arrange(df, desc(x))

5.3 Exercises

Exercise 5.3.1
How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).

The flights will first be sorted by desc(is.na(dep_time)). Since desc(is.na(dep_time)) is either TRUE when dep_time is missing, or FALSE, when it is not, the rows with missing values of dep_time will come first, since TRUE > FALSE.

arrange(flights, desc(is.na(dep_time)), dep_time)
               # ^^^^^^^^^^^^^^^^^^^^^
               # return T or F, T > F
               # sort all missing values to the start 

5.4 Select columns with select()
原始資料集中通常有非常多的變數,使用select( )函數可以列出需要的column,聚焦在我們感興趣的變數。若在 select( )函數中重複引入同一個變數,這個變數只會出現一次。 (The select() call ignores the duplication.)

# Select columns by name
select(flights, year, month, day)

# Select all columns between year and day (inclusive)
select(flights, year:day)

# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))

在select( )函數內可以使用的helper function:

  • starts_with("abc"): matches names that begin with “abc”.
  • ends_with("xyz"): matches names that end with “xyz”.
  • contains("ijk"): matches names that contain “ijk”.
  • matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.
  • num_range("x", 1:3): matches x1x2 and x3.

rename(), 是 select() 的變形,會保留所有變數,且可以對指定變數更名。

rename(flights, tail_num = tailnum)

select( )中使用everything( )函數,可以將data frame中所有變數引入select函數,免除手工key 大量變數的麻煩。

select(flights, time_hour, air_time, everything())

由於select( )函數中若有變數重複出現,還是只會列出一次。應用此性質加上everything() 函數,即可以重組column的排列順序。而不用將全部的column都列出。

# Using select()  with everything() to rearrange columns  
select(flights, arr_delay, everything())

Exercise 5.4.1
Brainstorm as many ways as possible to select dep_timedep_delayarr_time, and arr_delay from flights.

# Specify columns names as unquoted variable names.
select(flights, dep_time, dep_delay, arr_time, arr_delay)

# Specify column names as strings.
select(flights, "dep_time", "dep_delay", "arr_time", "arr_delay")

# Specify the column numbers of the variables.
select(flights, 4, 6, 7, 9)

# Specify the names of the variables with character vector and one_of().
select(flights, one_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))

# Selecting the variables by matching the start of their names using starts_with().
select(flights, starts_with("dep_"), starts_with("arr_"))

This is useful because the names of the variables can be stored in a variable and passed to one_of().

variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, one_of(variables))

Warning! ! Matching the names using contains() since there is not a pattern that can include all these variables without incorrectly including others .

Exercise 5.4.3
What does the one_of() function do? Why might it be helpful in conjunction with this vector?

The one_of() function selects variables with a character vector rather than unquoted variable name arguments.  This function is useful because it is easier to programmatically generate character vectors with variable names than to generate unquoted variable names, which are easier to type.

vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, one_of(vars))
select(flights, vars)

However there is a problem with this. It is not clear whether vars refers to a column name or a variable.
If it has the same name or to ensure that it will not conflict with the names of the columns in the data frame, use the !!! (bang-bang-bang) operator.

select(flights, !!!vars)
#> # A tibble: 336,776 x 5
#>    year month   day dep_delay arr_delay
#>   <int> <int> <int>     <dbl>     <dbl>
#> 1  2013     1     1         2        11
#> 2  2013     1     1         4        20
#> 3  2013     1     1         2        33
#> 4  2013     1     1        -1       -18
#> 5  2013     1     1        -6       -25
#> 6  2013     1     1        -4        12
#> # … with 3.368e+05 more rows

# This behavior, which is used by many tidyverse functions, 
# is an example of what is called non-standard evaluation (NSE) in R. 

Exercise 5.4.4
Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

select(flights, contains("TIME"))

#> # A tibble: 336,776 x 6
#>   dep_time sched_dep_time arr_time sched_arr_time air_time
#>      <int>          <int>    <int>          <int>    <dbl>
#> 1      517            515      830            819      227
#> 2      533            529      850            830      227
#> 3      542            540      923            850      160
#> 4      544            545     1004           1022      183
#> 5      554            600      812            837      116
#> 6      554            558      740            728      150
#> # … with 3.368e+05 more rows, and 1 more variable: time_hour <dttm>

contains( )中條件式可用字串,預設不區分大小寫。可以使用ignore.case = FALSE改變設定。

select(flights, contains("TIME", ignore.case = FALSE))

發表者:Q

塵世中一個迷途小書僮

發表留言

使用 WordPress.com 設計專業網站
立即開始使用