5.1 apply 族

函数 输入 输出
apply 矩阵、数据框 向量
lapply 向量、列表 列表
sapply 向量、列表 向量、矩阵
mapply 多个向量 列表
tapply 数据框、数组 向量
vapply 列表 矩阵
eapply 列表 列表
rapply 嵌套列表 嵌套列表

除此之外,还有 dendrapply() 专门处理层次聚类或分类回归树型结构, 而函数 kernapply() 用于时间序列的平滑处理

# Reproduce example 10.4.3 from Brockwell and Davis (1991) [@Brockwell_1991_Time]
spectrum(sunspot.year, kernel = kernel("daniell", c(11,7,3)), log = "no")
太阳黑子的频谱

图 5.2: 太阳黑子的频谱

将函数应用到多个向量,返回一个列表,生成四组服从正态分布 \(\mathcal{N}(\mu_i,\sigma_i)\) 的随机数,它们的均值和方差依次是 \(\mu_i = \sigma_i = 1 \ldots 4\)

means <- 1:4
sds <- 1:4
set.seed(2020)
samples <- mapply(rnorm, mean = means, sd = sds, MoreArgs = list(n = 50), SIMPLIFY = FALSE)
samples
## [[1]]
##  [1]  1.37697212  1.30154837 -0.09802317 -0.13040590 -1.79653432  1.72057350
##  [7]  1.93912102  0.77062225  2.75913135  1.11736679  0.14687718  1.90925918
## [13]  2.19637296  0.62841610  0.87673977  2.80004312  2.70399588 -2.03876461
## [19] -1.28897495  1.05830349  3.17436525  2.09818265  1.31822032  0.92685244
## [25]  1.83426874  1.19875064  2.29784138  1.93671831  0.85256681  1.11043199
## [31]  0.18749534  0.25629783  2.09534507  3.43537371  1.38811847  1.29062767
## [37]  0.71440171  1.07601472  0.43970140  1.44718837  1.90850113  0.49494040
## [43]  0.69899599  0.27396402 -0.18007703  1.25307471  0.62928870  1.02217956
## [49]  1.66004412  1.48879364
## 
## [[2]]
##  [1]  1.62242017  3.20271904  0.65247989  2.95210048  2.23750646  2.24245257
##  [7]  1.62790641 -0.65654238  0.86615410  3.15766787  5.81807446  2.50151409
## [13] -1.19663012  8.40326349  3.91047075  2.73728923  3.84583814  1.58895731
## [19]  2.18593340  2.33652436  3.59167825  5.29201121 -1.43384863  1.36331379
## [25]  0.19172006  0.59201441 -1.55619892  0.55548967  2.09230842  2.48731604
## [31]  3.25666261  1.95072284  6.62830662  2.35442051 -0.04882953  6.54936260
## [37] -1.77811333  4.18790320  5.69233405  3.04206535 -1.06592422 -1.87872994
## [43]  2.97383308  4.49047338  1.56545315  0.44081377  2.69774900  3.36344850
## [49]  0.93707723  0.64521316
## 
## [[3]]
##  [1] -2.18635182  0.02621703  1.24348331  4.15056524  5.23999476  0.21473726
##  [7]  1.98547111  7.63534205  3.79952663  3.89860180  2.03159394  7.30604230
## [13]  6.01958161 -2.15824091  3.89676139  0.52582309 -0.59876950 -0.82916135
## [19]  2.63046580  9.49782660  2.06315108  4.10231768  6.80831778 -3.79456152
## [25] -0.86954973  3.56365781  5.25492857  8.35404401  7.52481521  0.14051487
## [31]  3.31026822  1.18331851  2.70719812  2.62965568 -0.13403855  2.77538898
## [37]  8.28040600 -1.29635876 10.98660308 -0.87357460  3.04530532  2.88095828
## [43]  9.57432097 -2.92931265  4.39099699  2.21336353 -0.40714352  3.63389844
## [49]  3.29828378 -6.17005096
## 
## [[4]]
##  [1]  2.6114534 -3.7865046  3.1356934 -1.7967859  5.3817362  4.7528096
##  [7] -0.5114111  4.2018014  1.2692036  6.5922153  6.4414579  1.9493723
## [13]  7.0176236  4.7524199 -3.7131425  8.9432854  5.3131093  0.5803420
## [19] -3.5827529  7.3867697  8.9761004  4.9450398  0.9572258  7.0888838
## [25]  6.6462270 -2.3749335 -4.7613877 -0.7070158  8.2000205  3.9771877
## [31]  2.8959507  7.3940852 -0.1957386 -2.9343453 13.6565563  3.2174329
## [37]  7.7071478  1.1461455 -1.1989669  7.5371856  8.8025661 -0.6813591
## [43]  7.0458875  7.4803610  1.0910292  6.5064829 -0.3657709  1.9356219
## [49]  4.0677359  6.6439628

我们借用图5.3来看一下 mapply 的效果,多组随机数生成非常有助于快速模拟。

par(mfrow = c(2, 2), mar = c(2, 2, 2, 2))
invisible(lapply(samples, function(x) {
  plot(x, pch = 16, col = "grey")
  abline(h = mean(x), lwd = 2, col = "darkorange")
}))
 lapply 函数

图 5.3: lapply 函数

分别计算每个样本的平均值

sapply(samples, mean)
## [1] 1.125622 2.184323 2.731533 3.432760

分别计算每个样本的1,2,3 分位点

lapply(samples, quantile, probs = 1:3/4)
## [[1]]
##       25%       50%       75% 
## 0.6286342 1.1580587 1.8899430 
## 
## [[2]]
##       25%       50%       75% 
## 0.6470298 2.2399795 3.2431767 
## 
## [[3]]
##        25%        50%        75% 
## 0.05479149 2.74129355 4.33088906 
## 
## [[4]]
##          25%          50%          75% 
## -0.001718463  4.022461769  6.924774426

仅用 sapply() 函数替换上面的 lapply(),我们可以得到一个矩阵,值得注意的是函数 quantile()fivenum() 算出来的结果有一些差异

sapply(samples, quantile, probs = 1:3/4)
##          [,1]      [,2]       [,3]         [,4]
## 25% 0.6286342 0.6470298 0.05479149 -0.001718463
## 50% 1.1580587 2.2399795 2.74129355  4.022461769
## 75% 1.8899430 3.2431767 4.33088906  6.924774426
vapply(samples, fivenum, c(Min. = 0, "1st Qu." = 0, Median = 0, "3rd Qu." = 0, Max. = 0))
##               [,1]       [,2]        [,3]       [,4]
## Min.    -2.0387646 -1.8787299 -6.17005096 -4.7613877
## 1st Qu.  0.6284161  0.6452132  0.02621703 -0.1957386
## Median   1.1580587  2.2399795  2.74129355  4.0224618
## 3rd Qu.  1.9085011  3.2566626  4.39099699  7.0176236
## Max.     3.4353737  8.4032635 10.98660308 13.6565563

vapply 和 sapply 类似,但是预先指定返回值类型,这样可以更加安全,有时也更快。

以数据集 presidents 为例,它是一个 ts 对象类型的时间序列数据,记录了 1945 年至 1974 年每个季度美国总统的支持率,这组数据中存在缺失值,以 NA 表示。支持率的变化趋势见图 5.4

plot(presidents)
1945-1974美国总统的支持率

图 5.4: 1945-1974美国总统的支持率

计算这 30 年每个季度的平均支持率

tapply(presidents, cycle(presidents), mean, na.rm = TRUE)
##        1        2        3        4 
## 58.44828 56.43333 57.22222 53.07143

cycle() 函数计算序列中每个观察值在周期中的位置,presidents 的周期为 4,根据位置划分组,然后分组求平均,也可以化作如下计算步骤,虽然看起来复杂,但是数据操作的过程很清晰,不再看起来像是一个黑箱。

# Base R
cbind(expand.grid(quarter = c("Qtr1", "Qtr2", "Qtr3", "Qtr4"), year = 1945:1974), rate = as.vector(presidents)) %>%
  reshape(., v.names = "rate", idvar = "year", timevar = "quarter", direction = "wide", sep = "") %>%
  `colnames<-`(., gsub(pattern = "(rate)", x = colnames(.), replacement =  "")) %>% 
  `[`(., -1) %>% 
  apply(., 2, mean, na.rm = TRUE)
##     Qtr1     Qtr2     Qtr3     Qtr4 
## 58.44828 56.43333 57.22222 53.07143

tapply 函数来做分组求和

# 一个变量分组求和
tapply(warpbreaks$breaks, warpbreaks[, 3, drop = FALSE], sum)
## tension
##   L   M   H 
## 655 475 390
# 两个变量分组计数
with(warpbreaks, table(wool, tension))
##     tension
## wool L M H
##    A 9 9 9
##    B 9 9 9
# 两个变量分组求和
aggregate(breaks ~ wool + tension, data = warpbreaks,  sum) %>% 
  reshape(., v.names = "breaks", idvar = "wool", timevar = "tension", direction = "wide", sep = "") %>% 
  `colnames<-`(., gsub(pattern = "(breaks)", x = colnames(.), replacement =  ""))
##   wool   L   M   H
## 1    A 401 216 221
## 2    B 254 259 169