Operations on data structures
The R environment has a rich set of options available for performing operations on data within the various data structures. These operations can be performed in a variety of ways and can be restricted according to various criteria. The focus of this section is the purpose and formats of the various apply
commands.
The apply
commands are used to instruct R to use a given command on specific parts of a list, vector, or array. Each data type has different versions of the apply
commands that are available. Before discussing the different commands, it is important to define the notion of the margins of a table or array. The margins are defined along any dimension, and the dimension used must be specified. The margin
command can be used to determine the sum of the row, columns, or the entire column of an array or table:
> A <- matrix(1:12,nrow=3,byrow=TRUE) > A [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8 [3,] 9 10 11 12 > margin.table(A) [1] 78 > margin.table(A,1) [1] 10 26 42 > margin.table(A,2) [1] 15 18 21 24
The last two commands specify the optional margin argument. The margin.table(A,1)
command specifies that the sums are in the first dimension, that is, the rows. The margin.table(A,2)
command specifies that the sums are in the second dimension, that is, the columns. The idea of specifying which dimension to use in a command can be important when using the apply
commands.
The apply commands
The various apply
commands are used to operate on the different data structures. Each one—apply
, lapply
, sapply
, tapply
, and mapply
—will be briefly discussed in order in the following sections.
apply
The apply
command is used to apply a given function across a given margin of an array or table. For example, to take the sum of a row or column from a two way table, use the apply
command with arguments for the table, the sum
command, and which dimension to use:
> A <- matrix(1:12,nrow=3,byrow=TRUE) > A [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8 [3,] 9 10 11 12 > apply(A,1,sum) [1] 10 26 42 > apply(A,2,sum) [1] 15 18 21 24
You should be able to verify these results using the rowSums
and colSums
commands as well as the margin.table
command discussed previously.
lapply and sapply
The lapply
command is used to apply a function to each element in a list. The result is a list, where each component of the returned object is the function applied to the object in the original list with the same name:
> theList <- list(one=c(1,2,3),two=c(TRUE,FALSE,TRUE,TRUE)) > sumResult <- lapply(theList,sum) > sumResult $one [1] 6 $two [1] 3 > typeof(sumResult) [1] "list" > sumResult$one [1] 6
The sapply
command is similar to the lapply
command, and it performs the same operation. The difference is that the result is coerced to be a vector if possible:
> theList <- list(one=c(1,2,3),two=c(TRUE,FALSE,TRUE,TRUE)) > meanResult <- sapply(theList,mean) > meanResult one two 2.00 0.75 > typeof(meanResult) [1] "double"
tapply
The tapply
command is used to apply a function to different parts of data within an array. The function takes at least three arguments. The first is the data to apply an operation, the second is the set of factors that defines how the data is organized with respect to the different levels, and the third is the operation to perform. In the following example, a vector is defined that has the diameter of trees. A second vector is defined, which specifies what kind of tree was measured for each observation. The goal is to find the standard deviation for each type of tree:
> diameters <- c(28.8, 27.3, 45.8, 34.8, 25.3) > tree <- as.factor(c("pine","pine","oak","pine","oak")) > tapply(diameters,tree,sd) oak pine 14.495689 3.968627
mapply
The last command to examine is the mapply
command. The mapply
command takes a function to apply and a list of arrays. The function takes the first elements of each array and applies the function to that list. It then takes the second elements of each array and applies the function. This is repeated until it goes through every element. Note that if one of the arrays has fewer elements than the others, the mapply
command will reset and start at the beginning of that array to fill in the missing values:
> a <- c(1,2,3)
> b <- c(1,2,3)
> mapply(sum,a,b)
[1] 2 4 6
>