Operations on data structures_R Object-oriented Programming-QQ阅读都市男生网

上QQ阅读APP看书，第一时间看更新

Operations on data structures

The R environment has a rich set of options available for performing operations on data within the various data structures. These operations can be performed in a variety of ways and can be restricted according to various criteria. The focus of this section is the purpose and formats of the various apply commands.

The apply commands are used to instruct R to use a given command on specific parts of a list, vector, or array. Each data type has different versions of the apply commands that are available. Before discussing the different commands, it is important to define the notion of the margins of a table or array. The margins are defined along any dimension, and the dimension used must be specified. The margin command can be used to determine the sum of the row, columns, or the entire column of an array or table:

> A <- matrix(1:12,nrow=3,byrow=TRUE)
> A
 [,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
> margin.table(A)
[1] 78
> margin.table(A,1)
[1] 10 26 42
> margin.table(A,2)
[1] 15 18 21 24

The last two commands specify the optional margin argument. The margin.table(A,1) command specifies that the sums are in the first dimension, that is, the rows. The margin.table(A,2) command specifies that the sums are in the second dimension, that is, the columns. The idea of specifying which dimension to use in a command can be important when using the apply commands.

The apply commands

The various apply commands are used to operate on the different data structures. Each one—apply, lapply, sapply, tapply, and mapply—will be briefly discussed in order in the following sections.

apply

The apply command is used to apply a given function across a given margin of an array or table. For example, to take the sum of a row or column from a two way table, use the apply command with arguments for the table, the sum command, and which dimension to use:

> A <- matrix(1:12,nrow=3,byrow=TRUE)
> A
 [,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
> apply(A,1,sum)
[1] 10 26 42
> apply(A,2,sum)
[1] 15 18 21 24

You should be able to verify these results using the rowSums and colSums commands as well as the margin.table command discussed previously.

lapply and sapply

The lapply command is used to apply a function to each element in a list. The result is a list, where each component of the returned object is the function applied to the object in the original list with the same name:

> theList <- list(one=c(1,2,3),two=c(TRUE,FALSE,TRUE,TRUE))
> sumResult <-  lapply(theList,sum)
> sumResult
$one
[1] 6

$two
[1] 3

> typeof(sumResult)
[1] "list"
> sumResult$one
[1] 6

The sapply command is similar to the lapply command, and it performs the same operation. The difference is that the result is coerced to be a vector if possible:

> theList <- list(one=c(1,2,3),two=c(TRUE,FALSE,TRUE,TRUE))
> meanResult <- sapply(theList,mean)
> meanResult
 one two 
2.00 0.75 
> typeof(meanResult)
[1] "double"

tapply

The tapply command is used to apply a function to different parts of data within an array. The function takes at least three arguments. The first is the data to apply an operation, the second is the set of factors that defines how the data is organized with respect to the different levels, and the third is the operation to perform. In the following example, a vector is defined that has the diameter of trees. A second vector is defined, which specifies what kind of tree was measured for each observation. The goal is to find the standard deviation for each type of tree:

> diameters <- c(28.8, 27.3, 45.8, 34.8, 25.3)
> tree <- as.factor(c("pine","pine","oak","pine","oak"))
> tapply(diameters,tree,sd)
 oak pine 
14.495689 3.968627

mapply

The last command to examine is the mapply command. The mapply command takes a function to apply and a list of arrays. The function takes the first elements of each array and applies the function to that list. It then takes the second elements of each array and applies the function. This is repeated until it goes through every element. Note that if one of the arrays has fewer elements than the others, the mapply command will reset and start at the beginning of that array to fill in the missing values:

> a <- c(1,2,3)
> b <- c(1,2,3)
> mapply(sum,a,b)
[1] 2 4 6
>