Formatting, reshaping, and organizing data should be a second nature of every data scientists.
List is the most general (and flexible) form of objects in R
.
[]
is used to access the sub-list, [[]]
to access the element.
list()
function
alnum=list(c("a","b","c","d"), c(1,2))
[[1]]
[1] "a" "b" "c" "d"
[[2]]
[1] 1 2
The first sub-list of ‘alnum’ would be the character vector. This character vector may be retrieved by running the following code.
alnum[[1]]
[1] "a" "b" "c" "d"
Similarly, the following code will retrieve the second element of the second sub-list.
alnum[[2]][2]
[1] 2
One can also name the sub-list simply adding labels before the sub list.
alnum2=list(
alphabet=c("e","f","g","h"),
integer=c(3,4))
$alphabet
[1] "e" "f" "g" "h"
$integer
[1] 3 4
When the sub-lists are labeled, [[]]
is no longer necessary.
alnum2$alphabet #equivalent of alnum2[[1]]
[1] "e" "f" "g" "h"
alnum2$integer[2] #equivalent of alnum2[[2]][2]
[1] 4
cf) as.list()
cf) unlist()
Functions such as lapply
or strsplit
generates its output in a list format.
lapply(1:2, function(x) c(x-1,x+1))
[[1]]
[1] 0 2
[[2]]
[1] 1 3
strsplit("abcdef", "c")
[[1]]
[1] "ab" "def"
append
function
When someone is trying to combine two or more lists together, this is probably what they want.
append(alnum, alnum2)
[[1]]
[1] "a" "b" "c" "d"
[[2]]
[1] 1 2
$alphabet
[1] "e" "f" "g" "h"
$integer
[1] 3 4
Do not get confused with the following code.
list(alnum, alnum2)
[[1]]
[[1]][[1]]
[1] "a" "b" "c" "d"
[[1]][[2]]
[1] 1 2
[[2]]
[[2]]$alphabet
[1] "e" "f" "g" "h"
[[2]]$integer
[1] 3 4
Often times (especially when using a function that gives output in a list format) one may wish to coerce a list to a 2D matrix or into a data frame. The function do.call
may be handy in this case in conjunction with functions such as rbind
, cbind
, rbind.data.frame
, cbind.data.frame
.
lapply(1:2, function(x) c(x-1,x+1)) %>%
do.call("rbind", .)
[,1] [,2]
[1,] 0 2
[2,] 1 3
cf) Check dplyr::bind_rows or plyr::rbind.fill for list with unequal length.
Data frame is a list of vectors of equal length with all contents being heterogeneous.
cf) matrix for homogeneous
base
R
functions
Use data.frame
, as.data.frame
, rbind.data.frame
, cbind.data.frame
to create or change existing data to data frame format. Note that the data.frame
function builds data frame but as.data.frame
is a function that coerce other objects into dataframe
class.
e.g. Check the output difference between the two codes.
data.frame(matrix(1:9, nrow=3), letters[1:3])
X1 X2 X3 letters.1.3.
1 1 4 7 a
2 2 5 8 b
3 3 6 9 c
as.data.frame(matrix(1:9, nrow=3), letters[1:3])
V1 V2 V3
a 1 4 7
b 2 5 8
c 3 6 9
Useful argument. stringsAsFactors
(logical): F
if the character vector should not be converted to a factor. e.g.
check=matrix(letters[1:9], nrow=3, byrow=T)
check %>% as.data.frame() %>% str()
'data.frame': 3 obs. of 3 variables:
$ V1: Factor w/ 3 levels "a","d","g": 1 2 3
$ V2: Factor w/ 3 levels "b","e","h": 1 2 3
$ V3: Factor w/ 3 levels "c","f","i": 1 2 3
check %>% as.data.frame(stringsAsFactors=F) %>% str()
'data.frame': 3 obs. of 3 variables:
$ V1: chr "a" "d" "g"
$ V2: chr "b" "e" "h"
$ V3: chr "c" "f" "i"
Note that the factors and levels are dropped.
$
operator
df$colmn name
df[,column number]
df[, "column name"]
dplyr
package
Utilize the function select
from the dplyr
package.
select(df, column name)
recommend using it in conjunction with the pipe operator %>%
and other functions such as filter
, summarize
or mutate
.
e.g.
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars$mpg
mtcars[,1]
mtcars[,"mpg"]
mtcars %>%
select(mpg)
[1] 21.0 21.0 22.8 21.4 18.7 18.1
Note
mtcars[,1:5]
mtcars[,c(1,3,5)]
mtcars[,c("mpg", "disp", "drat")]
mtcars[,-c(1,3,5)]
@
for S4. df[row number,]
df["row name"“, ]
e.g.
mtcars[1,]
mtcars["Mazda RX4",]
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
==
and !=
df[df$“column name”==“value of interest”] df[df$“column name”!=“value of interest”]
e.g. There are 3 different types of cylinders in the mtcars
data set.
4 6 8
11 7 14
Use the logical operators ==
or !=
to filter data with certain cylinder number.
mtcars[mtcars$cyl==6,]
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
mtcars[mtcars$cyl!=8,]
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Logical operators such as |
and &
may also be applied.
mtcars[(mtcars$cyl==6)|(mtcars$cyl==8),]
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
mtcars[(mtcars$cyl==6) & (mtcars$hp==110),]
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
%in%
and its corresponding negation
%in%
is an operator that matches first argument with its second and returns a logical vector. It is somewhat similar to the ==
operator, but more useful if there are multiple values that needs to be subsetted. For example, the code in the above example: "mtcars[(mtcars$cyl==6)|(mtcars$cyl==8),]" can be simplified as "mtcars[mtcars$cyl %in% c(6, 8),]".
identical(mtcars[mtcars$cyl %in% c(6, 8),], mtcars[(mtcars$cyl==6)|(mtcars$cyl==8),])
[1] TRUE
Similarly, %in%
may be negated as the following code.
mtcars[!mtcars$cyl %in% c(6, 8),]
mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Note that mtcars[mtcars$cyl !%in% c(6, 8),]
would be a wrong code. I personally like to introduce a new function %!in%
that negates the %in%
function.
'%!in%' <-function(x, y)!('%in%'(x,y))
identical(mtcars[!mtcars$cyl %in% c(6, 8),],mtcars[mtcars$cyl %!in% c(6, 8),])
[1] TRUE
dplyr
package
filter
function from the dplyr
pacakge can do the same job, except is can be well integrated with other functions using the pipe operator %>%
.
mtcars %>%
filter(cyl<6)
Warning: package 'bindrcpp' was built under R version 3.3.3
mpg cyl disp hp drat wt qsec vs am gear carb
1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
3 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
4 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
5 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
6 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
7 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
8 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
9 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
10 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
11 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
The filter
function may be applied with |
, %in%
, or &
operators as well.
mtcars %>%
filter(cyl %in% c(6,8))
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
5 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
6 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
7 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
8 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
9 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
10 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
11 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
12 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
13 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
14 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
15 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
16 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
17 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
18 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
19 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
20 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
21 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
mtcars %>%
filter(cyl ==6 | gear==4)
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
5 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
6 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
7 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
8 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
9 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
10 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
11 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
12 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
13 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
14 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
15 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
mtcars %>%
filter(cyl==6 & gear==4)
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
3 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
4 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Data set structured in a human friendly format dose not necessarily imply computer/machine friendly fashion or vise versa. Often times, therefore, data structure should be changed from a human friendly wide form to a computer friendly long form or in the reversed manner. This may be even more important if one is sending and receiving data with co-workers who are not familiar with programming. tidyr
and reshape
are two useful packages that converts between the two data format: long vs. wide.
tidyr
gather
function
Consider the following sub data set from the mtcars
data set.
cars mpg cyl hp gear
1 Mazda RX4 21.0 6 110 4
2 Merc 450SLC 15.2 8 180 3
This is an example of a wide form data set where all variables are seperated in each column of the data frame. This may be a human friendly data format, however, may not be efficient for the computer to read in. Use the gather
function from the tidyr
package to convert the data from wide format to long format.
gather
simply collapses multiple columns and its values from the original data frame into two columns: the ‘key’ and the ‘value’ column. The levels of the ‘key’ variable will be the name of the columns gathered, and the ‘value’ column will contain its corresponding values.
Therefore, the function has 4 important arguments: data, key, value, and selected column.
# Using column numbers
subcars %>%
gather(key=spec, value=performance, 2, 4)
# Using column names
subcars %>%
gather(key=spec, value=performance, mpg, hp)
cars cyl gear spec performance
1 Mazda RX4 6 4 mpg 21.0
2 Merc 450SLC 8 3 mpg 15.2
3 Mazda RX4 6 4 hp 110.0
4 Merc 450SLC 8 3 hp 180.0
# Using column numbers with :
subcars %>%
gather(key=spec, value=values, 2:5)
# Using column names with :
subcars %>%
gather(key=spec, value=values, mpg:gear)
cars spec values
1 Mazda RX4 mpg 21.0
2 Merc 450SLC mpg 15.2
3 Mazda RX4 cyl 6.0
4 Merc 450SLC cyl 8.0
5 Mazda RX4 hp 110.0
6 Merc 450SLC hp 180.0
7 Mazda RX4 gear 4.0
8 Merc 450SLC gear 3.0
Sometimes, thinking in terms of complement or ‘everything else’ might be effecient when writing a script. Use the-
to indicate what else to ‘gather’.
subcars %>%
gather(key=spec, value=performance, -1, -3)
subcars %>%
gather(key=spec, value=performance, -cars, -cyl)
cars cyl spec performance
1 Mazda RX4 6 mpg 21.0
2 Merc 450SLC 8 mpg 15.2
3 Mazda RX4 6 hp 110.0
4 Merc 450SLC 8 hp 180.0
5 Mazda RX4 6 gear 4.0
6 Merc 450SLC 8 gear 3.0
spread
function
The spread
function will do the opposite of the gather
function. This is useful when one is trying to present a data set.
cars cyl gear spec performance
1 Mazda RX4 6 4 mpg 21.0
2 Merc 450SLC 8 3 mpg 15.2
3 Mazda RX4 6 4 hp 110.0
4 Merc 450SLC 8 3 hp 180.0
longform %>%
spread(key=spec, value=performance)
cars cyl gear hp mpg
1 Mazda RX4 6 4 110 21.0
2 Merc 450SLC 8 3 180 15.2
An important argument that might be useful is the fill
argument. This argument replaces missing values, either implicit or explicit.
# Example of implicit missing
implicit_missing%>%
spread(key=spec, value=performance, fill=NA)
cars cyl gear spec performance
1 Mazda RX4 6 4 mpg 21
3 Mazda RX4 6 4 hp 110
4 Merc 450SLC 8 3 hp 180
cars cyl gear hp mpg
1 Mazda RX4 6 4 110 21
2 Merc 450SLC 8 3 180 NA
# Example of implicit missing
explicit_missing%>%
spread(key=spec, value=performance, fill=NA)
cars cyl gear spec performance
1 Mazda RX4 6 4 mpg 21
2 Merc 450SLC 8 3 mpg NA
3 Mazda RX4 6 4 hp 110
4 Merc 450SLC 8 3 hp 180
cars cyl gear hp mpg
1 Mazda RX4 6 4 110 21
2 Merc 450SLC 8 3 180 NA
reshape2
The reshape2
package could do similar operations introduced in the tidyr
section above. I think tidyr
package has a simpler syntax compared to the reshape 2
package, however, reshape 2
package can perform additional features that the tidyr
package cannot.
melt
function
The melt
function can ‘gather’ the data set.
subcars %>%
melt(
variable.name = "spec",
value.names = "performance",
id.vars = c("cars", "cyl", "gear"))
cars cyl gear spec value
1 Mazda RX4 6 4 mpg 21.0
2 Merc 450SLC 8 3 mpg 15.2
3 Mazda RX4 6 4 hp 110.0
4 Merc 450SLC 8 3 hp 180.0
dcast
function
dcast
function is compatible to that of the ‘spread’ function.
longform %>%
dcast(formula = cars+cyl+gear ~ spec, value.var = "performance")
cars cyl gear hp mpg
1 Mazda RX4 6 4 110 21.0
2 Merc 450SLC 8 3 180 15.2
longform %>%
dcast(cars+cyl+gear ~ spec)
Using performance as value column: use value.var to override.
cars cyl gear hp mpg
1 Mazda RX4 6 4 110 21.0
2 Merc 450SLC 8 3 180 15.2
longform %>%
dcast(cars+cyl+gear ~ ., mean)
Using performance as value column: use value.var to override.
cars cyl gear .
1 Mazda RX4 6 4 65.5
2 Merc 450SLC 8 3 97.6
One positive side of using the reshape 2
package over the simpler tidyr
package is that the cast
function will aggregate all variables not defined in the function. It will not make much sence adding or taking means of “gear” and “mpg” for this particular example, but the example is given to check the syntax.
cars cyl gear spec performance
1 Mazda RX4 6 4 mpg 21.0
2 Merc 450SLC 8 3 mpg 15.2
3 Mazda RX4 6 4 hp 110.0
4 Merc 450SLC 8 3 hp 180.0
# Sum of gear and mpg variable for each car+cylinder
longform %>%
dcast(cars+cyl~., sum)
Using performance as value column: use value.var to override.
cars cyl .
1 Mazda RX4 6 131.0
2 Merc 450SLC 8 195.2
# Mean of gear and mpg variable for each car+cylinder combination
longform %>%
dcast(cars+cyl ~ ., mean)
Using performance as value column: use value.var to override.
cars cyl .
1 Mazda RX4 6 65.5
2 Merc 450SLC 8 97.6
a=lapply(1:4, function(x) rep(x,x))
sapply(a, function(x) x+1)
## [[1]]
## [1] 2
##
## [[2]]
## [1] 3 3
##
## [[3]]
## [1] 4 4 4
##
## [[4]]
## [1] 5 5 5 5