The Data Table Package

Data Frames

  • First let’s look at the ‘data.frame’. The data.frame is the bread and butter of R. It is an intuitive way to organize data into rows and columns, and subsetting is very straight forward. Organization is critical to a data.frame.
  • Data.frames require that each row and column are the same length, however, they may contain differing types of data. Thus, it is similar to both a matrix and a list.
  • df[i,j], “Take df, subset rows using i and columns using j”
# Load the iris dataset to work with
data("iris") 
# Subset rows on the left of the comma within the the brackets. In this example, i = rows (1:2), j = all columns when blank. The result is row 1 and 2, and all 4 columns of the iris dataset. See below.
df <- iris[1:2,]

# BTW, I use kable to present each table in a nice HTML format
kable(df) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
# Subset columns right of the comma. Use the combine function 'c' to create a numeric vector (all the same type of data within a vector).  In this case the numeric vector combines 2 and 4. The result shows columns 2 and 4 only. See below.
df <- iris[1:2, c(2,4)]
kable(df) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Sepal.Width Petal.Width
3.5 0.2
3.0 0.2

Logical Operators

  • Both logical operators, such as and (&), or (|), and not equal to (!=), and arithmetic operators, such as addition (+) and multiplication (*) are used within the bracket to subset rows and columns.
# Use logical expressions with bracket notation to subset. In this example, all records that are greater than or equal to a Sepal.Width of 3.8, are subsetted.  See the 6 records that met the conditions below.
df <- iris[iris$Sepal.Width > 3.8,]
kable(df) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
6 5.4 3.9 1.7 0.4 setosa
15 5.8 4.0 1.2 0.2 setosa
16 5.7 4.4 1.5 0.4 setosa
17 5.4 3.9 1.3 0.4 setosa
33 5.2 4.1 1.5 0.1 setosa
34 5.5 4.2 1.4 0.2 setosa
  • Another example.
# Use the '$' operator when subsetting with column names in the bracket.
df <- iris[iris$Petal.Length > 6 & iris$Sepal.Length < 7.3,]
kable(df) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
110 7.2 3.6 6.1 2.5 virginica

Data Aggregation

  • To aggregate data, you have to use tapply, aggregate, and/or table. For example, get the mean of Sepal.Length:
# Use 'tapply' to find the mean Sepal.Length for each Species.  The 'tapply' function follows the format: (X, INDEX, FUN), where X is an atomic object, typically a vector, INDEX is a list of one or more factors, each of same length as X, and FUN is the function to be applied.  Since the data are already in a data.frame, the lengths are equal. The results show the mean 'Sepal.Length' for each 'Species'.
df <- tapply(iris$Sepal.Length, iris$Species, mean)
data.frame(df) %>%
mutate_if(is.numeric, format, digits=3, nsmall = 0) %>%
kable(.) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
df
5.01
5.94
6.59
# Examine the first 5 rows. Note there are three levels of 'Petal.Length', 1.3, 1.4, and 1.5.
df <- iris[1:5,]
kable(df) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
# Use aggregate to find the mean of 'Sepal.Length' OF 'Species' GROUPED BY 'Petal.Length' (remeber, there are three levels). Highlight the 'aggregate' function in your RStudio editor and then hit F1 to open up the reference material for this function.
df <- aggregate(Sepal.Length[1:5] ~ Species[1:5] + Petal.Length[1:5], data = iris, mean)
kable(df) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Species[1:5] Petal.Length[1:5] Sepal.Length[1:5]
setosa 1.3 4.7
setosa 1.4 5.0
setosa 1.5 4.6
  • Thus, for plants grouped by Petal.Lengths (1.3, 1.4. and 1.5), the calculated means (of Species within these groups, only “setosa” in this case) are equal to 4.7, 5.0, and 1.5. Note that the data is sorted on the GROUP BY in ascending order.

Formulas

  • one ~ one, one ~ many, many ~ one, and many ~ many:
    • aggregate(weight ~ feed, data = chickwts, mean)
    • aggregate(breaks ~ wool + tension, data = warpbreaks, mean)
    • aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
    • aggregate(cbind(ncases, ncontrols) ~ alcgp + tobgp, data = esoph, sum)

Dot notation:

  • aggregate(. ~ Species, data = iris, mean)
  • aggregate(len ~ ., data = ToothGrowth, mean)

Often followed by xtabs():

  • ag <- aggregate(len ~ ., data = ToothGrowth, mean)
  • xtabs(len ~ ., data = ag)
  • See example of data below, with a few modifications for the illustration (i.e. column names)
data = aggregate(Sepal.Length[1:5] ~ Species[1:5] +
                   Petal.Length[1:5], data = iris, mean)
colnames(data) <- c("Species", "Sepal", "Petal")
df <- xtabs(Sepal ~ ., data = data)
kable(df) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)
4.6 4.7 5
setosa 1.5 1.3 1.4
versicolor 0.0 0.0 0.0
virginica 0.0 0.0 0.0
# The table function creates a matix to see how the 5 rows are grouped into the 3 Petal.Length.
df <- table(Petal = iris$Petal.Length[1:5], 
            Sepal = iris$Sepal.Length[1:5])
kable(df) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) 
4.6 4.7 4.9 5 5.1
1.3 0 1 0 0 0
1.4 0 0 1 1 1
1.5 1 0 0 0 0

The Data.Table

  • So, just when you think you’ve mastered the data.frame, you find out about the data.table package. Using the iris data, I’ll work through the main sections found in the data.table cheat sheet, which can be found here, which first appeared on DataCamp Blog.
  • dt[i,j,by], “Take dt, subset rows using i, then calculate j grouped by by”

Subsetting Rows Using i

# or dt[3:5], Subset rows by number, similar to data.frame.
dt <- data.table(iris[3:5,])
kable(dt) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)  
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
# Subset on column names to select rows that satisfy the condition. dt[Species == "setosa"]. The [1:5] shows the first 5 rows, without this the whole data set is displayed (a feature not available with a data.frame)
dt <- dt[Species == "setosa",][1:5]
kable(dt) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) 
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
NA NA NA NA NA
NA NA NA NA NA
# Select on multiple values in a column.
dt <- dt[Species %in% c("setosa", "versicolor")][48:52]
kable(dt) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) 
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
NA NA NA NA NA
NA NA NA NA NA
NA NA NA NA NA
NA NA NA NA NA
NA NA NA NA NA

Manipulate on Columns in j

# dt[,Species] returns a vector [1] "sestosa"....."versicolor", etc..
# Subset multiple columns with dot notation, which are returned as a data.table.  This is a very nice way to reduce the dimension of data you're interested in and work with smaller tables. Note that the .() is an alias of a list. If it is not used, the results returned are a vector.
dt <- dt[,.(Species, Sepal.Length)][48:52]
kable(dt) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)
Species Sepal.Length
NA NA
NA NA
NA NA
NA NA
NA NA

Call functions in j.

dt <- data.table(iris)
dt <- c(dt[,length(Sepal.Length)], 
        dt[,sum(Sepal.Length)], 
        dt[,mean(Sepal.Length)], 
        dt[,sd(Sepal.Length)])
data.table(dt) %>%
mutate_if(is.numeric, format, digits=2, nsmall = 0) %>%
kable(.) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)
dt
150.00
876.50
5.84
0.83
  • How about all of these functions in one formula.
# use the combine functions to perform multiple functions on the same column
dt <- data.table(iris)
dt <- dt[,c(length(Sepal.Length), 
              sum(Sepal.Length), 
              mean(Sepal.Length), 
              sd(Sepal.Length))] 

# create a list of names
names(dt) <- c("Length", "Sum", "Mean", "Standard Deviation") 

# Return the vector of results with names for a report. Easy, and saves coding and lots of time.
data.table(dt) %>%
  mutate_if(is.numeric, format, digits=2, nsmall = 0) %>%
  kable(.) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)
dt
150.00
876.50
5.84
0.83
  • If the column lengths are different, the 2nd column will get recycled. Multiple expressions can also be wrapped into {} brackets in the j position. Such as {print(Species) plot(Sepal.Length) NULL}
# Use bracket notation similar to data.frame; the return carriage must be used to perform each wrapped function.
# dt[, {print(Species[1:5]) 
#       plot(Sepal.Length) 
#       NULL}] 

# Or use the dot notiation and commas to avoid the return carriage.  What ever you prefer. 
dt <- data.table(iris)
dt <- dt[,.(print(Species[1:5]), plot(Sepal.Length), NULL)]

[1] setosa setosa setosa setosa setosa Levels: setosa versicolor virginica

data.table(dt) %>%
  mutate_if(is.numeric, format, digits = 2, nsmall = 0) %>%
  kable(.) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)
V1
setosa
setosa
setosa
setosa
setosa

Doing j BY Group

# use the dot notation to provide labels to the column.
dt <- data.table(iris)
dt <- dt[ ,.(Results = c(length(Sepal.Length), 
                           sum(Sepal.Length), 
                           mean(Sepal.Length), 
                           sd(Sepal.Length))), 
            by = "Species"] 

# create a list of names that are then applied to a column. Save time by using the length of the number of levels found from the by-selector.
dt$Category <- rep(c("Length", "Sum", 
                       "Mean", "Standard Deviation"),
                     length(levels(data$Species))) 

# Returns a data.table of results for a report. Easy, and saves time.
data.table(dt) %>%
  mutate_if(is.numeric, format, digits = 2, nsmall = 0) %>%
  kable(.) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)
Species Results Category
setosa 50.00 Length
setosa 250.30 Sum
setosa 5.01 Mean
setosa 0.35 Standard Deviation
versicolor 50.00 Length
versicolor 296.80 Sum
versicolor 5.94 Mean
versicolor 0.52 Standard Deviation
virginica 50.00 Length
virginica 329.40 Sum
virginica 6.59 Mean
virginica 0.64 Standard Deviation
# Group by several factors using the dot nottion .(X, Y).
dt <- data.table(iris)
dt <- dt[,.("Mean" = mean(Sepal.Length)), 
         by =.("Species" = Species, 
              "Petal Length" = Petal.Length)][1:5]
data.table(dt) %>%
  mutate_if(is.numeric, format, digits = 2, nsmall = 0) %>%
  kable(.) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)  
Species Petal Length Mean
setosa 1.4 4.9
setosa 1.3 4.8
setosa 1.5 5.1
setosa 1.7 5.4
setosa 1.6 4.9
# Count the sum by multiple columns. This is a great feature!!  By the way, you can perform functions within the dot notation in the by bracket notation also.
dt <- data.table(iris)
dt <- dt[,.N,by = .(Species, Petal.Length)][1:5]
kable(dt) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)
Species Petal.Length N
setosa 1.4 13
setosa 1.3 7
setosa 1.5 13
setosa 1.7 4
setosa 1.6 7

Adding/Updating Columns By Referencing in j Using :=

# let's create a subset to work with first
dtt <- data.table(iris[1:5,])

kable(dtt) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
  • use the := to update a value in a column
# test$Sepal.Length  ......  [1] 5.1 4.9 4.7 4.6 5.0
dt <- dtt[, Sepal.Length := exp(Sepal.Length)]

# use the combine function to update multiple columns at the same time.
kable(dt) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
164.02191 3.5 1.4 0.2 setosa
134.28978 3.0 1.4 0.2 setosa
109.94717 3.2 1.3 0.2 setosa
99.48432 3.1 1.5 0.2 setosa
148.41316 3.6 1.4 0.2 setosa
  • using functional :=
# test[,X := NULL] # removes a column instanly
# test[,c(X, Y) := NULL] # removes two columns
# Cols.chosen = c("Species", "Sepal.Width") # define columns
# test[, (Cols.chosen) := NULL] # removes columns based on the names in the vector
dt <- data.table(dtt[, ':=' (test = exp(Sepal.Length)),
                      by = Petal.Length][]) 
data.table(dt) %>%
  mutate_if(is.numeric, format, digits = 2, nsmall = 0) %>%
  kable(.) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F) 
Sepal.Length Sepal.Width Petal.Length Petal.Width Species test
164 3.5 1.4 0.2 setosa 1.7e+71
134 3.0 1.4 0.2 setosa 2.1e+58
110 3.2 1.3 0.2 setosa 5.6e+47
99 3.1 1.5 0.2 setosa 1.6e+43
148 3.6 1.4 0.2 setosa 2.9e+64
  • Finally, there are many advanced features, indexing and key functions that enhance data.table operations. Several websites and blogs have shown that data.table operations much more efficient than data.frames.
  • Although it will take some time to master the data.table, it will save you tons of time in the future, computationally and writing code. Until next time…