dplyr Compared to data.table

A few of my recent posts included manipulation of data inside the data.table by using special characters in the j and by positions. Another very popular approach is to use the dplyr package, rather than data.table. dplyr is a grammar of data manipulation, which includes arrange, filter, mutate, select, and summarise, all of which can be used with other functions, such as the group_by function or even logical operators. For comparison, I provide the similar action with data.table.

{{ instagram BgaKgO-HKF- }}

arrange

  • Used to sort the order of rows
  • Note the use of as_tibble, which simpliifes the data frame and improves the appearance
  • Note the use of forward-pipe operator %>%
data("iris")
iris <- as_tibble(iris)

df <- iris %>% # uses the pipe operator
  arrange(desc(Sepal.Length)) 
pander(head(df)) # note dplyr resets the rownames/order
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
7.9 3.8 6.4 2 virginica
7.7 3.8 6.7 2.2 virginica
7.7 2.6 6.9 2.3 virginica
7.7 2.8 6.7 2 virginica
7.7 3 6.1 2.3 virginica
7.6 3 6.6 2.1 virginica
  • setorder is similar to arrange
  • Make sure to use the negative-sign to set descending
dt <- data.table(iris)
setorder(dt, -Sepal.Length)  # note the '-' for descending
pander(head(dt))  # note data.table preserves row name order
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
7.9 3.8 6.4 2 virginica
7.7 3.8 6.7 2.2 virginica
7.7 2.6 6.9 2.3 virginica
7.7 2.8 6.7 2 virginica
7.7 3 6.1 2.3 virginica
7.6 3 6.6 2.1 virginica

filter

  • Picks rows based on their values
df <- iris %>%
    filter(Sepal.Length == 7.7)
pander(head(df))  # note dplyr reset the row name order
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
7.7 3.8 6.7 2.2 virginica
7.7 2.6 6.9 2.3 virginica
7.7 2.8 6.7 2 virginica
7.7 3 6.1 2.3 virginica
  • filter is similar to setting a row filter in the i position of the data.table
  • Note the positions in the data.table: > dt[i,j,by]
dt_filt <- dt[Sepal.Length == 7.7, ]  # filter values in the i position
pander(dt_filt)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
7.7 3.8 6.7 2.2 virginica
7.7 2.6 6.9 2.3 virginica
7.7 2.8 6.7 2 virginica
7.7 3 6.1 2.3 virginica

mutate

  • Adds a new variable as a function of existing variables
df <- iris %>%
    mutate(Sepal.Length, Sepal.Length_new = Sepal.Length/2)
pander(head(df))  # note the new column is added to the end of the table
Table continues below
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
Sepal.Length_new
2.55
2.45
2.35
2.3
2.5
2.7
  • for data.table, use the impute special character ‘:=’ in the j position
  • ‘:=’ creates a new variable in the data.table
dt_mut <- dt[, `:=`(Sepal.Length_new, Sepal.Length/2)]  # note the new column is added to the end of the data.table
pander(head(dt_mut))
Table continues below
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
7.9 3.8 6.4 2 virginica
7.7 3.8 6.7 2.2 virginica
7.7 2.6 6.9 2.3 virginica
7.7 2.8 6.7 2 virginica
7.7 3 6.1 2.3 virginica
7.6 3 6.6 2.1 virginica
Sepal.Length_new
3.95
3.85
3.85
3.85
3.85
3.8

select

  • Picks the columns you want in the final table.
  • Below, I’ve combine this function with mutate, which shows how various functions can be piped together
df <- iris %>%
  as_tibble() %>% mutate( # pipe several manipulationa together
    Sepal.Length_new = Sepal.Length/2, # adds column
    Sepal.Width_new = Sepal.Width/2, # adds column
    Petal.Width = NULL)  %>% # removes a column
  dplyr::select(Sepal.Length_new, Sepal.Width_new) # will potentially conflict with other packages

pander(head(df)) # note the new column is added to the end of the table
Sepal.Length_new Sepal.Width_new
2.55 1.75
2.45 1.5
2.35 1.6
2.3 1.55
2.5 1.8
2.7 1.95
  • For the data.table, you string operations together with additional brackets, much like the pipe operators
  • Use the list feature ‘.(x, y, z)’ to extract the features of interest - note the period in front of the parentheses, which results in lists
dt <- data.table(iris)

dt_mut <- dt[, `:=`(Sepal.Length_new, Sepal.Length/2)][, `:=`(Sepal.Width_new, Sepal.Width/2)][,
    `:=`(Sepal.Width, NULL)][, .(Sepal.Length_new, Sepal.Width_new)]  # note the new column is added to the end of the data.table, the final operations selects only the columns of interest
pander(head(dt_mut))
Sepal.Length_new Sepal.Width_new
2.55 1.75
2.45 1.5
2.35 1.6
2.3 1.55
2.5 1.8
2.7 1.95

summarise

  • Reduces multiple values to a single summary
df <- iris %>%
    dplyr::summarise(mean = mean(Sepal.Length), n = n())

pander(df)
mean n
5.843 150
dt_sum <- dt[, .(mean = mean(Sepal.Length), n = length(Sepal.Length))]

pander(dt_sum)
mean n
5.843 150

Both dplyr and data.table have tons more features to explore. Checkout dplyr here. And, checkout data.table here. A great post that considers differences in Speed, Memory Usage, Syntax and Features can be found here. Until next time…