Indexing Rows in Data.Table

During the day (and night), I’m a military scientist at Ft. Detrick. I have to tell you…I use R to do everything on a daily basis. Whether it’s a hard core research project, administrative tasks, such as a manpower analysis, financial modeling, or communicating my work to senior leaders with reproducible research documents in RMarkdown, I use R. R adds efficiency, effectiveness and innovation to the way I manage and perform daily tasks. Imagine performing your daily work without the use of Excel. That’s right, you likely need it to be effective. But, R is Excel on Human Growth Hormone, plus a few other super Steroids. The problem is you’ve learned to think about data through an Excel lens (kudos Bill Gates). Anyways, I digress…

One of the most common tasks I’ve run into is having to index a dataset. The package data.table provides powerful tools to address many problems like this. When you have a chance, check out my blog on the data.table package here. Like, Share, and Subscribe!

Solution

  • Use special read-only symbols that can be used in the j position of a data.table: .SD, .N, .I, .GRP, .BY.
  • Use .I to index the data.table

Let’s Try it Out!

Create a Data.Table to Work With

  • First let’s make a data.table with letters in the ‘x’ column.
  • I’m using the set.seed function to ensure it is reproducible if you try this out later.
  • Also, we will use the head function to see the first few rows of data.
  • BTW, most of the time, I use ‘dt’ to name objects that are a data.table, ‘df’ for data.frame, ‘ls’ for list, etc…
set.seed(111)
l <- sample(letters[21:26], 20, TRUE)
dt <- data.table(x = l)
head(dt)

x 1: z 2: w 3: x 4: w 5: u 6: w

  • By the way, you can set ‘letters’ to ‘LETTERS’ to return capitalized versions.
  • Also, change 20:26 to the specific index of each letter in the alphabet to return only the letter you want. For example, 1:5 returns a, b, c, d, and e.

How to Index the Data.Table?

  • You may want to index the whole data.table, especially if you plan to split it up and merge it back together at later time .
  • While you can simple just add dt$index <- 1:length(nrow(dt)) to add an indexed column, data.table provides special read-only symbols that can be used in the j position: .SD, .N, .I, .GRP, .BY.
  • recall that a data.table has similar SQL terminology: dt[i,j,by]
    • i is the row index/filter etc…WHERE
    • j is to SELECT
    • by is to GROUP BY
    • Take dt, subset the rows using ‘i’, then calculate ‘j’ grouped by ‘by
  • Here we will use .I to index in various ways.

How Many Times Does ‘x’ Occur?

  • You may want to determine how many times a given variable occurs in your dataset
  • WHERE x == ‘x’
  • SELECT .I (the indexed number)
  • GROUP BY n/a
dt[x == "x", .I]

[1] 1 2 3

At Which Positions Does ‘x’ Occur?

  • You may want to determine where (i.e. similar to the function which) a given variable occurs in your dataset
    • WHERE x == ‘x’ (inside the j position)
    • SELECT .I (the indexed number)
    • GROUP BY n/a
dt[, .I[x == "x"]]

[1] 3 9 15

At Which Positions Does Each Unique Letter First Occur?

  • You may want to determine the row of the first occurrence for each unique value.
    • WHERE n/a
    • SELECT .I[1] (the indexed first occurrence)
    • GROUP BY x (the column of interest)
dt[, .I[1], by = x]

x V1 1: z 1 2: w 2 3: x 3 4: u 5 5: y 7 6: v 10

At Which Positions Does Each Unique Letter Occur a Second Time?

  • You may want to determine the row of the second occurrence for each unique value.
    • WHERE n/a
    • SELECT .I[2] (the indexed second occurrence)
    • GROUP BY x (the column of interest)
dt[, .I[2], by = x]

x V1 1: z 16 2: w 4 3: x 9 4: u 11 5: y 12 6: v 14

Data.table is a power package to manipulate data and has the backbone to handle big data. Once you transition to thinking about data in terms of dt[i,j,by] you can apply these tools to a lot of your daily tasks.

Until next time…