data.table

Topics related to data.table:

Getting started with data.table

Data.table is a package for the R statistical computing environment. It extends the functionality of data frames from base R, particularly improving on their performance and syntax. A number of related tasks, including rolling and non-equi joins, are handled in a consistent concise syntax like DT[where, select|update|do, by].

A number of complementary functions are also included in the package:

  • I/O: fread/fwrite
  • Reshaping: melt/dcast/rbindlist/split
  • Runs of values: rleid

Adding and modifying columns

The official vignette, "Reference semantics", is the best introduction to this topic.

A reminder: DT[where, select|update|do, by] syntax is used to work with columns of a data.table.

  • The "where" part is the i argument
  • The "select|update|do" part is the j argument

These two arguments are usually passed by position instead of by name.

All modifications to columns can be done in j. Additionally, the set function is available for this use.

Creating a data.table

A data.table is an enhanced version of the data.frame class from base R. As such, its class() attribute is the vector "data.table" "data.frame" and functions that work on a data.frame will also work with a data.table. There are many ways to create, load or coerce to a data.table, as seen here.

Subsetting rows by group

A reminder: DT[where, select|update|do, by] syntax is used to work with columns of a data.table.

  • The "where" part is the i argument
  • The "select|update|do" part is the j argument

These two arguments are usually passed by position instead of by name.

Computing summary statistics

A reminder: DT[where, select|update|do, by] syntax is used to work with columns of a data.table.

  • The "where" part is the i argument
  • The "select|update|do" part is the j argument

These two arguments are usually passed by position instead of by name.

Using .SD and .SDcols for the subset of data

A reminder: DT[where, select|update|do, by] syntax is used to work with columns of a data.table.

  • The "where" part is the i argument
  • The "select|update|do" part is the j argument

These two arguments are usually passed by position instead of by name.

Reshaping, stacking and splitting

The official vignette, "Efficient reshaping using data.tables", is the best introduction to this topic.

Many reshaping tasks require moving between long and wide formats:

  • Wide data is data with each column representing a seperate variable, and rows representing seperate observations
  • Long data is data with the form ID | variable | value, where each row representing a observation-variable pair

Using list columns to store data

In case it looks weird that we're talking about lists without using that word in the code, note that .() is an alias for list() when used inside a DT[...] call.

Joins and merges

Working with keyed tables

If x & i have a key or x is keyed to match i's first few columns, then the on can be skipped like x[i].

Disambiguating column names in common

In j of x[i, on, j], columns of i can be referred with i.* prefixes.

Grouping on subsets

In j of x[i, on, j, by=.EACHI], j is computed for each row of i.

This is the only value of by worth using. For any other value, columns of i are not available.

Using keys and indices

The official vignettes are the best introduction to this topic:

Keys vs indices

A data.table can be "keyed" by a sequence of columns, telling interested functions that the data is sorted by those columns. To get or set the key, use the functions documented at ?key.

Similarly, functions can take advantage of a data.table's "indices." Each index -- and a table can have more than one -- stores information about the order of the data with respect a sequence of columns. Like a key, an index can speed up certain tasks. To get or set indices, use the functions documented at ?indices.

Indices may also be set automatically (currently only for a single column at a time). See ?datatable.optimize for details on how this works and how to disable it if necessary.

Verification and updating

Missing values are allowed in a key column.

Keys and indices are stored as attributes and may, by accident, not correspond to the actual order of data in the table. Many functions check the validity of the key or index before using it, but it's worth keeping in mind.

Keys and indices are removed after updates where it's not obvious that sort order is preserved. For example, starting from DT = data.table(a=c(1,2,4), key="a"), if we update like DT[2, a := 3], the key is broken.

Cleaning data

Why is my old code not working?