Data.table is a package for the R statistical computing environment. It extends the functionality of data frames from base R, particularly improving on their performance and syntax. A number of related tasks, including rolling and non-equi joins, are handled in a consistent concise syntax like DT[where, select|update|do, by]
.
A number of complementary functions are also included in the package:
fread
/fwrite
melt
/dcast
/rbindlist
/split
rleid
The package's official wiki has some essential materials:
As a new user, you will want to check out the vignettes, FAQ and cheat sheet.
Before asking a question -- here on StackOverflow or anywhere else -- please read the support page.
For help on individual functions, the syntax is help("fread")
or ?fread
. If the package has not been loaded, use the full name like ?data.table::fread
.
Install the stable release from CRAN:
install.packages("data.table")
Or the development version from github:
install.packages("data.table", type = "source",
repos = "http://Rdatatable.github.io/data.table")
To revert from devel to CRAN, the current version must first be removed:
remove.packages("data.table")
install.packages("data.table")
Visit the website for full installation instructions and the latest version numbers.
Usually you will want to load the package and all of its functions with a line like
library(data.table)
If you only need one or two functions, you can refer to them like data.table::fread
instead.
DT[where, select|update|do, by]
syntax is used to work with columns of a data.table.
i
argumentj
argumentThese two arguments are usually passed by position instead of by name.
A sequence of steps can be chained like DT[...][...]
.
DT[...]
Function or symbol | Notes |
---|---|
.() | in several arguments, replaces list() |
J() | in i , replaces list() |
:= | in j , a function used to add or modify columns |
.N | in i , the total number of rows in j , the number of rows in a group |
.I | in j , the vector of row numbers in the table (filtered by i ) |
.SD | in j , the current subset of the data selected by the .SDcols argument |
.GRP | in j , the current index of the subset of the data |
.BY | in j , the list of by values for the current subset of data |
V1, V2, ... | default names for unnamed columns created in j |
DT[...]
Notation | Notes |
---|---|
DT1[DT2, on, j] | join two tables |
i.* | special prefix on DT2's columns after the join |
by=.EACHI | special option available only with a join |
DT1[!DT2, on, j] | anti-join two tables |
DT1[DT2, on, roll, j] | join two tables, rolling on the last column in on= |
Notation | Notes |
---|---|
melt(DT, id.vars, measure.vars) | transform to long format for multiple columns, use measure.vars = patterns(...) |
dcast(DT, formula) | transform to wide format |
rbind(DT1, DT2, ...) | stack enumerated data.tables |
rbindlist(DT_list, idcol) | stack a list of data.tables |
split(DT, by) | split a data.table into a list |
Function(s) | Notes |
---|---|
foverlaps | overlap joins |
merge | another way of joining two tables |
set | another way of adding or modifying columns |
fintersect , fsetdiff , funion , fsetequal , unique , duplicated , anyDuplicated | set-theory operations with rows as elements |
CJ | the Cartesian product of vectors |
uniqueN | the number of distinct rows |
rowidv(DT, cols) | row ID (1 to .N) within each group determined by cols |
rleidv(DT, cols) | group ID (1 to .GRP) within each group determined by runs of cols |
shift(DT, n) | apply a shift operator to every column |
setorder , setcolorder , setnames , setkey , setindex , setattr | modify attributes and order by reference |
Features | Notes |
---|---|
IDate and ITime | integer dates and times |