Getting started with R LanguageData framesReading and writing tabular data in plain-text files (CSV, TSV, etc.)Pipe operators (%>% and others)Linear Models (Regression)data.tableboxplotFormulaSplit functionCreating vectorsFactorsPattern Matching and ReplacementRun-length encodingDate and TimeSpeeding up tough-to-vectorize codeggplot2ListsIntroduction to Geographical MapsBase PlottingSet operationstidyverseRcppRandom Numbers GeneratorString manipulation with stringi packageParallel processingSubsettingDebuggingInstalling packagesArima ModelsDistribution FunctionsShinyspatial analysissqldfCode profilingControl flow structuresColumn wise operationJSONRODBClubridateTime Series and Forecastingstrsplit functionWeb scraping and parsingGeneralized linear modelsReshaping data between long and wide formsRMarkdown and knitr presentationScope of variablesPerforming a Permutation TestxgboostR code vectorization best practicesMissing valuesHierarchical Linear ModelingClassesIntrospection*apply family of functions (functionals)Text miningANOVARaster and Image AnalysisSurvival analysisFault-tolerant/resilient codeReproducible RUpdating R and the package libraryFourier Series and Transformations.RprofiledplyrcaretExtracting and Listing Files in Compressed ArchivesProbability Distributions with RR in LaTeX with knitrWeb Crawling in RArithmetic OperatorsCreating reports with RMarkdownGPU-accelerated computingheatmap and heatmap.2Network analysis with the igraph packageFunctional programmingGet user inputroxygen2HashmapsSpark API (SparkR)Meta: Documentation GuidelinesI/O for foreign tables (Excel, SAS, SPSS, Stata)I/O for database tablesI/O for geographic data (shapefiles, etc.)I/O for raster imagesI/O for R's binary formatReading and writing stringsInput and outputRecyclingExpression: parse + evalRegular Expressions (regex)CombinatoricsPivot and unpivot with data.tableInspecting packagesSolving ODEs in RFeature Selection in R -- Removing Extraneous FeaturesBibliography in RMDWriting functions in RColor schemes for graphicsHierarchical clustering with hclustRandom Forest AlgorithmBar ChartCleaning dataRESTful R ServicesMachine learningVariablesThe Date classThe logical classThe character classNumeric classes and storage modesMatricesDate-time classes (POSIXct and POSIXlt)Using texreg to export models in a paper-ready wayPublishingImplement State Machine Pattern using S4 ClassReshape using tidyrModifying strings by substitutionNon-standard evaluation and standard evaluationRandomizationObject-Oriented Programming in RRegular Expression Syntax in RCoercionStandardize analyses by writing standalone R scriptsAnalyze tweets with RNatural language processingUsing pipe assignment in your own package %<>%: How to ?R Markdown Notebooks (from RStudio)Updating R versionAggregating data framesData acquisitionR memento by examplesCreating packages with devtools

Pivot and unpivot with data.table

Other topics

Remarks:

Much of what goes into conditioning data to build models or visualizations can be accomplished with data.table. As compare to other options, data.table offers advantages of speed and flexibility.

Pivot and unpivot tabular data with data.table - I

Convert from wide form to long form

Load data USArrests from datasets.

data("USArrests")
head(USArrests)

           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7

Use ?USArrests to find out more. First, convert to data.table. The names of states are row names in the original data.frame.

library(data.table)
DT <- as.data.table(USArrests, keep.rownames=TRUE)

This is data in the wide form. It has a column for each variable. The data can also be stored in long form without loss of information. The long form has one column that stores the variable names. Then, it has another column for the variable values. The long form of USArrests looks like so.

            State    Crime  Rate
  1:       Alabama   Murder  13.2
  2:        Alaska   Murder  10.0
  3:       Arizona   Murder   8.1
  4:      Arkansas   Murder   8.8
  5:    California   Murder   9.0
 ---                             
196:      Virginia     Rape  20.7
197:    Washington     Rape  26.2
198: West Virginia     Rape   9.3
199:     Wisconsin     Rape  10.8
200:       Wyoming     Rape  15.6

We use the melt function to switch from wide form to long form.

DTm <- melt(DT)
names(DTm) <- c("State", "Crime", "Rate")

By default, melt treats all columns with numeric data as variables with values. In USArrests, the variable UrbanPop represents the percentage urban population of a state. It is different from the other variabes, Murder, Assault and Rape, which are violent crimes reported per 100,000 people. Suppose we want to retain UrbanPop column. We achieve this by setting id.vars as follows.

DTmu <- melt(DT, id.vars=c("rn", "UrbanPop" ), 
             variable.name='Crime', value.name = "Rate")
names(DTmu)[1] <- "State"

Note that we have specified the names of the column containing category names (Murder, Assault, etc.) with variable.name and the column containing the values with value.name. Our data looks like so.

             State UrbanPop  Crime Rate
  1:       Alabama       58 Murder 13.2
  2:        Alaska       48 Murder 10.0
  3:       Arizona       80 Murder  8.1
  4:      Arkansas       50 Murder  8.8
  5:    California       91 Murder  9.0

Generating summaries with with split-apply-combine style approach is a breeze. For example, to summarize violent crimes by state?

DTmu[, .(ViolentCrime = sum(Rate)), by=State]

This gives:

        State ViolentCrime
1:    Alabama        270.4
2:     Alaska        317.5
3:    Arizona        333.1
4:   Arkansas        218.3
5: California        325.6
6:   Colorado        250.6

Pivot and unpivot tabular data with data.table - II

Convert from long form to wide form

To recover data from the previous example, use dcast like so.

DTc <- dcast(DTmu, State + UrbanPop ~ Crime)

This gives the data in the original wide form.

             State UrbanPop Murder Assault Rape
 1:        Alabama       58   13.2     236 21.2
 2:         Alaska       48   10.0     263 44.5
 3:        Arizona       80    8.1     294 31.0
 4:       Arkansas       50    8.8     190 19.5
 5:     California       91    9.0     276 40.6

Here, the formula notation is used to specify the columns that form a unique record identifier (LHS) and the column containing category labels for new column names (RHS). Which column to use for the numeric values? By default, dcast uses the first column with numerical values left over when from the formula specification. To make explicit, use the parameter value.var with column name.

When the operation produces a list of values in each cell, dcast provides a fun.aggregate method to handle the situation. Say I am interested in states with similar urban population when investigating crime rates. I add a column Decile with computed information.

DTmu[, Decile := cut(UrbanPop, quantile(UrbanPop, probs = seq(0, 1, by=0.1)))]
levels(DTmu$Decile) <- paste0(1:10, "D")

Now, casting Decile ~ Crime produces multiple values per cell. I can use fun.aggregate to determine how these are handled. Both text and numerical values can be handle this way.

dcast(DTmu, Decile ~ Crime, value.var="Rate", fun.aggregate=sum)

This gives:

dcast(DTmu, Decile ~ Crime, value.var="Rate", fun.aggregate=mean)

This gives:

             State UrbanPop  Crime Rate Decile
  1:       Alabama       58 Murder 13.2     4D
  2:        Alaska       48 Murder 10.0     2D
  3:       Arizona       80 Murder  8.1     8D
  4:      Arkansas       50 Murder  8.8     2D
  5:    California       91 Murder  9.0    10D

There are multiple states in each decile of the urban population. Use fun.aggregate to specify how these should be handled.

dcast(DTmu, Decile ~ Crime, value.var="Rate", fun.aggregate=sum)

This sums over the data for like states, giving the following.

    Decile Murder Assault  Rape
 1:     1D   39.4     808  62.6
 2:     2D   35.3     815  94.3
 3:     3D   22.6     451  67.7
 4:     4D   54.9     898 106.0
 5:     5D   42.4     758 107.6 

Syntax:

  • Melt with melt(DT, id.vars=c(..), variable.name="CategoryLabel", value.name="Value")
  • Cast with dcast(DT, LHS ~ RHS, value.var="Value", fun.aggregate=sum)

Parameters:

ParameterDetails
id.varstell melt which columns to retain
variable.nametell melt what to call the column with category labels
value.nametell melt what to call the column that has values associated with category labels
value.vartell dcast where to find the values to cast in columns
formulatell dcast which columns to retain to form a unique record identifier (LHS) and which one holds the category labels (RHS)
fun.aggregatespecify the function to use when the casting operation generates a list of values in each cell

Contributors

Topic Id: 6934

Example Ids: 23437,23438

This site is not affiliated with any of the contributors.