R Language

Topics related to R Language:

Getting started with R Language

Editing R Docs on Stack Overflow

See the documentation guidelines for general rules when creating documentation.

A few features of R that immigrants from other language may find unusual

  • Unlike other languages variables in R need not require type declaration.
  • The same variable can be assigned different data types at different instances of time, if required.
  • Indexing of atomic vectors and lists starts from 1, not 0.
  • R arrays (and the special case of matrices) have a dim attribute that sets them apart from R's "atomic vectors" which have no attributes.
  • A list in R allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other lists, etc. It is not even required that these objects are related to each other in any way.
  • Recycling
  • Missing values

Data frames

Reading and writing tabular data in plain-text files (CSV, TSV, etc.)

Pipe operators (%>% and others)

Packages that use %>%

The pipe operator is defined in the magrittr package, but it gained huge visibility and popularity with the dplyr package (which imports the definition from magrittr). Now it is part of tidyverse, which is a collection of packages that "work in harmony because they share common data representations and API design".

The magrittr package also provides several variations of the pipe operator for those who want more flexibility in piping, such as the compound assignment pipe %<>%, the exposition pipe %$%, and the tee operator %T>%. It also provides a suite of alias functions to replace common functions that have special syntax (+, [, [[, etc.) so that they can be easily used within a chain of pipes.

Finding documentation

As with any infix operator (such as +, *, ^, &, %in%), you can find the official documentation if you put it in quotes: ?'%>%' or help('%>%') (assuming you have loaded a package that attaches pkg:magrittr).

Hotkeys

There is a special hotkey in RStudio for the pipe operator: Ctrl+Shift+M (Windows & Linux), Cmd+Shift+M (Mac).

Performance Considerations

While the pipe operator is useful, be aware that there is a negative impact on performance due mainly to the overhead of using it. Consider the following two things carefully when using the pipe operator:

  • Machine performance (loops)
  • Evaluation (object %>% rm() does not remove object)

Linear Models (Regression)

data.table

Installation and support

To install the data.table package:

# install from CRAN
install.packages("data.table")       

# or install development version 
install.packages("data.table", type = "source", repos = "http://Rdatatable.github.io/data.table")

# and to revert from devel to CRAN, the current version must first be removed
remove.packages("data.table")
install.packages("data.table")

The package's official site has wiki pages providing help getting started, and lists of presentations and articles from around the web. Before asking a question -- here on StackOverflow or anywhere else -- please read the support page.

Loading the package

Many of the functions in the examples above exist in the data.table namespace. To use them, you will need to add a line like library(data.table) first or to use their full path, like data.table::fread instead of simply fread. For help on individual functions, the syntax is help("fread") or ?fread. Again, if the package is not loaded, use the full name like ?data.table::fread.

boxplot

Formula

Split function

Creating vectors

Factors

An object with class factor is a vector with a particular set of characteristics.

  1. It is stored internally as an integer vector.
  2. It maintains a levels attribute the shows the character representation of the values.
  3. Its class is stored as factor

To illustrate, let us generate a vector of 1,000 observations from a set of colors.

set.seed(1)
Color <- sample(x = c("Red", "Blue", "Green", "Yellow"), 
                size = 1000, 
                replace = TRUE)
Color <- factor(Color)

We can observe each of the characteristics of Color listed above:

#* 1. It is stored internally as an `integer` vector
typeof(Color)
[1] "integer"
#* 2. It maintains a `levels` attribute the shows the character representation of the values.
#* 3. Its class is stored as `factor`
attributes(Color)
$levels
[1] "Blue"   "Green"  "Red"    "Yellow"

$class
[1] "factor"

The primary advantage of a factor object is efficiency in data storage. An integer requires less memory to store than a character. Such efficiency was highly desirable when many computers had much more limited resources than current machines (for a more detailed history of the motivations behind using factors, see stringsAsFactors: an Unauthorized Biography). The difference in memory use can be seen even in our Color object. As you can see, storing Color as a character requires about 1.7 times as much memory as the factor object.

#* Amount of memory required to store Color as a factor.
object.size(Color)
4624 bytes
#* Amount of memory required to store Color as a character
object.size(as.character(Color))
8232 bytes

Mapping the integer to the level

While the internal computation of factors sees the object as an integer, the desired representation for human consumption is the character level. For example,

head(Color)
[1] Blue   Blue   Green  Yellow Red    Yellow  
Levels: Blue Green Red Yellow

is a easier for human comprehension than

head(as.numeric(Color))
[1] 1 1 2 4 3 4

An approximate illustration of how R goes about matching the character representation to the internal integer value is:

head(levels(Color)[as.numeric(Color)])
[1] "Blue"   "Blue"   "Green"  "Yellow" "Red"    "Yellow"

Compare these results to

head(Color)
[1] Blue   Blue   Green  Yellow Red    Yellow  
Levels: Blue Green Red Yellow

Modern use of factors

In 2007, R introduced a hashing method for characters the reduced the memory burden of character vectors (ref: stringsAsFactors: an Unauthorized Biography). Take note that when we determined that characters require 1.7 times more storage space than factors, that was calculated in a recent version of R, meaning that the memory use of character vectors was even more taxing before 2007.

Owing to the hashing method in modern R and to far greater memory resources in modern computers, the issue of memory efficiency in storing character values has been reduced to a very small concern. The prevailing attitude in the R Community is a preference for character vectors over factors in most situations. The primary causes for the shift away from factors are

  1. The increase of unstructured and/or loosely controlled character data
  2. The tendency of factors to not behave as desired when the user forgets she is dealing with a factor and not a character

In the first case, it makes no sense to store free text or open response fields as factors, as there will unlikely be any pattern that allows for more than one observation per level. Alternatively, if the data structure is not carefully controlled, it is possible to get multiple levels that correspond to the same category (such as "blue", "Blue", and "BLUE"). In such cases, many prefer to manage these discrepancies as characters prior to converting to a factor (if conversion takes place at all).

In the second case, if the user thinks she is working with a character vector, certain methods may not respond as anticipated. This basic understanding can lead to confusion and frustration while trying to debug scripts and codes. While, strictly speaking, this may be considered the fault of the user, most users are happy to avoid using factors and avoid these situations altogether.

Pattern Matching and Replacement

Differences from other languages

Escaped regex symbols (like \1) are must be escaped a second time (like \\1), not only in the pattern argument, but also in the replacement to sub and gsub.

By default, the pattern for all commands (grep, sub, regexpr) is not Perl Compatible Regular Expression (PCRE) so some things like lookarounds are not supported. However, each function accepts a perl=TRUE argument to enable them. See the R Regular Expressions topic for details.

Specialized packages

Run-length encoding

Date and Time

Classes

  • POSIXct

    A date-time class, POSIXct stores time as seconds since UNIX epoch on 1970-01-01 00:00:00 UTC. It is the format returned when pulling the current time with Sys.Time().

  • POSIXlt

    A date-time class, stores a list of day, month, year, hour, minute, second, and so on. This is the format returned by strptime.

  • Date The only date class, stores the date as a floating-point number.

Selecting a date-time format

POSIXct is the sole option in the tidyverse and world of UNIX. It is faster and takes up less memory than POSIXlt.

origin = as.POSIXct("1970-01-01 00:00:00", format ="%Y-%m-%d %H:%M:%S", tz = "UTC")

origin
## [1] "1970-01-01 UTC"

origin + 47
## [1] "1970-01-01 00:00:47 UTC"

as.numeric(origin)     # At epoch
## 0

as.numeric(Sys.time()) # Right now (output as of July 21, 2016 at 11:47:37 EDT)
## 1469116057

posixlt = as.POSIXlt(Sys.time(), format ="%Y-%m-%d %H:%M:%S", tz = "America/Chicago")

# Conversion to POISXct
posixct = as.POSIXct(posixlt)
posixct

# Accessing components
posixlt$sec   # Seconds 0-61
posixlt$min   # Minutes 0-59
posixlt$hour  # Hour 0-23
posixlt$mday  # Day of the Month 1-31
posixlt$mon   # Months after the first of the year 0-11
posixlt$year  # Years since 1900.

ct = as.POSIXct("2015-05-25")
lt = as.POSIXlt("2015-05-25")

object.size(ct)
# 520 bytes
object.size(lt)
# 1816 bytes

Specialized packages

  • anytime
  • data.table IDate and ITime
  • fasttime
  • lubridate
  • nanotime

Speeding up tough-to-vectorize code

ggplot2

ggplot2 has its own perfect reference website http://ggplot2.tidyverse.org/.

Most of the time, it is more convenient to adapt the structure or content of the plotted data (e.g. a data.frame) than adjusting things within the plot afterwards.

RStudio publishes a very helpful "Data Visualization with ggplot2" cheatsheet that can be found here.

Lists

Introduction to Geographical Maps

Base Plotting

The items listed in the "Parameters" section is a small fraction of hte possible parameters that can be modified or set by the par function. See par for a more complete list. In addition all the graphics devices, including the system specific interactive graphics devices will have a set of parameters that can customize the output.

Set operations

tidyverse

Rcpp

Random Numbers Generator

String manipulation with stringi package

To install package simply run:

install.packages("stringi")

to load it:

require("stringi")

Parallel processing

Subsetting

Missing values:

Missing values (NAs) used in subsetting with [ return NA since a NA index

picks an unknown element and so returns NA in the corresponding element..

The "default" type of NA is "logical" (typeof(NA)) which means that, as any "logical" vector used in subsetting, will be recycled to match the length of the subsetted object. So x[NA] is equivalent to x[as.logical(NA)] which is equivalent to x[rep_len(as.logical(NA), length(x))] and, consequently, it returns a missing value (NA) for each element of x. As an example:

x <- 1:3
x[NA]
## [1] NA NA NA

While indexing with "numeric"/"integer" NA picks a single NA element (for each NA in index):

x[as.integer(NA)]
## [1] NA

x[c(NA, 1, NA, NA)]
## [1] NA  1 NA NA

Subsetting out of bounds:

The [ operator, with one argument passed, allows indices that are > length(x) and returns NA for atomic vectors or NULL for generic vectors. In contrast, with [[ and when [ is passed more arguments (i.e. subsetting out of bounds objects with length(dim(x)) > 2) an error is returned:

(1:3)[10]
## [1] NA
(1:3)[[10]]
## Error in (1:3)[[10]] : subscript out of bounds
as.matrix(1:3)[10]
## [1] NA
as.matrix(1:3)[, 10]
## Error in as.matrix(1:3)[, 10] : subscript out of bounds
list(1, 2, 3)[10]
## [[1]]
## NULL
list(1, 2, 3)[[10]]
## Error in list(1, 2, 3)[[10]] : subscript out of bounds

The behaviour is the same when subsetting with "character" vectors, that are not matched in the "names" attribute of the object, too:

c(a = 1, b = 2)["c"]
## <NA> 
##   NA 
list(a = 1, b = 2)["c"]
## <NA>
## NULL

Help topics:

See ?Extract for further information.

Debugging

Installing packages

Related Docs

Arima Models

The Arima function in the forecast package is more explicit in how it deals with constants, which may make it easier for some users relative to the arima function in base R.

ARIMA is a general framework for modeling and making predictions from time series data using (primarily) the series itself. The purpose of the framework is to differentiate short- and long-term dynamics in a series to improve the accuracy and certainty of forecasts. More poetically, ARIMA models provide a method for describing how shocks to a system transmit through time.

From an econometric perspective, ARIMA elements are necessary to correct serial correlation and ensure stationarity.

Distribution Functions

There are generally four prefixes:

  • d-The density function for the given distribution
  • p-The cumulative distribution function
  • q-Get the quantile associated with the given probability
  • r-Get a random sample

For the distributions built into R's base installation, see ?Distributions.

Shiny

spatial analysis

sqldf

Code profiling

Control flow structures

For loops are a flow control method for repeating a task or set of tasks over a domain. The core structure of a for loop is

for ( [index] in [domain]){
  [body]
}

Where

  1. [index] is a name takes exactly one value of [domain] over each iteration of the loop.
  2. [domain] is a vector of values over which to iterate.
  3. [body] is the set of instructions to apply on each iteration.

As a trivial example, consider the use of a for loop to obtain the cumulative sum of a vector of values.

x <- 1:4
cumulative_sum <- 0
for (i in x){
  cumulative_sum <- cumulative_sum + x[i]
}
cumulative_sum

Optimizing Structure of For Loops

For loops can be useful for conceptualizing and executing tasks to repeat. If not constructed carefully, however, they can be very slow to execute compared to the preferred used of the apply family of functions. Nonetheless, there are a handful of elements you can include in your for loop construction to optimize the loop. In many cases, good construction of the for loop will yield computational efficiency very close to that of an apply function.

A 'properly constructed' for loop builds on the core structure and includes a statement declaring the object that will capture each iteration of the loop. This object should have both a class and a length declared.

[output] <- [vector_of_length]
for ([index] in [length_safe_domain]){
  [output][index] <- [body]
}

To illustrate, let us write a loop to square each value in a numeric vector (this is a trivial example for illustration only. The 'correct' way of completing this task would be x_squared <- x^2).

x <- 1:100
x_squared <- vector("numeric", length = length(x))
for (i in seq_along(x)){
  x_squared[i] <- x[i]^2
}

Again, notice that we first declared a receptacle for the output x_squared, and gave it the class "numeric" with the same length as x. Additionally, we declared a "length safe domain" using the seq_along function. seq_along generates a vector of indices for an object that is suited for use in for loops. While it seems intuitive to use for (i in 1:length(x)), if x has 0 length, the loop will attempt to iterate over the domain of 1:0, resulting in an error (the 0th index is undefined in R).

Receptacle objects and length safe domains are handled internally by the apply family of functions and users are encouraged to adopt the apply approach in place of for loops as much as possible. However, if properly constructed, a for loop may occasionally provide greater code clarity with minimal loss of efficiency.

Vectorizing For Loops

For loops can often be a useful tool in conceptualizing the tasks that need to be completed within each iteration. When the loop is completely developed and conceptualized, there may be advantages to turning the loop into a function.

In this example, we will develop a for loop to calculate the mean of each column in the mtcars dataset (again, a trivial example as it could be accomplished via the colMeans function).

column_mean_loop <- vector("numeric", length(mtcars))
for (k in seq_along(mtcars)){
  column_mean_loop[k] <- mean(mtcars[[k]])
}

The for loop can be converted to an apply function by rewriting the body of the loop as a function.

col_mean_fn <- function(x) mean(x)
column_mean_apply <- vapply(mtcars, col_mean_fn, numeric(1))

And to compare the results:

identical(column_mean_loop, 
          unname(column_mean_apply)) #* vapply added names to the elements
                                     #* remove them for comparison

The advantages of the vectorized form is that we were able to eliminate a few lines of code. The mechanics of determining the length and type of the output object and iterating over a length safe domain are handled for us by the apply function. Additionally, the apply function is a little bit faster than the loop. The difference of speed is often negligible in human terms depending on the number of iterations and the complexity of the body.

Column wise operation

JSON

RODBC

lubridate

To install package from CRAN:

install.packages("lubridate")

To install development version from Github:

library(devtools)
# dev mode allows testing of development packages in a sandbox, without interfering
# with the other packages you have installed.
dev_mode(on=T)
install_github("hadley/lubridate")
dev_mode(on=F)

To get vignettes on lubridate package:

vignette("lubridate")

To get help about some function foo:

help(foo)     # help about function foo
?foo          # same thing

# Example
# help("is.period")
# ?is.period

To get examples for a function foo:

example("foo")

# Example
# example("interval")

Time Series and Forecasting

strsplit function

Web scraping and parsing

Generalized linear models

Reshaping data between long and wide forms

Helpful packages

RMarkdown and knitr presentation

Sub options parameters:

sub-optiondescriptionhtmlpdfwordodtrtfmdgithubioslidesslidybeamer
citation_packageThe LaTeX package to process citations, natbib, biblatex or noneXXX
code_foldingLet readers to toggle the display of R code, "none", "hide", or "show"X
colorthemeBeamer color theme to useX
cssCSS file to use to style documentXXX
devGraphics device to use for figure output (e.g. "png")XXXXXXX
durationAdd a countdown timer (in minutes) to footer of slidesX
fig_captionShould figures be rendered with captions?XXXXXXX
fig_height, fig_widthDefault figure height and width (in inches) for documentXXXXXXXXXX
highlightSyntax highlighting: "tango", "pygments", "kate","zenburn", "textmate"XXXXX
includesFile of content to place in document (in_header, before_body, after_body)XXXXXXXX
incrementalShould bullets appear one at a time (on presenter mouse clicks)?XXX
keep_mdSave a copy of .md file that contains knitr outputXXXXXX
keep_texSave a copy of .tex file that contains knitr outputXX
latex_engineEngine to render latex, or ""pdflatex", "xelatex", lualatex"XX
lib_dirDirectory of dependency files to use (Bootstrap, MathJax, etc.)XXX
mathjaxSet to local or a URL to use a local/URL version of MathJax to renderXXX
md_extensionsMarkdown extensions to add to default definition or R MarkdownXXXXXXXXXX
number_sectionsAdd section numbering to headersXX
pandoc_argsAdditional arguments to pass to PandocXXXXXXXXXX
preserve_yamlPreserve YAML front matter in final document?X
reference_docxdocx file whose styles should be copied when producing docx outputX
self_containedEmbed dependencies into the docXXX
slide_levelThe lowest heading level that defines individual slidesX
smallerUse the smaller font size in the presentation?X
smartConvert straight quotes to curly, dashes to em-dashes, ... to ellipses, etc.XXX
templatePandoc template to use when rendering fileXXXXX
themeBootswatch or Beamer theme to use for pageXX
tocAdd a table of contents at start of documentXXXXXXX
toc_depthThe lowest level of headings to add to table of contentsXXXXXX
toc_floatFloat the table of contents to the left of the main contentX

Scope of variables

The most common pitfall with scope arises in parallelization. All variables and functions must be passed into a new environment that is run on each thread.

Performing a Permutation Test

xgboost

R code vectorization best practices

Missing values

Missing values are represented by the symbol NA (not available). Impossible values (e.g., as a result of sqrt(-1)) are represented by the symbol NaN (not a number).

Hierarchical Linear Modeling

Classes

There are several functions for inspecting the "type" of an object. The most useful such function is class, although sometimes it is necessary to examine the mode of an object. Since we are discussing "types", one might think that typeof would be useful, but generally the result from mode will be more useful, because objects with no explicit "class"-attribute will have function dispatch determined by the "implicit class" determined by their mode.

Introspection

*apply family of functions (functionals)

A function in the *apply family is an abstraction of a for loop. Compared with the for loops *apply functions have the following advantages:

  1. Require less code to write.
  2. Doesn't have an iteration counter.
  3. Doesn't use temporary variables to store intermediate results.

However for loops are more general and can give us more control allowing to achieve complex computations that are not always trivial to do using *apply functions.

The relationship between for loops and *apply functions is explained in the documentation for for loops.

Members of the *apply Family

The *apply family of functions contains several variants of the same principle that differ based primarily on the kind of output they return.

functionInputOutput
applymatrix, data.frame, or arrayvector or matrix (depending on the length of each element returned)
sapplyvector or listvector or matrix (depending on the length of each element returned)
lapplyvector or listlist
vapplyvector or `listvector or matrix (depending on the length of each element returned) of the user-designated class
mapplymultiple vectors, lists or a combinationlist

See "Examples" to see how each of these functions is used.

Text mining

ANOVA

Raster and Image Analysis

Survival analysis

Fault-tolerant/resilient code

tryCatch

tryCatch returns the value associated to executing expr unless there's a condition: a warning or an error. If that's the case, specific return values (e.g. return(NA) above) can be specified by supplying a handler function for the respective conditions (see arguments warning and error in ?tryCatch). These can be functions that already exist, but you can also define them within tryCatch (as we did above).

Implications of choosing specific return values of the handler functions

As we've specified that NA should be returned in case of an error in the "try part", the third element in y is NA. If we'd have chosen NULL to be the return value, the length of y would just have been 2 instead of 3 as lapply will simply "ignore/drop" return values that are NULL. Also note that if you don't specify an explicit return value via return, the handler functions will return NULL (i.e. in case of an error or a warning condition).

"Undesired" warning message

When the third element of our urls vector hits our function, we get the following warning in addition to the fact that an error occurs (readLines first complains that it can't open the connection via a warning before actually failing with an error):

Warning message:
    In file(con, "r") : cannot open file 'I'm no URL': No such file or directory

An error "wins" over a warning, so we're not really interested in the warning in this particular case. Thus we have set warn = FALSE in readLines, but that doesn't seem to have any effect. An alternative way to suppress the warning is to use

suppressWarnings(readLines(con = url))

instead of

readLines(con = url, warn = FALSE)

Reproducible R

Updating R and the package library

Fourier Series and Transformations

The Fourier transform decomposes a function of time (a signal) into the frequencies that make it up, similarly to how a musical chord can be expressed as the amplitude (or loudness) of its constituent notes. The Fourier transform of a function of time itself is a complex-valued function of frequency, whose absolute value represents the amount of that frequency present in the original function, and whose complex argument is the phase offset of the basic sinusoid in that frequency.

The Fourier transform is called the frequency domain representation of the original signal. The term Fourier transform refers to both the frequency domain representation and the mathematical operation that associates the frequency domain representation to a function of time. The Fourier transform is not limited to functions of time, but in order to have a unified language, the domain of the original function is commonly referred to as the time domain. For many functions of practical interest one can define an operation that reverses this: the inverse Fourier transformation, also called Fourier synthesis, of a frequency domain representation combines the contributions of all the different frequencies to recover the original function of time.

Linear operations performed in one domain (time or frequency) have corresponding operations in the other domain, which are sometimes easier to perform. The operation of differentiation in the time domain corresponds to multiplication by the frequency, so some differential equations are easier to analyze in the frequency domain. Also, convolution in the time domain corresponds to ordinary multiplication in the frequency domain. Concretely, this means that any linear time-invariant system, such as an electronic filter applied to a signal, can be expressed relatively simply as an operation on frequencies. So significant simplification is often achieved by transforming time functions to the frequency domain, performing the desired operations, and transforming the result back to time.

Harmonic analysis is the systematic study of the relationship between the frequency and time domains, including the kinds of functions or operations that are "simpler" in one or the other, and has deep connections to almost all areas of modern mathematics.

Functions that are localized in the time domain have Fourier transforms that are spread out across the frequency domain and vice versa. The critical case is the Gaussian function, of substantial importance in probability theory and statistics as well as in the study of physical phenomena exhibiting normal distribution (e.g., diffusion), which with appropriate normalizations goes to itself under the Fourier transform. Joseph Fourier introduced the transform in his study of heat transfer, where Gaussian functions appear as solutions of the heat equation.

The Fourier transform can be formally defined as an improper Riemann integral, making it an integral transform, although this definition is not suitable for many applications requiring a more sophisticated integration theory.

For example, many relatively simple applications use the Dirac delta function, which can be treated formally as if it were a function, but the justification requires a mathematically more sophisticated viewpoint. The Fourier transform can also be generalized to functions of several variables on Euclidean space, sending a function of 3-dimensional space to a function of 3-dimensional momentum (or a function of space and time to a function of 4-momentum).

This idea makes the spatial Fourier transform very natural in the study of waves, as well as in quantum mechanics, where it is important to be able to represent wave solutions either as functions either of space or momentum and sometimes both. In general, functions to which Fourier methods are applicable are complex-valued, and possibly vector-valued. Still further generalization is possible to functions on groups, which, besides the original Fourier transform on ℝ or ℝn (viewed as groups under addition), notably includes the discrete-time Fourier transform (DTFT, group = ℤ), the discrete Fourier transform (DFT, group = ℤ mod N) and the Fourier series or circular Fourier transform (group = S1, the unit circle ≈ closed finite interval with endpoints identified). The latter is routinely employed to handle periodic functions. The Fast Fourier transform (FFT) is an algorithm for computing the DFT.

.Rprofile

dplyr

dplyr is an iteration of plyr that provides a flexible "verb" based functions to manipulate data in R. The latest version of dplyr can be downloaded from CRAN using

install.package("dplyr")

The key object in dplyr is a tbl, a representation of a tabular data structure. Currently dplyr (version 0.5.0) supports:

  • data frames
  • data tables
  • SQLite
  • PostgreSQL/Redshift
  • MySQL/MariaDB
  • Bigquery
  • MonetDB
  • data cubes with arrays (partial implementation)

caret

Extracting and Listing Files in Compressed Archives

Probability Distributions with R

R in LaTeX with knitr

Knitr is a tool that allows us to interweave natural language (in the form of LaTeX) and source code (in the form of R). In general, the concept of interspersing natural language and source code is called literate programming. Since knitr files contain a mixture of LaTeX (traditionally housed in .tex files) and R (traditionally housed in .R files) a new file extension called R noweb (.Rnw) is required. .Rnw files contain a mixture of LaTeX and R code.

Knitr allows for the generation of statistical reports in PDF format and is a key tool for achieving reproducable research.

Compiling .Rnw files to a PDF is a two step process. First, we need to know how to execute the R code and capture the output in a format that a LaTeX compiler can understand (a process called 'kniting'). We do this using the knitr package. The command for this is shown below, assuming you have installed the knitr package:

Rscript -e "library(knitr); knit('r-noweb-file.Rnw')

This will generate a normal .tex file (called r-noweb.tex in this example) which can then be turned into a PDF file using:

pdflatex r-noweb-file.tex

Web Crawling in R

Arithmetic Operators

Nearly all operators in R are really functions. For example, + is a function defined as function (e1, e2) .Primitive("+") where e1 is the left-hand side of the operator and e2 is the right-hand side of the operator. This means it is possible to accomplish rather counterintuitive effects by masking the + in base with a user defined function.

For example:

`+` <- function(e1, e2) {e1-e2}

> 3+10
[1] -7

Creating reports with RMarkdown

GPU-accelerated computing

heatmap and heatmap.2

Network analysis with the igraph package

Functional programming

Get user input

roxygen2

Hashmaps

Spark API (SparkR)

The SparkR package let's you work with distributed data frames on top of a Spark cluster. These allow you to do operations like selection, filtering, aggregation on very large datasets. SparkR overview SparkR package documentation

Meta: Documentation Guidelines

I/O for foreign tables (Excel, SAS, SPSS, Stata)

I/O for database tables

Specialized packages

I/O for geographic data (shapefiles, etc.)

I/O for raster images

I/O for R's binary format

Reading and writing strings

Input and output

To construct file paths, for reading or writing, use file.path.

Use dir to see what files are in a directory.

Recycling

What is recycling in R

Recycling is when an object is automatically extended in certain operations to match the length of another, longer object.

For example, the vectorised addition results in the following:

c(1,2,3) + c(1,2,3,4,5,6)  
[1] 2 4 6 5 7 9

Because of the recycling, the operation that actually happened was:

c(1,2,3,1,2,3) + c(1,2,3,4,5,6)

In cases where the longer object is not a multiple of the shorter one, a warning message is presented:

c(1,2,3) + c(1,2,3,4,5,6,7)
[1] 2 4 6 5 7 9 8
Warning message:
In c(1, 2, 3) + c(1, 2, 3, 4, 5, 6, 7) :
  longer object length is not a multiple of shorter object length

Another example of recycling:

matrix(nrow =5, ncol = 2, 1:5 )
     [,1] [,2]
[1,]    1    1
[2,]    2    2
[3,]    3    3
[4,]    4    4
[5,]    5    5

Expression: parse + eval

The function parse convert text and files into expressions.

The function eval evaluate expressions.

Regular Expressions (regex)

Character classes

  • "[AB]" could be A or B
  • "[[:alpha:]]" could be any letter
  • "[[:lower:]]" stands for any lower-case letter. Note that "[a-z]" is close but doesn't match, e.g., ú.
  • "[[:upper:]]" stands for any upper-case letter. Note that "[A-Z]" is close but doesn't match, e.g., Ú.
  • "[[:digit:]]" stands for any digit : 0, 1, 2, ..., or 9 and is equivalent to "[0-9]".

Quantifiers

+, * and ? apply as usual in regex. -- + matches at least once, * matches 0 or more times, and ? matches 0 or 1 time.

Start and end of line indicators

You can specify the position of the regex in the string :

  • "^..." forces the regular expression to be at the beginning of the string
  • "...$" forces the regular expression to be at the end of the string

Differences from other languages

Please note that regular expressions in R often look ever-so-slightly different from regular expressions used in other languages.

  • R requires double-backslash escapes (because "\" already implies escaping in general in R strings), so, for example, to capture whitespace in most regular expression engines, one simply needs to type \s, vs. \\s in R.

  • UTF-8 characters in R should be escaped with a capital U, e.g. [\U{1F600}] and [\U1F600] match 😀, whereas in, e.g., Ruby, this would be matched with a lower-case u.

Additional Resources

The following site reg101 is a good place for checking online regex before using it R-script.

The R Programmming wikibook has a page dedicated to text processing with many examples using regular expressions.

Combinatorics

Pivot and unpivot with data.table

Much of what goes into conditioning data to build models or visualizations can be accomplished with data.table. As compare to other options, data.table offers advantages of speed and flexibility.

Inspecting packages

Solving ODEs in R

Note that it is necessary to return the rate of change in the same ordering as the specification of the state variables. In example "The Lorenz model" this means, that in the function "Lorenz" command

return(list(c(dX, dY, dZ)))

has the same order as the definition of the state variables

yini <- c(X = 1, Y = 1, Z = 1)

Feature Selection in R -- Removing Extraneous Features

Bibliography in RMD

  • The purpose of this documentation is integrating an academic bibliography in a RMD file.

  • To use the documentation given above, you have to install rmarkdown in R via install.packages("rmarkdown").

  • Sometimes Rmarkdown removes the hyperlinks of the citations. The solution for this is adding the following code to your YAML header: link-citations: true

  • The bibliography may have any of these formats:

FormatFile extension
MODS.mods
BibLaTeX.bib
BibTeX.bibtex
RIS.ris
EndNote.enl
EndNote XML.xml
ISI.wos
MEDLINE.medline
Copac.copac
JSON citeproc.json

Writing functions in R

Color schemes for graphics

Hierarchical clustering with hclust

Random Forest Algorithm

Bar Chart

Cleaning data

RESTful R Services

Machine learning

Variables

The Date class

Related topics

Jumbled notes

  • Date: Stores time as number of days since UNIX epoch on 1970-01-01. with negative values for earlier dates.
  • It is represented as an integer (however, it is not enforced in the internal representation)
  • They are always printed following the rules of the current Gregorian calendar, even though the calendar was not in use a long time ago.
  • It doesn't keep track of timezones, so it should not be used to truncate the time out of POSIXct or POSIXlt objects.
  • sys.Date() returns an object of class Date

More notes

The logical class

Shorthand

TRUE, FALSE and NA are the only values for logical vectors; and all three are reserved words. T and F can be shorthand for TRUE and FALSE in a clean R session, but neither T nor F are reserved, so assignment of non-default values to those names can set users up for difficulties.

The character class

Numeric classes and storage modes

Matrices

Date-time classes (POSIXct and POSIXlt)

Pitfalls

With POSIXct, midnight will display only the date and time zone, though the full time is still stored.

Related topics

Specialized packages

  • lubridate

Using texreg to export models in a paper-ready way

Links

Publishing

Implement State Machine Pattern using S4 Class

Reshape using tidyr

Modifying strings by substitution

Non-standard evaluation and standard evaluation

Randomization

Users who are coming from other programming languages may be confused by the lack of a rand function equivalent to what they may have experienced before. Basic random number generation is done using the r* family of functions for each distribution (see the link above). Random numbers drawn uniformly from a range can be generated using runif, for "random uniform". Since this also looks suspiciously like "run if", it is often hard to figure out for new R users.

Object-Oriented Programming in R

Regular Expression Syntax in R

Coercion

Standardize analyses by writing standalone R scripts

To represent the standard input-/output channels, use the functions file("stdin") (input from terminal or other program via pipe), stdout() (standard output) and stderr() (standard error). Note that while there is the function stdin(), it can not be used when supplying a ready-made script to R, because it will read the next lines of that script instead of user input.

Analyze tweets with R

Natural language processing

Using pipe assignment in your own package %<>%: How to ?

R Markdown Notebooks (from RStudio)

Updating R version

Aggregating data frames

Data acquisition

R memento by examples

Creating packages with devtools