See the documentation guidelines for general rules when creating documentation.
arrays
(and the special case of matrices) have a dim
attribute that sets them apart from R's "atomic vectors" which have no attributes.Note that exporting to a plain text format sacrifices much of the information encoded in the data like variable classes for the sake of wide portability. For cases that do not require such portability, a format like .RData or Feather may be more useful.
Input/output for other types of files is covered in several other topics, all linked from Input and output.
%>%
The pipe operator is defined in the magrittr
package, but it gained huge visibility and popularity with the dplyr
package (which imports the definition from magrittr
). Now it is part of tidyverse
, which is a collection of packages that "work in harmony because they share common data representations and API design".
The magrittr
package also provides several variations of the pipe operator for those who want more flexibility in piping, such as the compound assignment pipe %<>%
, the exposition pipe %$%
, and the tee operator %T>%
. It also provides a suite of alias functions to replace common functions that have special syntax (+
, [
, [[
, etc.) so that they can be easily used within a chain of pipes.
As with any infix operator (such as +
, *
, ^
, &
, %in%
), you can find the official documentation if you put it in quotes: ?'%>%'
or help('%>%')
(assuming you have loaded a package that attaches pkg:magrittr
).
There is a special hotkey in RStudio for the pipe operator: Ctrl+Shift+M
(Windows & Linux), Cmd+Shift+M
(Mac).
While the pipe operator is useful, be aware that there is a negative impact on performance due mainly to the overhead of using it. Consider the following two things carefully when using the pipe operator:
object %>% rm()
does not remove object
)To install the data.table package:
# install from CRAN
install.packages("data.table")
# or install development version
install.packages("data.table", type = "source", repos = "http://Rdatatable.github.io/data.table")
# and to revert from devel to CRAN, the current version must first be removed
remove.packages("data.table")
install.packages("data.table")
The package's official site has wiki pages providing help getting started, and lists of presentations and articles from around the web. Before asking a question -- here on StackOverflow or anywhere else -- please read the support page.
Many of the functions in the examples above exist in the data.table namespace. To use them, you will need to add a line like library(data.table)
first or to use their full path, like data.table::fread
instead of simply fread
. For help on individual functions, the syntax is help("fread")
or ?fread
. Again, if the package is not loaded, use the full name like ?data.table::fread
.
An object with class factor
is a vector with a particular set of characteristics.
integer
vector.levels
attribute the shows the character representation of the values.factor
To illustrate, let us generate a vector of 1,000 observations from a set of colors.
set.seed(1)
Color <- sample(x = c("Red", "Blue", "Green", "Yellow"),
size = 1000,
replace = TRUE)
Color <- factor(Color)
We can observe each of the characteristics of Color
listed above:
#* 1. It is stored internally as an `integer` vector
typeof(Color)
[1] "integer"
#* 2. It maintains a `levels` attribute the shows the character representation of the values.
#* 3. Its class is stored as `factor`
attributes(Color)
$levels [1] "Blue" "Green" "Red" "Yellow" $class [1] "factor"
The primary advantage of a factor object is efficiency in data storage. An integer requires less memory to store than a character. Such efficiency was highly desirable when many computers had much more limited resources than current machines (for a more detailed history of the motivations behind using factors, see stringsAsFactors
: an Unauthorized Biography). The difference in memory use can be seen even in our Color
object. As you can see, storing Color
as a character requires about 1.7 times as much memory as the factor object.
#* Amount of memory required to store Color as a factor.
object.size(Color)
4624 bytes
#* Amount of memory required to store Color as a character
object.size(as.character(Color))
8232 bytes
While the internal computation of factors sees the object as an integer, the desired representation for human consumption is the character level. For example,
head(Color)
[1] Blue Blue Green Yellow Red Yellow Levels: Blue Green Red Yellow
is a easier for human comprehension than
head(as.numeric(Color))
[1] 1 1 2 4 3 4
An approximate illustration of how R goes about matching the character representation to the internal integer value is:
head(levels(Color)[as.numeric(Color)])
[1] "Blue" "Blue" "Green" "Yellow" "Red" "Yellow"
Compare these results to
head(Color)
[1] Blue Blue Green Yellow Red Yellow Levels: Blue Green Red Yellow
In 2007, R introduced a hashing method for characters the reduced the memory burden of character vectors (ref: stringsAsFactors
: an Unauthorized Biography). Take note that when we determined that characters require 1.7 times more storage space than factors, that was calculated in a recent version of R, meaning that the memory use of character vectors was even more taxing before 2007.
Owing to the hashing method in modern R and to far greater memory resources in modern computers, the issue of memory efficiency in storing character values has been reduced to a very small concern. The prevailing attitude in the R Community is a preference for character vectors over factors in most situations. The primary causes for the shift away from factors are
In the first case, it makes no sense to store free text or open response fields as factors, as there will unlikely be any pattern that allows for more than one observation per level. Alternatively, if the data structure is not carefully controlled, it is possible to get multiple levels that correspond to the same category (such as "blue", "Blue", and "BLUE"). In such cases, many prefer to manage these discrepancies as characters prior to converting to a factor (if conversion takes place at all).
In the second case, if the user thinks she is working with a character vector, certain methods may not respond as anticipated. This basic understanding can lead to confusion and frustration while trying to debug scripts and codes. While, strictly speaking, this may be considered the fault of the user, most users are happy to avoid using factors and avoid these situations altogether.
Escaped regex symbols (like \1
) are must be escaped a second time (like \\1
), not only in the pattern
argument, but also in the replacement
to sub
and gsub
.
By default, the pattern for all commands (grep, sub, regexpr) is not Perl Compatible Regular Expression (PCRE) so some things like lookarounds are not supported. However, each function accepts a perl=TRUE
argument to enable them. See the R Regular Expressions topic for details.
A run is a consecutive sequence of repeated values or observations. For repeated values, R's "run-length encoding" concisely describes a vector in terms of its runs. Consider:
dat <- c(1, 2, 2, 2, 3, 1, 4, 4, 1, 1)
We have a length-one run of 1s; then a length-three run of 2s; then a length-one run of 3s; and so on. R's run-length encoding captures all the lengths and values of a vector's runs.
A run can also refer to consecutive observations in a tabular data. While R doesn't have a natural way of encoding these, they can be handled with rleid
from the data.table package (currently a dead-end link).
A date-time class, POSIXct stores time as seconds since UNIX epoch on 1970-01-01 00:00:00 UTC
. It is the format returned when pulling the current time with Sys.Time()
.
A date-time class, stores a list of day, month, year, hour, minute, second, and so on. This is the format returned by strptime
.
Date The only date class, stores the date as a floating-point number.
POSIXct is the sole option in the tidyverse and world of UNIX. It is faster and takes up less memory than POSIXlt.
origin = as.POSIXct("1970-01-01 00:00:00", format ="%Y-%m-%d %H:%M:%S", tz = "UTC")
origin
## [1] "1970-01-01 UTC"
origin + 47
## [1] "1970-01-01 00:00:47 UTC"
as.numeric(origin) # At epoch
## 0
as.numeric(Sys.time()) # Right now (output as of July 21, 2016 at 11:47:37 EDT)
## 1469116057
posixlt = as.POSIXlt(Sys.time(), format ="%Y-%m-%d %H:%M:%S", tz = "America/Chicago")
# Conversion to POISXct
posixct = as.POSIXct(posixlt)
posixct
# Accessing components
posixlt$sec # Seconds 0-61
posixlt$min # Minutes 0-59
posixlt$hour # Hour 0-23
posixlt$mday # Day of the Month 1-31
posixlt$mon # Months after the first of the year 0-11
posixlt$year # Years since 1900.
ct = as.POSIXct("2015-05-25")
lt = as.POSIXlt("2015-05-25")
object.size(ct)
# 520 bytes
object.size(lt)
# 1816 bytes
ggplot2
has its own perfect reference website http://ggplot2.tidyverse.org/.
Most of the time, it is more convenient to adapt the structure or content of the plotted data (e.g. a data.frame
) than adjusting things within the plot afterwards.
RStudio publishes a very helpful "Data Visualization with ggplot2" cheatsheet that can be found here.
The items listed in the "Parameters" section is a small fraction of hte possible parameters that can be modified or set by the par
function. See par
for a more complete list. In addition all the graphics devices, including the system specific interactive graphics devices will have a set of parameters that can customize the output.
A set contains only one copy of each distinct element. Unlike some other programming languages, base R does not have a dedicated data type for sets. Instead, R treats a vector like a set by taking only its distinct elements. This applies to the set operators, setdiff
, intersect
, union
, setequal
and %in%
. For v %in% S
, only S
is treated as a set, however, not the vector v
.
For a true set data type in R, the Rcpp package provides some options.
To install package simply run:
install.packages("stringi")
to load it:
require("stringi")
Parallelization on remote machines require libraries to be downloaded on each machine. Prefer package::function()
calls. Several packages have parallelization natively built-in, including caret
, pls
and plyr
.
Microsoft R Open (Revolution R) also uses multi-threaded BLAS/LAPACK libraries which intrinsically parallelizes many common functions.
Missing values:
Missing values (NA
s) used in subsetting with [
return NA
since a NA
index
picks an unknown element and so returns NA in the corresponding element..
The "default" type of NA
is "logical" (typeof(NA)
) which means that, as any "logical" vector used in subsetting, will be recycled to match the length of the subsetted object. So x[NA]
is equivalent to x[as.logical(NA)]
which is equivalent to x[rep_len(as.logical(NA), length(x))]
and, consequently, it returns a missing value (NA
) for each element of x
. As an example:
x <- 1:3
x[NA]
## [1] NA NA NA
While indexing with "numeric"/"integer" NA
picks a single NA
element (for each NA
in index):
x[as.integer(NA)]
## [1] NA
x[c(NA, 1, NA, NA)]
## [1] NA 1 NA NA
Subsetting out of bounds:
The [
operator, with one argument passed, allows indices that are > length(x)
and returns NA
for atomic vectors or NULL
for generic vectors. In contrast, with [[
and when [
is passed more arguments (i.e. subsetting out of bounds objects with length(dim(x)) > 2
) an error is returned:
(1:3)[10]
## [1] NA
(1:3)[[10]]
## Error in (1:3)[[10]] : subscript out of bounds
as.matrix(1:3)[10]
## [1] NA
as.matrix(1:3)[, 10]
## Error in as.matrix(1:3)[, 10] : subscript out of bounds
list(1, 2, 3)[10]
## [[1]]
## NULL
list(1, 2, 3)[[10]]
## Error in list(1, 2, 3)[[10]] : subscript out of bounds
The behaviour is the same when subsetting with "character" vectors, that are not matched in the "names" attribute of the object, too:
c(a = 1, b = 2)["c"]
## <NA>
## NA
list(a = 1, b = 2)["c"]
## <NA>
## NULL
Help topics:
See ?Extract
for further information.
The Arima
function in the forecast package is more explicit in how it deals with constants, which may make it easier for some users relative to the arima
function in base R.
ARIMA is a general framework for modeling and making predictions from time series data using (primarily) the series itself. The purpose of the framework is to differentiate short- and long-term dynamics in a series to improve the accuracy and certainty of forecasts. More poetically, ARIMA models provide a method for describing how shocks to a system transmit through time.
From an econometric perspective, ARIMA elements are necessary to correct serial correlation and ensure stationarity.
There are generally four prefixes:
For the distributions built into R's base installation, see ?Distributions
.
For loops are a flow control method for repeating a task or set of tasks over a domain. The core structure of a for loop is
for ( [index] in [domain]){
[body]
}
Where
[index]
is a name takes exactly one value of [domain]
over each iteration of the loop.[domain]
is a vector of values over which to iterate.[body]
is the set of instructions to apply on each iteration.As a trivial example, consider the use of a for loop to obtain the cumulative sum of a vector of values.
x <- 1:4
cumulative_sum <- 0
for (i in x){
cumulative_sum <- cumulative_sum + x[i]
}
cumulative_sum
For loops can be useful for conceptualizing and executing tasks to repeat. If not constructed carefully, however, they can be very slow to execute compared to the preferred used of the apply
family of functions. Nonetheless, there are a handful of elements you can include in your for loop construction to optimize the loop. In many cases, good construction of the for loop will yield computational efficiency very close to that of an apply function.
A 'properly constructed' for loop builds on the core structure and includes a statement declaring the object that will capture each iteration of the loop. This object should have both a class and a length declared.
[output] <- [vector_of_length]
for ([index] in [length_safe_domain]){
[output][index] <- [body]
}
To illustrate, let us write a loop to square each value in a numeric vector (this is a trivial example for illustration only. The 'correct' way of completing this task would be x_squared <- x^2
).
x <- 1:100
x_squared <- vector("numeric", length = length(x))
for (i in seq_along(x)){
x_squared[i] <- x[i]^2
}
Again, notice that we first declared a receptacle for the output x_squared
, and gave it the class "numeric" with the same length as x
. Additionally, we declared a "length safe domain" using the seq_along
function. seq_along
generates a vector of indices for an object that is suited for use in for loops. While it seems intuitive to use for (i in 1:length(x))
, if x
has 0 length, the loop will attempt to iterate over the domain of 1:0
, resulting in an error (the 0th index is undefined in R).
Receptacle objects and length safe domains are handled internally by the apply
family of functions and users are encouraged to adopt the apply
approach in place of for loops as much as possible. However, if properly constructed, a for loop may occasionally provide greater code clarity with minimal loss of efficiency.
For loops can often be a useful tool in conceptualizing the tasks that need to be completed within each iteration. When the loop is completely developed and conceptualized, there may be advantages to turning the loop into a function.
In this example, we will develop a for loop to calculate the mean of each column in the mtcars
dataset (again, a trivial example as it could be accomplished via the colMeans
function).
column_mean_loop <- vector("numeric", length(mtcars))
for (k in seq_along(mtcars)){
column_mean_loop[k] <- mean(mtcars[[k]])
}
The for loop can be converted to an apply function by rewriting the body of the loop as a function.
col_mean_fn <- function(x) mean(x)
column_mean_apply <- vapply(mtcars, col_mean_fn, numeric(1))
And to compare the results:
identical(column_mean_loop,
unname(column_mean_apply)) #* vapply added names to the elements
#* remove them for comparison
The advantages of the vectorized form is that we were able to eliminate a few lines of code. The mechanics of determining the length and type of the output object and iterating over a length safe domain are handled for us by the apply function. Additionally, the apply function is a little bit faster than the loop. The difference of speed is often negligible in human terms depending on the number of iterations and the complexity of the body.
To install package from CRAN:
install.packages("lubridate")
To install development version from Github:
library(devtools)
# dev mode allows testing of development packages in a sandbox, without interfering
# with the other packages you have installed.
dev_mode(on=T)
install_github("hadley/lubridate")
dev_mode(on=F)
To get vignettes on lubridate package:
vignette("lubridate")
To get help about some function foo
:
help(foo) # help about function foo
?foo # same thing
# Example
# help("is.period")
# ?is.period
To get examples for a function foo
:
example("foo")
# Example
# example("interval")
Forecasting and time-series analysis may be handled with commonplace functions from the stats
package, such as glm()
or a large number of specialized packages. The CRAN Task View for time-series analysis provides a detailed listing of key packages by topic with short descriptions.
Scraping refers to using a computer to retrieve the code of a webpage. Once the code is obtained, it must be parsed into a useful form for further use in R.
Base R does not have many of the tools required for these processes, so scraping and parsing are typically done with packages. Some packages are most useful for scraping (RSelenium
, httr
, curl
, RCurl
), some for parsing (XML
, xml2
), and some for both (rvest
).
A related process is scraping a web API, which unlike a webpage returns data intended to be machine-readable. Many of the same packages are used for both.
Some websites object to being scraped, whether due to increased server loads or concerns about data ownership. If a website forbids scraping in it Terms of Use, scraping it is illegal.
sub-option | description | html | word | odt | rtf | md | github | ioslides | slidy | beamer | |
---|---|---|---|---|---|---|---|---|---|---|---|
citation_package | The LaTeX package to process citations, natbib, biblatex or none | X | X | X | |||||||
code_folding | Let readers to toggle the display of R code, "none", "hide", or "show" | X | |||||||||
colortheme | Beamer color theme to use | X | |||||||||
css | CSS file to use to style document | X | X | X | |||||||
dev | Graphics device to use for figure output (e.g. "png") | X | X | X | X | X | X | X | |||
duration | Add a countdown timer (in minutes) to footer of slides | X | |||||||||
fig_caption | Should figures be rendered with captions? | X | X | X | X | X | X | X | |||
fig_height, fig_width | Default figure height and width (in inches) for document | X | X | X | X | X | X | X | X | X | X |
highlight | Syntax highlighting: "tango", "pygments", "kate","zenburn", "textmate" | X | X | X | X | X | |||||
includes | File of content to place in document (in_header, before_body, after_body) | X | X | X | X | X | X | X | X | ||
incremental | Should bullets appear one at a time (on presenter mouse clicks)? | X | X | X | |||||||
keep_md | Save a copy of .md file that contains knitr output | X | X | X | X | X | X | ||||
keep_tex | Save a copy of .tex file that contains knitr output | X | X | ||||||||
latex_engine | Engine to render latex, or ""pdflatex", "xelatex", lualatex" | X | X | ||||||||
lib_dir | Directory of dependency files to use (Bootstrap, MathJax, etc.) | X | X | X | |||||||
mathjax | Set to local or a URL to use a local/URL version of MathJax to render | X | X | X | |||||||
md_extensions | Markdown extensions to add to default definition or R Markdown | X | X | X | X | X | X | X | X | X | X |
number_sections | Add section numbering to headers | X | X | ||||||||
pandoc_args | Additional arguments to pass to Pandoc | X | X | X | X | X | X | X | X | X | X |
preserve_yaml | Preserve YAML front matter in final document? | X | |||||||||
reference_docx | docx file whose styles should be copied when producing docx output | X | |||||||||
self_contained | Embed dependencies into the doc | X | X | X | |||||||
slide_level | The lowest heading level that defines individual slides | X | |||||||||
smaller | Use the smaller font size in the presentation? | X | |||||||||
smart | Convert straight quotes to curly, dashes to em-dashes, ... to ellipses, etc. | X | X | X | |||||||
template | Pandoc template to use when rendering file | X | X | X | X | X | |||||
theme | Bootswatch or Beamer theme to use for page | X | X | ||||||||
toc | Add a table of contents at start of document | X | X | X | X | X | X | X | |||
toc_depth | The lowest level of headings to add to table of contents | X | X | X | X | X | X | ||||
toc_float | Float the table of contents to the left of the main content | X |
The most common pitfall with scope arises in parallelization. All variables and functions must be passed into a new environment that is run on each thread.
Missing values are represented by the symbol NA
(not available). Impossible values (e.g., as a result of sqrt(-1)
) are represented by the symbol NaN
(not a number).
There are several functions for inspecting the "type" of an object. The most useful such function is class
, although sometimes it is necessary to examine the mode
of an object. Since we are discussing "types", one might think that typeof
would be useful, but generally the result from mode
will be more useful, because objects with no explicit "class"-attribute will have function dispatch determined by the "implicit class" determined by their mode.
A function in the *apply
family is an abstraction of a for
loop. Compared with the for
loops *apply
functions have the following advantages:
However for
loops are more general and can give us more control allowing to achieve complex computations that are not always trivial to do using *apply
functions.
The relationship between for
loops and *apply
functions is explained in the documentation for for
loops.
*apply
FamilyThe *apply
family of functions contains several variants of the same principle that differ based primarily on the kind of output they return.
function | Input | Output |
---|---|---|
apply | matrix , data.frame , or array | vector or matrix (depending on the length of each element returned) |
sapply | vector or list | vector or matrix (depending on the length of each element returned) |
lapply | vector or list | list |
vapply | vector or `list | vector or matrix (depending on the length of each element returned) of the user-designated class |
mapply | multiple vectors, lists or a combination | list |
See "Examples" to see how each of these functions is used.
tryCatch
tryCatch
returns the value associated to executing expr
unless there's a condition: a warning or an error. If that's the case, specific return values (e.g. return(NA)
above) can be specified by supplying a handler function for the respective conditions (see arguments warning
and error
in ?tryCatch
). These can be functions that already exist, but you can also define them within tryCatch
(as we did above).
As we've specified that NA
should be returned in case of an error in the "try part", the third element in y
is NA
. If we'd have chosen NULL
to be the return value, the length of y
would just have been 2
instead of 3
as lapply
will simply "ignore/drop" return values that are NULL
. Also note that if you don't specify an explicit return value via return
, the handler functions will return NULL
(i.e. in case of an error or a warning condition).
When the third element of our urls
vector hits our function, we get the following warning in addition to the fact that an error occurs (readLines
first complains that it can't open the connection via a warning before actually failing with an error):
Warning message:
In file(con, "r") : cannot open file 'I'm no URL': No such file or directory
An error "wins" over a warning, so we're not really interested in the warning in this particular case. Thus we have set warn = FALSE
in readLines
, but that doesn't seem to have any effect. An alternative way to suppress the warning is to use
suppressWarnings(readLines(con = url))
instead of
readLines(con = url, warn = FALSE)
To create reproducible results, all sources of variation need to be fixed. For instance, if a (pseudo)random number generator is used, the seed needs to be fixed if you want to recreate the same results. Another way to reduce variation is to combine text and computation in the same document.
Peng, R. D. (2011). Reproducible Research in Computational. Science, 334(6060), 1226–1227. http://doi.org/10.1126/science.1213847
Peng, Roger D. Report Writing for Data Science in R. Leanpub, 2015. https://leanpub.com/reportwriting.
The Fourier transform decomposes a function of time (a signal) into the frequencies that make it up, similarly to how a musical chord can be expressed as the amplitude (or loudness) of its constituent notes. The Fourier transform of a function of time itself is a complex-valued function of frequency, whose absolute value represents the amount of that frequency present in the original function, and whose complex argument is the phase offset of the basic sinusoid in that frequency.
The Fourier transform is called the frequency domain representation of the original signal. The term Fourier transform refers to both the frequency domain representation and the mathematical operation that associates the frequency domain representation to a function of time. The Fourier transform is not limited to functions of time, but in order to have a unified language, the domain of the original function is commonly referred to as the time domain. For many functions of practical interest one can define an operation that reverses this: the inverse Fourier transformation, also called Fourier synthesis, of a frequency domain representation combines the contributions of all the different frequencies to recover the original function of time.
Linear operations performed in one domain (time or frequency) have corresponding operations in the other domain, which are sometimes easier to perform. The operation of differentiation in the time domain corresponds to multiplication by the frequency, so some differential equations are easier to analyze in the frequency domain. Also, convolution in the time domain corresponds to ordinary multiplication in the frequency domain. Concretely, this means that any linear time-invariant system, such as an electronic filter applied to a signal, can be expressed relatively simply as an operation on frequencies. So significant simplification is often achieved by transforming time functions to the frequency domain, performing the desired operations, and transforming the result back to time.
Harmonic analysis is the systematic study of the relationship between the frequency and time domains, including the kinds of functions or operations that are "simpler" in one or the other, and has deep connections to almost all areas of modern mathematics.
Functions that are localized in the time domain have Fourier transforms that are spread out across the frequency domain and vice versa. The critical case is the Gaussian function, of substantial importance in probability theory and statistics as well as in the study of physical phenomena exhibiting normal distribution (e.g., diffusion), which with appropriate normalizations goes to itself under the Fourier transform. Joseph Fourier introduced the transform in his study of heat transfer, where Gaussian functions appear as solutions of the heat equation.
The Fourier transform can be formally defined as an improper Riemann integral, making it an integral transform, although this definition is not suitable for many applications requiring a more sophisticated integration theory.
For example, many relatively simple applications use the Dirac delta function, which can be treated formally as if it were a function, but the justification requires a mathematically more sophisticated viewpoint. The Fourier transform can also be generalized to functions of several variables on Euclidean space, sending a function of 3-dimensional space to a function of 3-dimensional momentum (or a function of space and time to a function of 4-momentum).
This idea makes the spatial Fourier transform very natural in the study of waves, as well as in quantum mechanics, where it is important to be able to represent wave solutions either as functions either of space or momentum and sometimes both. In general, functions to which Fourier methods are applicable are complex-valued, and possibly vector-valued. Still further generalization is possible to functions on groups, which, besides the original Fourier transform on ℝ or ℝn (viewed as groups under addition), notably includes the discrete-time Fourier transform (DTFT, group = ℤ), the discrete Fourier transform (DFT, group = ℤ mod N) and the Fourier series or circular Fourier transform (group = S1, the unit circle ≈ closed finite interval with endpoints identified). The latter is routinely employed to handle periodic functions. The Fast Fourier transform (FFT) is an algorithm for computing the DFT.
There is a nice chapter on the matter in Efficient R programming
dplyr is an iteration of plyr that provides a flexible "verb" based functions to manipulate data in R. The latest version of dplyr can be downloaded from CRAN using
install.package("dplyr")
The key object in dplyr is a tbl, a representation of a tabular data structure. Currently dplyr (version 0.5.0) supports:
Knitr is a tool that allows us to interweave natural language (in the form of LaTeX) and source code (in the form of R). In general, the concept of interspersing natural language and source code is called literate programming. Since knitr files contain a mixture of LaTeX (traditionally housed in .tex files) and R (traditionally housed in .R files) a new file extension called R noweb (.Rnw) is required. .Rnw files contain a mixture of LaTeX and R code.
Knitr allows for the generation of statistical reports in PDF format and is a key tool for achieving reproducable research.
Compiling .Rnw files to a PDF is a two step process. First, we need to know how to execute the R code and capture the output in a format that a LaTeX compiler can understand (a process called 'kniting'). We do this using the knitr package. The command for this is shown below, assuming you have installed the knitr package:
Rscript -e "library(knitr); knit('r-noweb-file.Rnw')
This will generate a normal .tex file (called r-noweb.tex in this example) which can then be turned into a PDF file using:
pdflatex r-noweb-file.tex
Nearly all operators in R are really functions. For example, +
is a function defined as function (e1, e2) .Primitive("+")
where e1 is the left-hand side of the operator and e2 is the right-hand side of the operator. This means it is possible to accomplish rather counterintuitive effects by masking the +
in base with a user defined function.
For example:
`+` <- function(e1, e2) {e1-e2}
> 3+10
[1] -7
GPU computing requires a 'platform' which can connect to and utilize the hardware. The two primary low-level languages that accomplish this are CUDA and OpenCL. The former requires installation of the proprietary NVIDIA CUDA Toolkit and is only applicable on NVIDIA GPUs. The latter is both company (e.g. NVIDIA, AMD, Intel) and hardware independent (CPU or GPU) but requires the installation of an SDK (software development kit). In order to use a GPU via R you will need to install one of these pieces of software first.
Once either the CUDA Toolkit or a OpenCL SDK is installed, you can install an appropriate R package. Almost all the R GPU packages are dependent upon CUDA and limited to NVIDIA GPUs. These include:
There are currently only two OpenCL enabled packages
Warning - installation can be difficult for different operating systems with different environmental variables and GPU platforms.
The SparkR
package let's you work with distributed data frames on top of a Spark cluster. These allow you to do operations like selection, filtering, aggregation on very large datasets.
SparkR overview
SparkR package documentation
To discuss editing the R tag Docs, visit the R chat.
Related Docs:
To construct file paths, for reading or writing, use file.path
.
Use dir
to see what files are in a directory.
What is recycling in R
Recycling is when an object is automatically extended in certain operations to match the length of another, longer object.
For example, the vectorised addition results in the following:
c(1,2,3) + c(1,2,3,4,5,6)
[1] 2 4 6 5 7 9
Because of the recycling, the operation that actually happened was:
c(1,2,3,1,2,3) + c(1,2,3,4,5,6)
In cases where the longer object is not a multiple of the shorter one, a warning message is presented:
c(1,2,3) + c(1,2,3,4,5,6,7)
[1] 2 4 6 5 7 9 8
Warning message:
In c(1, 2, 3) + c(1, 2, 3, 4, 5, 6, 7) :
longer object length is not a multiple of shorter object length
Another example of recycling:
matrix(nrow =5, ncol = 2, 1:5 )
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 4
[5,] 5 5
The function parse
convert text and files into expressions.
The function eval
evaluate expressions.
"[AB]"
could be A or B"[[:alpha:]]"
could be any letter"[[:lower:]]"
stands for any lower-case letter. Note that "[a-z]"
is close but doesn't match, e.g., ú
."[[:upper:]]"
stands for any upper-case letter. Note that "[A-Z]"
is close but doesn't match, e.g., Ú
."[[:digit:]]"
stands for any digit : 0, 1, 2, ..., or 9 and is equivalent to "[0-9]"
.+
, *
and ?
apply as usual in regex. -- +
matches at least once, *
matches 0 or more times, and ?
matches 0 or 1 time.
You can specify the position of the regex in the string :
"^..."
forces the regular expression to be at the beginning of the string"...$"
forces the regular expression to be at the end of the stringPlease note that regular expressions in R often look ever-so-slightly different from regular expressions used in other languages.
R requires double-backslash escapes (because "\"
already implies escaping in general in R strings), so, for example, to capture whitespace in most regular expression engines, one simply needs to type \s
, vs. \\s
in R.
UTF-8 characters in R should be escaped with a capital U, e.g. [\U{1F600}]
and [\U1F600]
match 😀, whereas in, e.g., Ruby, this would be matched with a lower-case u.
The following site reg101 is a good place for checking online regex before using it R-script.
The R Programmming wikibook has a page dedicated to text processing with many examples using regular expressions.
Much of what goes into conditioning data to build models or visualizations can be accomplished with data.table
. As compare to other options, data.table
offers advantages of speed and flexibility.
The Comprehensive R Archive Network (CRAN) is the primary package repository.
Note that it is necessary to return the rate of change in the same ordering as the specification of the state variables. In example "The Lorenz model" this means, that in the function "Lorenz" command
return(list(c(dX, dY, dZ)))
has the same order as the definition of the state variables
yini <- c(X = 1, Y = 1, Z = 1)
The purpose of this documentation is integrating an academic bibliography in a RMD file.
To use the documentation given above, you have to install rmarkdown
in R via install.packages("rmarkdown")
.
Sometimes Rmarkdown removes the hyperlinks of the citations. The solution for this is adding the following code to your YAML header:
link-citations: true
The bibliography may have any of these formats:
Format | File extension |
---|---|
MODS | .mods |
BibLaTeX | .bib |
BibTeX | .bibtex |
RIS | .ris |
EndNote | .enl |
EndNote XML | .xml |
ISI | .wos |
MEDLINE | .medline |
Copac | .copac |
JSON citeproc | .json |
Besides hclust, other methods are available, see the CRAN Package View on Clustering.
Date
: Stores time as number of days since UNIX epoch on 1970-01-01
. with negative values for earlier dates.POSIXct
or POSIXlt
objects.sys.Date()
returns an object of class Date
ymd
, mdy
, etc. are alternatives to as.Date
that also parse to Date class; see Parsing dates and datetimes from strings with lubridate.R users often want to publish analysis and results in a reproducible way. See Reproducible R for details.
Users who are coming from other programming languages may be confused by the lack of a rand
function equivalent to what they may have experienced before. Basic random number generation is done using the r*
family of functions for each distribution (see the link above). Random numbers drawn uniformly from a range can be generated using runif
, for "random uniform". Since this also looks suspiciously like "run if", it is often hard to figure out for new R users.
To represent the standard input-/output channels, use the functions file("stdin")
(input from terminal or other program via pipe), stdout()
(standard output) and stderr()
(standard error). Note that while there is the function stdin()
, it can not be used when supplying a ready-made script to R, because it will read the next lines of that script instead of user input.