Getting started with R Language Data frames Reading and writing tabular data in plain-text files (CSV, TSV, etc.)Pipe operators (%>% and others)Linear Models (Regression)data.table boxplot Formula Split function Creating vectors Factors Pattern Matching and Replacement Run-length encoding Date and Time Speeding up tough-to-vectorize code ggplot2 Lists Introduction to Geographical Maps Base Plotting Set operations tidyverse Rcpp Random Numbers Generator String manipulation with stringi package Parallel processing Subsetting Debugging Installing packages Arima Models Distribution Functions Shiny spatial analysis sqldf Code profiling Control flow structures Column wise operation JSON RODBC lubridate Time Series and Forecasting strsplit function Web scraping and parsing Generalized linear models Reshaping data between long and wide forms RMarkdown and knitr presentation Scope of variables Performing a Permutation Test xgboost R code vectorization best practices Missing values Hierarchical Linear Modeling Classes Introspection *apply family of functions (functionals)Text mining ANOVA Raster and Image Analysis Survival analysis Fault-tolerant/resilient code Reproducible R Updating R and the package library Fourier Series and Transformations .Rprofile dplyr caret Extracting and Listing Files in Compressed Archives Probability Distributions with R R in LaTeX with knitr Web Crawling in R Arithmetic Operators Creating reports with RMarkdown GPU-accelerated computing heatmap and heatmap.2 Network analysis with the igraph package Functional programming Get user input roxygen2 Hashmaps Spark API (SparkR)Meta: Documentation Guidelines I/O for foreign tables (Excel, SAS, SPSS, Stata)I/O for database tables I/O for geographic data (shapefiles, etc.)I/O for raster images I/O for R's binary format Reading and writing strings Input and output Recycling Expression: parse + eval Regular Expressions (regex)Combinatorics Pivot and unpivot with data.table Inspecting packages Solving ODEs in R Feature Selection in R -- Removing Extraneous Features Bibliography in RMD Writing functions in R Color schemes for graphics Hierarchical clustering with hclust Random Forest Algorithm Bar Chart Cleaning data RESTful R Services Machine learning Variables The Date class The logical class The character class Numeric classes and storage modes Matrices Date-time classes (POSIXct and POSIXlt)Using texreg to export models in a paper-ready way Publishing Implement State Machine Pattern using S4 Class Reshape using tidyr Modifying strings by substitution Non-standard evaluation and standard evaluation Randomization Object-Oriented Programming in R Regular Expression Syntax in R Coercion Standardize analyses by writing standalone R scripts Analyze tweets with R Natural language processing Using pipe assignment in your own package %<>%: How to ?R Markdown Notebooks (from RStudio)Updating R version Aggregating data frames Data acquisition R memento by examples Creating packages with devtools

*apply family of functions (functionals)

Remarks:

A function in the *apply family is an abstraction of a for loop. Compared with the for loops *apply functions have the following advantages:

Require less code to write.
Doesn't have an iteration counter.
Doesn't use temporary variables to store intermediate results.

However for loops are more general and can give us more control allowing to achieve complex computations that are not always trivial to do using *apply functions.

The relationship between for loops and *apply functions is explained in the documentation for for loops.

Members of the `*apply` Family

The *apply family of functions contains several variants of the same principle that differ based primarily on the kind of output they return.

function	Input	Output
`apply`	`matrix`, `data.frame`, or `array`	vector or matrix (depending on the length of each element returned)
`sapply`	vector or `list`	vector or matrix (depending on the length of each element returned)
`lapply`	vector or `list`	`list`
`vapply`	vector or `list	vector or matrix (depending on the length of each element returned) of the user-designated class
`mapply`	multiple vectors, `lists` or a combination	`list`

See "Examples" to see how each of these functions is used.

Use anonymous functions with apply

apply is used to evaluate a function (maybe an anonymous one) over the margins of an array or matrix.

Let's use the iris dataset to illustrate this idea. The iris dataset has measurements of 150 flowers from 3 species. Let's see how this dataset is structured:

> head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1         5.1          3.5          1.4         0.2  setosa
2         4.9          3.0          1.4         0.2  setosa
3         4.7          3.2          1.3         0.2  setosa
4         4.6          3.1          1.5         0.2  setosa
5         5.0          3.6          1.4         0.2  setosa
6         5.4          3.9          1.7         0.4  setosa

Now, imagine that you want to know the mean of each of these variables. One way to solve this might be to use a for loop, but R programmers will often prefer to use apply (for reasons why, see Remarks):

> apply(iris[1:4], 2, mean)

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
    5.843333     3.057333     3.758000     1.199333

In the first parameter, we subset iris to include only the first 4 columns, because mean only works on numeric data.
The second parameter value of 2 indicates that we want to work on the columns only (the second subscript of the r×c array); 1 would give the row means.

In the same way we can calculate more meaningful values:

# standard deviation
apply(iris[1:4], 2, sd)
# variance
apply(iris[1:4], 2, var)

Caveat: R has some built-in functions which are better for calculating column and row sums and means: colMeans and rowMeans.

Now, let's do a different and more meaningful task: let's calculate the mean only for those values which are bigger than 0.5. For that, we will create our own mean function.

> our.mean.function <- function(x) { mean(x[x > 0.5]) }
> apply(iris[1:4], 2, our.mean.function)

Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
    5.843333     3.057333     3.758000     1.665347

(Note the difference in the mean of Petal.Width)

But, what if we don't want to use this function in the rest of our code? Then, we can use an anonymous function, and write our code like this:

apply(iris[1:4], 2, function(x) { mean(x[x > 0.5]) })

So, as we have seen, we can use apply to execute the same operation on columns or rows of a dataset using only one line.

Caveat: Since apply returns very different kinds of output depending on the length of the results of the specified function, it may not be the best choice in cases where you are not working interactively. Some of the other *apply family functions are a bit more predictable (see Remarks).

Bulk File Loading

for a large number of files which may need to be operated on in a similar process and with well structured file names.

firstly a vector of the file names to be accessed must be created, there are multiple options for this:

Creating the vector manually with paste0()

 files <- paste0("file_", 1:100, ".rds")

Using list.files() with a regex search term for the file type, requires knowledge of regular expressions (regex) if other files of same type are in the directory.
```
 files <- list.files("./", pattern = "\\.rds$", full.names = TRUE)
```

where X is a vector of part of the files naming format used.

lapply will output each response as element of a list.

readRDS is specific to .rds files and will change depending on the application of the process.

my_file_list <- lapply(files, readRDS)

This is not necessarily faster than a for loop from testing but allows all files to be an element of a list without assigning them explicitly.

Finally, we often need to load multiple packages at once. This trick can do it quite easily by applying library() to all libraries that we wish to import:

lapply(c("jsonlite","stringr","igraph"),library,character.only=TRUE)

Combining multiple `data.frames` (`lapply`, `mapply`)

In this exercise, we will generate four bootstrap linear regression models and combine the summaries of these models into a single data frame.

library(broom)

#* Create the bootstrap data sets
BootData <- lapply(1:4,
                   function(i) mtcars[sample(1:nrow(mtcars),
                                             size = nrow(mtcars),
                                             replace = TRUE), ])

#* Fit the models
Models <- lapply(BootData,
                 function(BD) lm(mpg ~ qsec + wt + factor(am),
                                 data = BD))

#* Tidy the output into a data.frame
Tidied <- lapply(Models,
                 tidy)

#* Give each element in the Tidied list a name
Tidied <- setNames(Tidied, paste0("Boot", seq_along(Tidied)))

At this point, we can take two approaches to inserting the names into the data.frame.

#* Insert the element name into the summary with `lapply`
#* Requires passing the names attribute to `lapply` and referencing `Tidied` within
#* the applied function.
Described_lapply <- 
 lapply(names(Tidied),
        function(nm) cbind(nm, Tidied[[nm]]))

Combined_lapply <- do.call("rbind", Described_lapply)

#* Insert the element name into the summary with `mapply`
#* Allows us to pass the names and the elements as separate arguments.
Described_mapply <- 
 mapply(
  function(nm, dframe) cbind(nm, dframe),
  names(Tidied),
  Tidied,
  SIMPLIFY = FALSE)

Combined_mapply <- do.call("rbind", Described_mapply)

If you're a fan of magrittr style pipes, you can accomplish the entire task in a single chain (though it may not be prudent to do so if you need any of the intermediary objects, such as the model objects themselves):

library(magrittr)
library(broom)
Combined <- lapply(1:4,
                   function(i) mtcars[sample(1:nrow(mtcars),
                                             size = nrow(mtcars),
                                             replace = TRUE), ]) %>%
 lapply(function(BD) lm( mpg ~ qsec + wt + factor(am), data = BD)) %>%
 lapply(tidy) %>%
 setNames(paste0("Boot", seq_along(.))) %>%
 mapply(function(nm, dframe) cbind(nm, dframe),
        nm = names(.),
        dframe = .,
        SIMPLIFY = FALSE) %>%
 do.call("rbind", .)

Using built-in functionals

Built-in functionals: lapply(), sapply(), and mapply()

R comes with built-in functionals, of which perhaps the most well-known are the apply family of functions. Here is a description of some of the most common apply functions:

lapply() = takes a list as an argument and applies the specified function to the list.
sapply() = the same as lapply() but attempts to simplify the output to a vector or a matrix.
- vapply() = a variant of sapply() in which the output object's type must be specified.
mapply() = like lapply() but can pass multiple vectors as input to the specified function. Can be simplified like sapply().
- Map() is an alias to mapply() with SIMPLIFY = FALSE.

lapply()

lapply() can be used with two different iterations:

lapply(variable, FUN)
lapply(seq_along(variable), FUN)

# Two ways of finding the mean of x
set.seed(1)
df <- data.frame(x = rnorm(25), y = rnorm(25))
lapply(df, mean)
lapply(seq_along(df), function(x) mean(df[[x]))

sapply()

sapply() will attempt to resolve its output to either a vector or a matrix.

# Two examples to show the different outputs of sapply()
sapply(letters, print)  ## produces a vector
x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE,TRUE))
sapply(x, quantile)  ## produces a matrix

mapply()

mapply() works much like lapply() except it can take multiple vectors as input (hence the m for multivariate).

mapply(sum, 1:5, 10:6, 3) # 3 will be "recycled" by mapply

Using user-defined functionals

User-defined functionals

Users can create their own functionals to varying degrees of complexity. The following examples are from Functionals by Hadley Wickham:

randomise <- function(f) f(runif(1e3))
        
lapply2 <- function(x, f, ...) {
    out <- vector("list", length(x))
    for (i in seq_along(x)) {
        out[[i]] <- f(x[[i]], ...)
    }
    out
}

In the first case, randomise accepts a single argument f, and calls it on a sample of Uniform random variables. To demonstrate equivalence, we call set.seed below:

set.seed(123)
randomise(mean)
#[1] 0.4972778
    
set.seed(123)
mean(runif(1e3))
#[1] 0.4972778


set.seed(123)
randomise(max)
#[1] 0.9994045

set.seed(123)
max(runif(1e3))
#[1] 0.9994045

The second example is a re-implementation of base::lapply, which uses functionals to apply an operation (f) to each element in a list (x). The ... parameter allows the user to pass additional arguments to f, such as the na.rm option in the mean function:

lapply(list(c(1, 3, 5), c(2, NA, 6)), mean)
# [[1]]
# [1] 3
# 
# [[2]]
# [1] NA

lapply2(list(c(1, 3, 5), c(2, NA, 6)), mean)
# [[1]]
# [1] 3
# 
# [[2]]
# [1] NA


lapply(list(c(1, 3, 5), c(2, NA, 6)), mean, na.rm = TRUE)
# [[1]]
# [1] 3
# 
# [[2]]
# [1] 4

lapply2(list(c(1, 3, 5), c(2, NA, 6)), mean, na.rm = TRUE)
# [[1]]
# [1] 3
# 
# [[2]]
# [1] 4