Getting started with R Language Data frames Reading and writing tabular data in plain-text files (CSV, TSV, etc.)Pipe operators (%>% and others)Linear Models (Regression)data.table boxplot Formula Split function Creating vectors Factors Pattern Matching and Replacement Run-length encoding Date and Time Speeding up tough-to-vectorize code ggplot2 Lists Introduction to Geographical Maps Base Plotting Set operations tidyverse Rcpp Random Numbers Generator String manipulation with stringi package Parallel processing Subsetting Debugging Installing packages Arima Models Distribution Functions Shiny spatial analysis sqldf Code profiling Control flow structures Column wise operation JSON RODBC lubridate Time Series and Forecasting strsplit function Web scraping and parsing Generalized linear models Reshaping data between long and wide forms RMarkdown and knitr presentation Scope of variables Performing a Permutation Test xgboost R code vectorization best practices Missing values Hierarchical Linear Modeling Classes Introspection *apply family of functions (functionals)Text mining ANOVA Raster and Image Analysis Survival analysis Fault-tolerant/resilient code Reproducible R Updating R and the package library Fourier Series and Transformations .Rprofile dplyr caret Extracting and Listing Files in Compressed Archives Probability Distributions with R R in LaTeX with knitr Web Crawling in R Arithmetic Operators Creating reports with RMarkdown GPU-accelerated computing heatmap and heatmap.2 Network analysis with the igraph package Functional programming Get user input roxygen2 Hashmaps Spark API (SparkR)Meta: Documentation Guidelines I/O for foreign tables (Excel, SAS, SPSS, Stata)I/O for database tables I/O for geographic data (shapefiles, etc.)I/O for raster images I/O for R's binary format Reading and writing strings Input and output Recycling Expression: parse + eval Regular Expressions (regex)Combinatorics Pivot and unpivot with data.table Inspecting packages Solving ODEs in R Feature Selection in R -- Removing Extraneous Features Bibliography in RMD Writing functions in R Color schemes for graphics Hierarchical clustering with hclust Random Forest Algorithm Bar Chart Cleaning data RESTful R Services Machine learning Variables The Date class The logical class The character class Numeric classes and storage modes Matrices Date-time classes (POSIXct and POSIXlt)Using texreg to export models in a paper-ready way Publishing Implement State Machine Pattern using S4 Class Reshape using tidyr Modifying strings by substitution Non-standard evaluation and standard evaluation Randomization Object-Oriented Programming in R Regular Expression Syntax in R Coercion Standardize analyses by writing standalone R scripts Analyze tweets with R Natural language processing Using pipe assignment in your own package %<>%: How to ?R Markdown Notebooks (from RStudio)Updating R version Aggregating data frames Data acquisition R memento by examples Creating packages with devtools

Missing values

Remarks:

Missing values are represented by the symbol NA (not available). Impossible values (e.g., as a result of sqrt(-1)) are represented by the symbol NaN (not a number).

Examining missing data

anyNA reports whether any missing values are present; while is.na reports missing values elementwise:

vec <- c(1, 2, 3, NA, 5)

anyNA(vec)
# [1] TRUE
is.na(vec)
# [1] FALSE FALSE FALSE  TRUE FALSE

ìs.na returns a logical vector that is coerced to integer values under arithmetic operations (with FALSE=0, TRUE=1). We can use this to find out how many missing values there are:

sum(is.na(vec))
# [1] 1

Extending this approach, we can use colSums and is.na on a data frame to count NAs per column:

colSums(is.na(airquality))
#   Ozone Solar.R    Wind    Temp   Month     Day 
#      37       7       0       0       0       0

The naniar package (currently on github but not CRAN) offers further tools for exploring missing values.

Reading and writing data with NA values

When reading tabular datasets with the read.* functions, R automatically looks for missing values that look like "NA". However, missing values are not always represented by NA. Sometimes a dot (.), a hyphen(-) or a character-value (e.g.: empty) indicates that a value is NA. The na.strings parameter of the read.* function can be used to tell R which symbols/characters need to be treated as NA values:

read.csv("name_of_csv_file.csv", na.strings = "-")

It is also possible to indicate that more than one symbol needs to be read as NA:

read.csv('missing.csv', na.strings = c('.','-'))

Similarly, NAs can be written with customized strings using the na argument to write.csv. Other tools for reading and writing tables have similar options.

Using NAs of different classes

The symbol NA is for a logical missing value:

class(NA)
#[1] "logical"

This is convenient, since it can easily be coerced to other atomic vector types, and is therefore usually the only NA you will need:

x <- c(1, NA, 1)
class(x[2])
#[1] "numeric"

If you do need a single NA value of another type, use NA_character_, NA_integer_, NA_real_ or NA_complex_. For missing values of fancy classes, subsetting with NA_integer_ usually works; for example, to get a missing-value Date:

class(Sys.Date()[NA_integer_])
# [1] "Date"

TRUE/FALSE and/or NA

NA is a logical type and a logical operator with an NA will return NA if the outcome is ambiguous. Below, NA OR TRUE evaluates to TRUE because at least one side evaluates to TRUE, however NA OR FALSE returns NA because we do not know whether NA would have been TRUE or FALSE

NA | TRUE
# [1] TRUE  
# TRUE | TRUE is TRUE and FALSE | TRUE is also TRUE.

NA | FALSE
# [1] NA  
# TRUE | FALSE is TRUE but FALSE | FALSE is FALSE.

NA & TRUE
# [1] NA  
# TRUE & TRUE is TRUE but FALSE & TRUE is FALSE.

NA & FALSE
# [1] FALSE
# TRUE & FALSE is FALSE and FALSE & FALSE is also FALSE.

These properties are helpful if you want to subset a data set based on some columns that contain NA.

df <- data.frame(v1=0:9, 
                 v2=c(rep(1:2, each=4), NA, NA), 
                 v3=c(NA, letters[2:10]))

df[df$v2 == 1 & !is.na(df$v2), ]
#  v1 v2   v3
#1  0  1 <NA>
#2  1  1    b
#3  2  1    c
#4  3  1    d

df[df$v2 == 1, ]
     v1 v2   v3
#1     0  1 <NA>
#2     1  1    b
#3     2  1    c
#4     3  1    d
#NA   NA NA <NA>
#NA.1 NA NA <NA>

Omitting or replacing missing values

Recoding missing values

Regularly, missing data isn't coded as NA in datasets. In SPSS for example, missing values are often represented by the value 99.

num.vec <- c(1, 2, 3, 99, 5)
num.vec
## [1]  1  2  3 99  5

It is possible to directly assign NA using subsetting

num.vec[num.vec == 99] <- NA

However, the preferred method is to use is.na<- as below. The help file (?is.na) states:

is.na<- may provide a safer way to set missingness. It behaves differently for factors, for example.

is.na(num.vec) <- num.vec == 99

Both methods return

num.vec
## [1]  1  2  3 NA  5

Removing missing values

Missing values can be removed in several ways from a vector:

num.vec[!is.na(num.vec)]
num.vec[complete.cases(num.vec)]
na.omit(num.vec)
## [1] 1 2 3 5

Excluding missing values from calculations

When using arithmetic functions on vectors with missing values, a missing value will be returned:

mean(num.vec) # returns: [1] NA

The na.rm parameter tells the function to exclude the NA values from the calculation:

mean(num.vec, na.rm = TRUE) # returns: [1] 2.75

# an alternative to using 'na.rm = TRUE':
mean(num.vec[!is.na(num.vec)]) # returns: [1] 2.75

Some R functions, like lm, have a na.action parameter. The default-value for this is na.omit, but with options(na.action = 'na.exclude') the default behavior of R can be changed.

If it is not necessary to change the default behavior, but for a specific situation another na.action is needed, the na.action parameter needs to be included in the function call, e.g.:

 lm(y2 ~ y1, data = anscombe, na.action = 'na.exclude')

Contributors

Topic Id: 3388

Example Ids: 11656,11657,11665,12423,28025

This site is not affiliated with any of the contributors.