Getting started with R LanguageData framesReading and writing tabular data in plain-text files (CSV, TSV, etc.)Pipe operators (%>% and others)Linear Models (Regression)data.tableboxplotFormulaSplit functionCreating vectorsFactorsPattern Matching and ReplacementRun-length encodingDate and TimeSpeeding up tough-to-vectorize codeggplot2ListsIntroduction to Geographical MapsBase PlottingSet operationstidyverseRcppRandom Numbers GeneratorString manipulation with stringi packageParallel processingSubsettingDebuggingInstalling packagesArima ModelsDistribution FunctionsShinyspatial analysissqldfCode profilingControl flow structuresColumn wise operationJSONRODBClubridateTime Series and Forecastingstrsplit functionWeb scraping and parsingGeneralized linear modelsReshaping data between long and wide formsRMarkdown and knitr presentationScope of variablesPerforming a Permutation TestxgboostR code vectorization best practicesMissing valuesHierarchical Linear ModelingClassesIntrospection*apply family of functions (functionals)Text miningANOVARaster and Image AnalysisSurvival analysisFault-tolerant/resilient codeReproducible RUpdating R and the package libraryFourier Series and Transformations.RprofiledplyrcaretExtracting and Listing Files in Compressed ArchivesProbability Distributions with RR in LaTeX with knitrWeb Crawling in RArithmetic OperatorsCreating reports with RMarkdownGPU-accelerated computingheatmap and heatmap.2Network analysis with the igraph packageFunctional programmingGet user inputroxygen2HashmapsSpark API (SparkR)Meta: Documentation GuidelinesI/O for foreign tables (Excel, SAS, SPSS, Stata)I/O for database tablesI/O for geographic data (shapefiles, etc.)I/O for raster imagesI/O for R's binary formatReading and writing stringsInput and outputRecyclingExpression: parse + evalRegular Expressions (regex)CombinatoricsPivot and unpivot with data.tableInspecting packagesSolving ODEs in RFeature Selection in R -- Removing Extraneous FeaturesBibliography in RMDWriting functions in RColor schemes for graphicsHierarchical clustering with hclustRandom Forest AlgorithmBar ChartCleaning dataRESTful R ServicesMachine learningVariablesThe Date classThe logical classThe character classNumeric classes and storage modesMatricesDate-time classes (POSIXct and POSIXlt)Using texreg to export models in a paper-ready wayPublishingImplement State Machine Pattern using S4 ClassReshape using tidyrModifying strings by substitutionNon-standard evaluation and standard evaluationRandomizationObject-Oriented Programming in RRegular Expression Syntax in RCoercionStandardize analyses by writing standalone R scriptsAnalyze tweets with RNatural language processingUsing pipe assignment in your own package %<>%: How to ?R Markdown Notebooks (from RStudio)Updating R versionAggregating data framesData acquisitionR memento by examplesCreating packages with devtools

Missing values

Other topics

Remarks:

Missing values are represented by the symbol NA (not available). Impossible values (e.g., as a result of sqrt(-1)) are represented by the symbol NaN (not a number).

Examining missing data

anyNA reports whether any missing values are present; while is.na reports missing values elementwise:

vec <- c(1, 2, 3, NA, 5)

anyNA(vec)
# [1] TRUE
is.na(vec)
# [1] FALSE FALSE FALSE  TRUE FALSE

ìs.na returns a logical vector that is coerced to integer values under arithmetic operations (with FALSE=0, TRUE=1). We can use this to find out how many missing values there are:

sum(is.na(vec))
# [1] 1

Extending this approach, we can use colSums and is.na on a data frame to count NAs per column:

colSums(is.na(airquality))
#   Ozone Solar.R    Wind    Temp   Month     Day 
#      37       7       0       0       0       0 

The naniar package (currently on github but not CRAN) offers further tools for exploring missing values.

Reading and writing data with NA values

When reading tabular datasets with the read.* functions, R automatically looks for missing values that look like "NA". However, missing values are not always represented by NA. Sometimes a dot (.), a hyphen(-) or a character-value (e.g.: empty) indicates that a value is NA. The na.strings parameter of the read.* function can be used to tell R which symbols/characters need to be treated as NA values:

read.csv("name_of_csv_file.csv", na.strings = "-")

It is also possible to indicate that more than one symbol needs to be read as NA:

read.csv('missing.csv', na.strings = c('.','-'))

Similarly, NAs can be written with customized strings using the na argument to write.csv. Other tools for reading and writing tables have similar options.

Using NAs of different classes

The symbol NA is for a logical missing value:

class(NA)
#[1] "logical"

This is convenient, since it can easily be coerced to other atomic vector types, and is therefore usually the only NA you will need:

x <- c(1, NA, 1)
class(x[2])
#[1] "numeric"

If you do need a single NA value of another type, use NA_character_, NA_integer_, NA_real_ or NA_complex_. For missing values of fancy classes, subsetting with NA_integer_ usually works; for example, to get a missing-value Date:

class(Sys.Date()[NA_integer_])
# [1] "Date"

TRUE/FALSE and/or NA

NA is a logical type and a logical operator with an NA will return NA if the outcome is ambiguous. Below, NA OR TRUE evaluates to TRUE because at least one side evaluates to TRUE, however NA OR FALSE returns NA because we do not know whether NA would have been TRUE or FALSE

NA | TRUE
# [1] TRUE  
# TRUE | TRUE is TRUE and FALSE | TRUE is also TRUE.

NA | FALSE
# [1] NA  
# TRUE | FALSE is TRUE but FALSE | FALSE is FALSE.

NA & TRUE
# [1] NA  
# TRUE & TRUE is TRUE but FALSE & TRUE is FALSE.

NA & FALSE
# [1] FALSE
# TRUE & FALSE is FALSE and FALSE & FALSE is also FALSE.

These properties are helpful if you want to subset a data set based on some columns that contain NA.

df <- data.frame(v1=0:9, 
                 v2=c(rep(1:2, each=4), NA, NA), 
                 v3=c(NA, letters[2:10]))

df[df$v2 == 1 & !is.na(df$v2), ]
#  v1 v2   v3
#1  0  1 <NA>
#2  1  1    b
#3  2  1    c
#4  3  1    d

df[df$v2 == 1, ]
     v1 v2   v3
#1     0  1 <NA>
#2     1  1    b
#3     2  1    c
#4     3  1    d
#NA   NA NA <NA>
#NA.1 NA NA <NA>

Omitting or replacing missing values

Recoding missing values

Regularly, missing data isn't coded as NA in datasets. In SPSS for example, missing values are often represented by the value 99.

num.vec <- c(1, 2, 3, 99, 5)
num.vec
## [1]  1  2  3 99  5

It is possible to directly assign NA using subsetting

num.vec[num.vec == 99] <- NA

However, the preferred method is to use is.na<- as below. The help file (?is.na) states:

is.na<- may provide a safer way to set missingness. It behaves differently for factors, for example.

is.na(num.vec) <- num.vec == 99

Both methods return

num.vec
## [1]  1  2  3 NA  5

Removing missing values

Missing values can be removed in several ways from a vector:

num.vec[!is.na(num.vec)]
num.vec[complete.cases(num.vec)]
na.omit(num.vec)
## [1] 1 2 3 5

Excluding missing values from calculations

When using arithmetic functions on vectors with missing values, a missing value will be returned:

mean(num.vec) # returns: [1] NA

The na.rm parameter tells the function to exclude the NA values from the calculation:

mean(num.vec, na.rm = TRUE) # returns: [1] 2.75

# an alternative to using 'na.rm = TRUE':
mean(num.vec[!is.na(num.vec)]) # returns: [1] 2.75

Some R functions, like lm, have a na.action parameter. The default-value for this is na.omit, but with options(na.action = 'na.exclude') the default behavior of R can be changed.

If it is not necessary to change the default behavior, but for a specific situation another na.action is needed, the na.action parameter needs to be included in the function call, e.g.:

 lm(y2 ~ y1, data = anscombe, na.action = 'na.exclude')

Contributors

Topic Id: 3388

Example Ids: 11656,11657,11665,12423,28025

This site is not affiliated with any of the contributors.