Getting started with R Language Data frames Reading and writing tabular data in plain-text files (CSV, TSV, etc.)Pipe operators (%>% and others)Linear Models (Regression)data.table boxplot Formula Split function Creating vectors Factors Pattern Matching and Replacement Run-length encoding Date and Time Speeding up tough-to-vectorize code ggplot2 Lists Introduction to Geographical Maps Base Plotting Set operations tidyverse Rcpp Random Numbers Generator String manipulation with stringi package Parallel processing Subsetting Debugging Installing packages Arima Models Distribution Functions Shiny spatial analysis sqldf Code profiling Control flow structures Column wise operation JSON RODBC lubridate Time Series and Forecasting strsplit function Web scraping and parsing Generalized linear models Reshaping data between long and wide forms RMarkdown and knitr presentation Scope of variables Performing a Permutation Test xgboost R code vectorization best practices Missing values Hierarchical Linear Modeling Classes Introspection *apply family of functions (functionals)Text mining ANOVA Raster and Image Analysis Survival analysis Fault-tolerant/resilient code Reproducible R Updating R and the package library Fourier Series and Transformations .Rprofile dplyr caret Extracting and Listing Files in Compressed Archives Probability Distributions with R R in LaTeX with knitr Web Crawling in R Arithmetic Operators Creating reports with RMarkdown GPU-accelerated computing heatmap and heatmap.2 Network analysis with the igraph package Functional programming Get user input roxygen2 Hashmaps Spark API (SparkR)Meta: Documentation Guidelines I/O for foreign tables (Excel, SAS, SPSS, Stata)I/O for database tables I/O for geographic data (shapefiles, etc.)I/O for raster images I/O for R's binary format Reading and writing strings Input and output Recycling Expression: parse + eval Regular Expressions (regex)Combinatorics Pivot and unpivot with data.table Inspecting packages Solving ODEs in R Feature Selection in R -- Removing Extraneous Features Bibliography in RMD Writing functions in R Color schemes for graphics Hierarchical clustering with hclust Random Forest Algorithm Bar Chart Cleaning data RESTful R Services Machine learning Variables The Date class The logical class The character class Numeric classes and storage modes Matrices Date-time classes (POSIXct and POSIXlt)Using texreg to export models in a paper-ready way Publishing Implement State Machine Pattern using S4 Class Reshape using tidyr Modifying strings by substitution Non-standard evaluation and standard evaluation Randomization Object-Oriented Programming in R Regular Expression Syntax in R Coercion Standardize analyses by writing standalone R scripts Analyze tweets with R Natural language processing Using pipe assignment in your own package %<>%: How to ?R Markdown Notebooks (from RStudio)Updating R version Aggregating data frames Data acquisition R memento by examples Creating packages with devtools

Performing a Permutation Test

A fairly general function

We will use the built in tooth growth dataset. We are interested in whether there is a statistically significant difference in tooth growth when the guinea pigs are given vitamin C vs orange juice.

Here's the full example:

teethVC = ToothGrowth[ToothGrowth$supp == 'VC',]
teethOJ = ToothGrowth[ToothGrowth$supp == 'OJ',]

permutationTest = function(vectorA, vectorB, testStat){
  N = 10^5
  fullSet = c(vectorA, vectorB)
  lengthA = length(vectorA)
  lengthB = length(vectorB)
  trials <- replicate(N, 
                      {index <- sample(lengthB + lengthA, size = lengthA, replace = FALSE)
                      testStat((fullSet[index]), fullSet[-index])  } )
  trials
}
vec1 =teethVC$len;
vec2 =teethOJ$len;
subtractMeans = function(a, b){ return (mean(a) - mean(b))}
result = permutationTest(vec1, vec2, subtractMeans)
observedMeanDifference = subtractMeans(vec1, vec2)
result = c(result, observedMeanDifference)
hist(result)
abline(v=observedMeanDifference, col = "blue")
pValue = 2*mean(result <= (observedMeanDifference))
pValue

After we read in the CSV, we define the function

permutationTest = function(vectorA, vectorB, testStat){
  N = 10^5
  fullSet = c(vectorA, vectorB)
  lengthA = length(vectorA)
  lengthB = length(vectorB)
  trials <- replicate(N, 
                      {index <- sample(lengthB + lengthA, size = lengthA, replace = FALSE)
                      testStat((fullSet[index]), fullSet[-index])  } )
  trials
}

This function takes two vectors, and shuffles their contents together, then performs the function testStat on the shuffled vectors. The result of teststat is added to trials, which is the return value.

It does this N = 10^5 times. Note that the value N could very well have been a parameter to the function.

This leaves us with a new set of data, trials, the set of means that might result if there truly is no relationship between the two variables.

Now to define our test statistic:

subtractMeans = function(a, b){ return (mean(a) - mean(b))}

Perform the test:

result = permutationTest(vec1, vec2, subtractMeans)

Calculate our actual observed mean difference:

observedMeanDifference = subtractMeans(vec1, vec2)

Let's see what our observation looks like on a histogram of our test statistic.

hist(result)
abline(v=observedMeanDifference, col = "blue")

It doesn't look like our observed result is very likely to occur by random chance...

We want to calculate the p-value, the likeliehood of the original observed result if their is no relationship between the two variables.

pValue = 2*mean(result >= (observedMeanDifference))

Let's break that down a bit:

result >= (observedMeanDifference)

Will create a boolean vector, like:

FALSE TRUE FALSE FALSE TRUE FALSE ...

With TRUE every time the value of result is greater than or equal to the observedMean.

The function mean will interpret this vector as 1 for TRUE and 0 for FALSE, and give us the percentage of 1's in the mix, ie the number of times our shuffled vector mean difference surpassed or equalled what we observed.

Finally, we multiply by 2 because the distribution of our test statistic is highly symmetric, and we really want to know which results are "more extreme" than our observed result.

All that's left is to output the p-value, which turns out to be 0.06093939. Interpretation of this value is subjective, but I would say that it looks like Vitamin C promotes tooth growth quite a lot more than Orange Juice does.

Contributors

Topic Id: 3216

Example Ids: 11042

This site is not affiliated with any of the contributors.