Getting started with R LanguageData framesReading and writing tabular data in plain-text files (CSV, TSV, etc.)Pipe operators (%>% and others)Linear Models (Regression)data.tableboxplotFormulaSplit functionCreating vectorsFactorsPattern Matching and ReplacementRun-length encodingDate and TimeSpeeding up tough-to-vectorize codeggplot2ListsIntroduction to Geographical MapsBase PlottingSet operationstidyverseRcppRandom Numbers GeneratorString manipulation with stringi packageParallel processingSubsettingDebuggingInstalling packagesArima ModelsDistribution FunctionsShinyspatial analysissqldfCode profilingControl flow structuresColumn wise operationJSONRODBClubridateTime Series and Forecastingstrsplit functionWeb scraping and parsingGeneralized linear modelsReshaping data between long and wide formsRMarkdown and knitr presentationScope of variablesPerforming a Permutation TestxgboostR code vectorization best practicesMissing valuesHierarchical Linear ModelingClassesIntrospection*apply family of functions (functionals)Text miningANOVARaster and Image AnalysisSurvival analysisFault-tolerant/resilient codeReproducible RUpdating R and the package libraryFourier Series and Transformations.RprofiledplyrcaretExtracting and Listing Files in Compressed ArchivesProbability Distributions with RR in LaTeX with knitrWeb Crawling in RArithmetic OperatorsCreating reports with RMarkdownGPU-accelerated computingheatmap and heatmap.2Network analysis with the igraph packageFunctional programmingGet user inputroxygen2HashmapsSpark API (SparkR)Meta: Documentation GuidelinesI/O for foreign tables (Excel, SAS, SPSS, Stata)I/O for database tablesI/O for geographic data (shapefiles, etc.)I/O for raster imagesI/O for R's binary formatReading and writing stringsInput and outputRecyclingExpression: parse + evalRegular Expressions (regex)CombinatoricsPivot and unpivot with data.tableInspecting packagesSolving ODEs in RFeature Selection in R -- Removing Extraneous FeaturesBibliography in RMDWriting functions in RColor schemes for graphicsHierarchical clustering with hclustRandom Forest AlgorithmBar ChartCleaning dataRESTful R ServicesMachine learningVariablesThe Date classThe logical classThe character classNumeric classes and storage modesMatricesDate-time classes (POSIXct and POSIXlt)Using texreg to export models in a paper-ready wayPublishingImplement State Machine Pattern using S4 ClassReshape using tidyrModifying strings by substitutionNon-standard evaluation and standard evaluationRandomizationObject-Oriented Programming in RRegular Expression Syntax in RCoercionStandardize analyses by writing standalone R scriptsAnalyze tweets with RNatural language processingUsing pipe assignment in your own package %<>%: How to ?R Markdown Notebooks (from RStudio)Updating R versionAggregating data framesData acquisitionR memento by examplesCreating packages with devtools

Hierarchical clustering with hclust

Other topics

Remarks:

Besides hclust, other methods are available, see the CRAN Package View on Clustering.

Example 1 - Basic use of hclust, display of dendrogram, plot clusters

The cluster library contains the ruspini data - a standard set of data for illustrating cluster analysis.

    library(cluster)                ## to get the ruspini data
    plot(ruspini, asp=1, pch=20)    ## take a look at the data

Ruspini Data

hclust expects a distance matrix, not the original data. We compute the tree using the default parameters and display it. The hang parameter lines up all of the leaves of the tree along the baseline.

    ruspini_hc_defaults <- hclust(dist(ruspini))
    dend <- as.dendrogram(ruspini_hc_defaults)
    if(!require(dendextend)) install.packages("dendextend"); library(dendextend)
    dend <- color_branches(dend, k = 4) 
    plot(dend)

enter image description here

Cut the tree to give four clusters and replot the data coloring the points by cluster. k is the desired number of clusters.

    rhc_def_4 = cutree(ruspini_hc_defaults,k=4)
    plot(ruspini, pch=20, asp=1, col=rhc_def_4)

Ruspini data - first clustering

This clustering is a little odd. We can get a better clustering by scaling the data first.

    scaled_ruspini_hc_defaults = hclust(dist(scale(ruspini)))
    srhc_def_4 = cutree(scaled_ruspini_hc_defaults,4)
    plot(ruspini, pch=20, asp=1, col=srhc_def_4)

Scaled Ruspini data - clustered

The default dissimilarity measure for comparing clusters is "complete". You can specify a different measure with the method parameter.

    ruspini_hc_single = hclust(dist(ruspini), method="single")

Example 2 - hclust and outliers

With hierarchical clustering, outliers often show up as one-point clusters.

Generate three Gaussian distributions to illustrate the effect of outliers.

    set.seed(656)
    x = c(rnorm(150, 0, 1), rnorm(150,9,1), rnorm(150,4.5,1))
    y = c(rnorm(150, 0, 1), rnorm(150,0,1), rnorm(150,5,1))
    XYdf = data.frame(x,y)
    plot(XYdf, pch=20)

Test Data

Build the cluster structure, split it into three cluster.

    XY_sing = hclust(dist(XYdf), method="single")
    XYs3 = cutree(XY_sing,k=3)
    table(XYs3)
    XYs3
      1   2   3 
    448   1   1 

hclust found two outliers and put everything else into one big cluster. To get the "real" clusters, you may need to set k higher.

    XYs6 = cutree(XY_sing,k=6)
    table(XYs6)
    XYs6
      1   2   3   4   5   6 
    148 150   1 149   1   1 
    plot(XYdf, pch=20, col=XYs6)

Test data clustered - 3 Outliers & 3 large clusters

This StackOverflow post has some guidance on how to pick the number of clusters, but be aware of this behavior in hierarchical clustering.

Contributors

Topic Id: 8084

Example Ids: 26070,26071

This site is not affiliated with any of the contributors.