R: Clustering with trimming

cluster_events {beadplexr}

R Documentation

Clustering with trimming

Description

Cluster identification with various algorithms and subsequent trimming of each cluster

Usage

bp_kmeans(df, .parameter, .column_name, .k, .trim = 0, .data = NULL, ...)

bp_clara(df, .parameter, .column_name, .k, .trim = 0, .data = NULL, ...)

bp_dbscan(
  df,
  .parameter,
  .column_name,
  .eps = 0.2,
  .MinPts = 50,
  .data = NULL,
  ...
)

bp_mclust(
  df,
  .parameter,
  .column_name,
  .k,
  .trim = 0,
  .sample_frac = 0.05,
  .max_subset = 500,
  .data = NULL,
  ...
)

bp_density_cut(df, .parameter, .column_name, .k, .trim = 0, .data = NULL, ...)

Arguments

`df`	A tidy data.frame.
`.parameter`	A character giving the name of column(s) where populations are identified.
`.column_name`	A character giving the name of the column to store the population information.
`.k`	Numeric giving the number of expected clusters, or a set of initial cluster centers.
`.trim`	A numeric between 0 and 1, giving the fraction of points to remove by marking them NA.
`.data`	Deprecated. Use `df`.
`...`	Additional arguments passed to appropriate methods, see below.
`.eps`	Reachability distance, see `fpc::dbscan()`.
`.MinPts`	Reachability minimum no. of points, see `fpc::dbscan()`.
`.sample_frac`	A numeric between 0 and 1 giving the fraction of points to use in initialisation of `Mclust()`.
`.max_subset`	A numeric giving the maximum of events to use in initialisation of `Mclust()`, see below.

Value

The data.frame in df with the cluster classification added in the column given by .column_name.

Additional parameters

Information on additional arguments passed, can be found here:

clara: cluster::clara()
kmeans: kmeans()
dbscan: fpc::dbscan()
mclust: mclust::Mclust()
density_cut: approx_adjust()

Default parameters to `clara()`

cluster::clara() is by default called with the following parameters:

samples: 100
pamLike: TRUE

Parameters to dbscan

It requires some trial and error to get the right parameters for the density based clustering, but the parameters usually stay stable throughout an entire experiment and over time (assuming that there is only little drift in the flow cytometer). There is no guarantee that the correct number of clusters are returned, and it might be better to use this on the forward - side scatter discrimination.

Scaling of the parameters seems to be appropriate in most cases for the forward - side scatter discrimination and is automatically performed.

Parameters to mclust

Mclust is is slow and memory hungry on large datasets. Using a subset of the data to initialise the clustering greatly improves the speed. I have found that a subset sample of 500 even works well and gives no markedly better clustering than a subset of 5000 events, but initialisation with 500 makes the clustering complete about 12 times faster than with 5000 events.

Parameters to density_cut

This simple function works by smoothing a density function until the desired number of clusters are found. The segregation of the clusters follows at the lowest point between two clusters.

Examples

library(beadplexr)
library(dplyr)
library(ggplot2)

data("lplex")

lplex[[1]] |>
  # Speed things up a bit by selecting one fourth of the events.
  # Probably not something you'd usually do
  dplyr::sample_frac(0.25) |>
  bp_kmeans(.parameter = c("FSC-A", "SSC-A"),
            .column_name = "population", .trim = 0.1, .k = 2) |>
  ggplot() +
  aes(x = `FSC-A`, y = `SSC-A`, colour = population) +
  geom_point()

library(beadplexr)
library(dplyr)
library(ggplot2)

data("lplex")

lplex[[1]] |>
  # Speed things up a bit by selecting one fourth of the events.
  # Probably not something you'd usually do
  dplyr::sample_frac(0.25) |>
  bp_clara(.parameter = c("FSC-A", "SSC-A"),
           .column_name = "population", .trim = 0.1, .k = 2) |>
  ggplot() +
  aes(x = `FSC-A`, y = `SSC-A`, colour = population) +
  geom_point()

lplex[[1]] |>
  # Speed things up a bit by selecting one fourth of the events.
  # Probably not something you'd usually do
  dplyr::sample_frac(0.25) |>
  bp_clara(.parameter = c("FSC-A", "SSC-A"),
           .column_name = "population", .trim = 0, .k = 2) |>
  ggplot() +
  aes(x = `FSC-A`, y = `SSC-A`, colour = population) +
  geom_point()

## Not run: 
library(beadplexr)
library(dplyr)
library(ggplot2)

data("lplex")

lplex[[1]] |>
  # Speed things up a bit by selecting one fourth of the events.
  # Probably not something you'd usually do
  dplyr::sample_frac(0.25) |>
  bp_dbscan(.parameter = c("FSC-A", "SSC-A"), .column_name = "population",
            eps = 0.2, MinPts = 50) |>
  ggplot() +
  aes(x = `FSC-A`, y = `SSC-A`, colour = population) +
  geom_point()

pop1 <- lplex[[1]] |>
  # Speed things up a bit by selecting one fourth of the events.
  # Probably not something you'd usually do
  dplyr::sample_frac(0.25) |>
  bp_dbscan(.parameter = c("FSC-A", "SSC-A"), .column_name = "population",
    eps = 0.2, MinPts = 50) |>
  dplyr::filter(population == "1")

pop1 |>
  bp_dbscan(.parameter = c("FL6-H", "FL2-H"), .column_name = "population",
    eps = 0.2, MinPts = 50) |>
  pull(population) |>
  unique()

pop1 |>
  bp_dbscan(.parameter = c("FL6-H", "FL2-H"), .column_name = "population",
    eps = 0.2, MinPts = 50, scale = FALSE) |>
  pull(population) |>
  unique()

## End(Not run)
library(beadplexr)
library(ggplot2)

data("lplex")

lplex[[1]] |>
  bp_mclust(.parameter = c("FSC-A", "SSC-A"),
           .column_name = "population", .trim = 0, .k = 2) |>
  ggplot() +
  aes(x = `FSC-A`, y = `SSC-A`, colour = population) +
  geom_point()
library(beadplexr)
library(ggplot2)

data("lplex")

lplex[[1]] |>
  bp_density_cut(.parameter = c("FSC-A"),
           .column_name = "population", .trim = 0, .k = 2) |>
  ggplot() +
  aes(x = `FSC-A`, y = `SSC-A`, colour = population) +
  geom_point()

[Package beadplexr version 0.5.0 Index]