From a9f3cbfa064ef1e6855481a9cd87f625d87fa7a9 Mon Sep 17 00:00:00 2001 From: "Documenter.jl" Date: Tue, 18 Jun 2024 16:17:57 +0000 Subject: [PATCH] build based on c7d51c0 --- dev/api/index.html | 18 +- dev/changelog/index.html | 2 +- dev/cite/index.html | 7 +- dev/funding/index.html | 2 +- dev/generated/example/index.html | 3376 ++++++++++++++++-------------- dev/index.html | 2 +- dev/search/index.html | 2 +- dev/search_index.js | 2 +- 8 files changed, 1865 insertions(+), 1546 deletions(-) diff --git a/dev/api/index.html b/dev/api/index.html index fb9e004..1bc22c5 100644 --- a/dev/api/index.html +++ b/dev/api/index.html @@ -2,26 +2,26 @@ Public API · RedClust.jl

Public API

Main functions

RedClust.fitpriorFunction
fitprior(data, algo, diss = false;
 Kmin = 1,
 Kmax = Int(floor(size(data)[end] / 2),
-verbose = true)

Determines the best prior hyperparameters from the data. A notional clustering is obtained using k-means or k-medoids and the elbow method, and the distances are split into within-cluster distances and inter-cluster distances based on the notional clustering. These distances are then used to fit the prior hyperparameters using MLE and empirical Bayes sampling.

Required Arguments

  • data::Union{Vector{Vector{Float64}}, Matrix{Float64}}: can either be a vector of (possibly multi-dimensional) observations, or a matrix with each column an observation, or a square matrix of pairwise dissimilarities.
  • algo::String: must be one of "k-means" or "k-medoids".

Optional Arguments

  • diss::bool: if true, data will be assumed to be a pairwise dissimilarity matrix.
  • Kmin::Integer: minimum number of clusters to scan for the elbow method.
  • Kmax::Integer: maximum number of clusters to scan for the elbow method. If left unspecified, it is set to half the number of observations.
  • verbose::Bool: if false, disables all info messages and progress bars.

Returns

An object of type PriorHyperparamsList.

source
RedClust.fitprior2Function
fitprior2(data, algo, diss = false;
+verbose = true)

Determines the best prior hyperparameters from the data. A notional clustering is obtained using k-means or k-medoids and the elbow method, and the distances are split into within-cluster distances and inter-cluster distances based on the notional clustering. These distances are then used to fit the prior hyperparameters using MLE and empirical Bayes sampling.

Required Arguments

  • data::Union{Vector{Vector{Float64}}, Matrix{Float64}}: can either be a vector of (possibly multi-dimensional) observations, or a matrix with each column an observation, or a square matrix of pairwise dissimilarities.
  • algo::String: must be one of "k-means" or "k-medoids".

Optional Arguments

  • diss::bool: if true, data will be assumed to be a pairwise dissimilarity matrix.
  • Kmin::Integer: minimum number of clusters to scan for the elbow method.
  • Kmax::Integer: maximum number of clusters to scan for the elbow method. If left unspecified, it is set to half the number of observations.
  • verbose::Bool: if false, disables all info messages and progress bars.

Returns

An object of type PriorHyperparamsList.

source
RedClust.fitprior2Function
fitprior2(data, algo, diss = false;
 Kmin = 1,
 Kmax = Int(floor(size(data)[end] / 2),
-verbose = true)

Determines the best prior hyperparameters from the data. Uses the same method as fitprior to obtain values for $\sigma$, $\eta$, $u$, and $v$, but derives values for the cluster-specific parameters by considering within-cluster and cross-cluster distances over clusterings with $K$ clusters for all values of $K \in [K_\mathrm{min}, K_\mathrm{max}]$, weighted by the prior predictive distribution of $K$ in that range.

Required Arguments

  • data::Union{Vector{Vector{Float64}}, Matrix{Float64}}: can either be a vector of (possibly multi-dimensional) observations, or a matrix with each column an observation, or a square matrix of pairwise dissimilarities.
  • algo::String: must be one of "k-means" or "k-medoids".

Optional Arguments

  • diss::bool: if true, data will be assumed to be a pairwise dissimilarity matrix.
  • Kmin::Integer: minimum number of clusters to scan for the elbow method.
  • Kmax::Integer: maximum number of clusters to scan for the elbow method. If left unspecified, it is set to half the number of observations.
  • verbose::Bool: if false, disables all info messages and progress bars.

Returns

An object of type PriorHyperparamsList.

source
RedClust.sampleKMethod
sampleK(params::PriorHyperparamsList, numsamples::Int, n::Int)::Vector{Int}
-sampleK(η::Real, σ::Real, u::Real, v::Real, numsamples::Int, n::Int)

Returns a vector of length numsamples containing samples of $K$ (number of clusters) generated from its marginal prior predictive distribution inferred from params. The parameter n is the number of observations in the model.

source
RedClust.sampledistFunction
sampledist(params::PriorHyperparamsList, type::String, numsamples = 1)::Vector{Float64}

Generate a vector of samples of length numsamples from the prior predictive distribution on the distances, as encapsulated in params. type must be either "intercluster" or "intracluster".

source
RedClust.runsamplerFunction
runsampler(data,
+verbose = true)

Determines the best prior hyperparameters from the data. Uses the same method as fitprior to obtain values for $\sigma$, $\eta$, $u$, and $v$, but derives values for the cluster-specific parameters by considering within-cluster and cross-cluster distances over clusterings with $K$ clusters for all values of $K \in [K_\mathrm{min}, K_\mathrm{max}]$, weighted by the prior predictive distribution of $K$ in that range.

Required Arguments

  • data::Union{Vector{Vector{Float64}}, Matrix{Float64}}: can either be a vector of (possibly multi-dimensional) observations, or a matrix with each column an observation, or a square matrix of pairwise dissimilarities.
  • algo::String: must be one of "k-means" or "k-medoids".

Optional Arguments

  • diss::bool: if true, data will be assumed to be a pairwise dissimilarity matrix.
  • Kmin::Integer: minimum number of clusters to scan for the elbow method.
  • Kmax::Integer: maximum number of clusters to scan for the elbow method. If left unspecified, it is set to half the number of observations.
  • verbose::Bool: if false, disables all info messages and progress bars.

Returns

An object of type PriorHyperparamsList.

source
RedClust.sampleKMethod
sampleK(params::PriorHyperparamsList, numsamples::Int, n::Int)::Vector{Int}
+sampleK(η::Real, σ::Real, u::Real, v::Real, numsamples::Int, n::Int)

Returns a vector of length numsamples containing samples of $K$ (number of clusters) generated from its marginal prior predictive distribution inferred from params. The parameter n is the number of observations in the model.

source
RedClust.sampledistFunction
sampledist(params::PriorHyperparamsList, type::String, numsamples = 1)::Vector{Float64}

Generate a vector of samples of length numsamples from the prior predictive distribution on the distances, as encapsulated in params. type must be either "intercluster" or "intracluster".

source
RedClust.runsamplerFunction
runsampler(data,
 options = MCMCOptionsList(),
 params = PriorHyperparamsList();
-verbose = true) -> MCMCResult

Runs the MCMC sampler on the data.

Arguments

  • data::MCMCData: contains the distance matrix.
  • options::MCMCOptionsList: contains the number of iterations, burnin, etc.
  • params::PriorHyperparamsList: contains the prior hyperparameters for the model.
  • verbose::Bool: if false, disables all info messages and progress bars.

Returns

A struct of type MCMCResult containing the MCMC samples, convergence diagnostics, and summary statistics.

See also

MCMCData, MCMCOptionsList, fitprior, PriorHyperparamsList, MCMCResult.

source

Point estimation

RedClust.binderlossMethod
binderloss(a::ClustLabelVector, b::ClustLabelVector;
-normalised = true) -> Float64

Computes the Binder loss between two clusterings. If normalised = true then the result is equal to one minus the rand index between a and b.

source
RedClust.getpointestimateMethod
getpointestimate(samples::MCMCResult; method = "MAP", loss = "VI")

Computes a point estimate from a vector of samples of cluster allocations by searching for a minimiser of the posterior expectation of some loss function.

Arguments

  • method::String: must be one of the following -
- `"MAP"`: maximum a posteriori.
+verbose = true) -> MCMCResult

Runs the MCMC sampler on the data.

Arguments

  • data::MCMCData: contains the distance matrix.
  • options::MCMCOptionsList: contains the number of iterations, burnin, etc.
  • params::PriorHyperparamsList: contains the prior hyperparameters for the model.
  • verbose::Bool: if false, disables all info messages and progress bars.

Returns

A struct of type MCMCResult containing the MCMC samples, convergence diagnostics, and summary statistics.

See also

MCMCData, MCMCOptionsList, fitprior, PriorHyperparamsList, MCMCResult.

source

Point estimation

RedClust.binderlossMethod
binderloss(a::ClustLabelVector, b::ClustLabelVector;
+normalised = true) -> Float64

Computes the Binder loss between two clusterings. If normalised = true then the result is equal to one minus the rand index between a and b.

source
RedClust.getpointestimateMethod
getpointestimate(samples::MCMCResult; method = "MAP", loss = "VI")

Computes a point estimate from a vector of samples of cluster allocations by searching for a minimiser of the posterior expectation of some loss function.

Arguments

  • method::String: must be one of the following -
- `"MAP"`: maximum a posteriori.
 - `"MLE"`: maximum likelihood estimation.
 - `"MPEL"`: minimum posterior expected loss.
-`"MAP"` and `"MLE"` search among the MCMC samples for the clustering with the maximum log posterior and log likelihood respectively. `"MPEL"` searches for a clustering that minimises the posterior expected loss of some loss function specified by the `loss` argument. The search space is the set of samples in `samples`.
  • loss: Determines the loss function used for the "MPEL" method. Must be either be a string or a function. If specified as a string, must be one of "binder" (Binder loss), "omARI" (one minus the Adjusted Rand Index), "VI" (Variation of Information distance), or "ID" (Information Distance). If specified as a function, must have a method defined for (x::Vector{Int}, y::Vector{Int}) -> Real.

Returns

Returns a tuple (clust, i) where clust is a clustering allocation in samples and i is its sample index.

source
RedClust.infodistMethod
infodist(a::ClustLabelVector, b::ClustLabelVector;
-normalised = true) -> Float64

Computes the information distance between two clusterings. The information distance is defined as

\[d_{\mathrm{ID}}(a, b) = \max \{H(A), H(B)\} - I(A, B)\]

where $A$ and $B$ are the cluster membership probability functions for $a$ and $b$ respectively, $H$ denotes the entropy of a distribution, and $I$ denotes the mutual information between two distributions. The information distance has range $[0, \log N]$ where $N$ is the number of observations. If normalised = true, the result is scaled to the range $[0, 1]$.

source

Summary and clustering evaluation

RedClust.evaluateclusteringMethod

evaluateclustering(clusts::ClustLabelVector, truth::ClustLabelVector) Returns a named tuple of values quantifying the accuracy of the clustering assignments in clusts with respect to the ground truth clustering assignments in truth.

Return values

  • nbloss: Normalised Binder loss (= 1 - rand index). Lower is better.
  • ari: Adjusted Rand index. Higher is better.
  • vi: Variation of Information distance ($\in [0, \log N]$). Lower is better.
  • nvi: Normalised VI distance ($\in [0, 1]$).
  • id: Information Distance ($\in [0, \log N]$). Lower is better.
  • nid: Normalised Information Distance ($\in [0, 1]$).
  • nmi: Normalised Mutual Information ($\in [0, 1]$). Higher is better.
source
RedClust.summariseMethod

summarise([io::IO], clusts::ClustLabelVector, truth::ClustLabelVector, printoutput = true) -> String Prints a summary of the clustering accuracy of clusts with respect to the ground truth in truth. The output is printed to the output stream io, which defaults to stdout if not provided.

source

Convenience functions

RedClust.adjacencymatrixMethod

adjacencymatrix(clusts::ClustLabelVector) -> Matrix{Bool} Returns the n×n adjacency matrix corresponding to the given cluster label vector clusts, where n = length(clusts).

source
RedClust.generatemixtureMethod

generatemixture(N, K; [α, dim, radius, σ, rng])

Generates a multivariate Normal mixture, with kernel weights generated from a Dirichlet prior. The kernels are centred at the vertices of a dim-dimensional simplex with edge length radius.

Required Arguments

  • N::Integer: number of observations to generate.
  • K::Integer: number of mixture kernels.

Optional Arguments

  • α::Float64 = K: parameter for the Dirichlet prior.
  • dim::Integer = K: dimension of the observations.
  • radius::Float64 = 1: radius of the simplex whose vertices are the kernel means.
  • σ::Float64 = 0.1: variance of each kernel.
  • rng: a random number generator to use, or an integer to seed the default random number generator with. If not provided, the default RNG provided by the Random.jl package will be used with default seeding.

Returns

Named tuple containing the following fields-

  • points::Vector{Vector{Float64}}: a vector of N observations.
  • distancematrix::Matrix{Float64}: an N×N matrix of pairwise Euclidean distances between the observations.
  • clusts::ClustLabelVector: vector of N cluster assignments.
  • probs::Float64: vector of K cluster weights generated from the Dirichlet prior, used to generate the observations.
  • oracle_coclustering::Matrix{Float64}: N×N matrix of co-clustering probabilities, calculated assuming full knowledge of the cluster centres and cluster weights.
source
RedClust.makematrixMethod

makematrix(x::AbstractVector{<:AbstractVector}) -> Matrix

Convert a vector of vectors into a matrix, where each vector becomes a column in the matrix.

source
RedClust.sortlabelsMethod

sortlabels(x::ClustLabelVector) -> ClustLabelVector Returns a cluster label vector y such that x and y have the same adjacency structure and labels in y occur in sorted ascending order.

source

Example datasets

RedClust.example_datasetMethod
example_dataset(n::Int)

Returns a named tuple containing the dataset from the n$^{\mathrm{th}}$ simulated example in the main paper. This dataset was generated using the following code in Julia v1.8.1:

	using RedClust
+`"MAP"` and `"MLE"` search among the MCMC samples for the clustering with the maximum log posterior and log likelihood respectively. `"MPEL"` searches for a clustering that minimises the posterior expected loss of some loss function specified by the `loss` argument. The search space is the set of samples in `samples`.
  • loss: Determines the loss function used for the "MPEL" method. Must be either be a string or a function. If specified as a string, must be one of "binder" (Binder loss), "omARI" (one minus the Adjusted Rand Index), "VI" (Variation of Information distance), or "ID" (Information Distance). If specified as a function, must have a method defined for (x::Vector{Int}, y::Vector{Int}) -> Real.

Returns

Returns a tuple (clust, i) where clust is a clustering allocation in samples and i is its sample index.

source
RedClust.infodistMethod
infodist(a::ClustLabelVector, b::ClustLabelVector;
+normalised = true) -> Float64

Computes the information distance between two clusterings. The information distance is defined as

\[d_{\mathrm{ID}}(a, b) = \max \{H(A), H(B)\} - I(A, B)\]

where $A$ and $B$ are the cluster membership probability functions for $a$ and $b$ respectively, $H$ denotes the entropy of a distribution, and $I$ denotes the mutual information between two distributions. The information distance has range $[0, \log N]$ where $N$ is the number of observations. If normalised = true, the result is scaled to the range $[0, 1]$.

source

Summary and clustering evaluation

RedClust.evaluateclusteringMethod

evaluateclustering(clusts::ClustLabelVector, truth::ClustLabelVector) Returns a named tuple of values quantifying the accuracy of the clustering assignments in clusts with respect to the ground truth clustering assignments in truth.

Return values

  • nbloss: Normalised Binder loss (= 1 - rand index). Lower is better.
  • ari: Adjusted Rand index. Higher is better.
  • vi: Variation of Information distance ($\in [0, \log N]$). Lower is better.
  • nvi: Normalised VI distance ($\in [0, 1]$).
  • id: Information Distance ($\in [0, \log N]$). Lower is better.
  • nid: Normalised Information Distance ($\in [0, 1]$).
  • nmi: Normalised Mutual Information ($\in [0, 1]$). Higher is better.
source
RedClust.summariseMethod

summarise([io::IO], clusts::ClustLabelVector, truth::ClustLabelVector, printoutput = true) -> String Prints a summary of the clustering accuracy of clusts with respect to the ground truth in truth. The output is printed to the output stream io, which defaults to stdout if not provided.

source

Convenience functions

RedClust.adjacencymatrixMethod

adjacencymatrix(clusts::ClustLabelVector) -> Matrix{Bool} Returns the n×n adjacency matrix corresponding to the given cluster label vector clusts, where n = length(clusts).

source
RedClust.generatemixtureMethod

generatemixture(N, K; [α, dim, radius, σ, rng])

Generates a multivariate Normal mixture, with kernel weights generated from a Dirichlet prior. The kernels are centred at the vertices of a dim-dimensional simplex with edge length radius.

Required Arguments

  • N::Integer: number of observations to generate.
  • K::Integer: number of mixture kernels.

Optional Arguments

  • α::Float64 = K: parameter for the Dirichlet prior.
  • dim::Integer = K: dimension of the observations.
  • radius::Float64 = 1: radius of the simplex whose vertices are the kernel means.
  • σ::Float64 = 0.1: variance of each kernel.
  • rng: a random number generator to use, or an integer to seed the default random number generator with. If not provided, the default RNG provided by the Random.jl package will be used with default seeding.

Returns

Named tuple containing the following fields-

  • points::Vector{Vector{Float64}}: a vector of N observations.
  • distancematrix::Matrix{Float64}: an N×N matrix of pairwise Euclidean distances between the observations.
  • clusts::ClustLabelVector: vector of N cluster assignments.
  • probs::Float64: vector of K cluster weights generated from the Dirichlet prior, used to generate the observations.
  • oracle_coclustering::Matrix{Float64}: N×N matrix of co-clustering probabilities, calculated assuming full knowledge of the cluster centres and cluster weights.
source
RedClust.makematrixMethod

makematrix(x::AbstractVector{<:AbstractVector}) -> Matrix

Convert a vector of vectors into a matrix, where each vector becomes a column in the matrix.

source
RedClust.sortlabelsMethod

sortlabels(x::ClustLabelVector) -> ClustLabelVector Returns a cluster label vector y such that x and y have the same adjacency structure and labels in y occur in sorted ascending order.

source

Example datasets

RedClust.example_datasetMethod
example_dataset(n::Int)

Returns a named tuple containing the dataset from the n$^{\mathrm{th}}$ simulated example in the main paper. This dataset was generated using the following code in Julia v1.8.1:

	using RedClust
 	using Random: seed!
 	seed!(44)
 	K = 10 # Number of clusters
 	N = 100 # Number of points
 	data_σ = [0.25, 0.2, 0.18][n] # Variance of the normal kernel
 	data_dim = [10, 50, 10][n] # Data dimension
-	data = generatemixture(N, K; α = 10, σ = data_σ, dim = data_dim);

Note however that the above code may produce different results on your computer because the random number generator in Julia is not meant for reproducible results across different computers, different versions of Julia, or different versions of the Random.jl package, even with appropriate seeding. Therefore the datasets have been included with this package, and it is recommended to access them via this function.

See also generatemixture.

source
RedClust.example_datasetsMethod

Return a read-only handle to a HDF5 file that contains example datasets from the main paper. You must remember to close the file once you are done reading from it. This function is provided for reproducibility purposes only; it is recommended to read the datasets via the convenience function example_dataset.

source

Types

RedClust.MCMCDataType

MCMCData(points::AbstractVector{<:AbstractVector{<:Float64}}) MCMCData(dissimilaritymatrix::AbstractMatrix{Float64})

Contains the pairwise dissimilarities for the MCMC sampler.

source
RedClust.MCMCOptionsListType
MCMCOptionsList(;
+	data = generatemixture(N, K; α = 10, σ = data_σ, dim = data_dim);

Note however that the above code may produce different results on your computer because the random number generator in Julia is not meant for reproducible results across different computers, different versions of Julia, or different versions of the Random.jl package, even with appropriate seeding. Therefore the datasets have been included with this package, and it is recommended to access them via this function.

See also generatemixture.

source
RedClust.example_datasetsMethod

Return a read-only handle to a HDF5 file that contains example datasets from the main paper. You must remember to close the file once you are done reading from it. This function is provided for reproducibility purposes only; it is recommended to read the datasets via the convenience function example_dataset.

source

Types

RedClust.MCMCDataType

MCMCData(points::AbstractVector{<:AbstractVector{<:Float64}}) MCMCData(dissimilaritymatrix::AbstractMatrix{Float64})

Contains the pairwise dissimilarities for the MCMC sampler.

source
RedClust.MCMCOptionsListType
MCMCOptionsList(;
 numiters = 5000,
 burnin = floor(0.2 * numiters),
 thin = 1,
@@ -35,4 +35,4 @@
 - `burnin::Integer`: number of iterations to discard as burn-in.
 - `thin::Integer`: will keep every `thin` samples.
 - `numGibbs:Integer`: number of intermediate Gibbs scans in the split-merge step.
-- `numMH:Integer`: number of split-merge steps per MCMC iteration.
source
RedClust.MCMCResultType

Struct containing MCMC samples.

Fields

  • options::MCMCOptionsList: options passed to the sampler.
  • params::PriorHyperparamsList: prior hyperparameters used by the sampler.
  • clusts::Vector{ClustLabelVector}: contains the clustering allocations. clusts[i] is a vector containing the clustering allocation from the ith sample.
  • posterior_coclustering::Matrix{Float64}: the posterior coclustering matrix.
  • K::Vector{Int}: posterior samples of $K$, i.e., the number of clusters. K[i] is the number of clusters in clusts[i].
  • r::Vector{Float64}: posterior samples of the parameter $r$.
  • p::Vector{Float64}: posterior samples of the parameter $p$.
  • K_mean, r_mean, p_mean: posterior mean of K, r, and p respectively.
  • K_variance, r_variance, p_variance: posterior variance of K, r, and p respectively.
  • K_acf::Vector{Float64}, r_acf::Vector{Float64}, p_acf::Vector{Float64}: autocorrelation function for K, r, and p respectively.
  • K_iac, r_iac, and p_iac: integrated autocorrelation coefficient for K, r, and p respectively.
  • K_ess::Float64, r_ess::Float64, and p_ess::Float64: effective sample size for K, r, and p respectively.
  • loglik::Vector{Float64}: log-likelihood for each sample.
  • logposterior::Vector{Float64}: a function proportional to the log-posterior for each sample, with constant of proportionality equal to the normalising constant of the partition prior.
  • splitmerge_splits: Boolean vector indicating the iterations when a split proposal was used in the split-merge step. Has length numMH * numiters (see MCMCOptionsList).
  • splitmerge_acceptance_rate: acceptance rate of the split-merge proposals.
  • r_acceptances: Boolean vector indicating the iterations (including burnin and the thinned out iterations) where the Metropolis-Hastings proposal for r was accepted.
  • r_acceptance_rate:: Metropolis-Hastings acceptance rate for r.
  • runtime: total runtime for all iterations.
  • mean_iter_time: average time taken for each iteration.
source
RedClust.PriorHyperparamsListType

PriorHyperparamsList(; [kwargs])

Contains the prior hyperparameters for the model. It is recommended to set the values using the fitprior function. Do not set these values manually unless you know what you are doing.

Constructor arguments

  • δ1::Float64 = 1: the parameter $\delta_1$.
  • α::Float64 = 1: the shape parameter $\alpha$ in the prior for each $\lambda_k$.
  • β::Float64 = 1: the rate parameter $\beta$ in the prior for each $\lambda_k$.
  • δ2::Float64 = 1: the parameter $\delta_2$.
  • ζ::Float64 = 1: the shape parameter $\zeta$ in the prior for each $\theta_{kt}$.
  • γ::Float64 = 1: the rate parameter $\gamma$ in the prior for each $\theta_{kt}$.
  • η::Float64 = 1: the shape parameter $\eta$ in the prior for $r$.
  • σ::Float64 = 1: the rate parameter $\sigma$ in the prior for $r$.
  • u::Float64 = 1: the parameter $u$ in the prior for $p$.
  • v::Float64 = 1: the parameter $v$ in the prior for $p$.
  • proposalsd_r::Float64 = 1: standard deviation of the truncated Normal proposal when sampling r via a Metropolis-Hastings step.
  • K_initial::Int = 1: initial number of clusters to start the MCMC.
  • repulsion::Bool = true: whether to use the repulsive component of the likelihood.
  • maxK::Int = 0: maxmimum number of clusters to allow. If set to 0 then no maximum is imposed.

See also

fitprior

source
+- `numMH:Integer`: number of split-merge steps per MCMC iteration.source
RedClust.MCMCResultType

Struct containing MCMC samples.

Fields

  • options::MCMCOptionsList: options passed to the sampler.
  • params::PriorHyperparamsList: prior hyperparameters used by the sampler.
  • clusts::Vector{ClustLabelVector}: contains the clustering allocations. clusts[i] is a vector containing the clustering allocation from the ith sample.
  • posterior_coclustering::Matrix{Float64}: the posterior coclustering matrix.
  • K::Vector{Int}: posterior samples of $K$, i.e., the number of clusters. K[i] is the number of clusters in clusts[i].
  • r::Vector{Float64}: posterior samples of the parameter $r$.
  • p::Vector{Float64}: posterior samples of the parameter $p$.
  • K_mean, r_mean, p_mean: posterior mean of K, r, and p respectively.
  • K_variance, r_variance, p_variance: posterior variance of K, r, and p respectively.
  • K_acf::Vector{Float64}, r_acf::Vector{Float64}, p_acf::Vector{Float64}: autocorrelation function for K, r, and p respectively.
  • K_iac, r_iac, and p_iac: integrated autocorrelation coefficient for K, r, and p respectively.
  • K_ess::Float64, r_ess::Float64, and p_ess::Float64: effective sample size for K, r, and p respectively.
  • loglik::Vector{Float64}: log-likelihood for each sample.
  • logposterior::Vector{Float64}: a function proportional to the log-posterior for each sample, with constant of proportionality equal to the normalising constant of the partition prior.
  • splitmerge_splits: Boolean vector indicating the iterations when a split proposal was used in the split-merge step. Has length numMH * numiters (see MCMCOptionsList).
  • splitmerge_acceptance_rate: acceptance rate of the split-merge proposals.
  • r_acceptances: Boolean vector indicating the iterations (including burnin and the thinned out iterations) where the Metropolis-Hastings proposal for r was accepted.
  • r_acceptance_rate:: Metropolis-Hastings acceptance rate for r.
  • runtime: total runtime for all iterations.
  • mean_iter_time: average time taken for each iteration.
source
RedClust.PriorHyperparamsListType

PriorHyperparamsList(; [kwargs])

Contains the prior hyperparameters for the model. It is recommended to set the values using the fitprior function. Do not set these values manually unless you know what you are doing.

Constructor arguments

  • δ1::Float64 = 1: the parameter $\delta_1$.
  • α::Float64 = 1: the shape parameter $\alpha$ in the prior for each $\lambda_k$.
  • β::Float64 = 1: the rate parameter $\beta$ in the prior for each $\lambda_k$.
  • δ2::Float64 = 1: the parameter $\delta_2$.
  • ζ::Float64 = 1: the shape parameter $\zeta$ in the prior for each $\theta_{kt}$.
  • γ::Float64 = 1: the rate parameter $\gamma$ in the prior for each $\theta_{kt}$.
  • η::Float64 = 1: the shape parameter $\eta$ in the prior for $r$.
  • σ::Float64 = 1: the rate parameter $\sigma$ in the prior for $r$.
  • u::Float64 = 1: the parameter $u$ in the prior for $p$.
  • v::Float64 = 1: the parameter $v$ in the prior for $p$.
  • proposalsd_r::Float64 = 1: standard deviation of the truncated Normal proposal when sampling r via a Metropolis-Hastings step.
  • K_initial::Int = 1: initial number of clusters to start the MCMC.
  • repulsion::Bool = true: whether to use the repulsive component of the likelihood.
  • maxK::Int = 0: maxmimum number of clusters to allow. If set to 0 then no maximum is imposed.

See also

fitprior

source
diff --git a/dev/changelog/index.html b/dev/changelog/index.html index ddd7385..ee1ef15 100644 --- a/dev/changelog/index.html +++ b/dev/changelog/index.html @@ -1,2 +1,2 @@ -Changelog · RedClust.jl

Changelog

[1.2.2]

Added

  • tests for Julia 1.9 in CI matrix
  • HDF5.jl v0.17 compatibility

[1.2.1]

Added

Fixed

  • error when constructing MCMCData with input distance matrix that has non-zero diagonal entries.
  • edge case for `fitprior2' when Kmin = Kmax = 1 or Kmin = Kmax = N generates a warning and fallback to default repulsion/cohesion parameters.

[1.2.0]

Added

  • added function fitprior2, alternative method to fit prior hyperparameters.

Changed

  • pretty print for MCMCResult also shows whether the model includes repulsion.
  • moved basic_example.jl from the examples folder to docs.
  • removed examples (added to a separate branch).

[1.1.0]

Added

  • added function signature for sampleK to accept individual parameters.

[1.0.1]

Documentation now updated to use Literate.jl to generate the example.

[1.0.0]

Added

  • examples from the main paper are now included in the examples folder. Data from the original paper is included in the data folder in the standard HDF5 format.
  • the examples now have more plots.
  • functionality to sample the distances from the prior predictive distribution (see sampledist).
  • functionality to sample $K$ from its marginal prior predictive (see sampleK).
  • generatemixture accepts a random number generator or a seed for the default RNG for reproducibility of results.

Removed

  • dependency on RCall and the various calls to the salso algorithm were removed. It is left to the user to make these calls if necessary.

[0.2.2]

Fixed

  • corrected time not printing in the correct format.
  • moved prettytime and prettynumber to utils.jl.

[0.2.1]

Added

  • added tests for pretty printing.
  • added tests for custom loss function in getpointestimate.

Fixed

  • verbose output in sampler missing newline.
  • custom loss function when computing a point estimate.

Removed

  • _infodist not in use, removed.

[0.2.0]

Added

  • added log-posterior to result.
  • added log-likelihood and log-posterior plots to basic example.
  • point-estimation via maximum likelihood and maximum a posteriori.
  • infodist function to compute the information distance between clusterings.
  • convenience constructor for MCMCData.
  • add verbose option for runsampler and fitprior.
  • pretty printing for MCMCData, MCMCOptionsList, and PriorHyperparamsList.
  • added bibliographic references to documentation.
  • added changelog.

Breaking

  • MCMCOptions constructor changed account for change in point-estimate calculation.
  • fitprior now only returns the hyperparameter list.
  • removed summarise for MCMCResult objects (use pretty printing instead).

Fixed

  • corrected computation of log-likelihood in result.
  • fixed edge case error in fitprior when the found value of K is either 1 or N. This case arises only for edge values of Kmin and Kmax, since the elbow method will otherwise not choose 1 or N for K.

Changed

  • added bounds checking for Kmin and Kmax in fitprior.
  • added input message for better debugging in fitprior.
  • added progress bar for fitprior if useR = false.
  • added input validation for generatemixture.
  • added input validation for binderloss and evaluateclustering.
  • added input validation for the constructors of MCMCData and MCMCOptionsList.
  • separated computation of log-likelihood and log-prior.
  • removed redundant loss function calculations when computing point-estimate.
  • Binder loss calculation (binderloss) is now approximate (using randindex from Clustering.jl) but faster.
  • runsampler no longer automatically calculates a point estimate.
  • summarise has a separate signature for MCMC output and for point estimate summary, and now provides more measures for point estimates.
  • minor optimisations using @inbounds, @simd, and @turbo.
  • using StaticArrays.jl and separate function for restricted Gibbs scan.
  • speedup via non-generic implementation of matrix and vector sums with LoopVectorization.jl.
+Changelog · RedClust.jl

Changelog

[1.2.2]

Added

  • tests for Julia 1.9 in CI matrix
  • HDF5.jl v0.17 compatibility

[1.2.1]

Added

Fixed

  • error when constructing MCMCData with input distance matrix that has non-zero diagonal entries.
  • edge case for `fitprior2' when Kmin = Kmax = 1 or Kmin = Kmax = N generates a warning and fallback to default repulsion/cohesion parameters.

[1.2.0]

Added

  • added function fitprior2, alternative method to fit prior hyperparameters.

Changed

  • pretty print for MCMCResult also shows whether the model includes repulsion.
  • moved basic_example.jl from the examples folder to docs.
  • removed examples (added to a separate branch).

[1.1.0]

Added

  • added function signature for sampleK to accept individual parameters.

[1.0.1]

Documentation now updated to use Literate.jl to generate the example.

[1.0.0]

Added

  • examples from the main paper are now included in the examples folder. Data from the original paper is included in the data folder in the standard HDF5 format.
  • the examples now have more plots.
  • functionality to sample the distances from the prior predictive distribution (see sampledist).
  • functionality to sample $K$ from its marginal prior predictive (see sampleK).
  • generatemixture accepts a random number generator or a seed for the default RNG for reproducibility of results.

Removed

  • dependency on RCall and the various calls to the salso algorithm were removed. It is left to the user to make these calls if necessary.

[0.2.2]

Fixed

  • corrected time not printing in the correct format.
  • moved prettytime and prettynumber to utils.jl.

[0.2.1]

Added

  • added tests for pretty printing.
  • added tests for custom loss function in getpointestimate.

Fixed

  • verbose output in sampler missing newline.
  • custom loss function when computing a point estimate.

Removed

  • _infodist not in use, removed.

[0.2.0]

Added

  • added log-posterior to result.
  • added log-likelihood and log-posterior plots to basic example.
  • point-estimation via maximum likelihood and maximum a posteriori.
  • infodist function to compute the information distance between clusterings.
  • convenience constructor for MCMCData.
  • add verbose option for runsampler and fitprior.
  • pretty printing for MCMCData, MCMCOptionsList, and PriorHyperparamsList.
  • added bibliographic references to documentation.
  • added changelog.

Breaking

  • MCMCOptions constructor changed account for change in point-estimate calculation.
  • fitprior now only returns the hyperparameter list.
  • removed summarise for MCMCResult objects (use pretty printing instead).

Fixed

  • corrected computation of log-likelihood in result.
  • fixed edge case error in fitprior when the found value of K is either 1 or N. This case arises only for edge values of Kmin and Kmax, since the elbow method will otherwise not choose 1 or N for K.

Changed

  • added bounds checking for Kmin and Kmax in fitprior.
  • added input message for better debugging in fitprior.
  • added progress bar for fitprior if useR = false.
  • added input validation for generatemixture.
  • added input validation for binderloss and evaluateclustering.
  • added input validation for the constructors of MCMCData and MCMCOptionsList.
  • separated computation of log-likelihood and log-prior.
  • removed redundant loss function calculations when computing point-estimate.
  • Binder loss calculation (binderloss) is now approximate (using randindex from Clustering.jl) but faster.
  • runsampler no longer automatically calculates a point estimate.
  • summarise has a separate signature for MCMC output and for point estimate summary, and now provides more measures for point estimates.
  • minor optimisations using @inbounds, @simd, and @turbo.
  • using StaticArrays.jl and separate function for restricted Gibbs scan.
  • speedup via non-generic implementation of matrix and vector sums with LoopVectorization.jl.
diff --git a/dev/cite/index.html b/dev/cite/index.html index 9a9940c..3749eb3 100644 --- a/dev/cite/index.html +++ b/dev/cite/index.html @@ -1,8 +1,11 @@ -Cite This Package · RedClust.jl

Citation Information

If you want to use this package in your work, please cite it as:

Natarajan, A., De Iorio, M., Heinecke, A., Mayer, E. and Glenn, S. (2023). ‘Cohesion and Repulsion in Bayesian Distance Clustering’, Journal of the Americal Statistical Association. DOI: 10.1080/01621459.2023.2191821.

For BibTeX users:

@article{NDI23,
+Cite This Package · RedClust.jl

Citation Information

If you want to use this package in your work, please cite it as:

Natarajan, A., De Iorio, M., Heinecke, A., Mayer, E. and Glenn, S. (2023). ‘Cohesion and Repulsion in Bayesian Distance Clustering’, Journal of the American Statistical Association, 119(546), pp. 1374–1384. DOI: 10.1080/01621459.2023.2191821.

For BibTeX users:

@article{NDI23,
   doi = {10.1080/01621459.2023.2191821},
   author = {Natarajan, Abhinav and De Iorio, Maria and Heinecke, Andreas and Mayer, Emanuel and Glenn, Simon},
   title = {Cohesion and Repulsion in Bayesian Distance Clustering},
   journal = {Journal of the American Statistical Association},
+  volume = {119},
+  issue = {546},
+  pages={1374--1384},
   year = {2023}
-}
+}
diff --git a/dev/funding/index.html b/dev/funding/index.html index bb62633..0d77f00 100644 --- a/dev/funding/index.html +++ b/dev/funding/index.html @@ -1,2 +1,2 @@ -Funding Information · RedClust.jl

Funding Information

We are grateful to the following funding organisations for their financial support for the research that led to the main publication behind this package.

  • National University of Singapore, HSS Seed Fund R-607-000-449-646
  • Yale-NUS College, grant IG20-RA001
  • Singapore Ministry of Education, Academic Research Fund Tier 2 Grant MOE-T2EP40121-0021
+Funding Information · RedClust.jl

Funding Information

We are grateful to the following funding organisations for their financial support for the research that led to the main publication behind this package.

  • National University of Singapore, HSS Seed Fund R-607-000-449-646
  • Yale-NUS College, grant IG20-RA001
  • Singapore Ministry of Education, Academic Research Fund Tier 2 Grant MOE-T2EP40121-0021
diff --git a/dev/generated/example/index.html b/dev/generated/example/index.html index b00c4fe..15a3153 100644 --- a/dev/generated/example/index.html +++ b/dev/generated/example/index.html @@ -66,47 +66,67 @@ end

We can visualise the true adjacency matrix of the observations with respect to the true clusters that they were drawn from, as well as the oracle coclustering matrix. The latter is the matrix of co-clustering probabilities of the observations conditioned upon the data generation process. This takes into account full information about the cluster weights (and how they are generated), the mixture kernels for each cluster, and the location and scale parameters for these kernels.

sqmatrixplot(combine_sqmatrices(oracle_coclustering, 1.0 * adjacencymatrix(clusts)), title = "Adjacency vs Oracle Co-clustering Probabilities \n(upper right and lower left triangle)\n")
- + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - + - + - +

We can visualise the matrix of pairwise distances between the observations.

sqmatrixplot(distmatrix, title = "Matrix of Pairwise Distances")
- + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - + - + - +

We can also plot the histogram of distances, grouped by whether they are inter-cluster distances (ICD) or within-cluster distances (WCD).

begin
     empirical_intracluster = uppertriangle(distmatrix)[
@@ -1487,278 +1527,298 @@
 end
- + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Prior Hyperparameters

RedClust includes the function fitprior to heuristically choose prior hyperparameters based on the data.

params = fitprior(points, "k-means", false)
Model Hyperparameters
 Likelihood Hyperparameters
 δ₁ = 17.380
@@ -1796,59 +1856,81 @@ 

- + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

We can also evaluate the prior hyperparameters by checking the marginal predictive distribution on $K$ (the number of clusters).

begin
     Ksamples = sampleK(params, 10000, N)
     histogram_pmf(Ksamples, legend = false,
@@ -1857,243 +1939,271 @@ 

- + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Sampling

Running the MCMC is straightforward. We set up the MCMC options using MCMCOptionsList.

options = MCMCOptionsList(numiters=50000)
MCMC Options
 50000 iterations
@@ -2113,8 +2223,8 @@ 

SamplingSummary for K IAC : 6.240 @@ -2140,47 +2250,67 @@

Sampling

- + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Sampling - + - + Sampling - +

Plot the posterior distribution of $K$:

histogram_pmf(result.K,
 xlabel = "\$K\$", ylabel = "Probability", title = "Posterior Distribution of \$K\$")
- + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Plot the posterior distribution of $r$:

begin
     histogram(result.r, normalize = :pdf,
@@ -2543,337 +2693,359 @@ 

Sampling

- + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Plot the posterior distribution of $p$:

begin
     histogram(result.p, normalize = :pdf,
     ylabel = "Density", xlabel = "\$p\$",
@@ -2886,646 +3058,790 @@ 

Sampling

- + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Convergence statistics

Plot the traceplot of the autocorrelation function of $K$:

plot(result.K_acf, legend = false, linewidth = 1,
 xlabel = "Lag", ylabel = "Autocorrelation",
 title = "Autocorrelation Function of \$K\$")
- + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Plot the traceplot of the autocorrelation function of $r$:

plot(result.r_acf, legend = false, linewidth = 1,
 xlabel = "Lag", ylabel = "Autocorrelation",
 title = "Autocorrelation Function of \$r\$")
- + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Plot the traceplot of the autocorrelation function of $p$:

plot(result.p_acf, legend = false, linewidth = 1,
 xlabel = "Lag", ylabel = "Autocorrelation",
 title = "Autocorrelation Function of \$p\$")
- + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Check the trace plot of the log-likelihood to make sure the MCMC is moving well:

plot(result.loglik, legend = false, linewidth = 1,
 xlabel = "Iteration", ylabel = "Log likelihood",
 title = "Log-Likelihood Trace Plot")
- + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Check the trace plot of the log-posterior:

plot(result.logposterior, legend = false, linewidth = 1,
 xlabel = "Iteration", ylabel = "Log posterior",
 title = "Log-Posterior Trace Plot")
- + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Point Estimates

The function getpointestimate finds an optimal point estimate, based on some notion of optimality. For example, to get the maximum a posteriori estimate we can run the following.

pointestimate, index = getpointestimate(result; method="MAP")
([1, 1, 1, 1, 1, 2, 1, 1, 1, 3  …  2, 2, 2, 2, 2, 2, 2, 2, 2, 2], 9716)

We can compare the point-estimate to the true clustering through their adjacency matrices.

sqmatrixplot(combine_sqmatrices(adjacencymatrix(pointestimate), adjacencymatrix(clusts)),
 title = "True Clustering vs MAP Point Estimate")
- + - + - + - + - + - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + " transform="translate(472, 200)"/> - + - + UBAoOGoECwoCBYGCQMHP4wQLCgIFgYJA4QdRiRBjiXmV5wAAAABJRU5ErkJggg== " transform="translate(1873, 207)"/> - +

We can check the accuracy of the point estimate in terms of clustering metrics.

summarise(pointestimate, clusts)
Clustering summary
 Number of clusters : 13
@@ -3783,4 +4099,4 @@ 

Adjusted Rand Index : 0.9028187471471413 Normalised Variation of Information (NVI) distance : 0.06086895626900138 Normalised Information Distance (NID) : 0.0429266154915723 -Normalised Mutual Information : 0.9400474997102378


This page was generated using Literate.jl.

+Normalised Mutual Information : 0.9400474997102378


This page was generated using Literate.jl.

diff --git a/dev/index.html b/dev/index.html index a116826..6345b37 100644 --- a/dev/index.html +++ b/dev/index.html @@ -21,4 +21,4 @@ \end{align*}\]

This results in the following partition prior:

\[\pi(\rho_n \mid r, p) \propto K! p^{n-K}(1-p)^{rK}\Gamma(r)^{-K}\prod_{k=1}^K n_k \Gamma(n_k+r-1)\]

where $\rho_n$ is the partition and $K$ is the number of clusters in the partition. The partition likelihood is given by

\[\begin{align*} \lambda_k &\overset{\mathrm{iid}}{\sim} \mathrm{Gamma}(\alpha, \beta), \quad 1 \leq k \leq K\\ \theta_{kt} &\overset{\mathrm{iid}}{\sim} \mathrm{Gamma}(\zeta, \gamma), \quad 1 \leq k < t \leq K -\end{align*}\]

\[\pi(D, \mid \rho_n, \boldsymbol{\lambda}, \boldsymbol{\theta}, r, p) = \left[ \prod_{k=1}^K \prod_{\substack{i, j \in C_k \\ i \neq j}} \frac{D_{ij}^{\delta_1 - 1}\lambda_k^{\delta_1}}{\Gamma(\delta_1)} \exp(-\lambda_k D_{ij}) \right] \left[\prod_{\substack{t, k = 1 \\ t \neq K}} \prod_{\substack{i \in C_k \\ j \in C_t}} \frac{D_{ij}^{\delta_2-1}\theta_{kt}^{\delta_2}}{\Gamma(\delta_2)} \exp(-\theta_{kt}D_{ij}) \right]\]

where $\boldsymbol D$ is the matrix of pairwise dissimilarities between observations.

Point estimation

A clustering point-estimate $\boldsymbol c^*$ can be determined in a number of ways.

  1. Choosing the sample with maximum likelihood or maximum posterior. This corresponds to an MLE or MAP estimate using the MCMC samples as an approximation to the likelihood and posterior.
  2. Searching for a clustering that minimises the posterior expectation of a loss function $\ell$ such as the Binder loss (Binder, 1978), the Variation of Information distance (Meilă, 2007; Wade and Ghahramani, 2018), or the Normalised Information Distance (Kraskov et al., 2005). That is,

\[\boldsymbol c^* = \argmin_{\boldsymbol c} \mathbb{E}_{\boldsymbol c'}[\ell(\boldsymbol c, \boldsymbol c')]\]

  1. A naive method is to restrict the search space to those clusterings visited by the MCMC sampler. This method is implemented in RedClust in the function getpointestimate.
  2. A better method is the SALSO algorithm (Dahl et al., 2022), implemented in the R package salso, which heuristically searches the space of all possible clusterings. You can use the RCall package in Julia to make calls to the salso package in R if you have R and salso installed.

Bibliography

Betancourt, B., Zanella, G. and Steorts, R. C. (2022), ‘Random partition models for microclustering tasks’, Journal of the American Statistical Association 117(539), 1215–1227. DOI: 10.1080/01621459.2020.1841647.

Binder, D. A. (1978). Bayesian Cluster Analysis. Biometrika, 65(1), 31–38. DOI: 10.2307/2335273.

Dahl, D. B., Johnson, D. J. and M¨uller, P. (2022), ‘Search algorithms and loss functions for Bayesian clustering’, Journal of Computational and Graphical Statistics 0(0), 1–13. DOI: 10.1080/10618600.2022.2069779.

Kraskov, A., Stögbauer, H., Andrzejak, R. G. and P Grassberger, P. (2005), ‘Hierarchical clustering using mutual information’, Europhysics Letters (EPL) 70(2), 278-284, DOI: 10.1209/epl/i2004-10483-y.

Marina Meilă (2007), ‘Comparing clusterings—an information based distance’, Journal of Multivariate Analysis, 98(5), 873-895, DOI: 10.1016/j.jmva.2006.11.013.

Miller, J., Betancourt, B., Zaidi, A., Wallach, H. and Steorts, R. C. (2015), ‘Microclustering: When the cluster sizes grow sublinearly with the size of the data set’, arXiv 1512.00792.

Natarajan, A., De Iorio, M., Heinecke, A., Mayer, E. and Glenn, S. (2023). ‘Cohesion and Repulsion in Bayesian Distance Clustering’, Journal of the Americal Statistical Association. DOI: 10.1080/01621459.2023.2191821.

Wade, S. and Ghahramani, Z. (2018), ‘Bayesian Cluster Analysis: Point Estimation and Credible Balls (with Discussion)’, Bayesian Analysis 13(2), 559 – 626. DOI: 10.1214/17-BA1073.

+\end{align*}\]

\[\pi(D, \mid \rho_n, \boldsymbol{\lambda}, \boldsymbol{\theta}, r, p) = \left[ \prod_{k=1}^K \prod_{\substack{i, j \in C_k \\ i \neq j}} \frac{D_{ij}^{\delta_1 - 1}\lambda_k^{\delta_1}}{\Gamma(\delta_1)} \exp(-\lambda_k D_{ij}) \right] \left[\prod_{\substack{t, k = 1 \\ t \neq K}} \prod_{\substack{i \in C_k \\ j \in C_t}} \frac{D_{ij}^{\delta_2-1}\theta_{kt}^{\delta_2}}{\Gamma(\delta_2)} \exp(-\theta_{kt}D_{ij}) \right]\]

where $\boldsymbol D$ is the matrix of pairwise dissimilarities between observations.

Point estimation

A clustering point-estimate $\boldsymbol c^*$ can be determined in a number of ways.

  1. Choosing the sample with maximum likelihood or maximum posterior. This corresponds to an MLE or MAP estimate using the MCMC samples as an approximation to the likelihood and posterior.
  2. Searching for a clustering that minimises the posterior expectation of a loss function $\ell$ such as the Binder loss (Binder, 1978), the Variation of Information distance (Meilă, 2007; Wade and Ghahramani, 2018), or the Normalised Information Distance (Kraskov et al., 2005). That is,

\[\boldsymbol c^* = \argmin_{\boldsymbol c} \mathbb{E}_{\boldsymbol c'}[\ell(\boldsymbol c, \boldsymbol c')]\]

  1. A naive method is to restrict the search space to those clusterings visited by the MCMC sampler. This method is implemented in RedClust in the function getpointestimate.
  2. A better method is the SALSO algorithm (Dahl et al., 2022), implemented in the R package salso, which heuristically searches the space of all possible clusterings. You can use the RCall package in Julia to make calls to the salso package in R if you have R and salso installed.

Bibliography

Betancourt, B., Zanella, G. and Steorts, R. C. (2022), ‘Random partition models for microclustering tasks’, Journal of the American Statistical Association 117(539), 1215–1227. DOI: 10.1080/01621459.2020.1841647.

Binder, D. A. (1978). Bayesian Cluster Analysis. Biometrika, 65(1), 31–38. DOI: 10.2307/2335273.

Dahl, D. B., Johnson, D. J. and M¨uller, P. (2022), ‘Search algorithms and loss functions for Bayesian clustering’, Journal of Computational and Graphical Statistics 0(0), 1–13. DOI: 10.1080/10618600.2022.2069779.

Kraskov, A., Stögbauer, H., Andrzejak, R. G. and P Grassberger, P. (2005), ‘Hierarchical clustering using mutual information’, Europhysics Letters (EPL) 70(2), 278-284, DOI: 10.1209/epl/i2004-10483-y.

Marina Meilă (2007), ‘Comparing clusterings—an information based distance’, Journal of Multivariate Analysis, 98(5), 873-895, DOI: 10.1016/j.jmva.2006.11.013.

Miller, J., Betancourt, B., Zaidi, A., Wallach, H. and Steorts, R. C. (2015), ‘Microclustering: When the cluster sizes grow sublinearly with the size of the data set’, arXiv 1512.00792.

Natarajan, A., De Iorio, M., Heinecke, A., Mayer, E. and Glenn, S. (2023). ‘Cohesion and Repulsion in Bayesian Distance Clustering’, Journal of the Americal Statistical Association. DOI: 10.1080/01621459.2023.2191821.

Wade, S. and Ghahramani, Z. (2018), ‘Bayesian Cluster Analysis: Point Estimation and Credible Balls (with Discussion)’, Bayesian Analysis 13(2), 559 – 626. DOI: 10.1214/17-BA1073.

diff --git a/dev/search/index.html b/dev/search/index.html index 19ae077..b2ede97 100644 --- a/dev/search/index.html +++ b/dev/search/index.html @@ -1,2 +1,2 @@ -Search · RedClust.jl

Loading search...

    +Search · RedClust.jl

    Loading search...

      diff --git a/dev/search_index.js b/dev/search_index.js index 0b7499e..1a24a6d 100644 --- a/dev/search_index.js +++ b/dev/search_index.js @@ -1,3 +1,3 @@ var documenterSearchIndex = {"docs": -[{"location":"api/#Public-API","page":"Public API","title":"Public API","text":"","category":"section"},{"location":"api/#Main-functions","page":"Public API","title":"Main functions","text":"","category":"section"},{"location":"api/","page":"Public API","title":"Public API","text":"Modules = [RedClust]\nPages = [\"prior.jl\", \"mcmc.jl\"]\nOrder = [:function]","category":"page"},{"location":"api/#RedClust.fitprior","page":"Public API","title":"RedClust.fitprior","text":"fitprior(data, algo, diss = false;\nKmin = 1,\nKmax = Int(floor(size(data)[end] / 2),\nverbose = true)\n\nDetermines the best prior hyperparameters from the data. A notional clustering is obtained using k-means or k-medoids and the elbow method, and the distances are split into within-cluster distances and inter-cluster distances based on the notional clustering. These distances are then used to fit the prior hyperparameters using MLE and empirical Bayes sampling.\n\nRequired Arguments\n\ndata::Union{Vector{Vector{Float64}}, Matrix{Float64}}: can either be a vector of (possibly multi-dimensional) observations, or a matrix with each column an observation, or a square matrix of pairwise dissimilarities.\nalgo::String: must be one of \"k-means\" or \"k-medoids\".\n\nOptional Arguments\n\ndiss::bool: if true, data will be assumed to be a pairwise dissimilarity matrix.\nKmin::Integer: minimum number of clusters to scan for the elbow method.\nKmax::Integer: maximum number of clusters to scan for the elbow method. If left unspecified, it is set to half the number of observations.\nverbose::Bool: if false, disables all info messages and progress bars.\n\nReturns\n\nAn object of type PriorHyperparamsList.\n\n\n\n\n\n","category":"function"},{"location":"api/#RedClust.fitprior2","page":"Public API","title":"RedClust.fitprior2","text":"fitprior2(data, algo, diss = false;\nKmin = 1,\nKmax = Int(floor(size(data)[end] / 2),\nverbose = true)\n\nDetermines the best prior hyperparameters from the data. Uses the same method as fitprior to obtain values for sigma, eta, u, and v, but derives values for the cluster-specific parameters by considering within-cluster and cross-cluster distances over clusterings with K clusters for all values of K in K_mathrmmin K_mathrmmax, weighted by the prior predictive distribution of K in that range.\n\nRequired Arguments\n\ndata::Union{Vector{Vector{Float64}}, Matrix{Float64}}: can either be a vector of (possibly multi-dimensional) observations, or a matrix with each column an observation, or a square matrix of pairwise dissimilarities.\nalgo::String: must be one of \"k-means\" or \"k-medoids\".\n\nOptional Arguments\n\ndiss::bool: if true, data will be assumed to be a pairwise dissimilarity matrix.\nKmin::Integer: minimum number of clusters to scan for the elbow method.\nKmax::Integer: maximum number of clusters to scan for the elbow method. If left unspecified, it is set to half the number of observations.\nverbose::Bool: if false, disables all info messages and progress bars.\n\nReturns\n\nAn object of type PriorHyperparamsList.\n\n\n\n\n\n","category":"function"},{"location":"api/#RedClust.sampleK-Tuple{Real, Real, Real, Real, Int64, Int64}","page":"Public API","title":"RedClust.sampleK","text":"sampleK(params::PriorHyperparamsList, numsamples::Int, n::Int)::Vector{Int}\nsampleK(η::Real, σ::Real, u::Real, v::Real, numsamples::Int, n::Int)\n\nReturns a vector of length numsamples containing samples of K (number of clusters) generated from its marginal prior predictive distribution inferred from params. The parameter n is the number of observations in the model.\n\n\n\n\n\n","category":"method"},{"location":"api/#RedClust.sampledist","page":"Public API","title":"RedClust.sampledist","text":"sampledist(params::PriorHyperparamsList, type::String, numsamples = 1)::Vector{Float64}\n\nGenerate a vector of samples of length numsamples from the prior predictive distribution on the distances, as encapsulated in params. type must be either \"intercluster\" or \"intracluster\".\n\n\n\n\n\n","category":"function"},{"location":"api/#RedClust.runsampler","page":"Public API","title":"RedClust.runsampler","text":"runsampler(data,\noptions = MCMCOptionsList(),\nparams = PriorHyperparamsList();\nverbose = true) -> MCMCResult\n\nRuns the MCMC sampler on the data.\n\nArguments\n\ndata::MCMCData: contains the distance matrix.\noptions::MCMCOptionsList: contains the number of iterations, burnin, etc.\nparams::PriorHyperparamsList: contains the prior hyperparameters for the model.\nverbose::Bool: if false, disables all info messages and progress bars.\n\nReturns\n\nA struct of type MCMCResult containing the MCMC samples, convergence diagnostics, and summary statistics.\n\nSee also\n\nMCMCData, MCMCOptionsList, fitprior, PriorHyperparamsList, MCMCResult.\n\n\n\n\n\n","category":"function"},{"location":"api/#Point-estimation","page":"Public API","title":"Point estimation","text":"","category":"section"},{"location":"api/","page":"Public API","title":"Public API","text":"Modules = [RedClust]\nPages = [\"pointestimate.jl\"]\nOrder = [:function]","category":"page"},{"location":"api/#RedClust.binderloss-Tuple{Vector{Int64}, Vector{Int64}}","page":"Public API","title":"RedClust.binderloss","text":"binderloss(a::ClustLabelVector, b::ClustLabelVector;\nnormalised = true) -> Float64\n\nComputes the Binder loss between two clusterings. If normalised = true then the result is equal to one minus the rand index between a and b.\n\n\n\n\n\n","category":"method"},{"location":"api/#RedClust.getpointestimate-Tuple{MCMCResult}","page":"Public API","title":"RedClust.getpointestimate","text":"getpointestimate(samples::MCMCResult; method = \"MAP\", loss = \"VI\")\n\nComputes a point estimate from a vector of samples of cluster allocations by searching for a minimiser of the posterior expectation of some loss function.\n\nArguments\n\nmethod::String: must be one of the following -\n\n- `\"MAP\"`: maximum a posteriori.\n- `\"MLE\"`: maximum likelihood estimation.\n- `\"MPEL\"`: minimum posterior expected loss.\n`\"MAP\"` and `\"MLE\"` search among the MCMC samples for the clustering with the maximum log posterior and log likelihood respectively. `\"MPEL\"` searches for a clustering that minimises the posterior expected loss of some loss function specified by the `loss` argument. The search space is the set of samples in `samples`.\n\nloss: Determines the loss function used for the \"MPEL\" method. Must be either be a string or a function. If specified as a string, must be one of \"binder\" (Binder loss), \"omARI\" (one minus the Adjusted Rand Index), \"VI\" (Variation of Information distance), or \"ID\" (Information Distance). If specified as a function, must have a method defined for (x::Vector{Int}, y::Vector{Int}) -> Real.\n\nReturns\n\nReturns a tuple (clust, i) where clust is a clustering allocation in samples and i is its sample index.\n\n\n\n\n\n","category":"method"},{"location":"api/#RedClust.infodist-Tuple{Vector{Int64}, Vector{Int64}}","page":"Public API","title":"RedClust.infodist","text":"infodist(a::ClustLabelVector, b::ClustLabelVector;\nnormalised = true) -> Float64\n\nComputes the information distance between two clusterings. The information distance is defined as\n\nd_mathrmID(a b) = max H(A) H(B) - I(A B)\n\nwhere A and B are the cluster membership probability functions for a and b respectively, H denotes the entropy of a distribution, and I denotes the mutual information between two distributions. The information distance has range 0 log N where N is the number of observations. If normalised = true, the result is scaled to the range 0 1.\n\n\n\n\n\n","category":"method"},{"location":"api/#Summary-and-clustering-evaluation","page":"Public API","title":"Summary and clustering evaluation","text":"","category":"section"},{"location":"api/","page":"Public API","title":"Public API","text":"Modules = [RedClust]\nPages = [\"summaries.jl\"]\nOrder = [:function]","category":"page"},{"location":"api/#RedClust.evaluateclustering-Tuple{Vector{Int64}, Vector{Int64}}","page":"Public API","title":"RedClust.evaluateclustering","text":"evaluateclustering(clusts::ClustLabelVector, truth::ClustLabelVector) Returns a named tuple of values quantifying the accuracy of the clustering assignments in clusts with respect to the ground truth clustering assignments in truth.\n\nReturn values\n\nnbloss: Normalised Binder loss (= 1 - rand index). Lower is better.\nari: Adjusted Rand index. Higher is better.\nvi: Variation of Information distance (in 0 log N). Lower is better.\nnvi: Normalised VI distance (in 0 1).\nid: Information Distance (in 0 log N). Lower is better.\nnid: Normalised Information Distance (in 0 1).\nnmi: Normalised Mutual Information (in 0 1). Higher is better.\n\n\n\n\n\n","category":"method"},{"location":"api/#RedClust.summarise-Tuple{IO, Vector{Int64}, Vector{Int64}}","page":"Public API","title":"RedClust.summarise","text":"summarise([io::IO], clusts::ClustLabelVector, truth::ClustLabelVector, printoutput = true) -> String Prints a summary of the clustering accuracy of clusts with respect to the ground truth in truth. The output is printed to the output stream io, which defaults to stdout if not provided.\n\n\n\n\n\n","category":"method"},{"location":"api/#Convenience-functions","page":"Public API","title":"Convenience functions","text":"","category":"section"},{"location":"api/","page":"Public API","title":"Public API","text":"Modules = [RedClust]\nPages = [\"utils.jl\"]\nOrder = [:function]","category":"page"},{"location":"api/#RedClust.adjacencymatrix-Tuple{Vector{Int64}}","page":"Public API","title":"RedClust.adjacencymatrix","text":"adjacencymatrix(clusts::ClustLabelVector) -> Matrix{Bool} Returns the n×n adjacency matrix corresponding to the given cluster label vector clusts, where n = length(clusts).\n\n\n\n\n\n","category":"method"},{"location":"api/#RedClust.generatemixture-Tuple{Integer, Integer}","page":"Public API","title":"RedClust.generatemixture","text":"generatemixture(N, K; [α, dim, radius, σ, rng])\n\nGenerates a multivariate Normal mixture, with kernel weights generated from a Dirichlet prior. The kernels are centred at the vertices of a dim-dimensional simplex with edge length radius.\n\nRequired Arguments\n\nN::Integer: number of observations to generate.\nK::Integer: number of mixture kernels.\n\nOptional Arguments\n\nα::Float64 = K: parameter for the Dirichlet prior.\ndim::Integer = K: dimension of the observations.\nradius::Float64 = 1: radius of the simplex whose vertices are the kernel means.\nσ::Float64 = 0.1: variance of each kernel.\nrng: a random number generator to use, or an integer to seed the default random number generator with. If not provided, the default RNG provided by the Random.jl package will be used with default seeding.\n\nReturns\n\nNamed tuple containing the following fields-\n\npoints::Vector{Vector{Float64}}: a vector of N observations.\ndistancematrix::Matrix{Float64}: an N×N matrix of pairwise Euclidean distances between the observations.\nclusts::ClustLabelVector: vector of N cluster assignments.\nprobs::Float64: vector of K cluster weights generated from the Dirichlet prior, used to generate the observations.\noracle_coclustering::Matrix{Float64}: N×N matrix of co-clustering probabilities, calculated assuming full knowledge of the cluster centres and cluster weights.\n\n\n\n\n\n","category":"method"},{"location":"api/#RedClust.makematrix-Tuple{AbstractVector{<:AbstractVector}}","page":"Public API","title":"RedClust.makematrix","text":"makematrix(x::AbstractVector{<:AbstractVector}) -> Matrix\n\nConvert a vector of vectors into a matrix, where each vector becomes a column in the matrix.\n\n\n\n\n\n","category":"method"},{"location":"api/#RedClust.sortlabels-Tuple{Vector{Int64}}","page":"Public API","title":"RedClust.sortlabels","text":"sortlabels(x::ClustLabelVector) -> ClustLabelVector Returns a cluster label vector y such that x and y have the same adjacency structure and labels in y occur in sorted ascending order.\n\n\n\n\n\n","category":"method"},{"location":"api/#Example-datasets","page":"Public API","title":"Example datasets","text":"","category":"section"},{"location":"api/","page":"Public API","title":"Public API","text":"Modules = [RedClust]\nPages = [\"example_data.jl\"]\nOrder = [:function]","category":"page"},{"location":"api/#RedClust.example_dataset-Tuple{Int64}","page":"Public API","title":"RedClust.example_dataset","text":"example_dataset(n::Int)\n\nReturns a named tuple containing the dataset from the n^mathrmth simulated example in the main paper. This dataset was generated using the following code in Julia v1.8.1:\n\n\tusing RedClust\n\tusing Random: seed!\n\tseed!(44)\n\tK = 10 # Number of clusters\n\tN = 100 # Number of points\n\tdata_σ = [0.25, 0.2, 0.18][n] # Variance of the normal kernel\n\tdata_dim = [10, 50, 10][n] # Data dimension\n\tdata = generatemixture(N, K; α = 10, σ = data_σ, dim = data_dim);\n\nNote however that the above code may produce different results on your computer because the random number generator in Julia is not meant for reproducible results across different computers, different versions of Julia, or different versions of the Random.jl package, even with appropriate seeding. Therefore the datasets have been included with this package, and it is recommended to access them via this function.\n\nSee also generatemixture.\n\n\n\n\n\n","category":"method"},{"location":"api/#RedClust.example_datasets-Tuple{}","page":"Public API","title":"RedClust.example_datasets","text":"Return a read-only handle to a HDF5 file that contains example datasets from the main paper. You must remember to close the file once you are done reading from it. This function is provided for reproducibility purposes only; it is recommended to read the datasets via the convenience function example_dataset.\n\n\n\n\n\n","category":"method"},{"location":"api/#Types","page":"Public API","title":"Types","text":"","category":"section"},{"location":"api/","page":"Public API","title":"Public API","text":"Modules = [RedClust]\nOrder = [:type]","category":"page"},{"location":"api/#RedClust.ClustLabelVector","page":"Public API","title":"RedClust.ClustLabelVector","text":"Alias for Vector{Int}\n\n\n\n\n\n","category":"type"},{"location":"api/#RedClust.MCMCData","page":"Public API","title":"RedClust.MCMCData","text":"MCMCData(points::AbstractVector{<:AbstractVector{<:Float64}}) MCMCData(dissimilaritymatrix::AbstractMatrix{Float64})\n\nContains the pairwise dissimilarities for the MCMC sampler.\n\n\n\n\n\n","category":"type"},{"location":"api/#RedClust.MCMCOptionsList","page":"Public API","title":"RedClust.MCMCOptionsList","text":"MCMCOptionsList(;\nnumiters = 5000,\nburnin = floor(0.2 * numiters),\nthin = 1,\nnumGibbs = 5,\nnumMH = 1)\n\nList of options for running the MCMC.\n\n# Constructor arguments\n- `numiters::Integer`: number of iterations to run.\n- `burnin::Integer`: number of iterations to discard as burn-in.\n- `thin::Integer`: will keep every `thin` samples.\n- `numGibbs:Integer`: number of intermediate Gibbs scans in the split-merge step.\n- `numMH:Integer`: number of split-merge steps per MCMC iteration.\n\n\n\n\n\n","category":"type"},{"location":"api/#RedClust.MCMCResult","page":"Public API","title":"RedClust.MCMCResult","text":"Struct containing MCMC samples.\n\nFields\n\noptions::MCMCOptionsList: options passed to the sampler.\nparams::PriorHyperparamsList: prior hyperparameters used by the sampler.\nclusts::Vector{ClustLabelVector}: contains the clustering allocations. clusts[i] is a vector containing the clustering allocation from the ith sample.\nposterior_coclustering::Matrix{Float64}: the posterior coclustering matrix.\nK::Vector{Int}: posterior samples of K, i.e., the number of clusters. K[i] is the number of clusters in clusts[i].\nr::Vector{Float64}: posterior samples of the parameter r.\np::Vector{Float64}: posterior samples of the parameter p.\nK_mean, r_mean, p_mean: posterior mean of K, r, and p respectively.\nK_variance, r_variance, p_variance: posterior variance of K, r, and p respectively.\nK_acf::Vector{Float64}, r_acf::Vector{Float64}, p_acf::Vector{Float64}: autocorrelation function for K, r, and p respectively.\nK_iac, r_iac, and p_iac: integrated autocorrelation coefficient for K, r, and p respectively.\nK_ess::Float64, r_ess::Float64, and p_ess::Float64: effective sample size for K, r, and p respectively.\nloglik::Vector{Float64}: log-likelihood for each sample.\nlogposterior::Vector{Float64}: a function proportional to the log-posterior for each sample, with constant of proportionality equal to the normalising constant of the partition prior.\nsplitmerge_splits: Boolean vector indicating the iterations when a split proposal was used in the split-merge step. Has length numMH * numiters (see MCMCOptionsList).\nsplitmerge_acceptance_rate: acceptance rate of the split-merge proposals.\nr_acceptances: Boolean vector indicating the iterations (including burnin and the thinned out iterations) where the Metropolis-Hastings proposal for r was accepted.\nr_acceptance_rate:: Metropolis-Hastings acceptance rate for r.\nruntime: total runtime for all iterations.\nmean_iter_time: average time taken for each iteration.\n\n\n\n\n\n","category":"type"},{"location":"api/#RedClust.PriorHyperparamsList","page":"Public API","title":"RedClust.PriorHyperparamsList","text":"PriorHyperparamsList(; [kwargs])\n\nContains the prior hyperparameters for the model. It is recommended to set the values using the fitprior function. Do not set these values manually unless you know what you are doing.\n\nConstructor arguments\n\nδ1::Float64 = 1: the parameter delta_1.\nα::Float64 = 1: the shape parameter alpha in the prior for each lambda_k.\nβ::Float64 = 1: the rate parameter beta in the prior for each lambda_k.\nδ2::Float64 = 1: the parameter delta_2.\nζ::Float64 = 1: the shape parameter zeta in the prior for each theta_kt.\nγ::Float64 = 1: the rate parameter gamma in the prior for each theta_kt.\nη::Float64 = 1: the shape parameter eta in the prior for r.\nσ::Float64 = 1: the rate parameter sigma in the prior for r.\nu::Float64 = 1: the parameter u in the prior for p.\nv::Float64 = 1: the parameter v in the prior for p.\nproposalsd_r::Float64 = 1: standard deviation of the truncated Normal proposal when sampling r via a Metropolis-Hastings step.\nK_initial::Int = 1: initial number of clusters to start the MCMC.\nrepulsion::Bool = true: whether to use the repulsive component of the likelihood.\nmaxK::Int = 0: maxmimum number of clusters to allow. If set to 0 then no maximum is imposed.\n\nSee also\n\nfitprior\n\n\n\n\n\n","category":"type"},{"location":"funding/#Funding-Information","page":"Funding Information","title":"Funding Information","text":"","category":"section"},{"location":"funding/","page":"Funding Information","title":"Funding Information","text":"We are grateful to the following funding organisations for their financial support for the research that led to the main publication behind this package.","category":"page"},{"location":"funding/","page":"Funding Information","title":"Funding Information","text":"National University of Singapore, HSS Seed Fund R-607-000-449-646\nYale-NUS College, grant IG20-RA001\nSingapore Ministry of Education, Academic Research Fund Tier 2 Grant MOE-T2EP40121-0021","category":"page"},{"location":"cite/#Citation-Information","page":"Cite This Package","title":"Citation Information","text":"","category":"section"},{"location":"cite/","page":"Cite This Package","title":"Cite This Package","text":"If you want to use this package in your work, please cite it as:","category":"page"},{"location":"cite/","page":"Cite This Package","title":"Cite This Package","text":"Natarajan, A., De Iorio, M., Heinecke, A., Mayer, E. and Glenn, S. (2023). ‘Cohesion and Repulsion in Bayesian Distance Clustering’, Journal of the Americal Statistical Association. DOI: 10.1080/01621459.2023.2191821.","category":"page"},{"location":"cite/","page":"Cite This Package","title":"Cite This Package","text":"For BibTeX users:","category":"page"},{"location":"cite/","page":"Cite This Package","title":"Cite This Package","text":"@article{NDI23,\n doi = {10.1080/01621459.2023.2191821},\n author = {Natarajan, Abhinav and De Iorio, Maria and Heinecke, Andreas and Mayer, Emanuel and Glenn, Simon},\n title = {Cohesion and Repulsion in Bayesian Distance Clustering},\n journal = {Journal of the American Statistical Association},\n year = {2023}\n}","category":"page"},{"location":"changelog/#Changelog","page":"Changelog","title":"Changelog","text":"","category":"section"},{"location":"changelog/#[1.2.2]","page":"Changelog","title":"[1.2.2]","text":"","category":"section"},{"location":"changelog/#Added","page":"Changelog","title":"Added","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"tests for Julia 1.9 in CI matrix\nHDF5.jl v0.17 compatibility","category":"page"},{"location":"changelog/#[1.2.1]","page":"Changelog","title":"[1.2.1]","text":"","category":"section"},{"location":"changelog/#Added-2","page":"Changelog","title":"Added","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"tests for fitprior2 in CI pipeline.","category":"page"},{"location":"changelog/#Fixed","page":"Changelog","title":"Fixed","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"error when constructing MCMCData with input distance matrix that has non-zero diagonal entries.\nedge case for `fitprior2' when Kmin = Kmax = 1 or Kmin = Kmax = N generates a warning and fallback to default repulsion/cohesion parameters. ","category":"page"},{"location":"changelog/#[1.2.0]","page":"Changelog","title":"[1.2.0]","text":"","category":"section"},{"location":"changelog/#Added-3","page":"Changelog","title":"Added","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"added function fitprior2, alternative method to fit prior hyperparameters.","category":"page"},{"location":"changelog/#Changed","page":"Changelog","title":"Changed","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"pretty print for MCMCResult also shows whether the model includes repulsion.\nmoved basic_example.jl from the examples folder to docs. \nremoved examples (added to a separate branch). ","category":"page"},{"location":"changelog/#[1.1.0]","page":"Changelog","title":"[1.1.0]","text":"","category":"section"},{"location":"changelog/#Added-4","page":"Changelog","title":"Added","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"added function signature for sampleK to accept individual parameters.","category":"page"},{"location":"changelog/#[1.0.1]","page":"Changelog","title":"[1.0.1]","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"Documentation now updated to use Literate.jl to generate the example. ","category":"page"},{"location":"changelog/#[1.0.0]","page":"Changelog","title":"[1.0.0]","text":"","category":"section"},{"location":"changelog/#Added-5","page":"Changelog","title":"Added","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"examples from the main paper are now included in the examples folder. Data from the original paper is included in the data folder in the standard HDF5 format. \nthe examples now have more plots.\nfunctionality to sample the distances from the prior predictive distribution (see sampledist).\nfunctionality to sample K from its marginal prior predictive (see sampleK).\ngeneratemixture accepts a random number generator or a seed for the default RNG for reproducibility of results. ","category":"page"},{"location":"changelog/#Removed","page":"Changelog","title":"Removed","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"dependency on RCall and the various calls to the salso algorithm were removed. It is left to the user to make these calls if necessary. ","category":"page"},{"location":"changelog/#[0.2.2]","page":"Changelog","title":"[0.2.2]","text":"","category":"section"},{"location":"changelog/#Fixed-2","page":"Changelog","title":"Fixed","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"corrected time not printing in the correct format. \nmoved prettytime and prettynumber to utils.jl.","category":"page"},{"location":"changelog/#[0.2.1]","page":"Changelog","title":"[0.2.1]","text":"","category":"section"},{"location":"changelog/#Added-6","page":"Changelog","title":"Added","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"added tests for pretty printing.\nadded tests for custom loss function in getpointestimate. ","category":"page"},{"location":"changelog/#Fixed-3","page":"Changelog","title":"Fixed","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"verbose output in sampler missing newline. \ncustom loss function when computing a point estimate. ","category":"page"},{"location":"changelog/#Removed-2","page":"Changelog","title":"Removed","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"_infodist not in use, removed.","category":"page"},{"location":"changelog/#[0.2.0]","page":"Changelog","title":"[0.2.0]","text":"","category":"section"},{"location":"changelog/#Added-7","page":"Changelog","title":"Added","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"added log-posterior to result.\nadded log-likelihood and log-posterior plots to basic example.\npoint-estimation via maximum likelihood and maximum a posteriori.\ninfodist function to compute the information distance between clusterings.\nconvenience constructor for MCMCData.\nadd verbose option for runsampler and fitprior.\npretty printing for MCMCData, MCMCOptionsList, and PriorHyperparamsList.\nadded bibliographic references to documentation.\nadded changelog.","category":"page"},{"location":"changelog/#Breaking","page":"Changelog","title":"Breaking","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"MCMCOptions constructor changed account for change in point-estimate calculation. \nfitprior now only returns the hyperparameter list.\nremoved summarise for MCMCResult objects (use pretty printing instead).","category":"page"},{"location":"changelog/#Fixed-4","page":"Changelog","title":"Fixed","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"corrected computation of log-likelihood in result.\nfixed edge case error in fitprior when the found value of K is either 1 or N. This case arises only for edge values of Kmin and Kmax, since the elbow method will otherwise not choose 1 or N for K.","category":"page"},{"location":"changelog/#Changed-2","page":"Changelog","title":"Changed","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"added bounds checking for Kmin and Kmax in fitprior.\nadded input message for better debugging in fitprior.\nadded progress bar for fitprior if useR = false. \nadded input validation for generatemixture.\nadded input validation for binderloss and evaluateclustering.\nadded input validation for the constructors of MCMCData and MCMCOptionsList.\nseparated computation of log-likelihood and log-prior.\nremoved redundant loss function calculations when computing point-estimate.\nBinder loss calculation (binderloss) is now approximate (using randindex from Clustering.jl) but faster.\nrunsampler no longer automatically calculates a point estimate. \nsummarise has a separate signature for MCMC output and for point estimate summary, and now provides more measures for point estimates.\nminor optimisations using @inbounds, @simd, and @turbo.\nusing StaticArrays.jl and separate function for restricted Gibbs scan. \nspeedup via non-generic implementation of matrix and vector sums with LoopVectorization.jl.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"EditURL = \"../../basic_example_for_docs.jl\"","category":"page"},{"location":"generated/example/#Example","page":"Example","title":"Example","text":"","category":"section"},{"location":"generated/example/","page":"Example","title":"Example","text":"Here we demonstrate the usage of RedClust through an example. This example can be downloaded as a Julia script here.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"We begin by setting up the necessary includes.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"using RedClust, Plots, StatsPlots\nusing Random: seed!\nusing StatsBase: counts\nusing LinearAlgebra: triu, diagind\nnothing # hide","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"#jl ## We define some colors optimized for color-blind individuals based on [this article](https://www.nature.com/articles/nmeth.1618) by Bang Wong in Nature.\nfunction wong_colors(alpha = 1.0)\n colors = [\n RGB(0/255, 114/255, 178/255), # blue\n RGB(230/255, 159/255, 0/255), # orange\n RGB(0/255, 158/255, 115/255), # green\n RGB(204/255, 121/255, 167/255), # reddish purple\n RGB(86/255, 180/255, 233/255), # sky blue\n RGB(213/255, 94/255, 0/255), # vermillion\n RGB(240/255, 228/255, 66/255), # yellow\n ]\n @. RGBA{Float32}(red(colors), green(colors), blue(colors), alpha)\nend\n## Set plotting defaults\ndefault(fontfamily = \"Computer Modern\",\ncolor_palette = wong_colors(0.8),\ngridlinewidth = 1,\nframestyle = :box,\nlinecolor = :match,\nlinewidth = 0.5,\nguidefontsize = 14,\ntickfontsize = 12,\ncolorbar_tickfontsize = 12,\nlegend_font_pointsize = 12,\nplot_titlefontsize = 14\n)\n## seed the default RNG so that documentation remains stable\nseed!(44)","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"Next we define some convenience functions for plotting.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"# Heatmap of square matrix\nfunction sqmatrixplot(X::Matrix; kwargs...)\n M, N = size(X)\n heatmap(\n X,\n aspect_ratio=:equal,\n color=:Blues,\n xlim=(1,M), ylim=(1,N),\n yflip = true, xmirror=true;\n kwargs...)\nend\n\n# Histogram with integer bins\nfunction histogram_pmf(X::AbstractVector{<:Integer}; binwidth::Int=1, kwargs...)\n xmin = minimum(X)\n xmax = maximum(X)\n binedges = xmin:binwidth:xmax\n c = counts(X)./length(X)\n bincounts = map(sum, [c[(i-xmin+1):minimum([i-xmin+1+binwidth-1, xmax-xmin+1])] for i in binedges])\n bar(binedges, bincounts,\n linewidth = 0,\n legend = false,\n xticks = binedges; kwargs...)\nend\n\n# Combine two symmetric square matrices together into the upper and lower triangle of a square matrix\nfunction combine_sqmatrices(lower::Matrix, upper::Matrix, diagonal::String = \"lower\")\n if size(lower)[1] != size(lower)[2]\n throw(ArgumentError(\"Argument `lower` must be square, has dimensions $(size(lower)).\"))\n end\n if size(upper)[1] != size(upper)[2]\n throw(ArgumentError(\"Argument `upper` must be a square matrix, has dimensions $(size(upper)).\"))\n end\n if !all(size(lower) .== size(upper))\n throw(ArgumentError(\"Arguments `lower` and `upper` must have the same size.\"))\n end\n if !(eltype(lower) <: eltype(upper)) && !(eltype(upper) <: eltype(lower))\n throw(ArgumentError(\"Arguments must have compatible entries, got $(eltype(lower)) and $(eltype(upper)).\"))\n end\n if diagonal ∉ [\"lower\", \"upper\"]\n throw(ArgumentError(\"Keyword argument `diagonal` must be either \\\"lower\\\" or \\\"upper\\\".\"))\n end\n result = copy(lower)\n temp = trues(size(lower))\n upper_idx = triu(temp, 1)\n diagonal_idx = diagind(temp)\n result[upper_idx] .= upper[upper_idx]\n result[diagonal_idx] .= ((diagonal == \"lower\") ? lower : upper)[diagonal_idx]\n return result\nend\nnothing # hide","category":"page"},{"location":"generated/example/#Generating-Data","page":"Example","title":"Generating Data","text":"","category":"section"},{"location":"generated/example/","page":"Example","title":"Example","text":"We can generate some example data using the function generatemixture.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"begin\n K = 10 # Number of clusters\n N = 100 # Number of points\n data_σ = 0.25 # Variance of the normal kernel\n data_dim = 10 # Data dimension\n α = 10 # parameter for Dirichlet prior on cluster weights\n data = generatemixture(N, K;\n α = α, σ = data_σ, dim = data_dim)\n points, distmatrix, clusts, probs, oracle_coclustering = data\n nothing # hide\nend","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"Alternatively, the function example_dataset can be used to retrieve the datasets used in the original RedClust paper.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"begin\n data = example_dataset(1)\n points, distmatrix, clusts, probs, oracle_coclustering = data\n nothing # hide\nend","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"We can visualise the true adjacency matrix of the observations with respect to the true clusters that they were drawn from, as well as the oracle coclustering matrix. The latter is the matrix of co-clustering probabilities of the observations conditioned upon the data generation process. This takes into account full information about the cluster weights (and how they are generated), the mixture kernels for each cluster, and the location and scale parameters for these kernels.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"sqmatrixplot(combine_sqmatrices(oracle_coclustering, 1.0 * adjacencymatrix(clusts)), title = \"Adjacency vs Oracle Co-clustering Probabilities \\n(upper right and lower left triangle)\\n\")","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"We can visualise the matrix of pairwise distances between the observations.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"sqmatrixplot(distmatrix, title = \"Matrix of Pairwise Distances\")","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"We can also plot the histogram of distances, grouped by whether they are inter-cluster distances (ICD) or within-cluster distances (WCD).","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"begin\n empirical_intracluster = uppertriangle(distmatrix)[\n uppertriangle(adjacencymatrix(clusts)) .== 1]\n empirical_intercluster = uppertriangle(distmatrix)[\n uppertriangle(adjacencymatrix(clusts)) .== 0]\n histogram(empirical_intercluster,\n bins = minimum(empirical_intercluster):0.05:maximum(empirical_intercluster),\n label=\"ICD\", xlabel = \"Distance\", ylabel=\"Frequency\",\n title = \"Observed distribution of distances\")\n histogram!(empirical_intracluster,\n bins = minimum(empirical_intracluster):0.05:maximum(empirical_intracluster),\n label=\"WCD\")\nend","category":"page"},{"location":"generated/example/#Prior-Hyperparameters","page":"Example","title":"Prior Hyperparameters","text":"","category":"section"},{"location":"generated/example/","page":"Example","title":"Example","text":"RedClust includes the function fitprior to heuristically choose prior hyperparameters based on the data.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"params = fitprior(points, \"k-means\", false)","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"We can check how good the chosen prior hyperparameters are by comparing the empirical distribution of distances to the (marginal) prior predictive distribution.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"begin\n pred_intracluster = sampledist(params, \"intracluster\", 10000)\n pred_intercluster = sampledist(params, \"intercluster\", 10000)\n density(pred_intercluster,\n label=\"Simulated ICD\", xlabel = \"Distance\", ylabel = \"Density\",\n linewidth = 2, linestyle = :dash)\n density!(empirical_intercluster,\n label=\"Empirical ICD\",\n color = 1, linewidth = 2)\n density!(pred_intracluster,\n label=\"Simulated WCD\",\n linewidth = 2, linestyle = :dash, color = 2)\n density!(empirical_intracluster,\n label=\"Empirical WCD\",\n linewidth = 2, color = 2)\n title!(\"Distances: Prior Predictive vs Empirical Distribution\")\nend","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"We can also evaluate the prior hyperparameters by checking the marginal predictive distribution on K (the number of clusters).","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"begin\n Ksamples = sampleK(params, 10000, N)\n histogram_pmf(Ksamples, legend = false,\n xticks=collect(0:10:maximum(Ksamples)),\n xlabel = \"\\$K\\$\", ylabel = \"Probability\", title = \"Marginal Prior Predictive Distribution of \\$K\\$\")\nend","category":"page"},{"location":"generated/example/#Sampling","page":"Example","title":"Sampling","text":"","category":"section"},{"location":"generated/example/","page":"Example","title":"Example","text":"Running the MCMC is straightforward. We set up the MCMC options using MCMCOptionsList.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"options = MCMCOptionsList(numiters=50000)","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"We then set up the input data using MCMCData.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"data = MCMCData(points)","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"We can then run the sampler using runsampler.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"result = runsampler(data, options, params)","category":"page"},{"location":"generated/example/#MCMC-Result","page":"Example","title":"MCMC Result","text":"","category":"section"},{"location":"generated/example/","page":"Example","title":"Example","text":"The MCMC result contains several details about the MCMC, including acceptance rate, runtime, and convergence diagnostics. For full details see MCMCResult. In this example we have the ground truth cluster labels, so we can evaluate the result. For example, we can compare the posterior coclustering matrix to the oracle co-clustering probabilities.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"sqmatrixplot(combine_sqmatrices(result.posterior_coclustering, oracle_coclustering),\ntitle=\"Posterior vs Oracle Coclustering Probabilities\")","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"Plot the posterior distribution of K:","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"histogram_pmf(result.K,\nxlabel = \"\\$K\\$\", ylabel = \"Probability\", title = \"Posterior Distribution of \\$K\\$\")","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"Plot the posterior distribution of r:","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"begin\n histogram(result.r, normalize = :pdf,\n legend_font_pointsize=12,\n label=\"Empirical density\", ylabel = \"Density\", xlabel = \"\\$r\\$\",\n title = \"Posterior Distribution of \\$r\\$\")\n density!(result.r,\n color=:black, linewidth = 2, linestyle=:dash,\n label=\"Kernel estimate\", legend_font_pointsize=12)\nend","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"Plot the posterior distribution of p:","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"begin\n histogram(result.p, normalize = :pdf,\n ylabel = \"Density\", xlabel = \"\\$p\\$\",\n title = \"Posterior Distribution of \\$p\\$\",\n label = \"Empirical density\",\n legend_font_pointsize=12)\n density!(result.p, color=:black, linewidth = 2, linestyle=:dash,\n label = \"Kernel estimate\",\n legend_position = :topleft)\nend","category":"page"},{"location":"generated/example/#Convergence-statistics","page":"Example","title":"Convergence statistics","text":"","category":"section"},{"location":"generated/example/","page":"Example","title":"Example","text":"Plot the traceplot of the autocorrelation function of K:","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"plot(result.K_acf, legend = false, linewidth = 1,\nxlabel = \"Lag\", ylabel = \"Autocorrelation\",\ntitle = \"Autocorrelation Function of \\$K\\$\")","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"Plot the traceplot of the autocorrelation function of r:","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"plot(result.r_acf, legend = false, linewidth = 1,\nxlabel = \"Lag\", ylabel = \"Autocorrelation\",\ntitle = \"Autocorrelation Function of \\$r\\$\")","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"Plot the traceplot of the autocorrelation function of p:","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"plot(result.p_acf, legend = false, linewidth = 1,\nxlabel = \"Lag\", ylabel = \"Autocorrelation\",\ntitle = \"Autocorrelation Function of \\$p\\$\")","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"Check the trace plot of the log-likelihood to make sure the MCMC is moving well:","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"plot(result.loglik, legend = false, linewidth = 1,\nxlabel = \"Iteration\", ylabel = \"Log likelihood\",\ntitle = \"Log-Likelihood Trace Plot\")","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"Check the trace plot of the log-posterior:","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"plot(result.logposterior, legend = false, linewidth = 1,\nxlabel = \"Iteration\", ylabel = \"Log posterior\",\ntitle = \"Log-Posterior Trace Plot\")","category":"page"},{"location":"generated/example/#Point-Estimates","page":"Example","title":"Point Estimates","text":"","category":"section"},{"location":"generated/example/","page":"Example","title":"Example","text":"The function getpointestimate finds an optimal point estimate, based on some notion of optimality. For example, to get the maximum a posteriori estimate we can run the following.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"pointestimate, index = getpointestimate(result; method=\"MAP\")","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"We can compare the point-estimate to the true clustering through their adjacency matrices.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"sqmatrixplot(combine_sqmatrices(adjacencymatrix(pointestimate), adjacencymatrix(clusts)),\ntitle = \"True Clustering vs MAP Point Estimate\")","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"We can check the accuracy of the point estimate in terms of clustering metrics.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"summarise(pointestimate, clusts)","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"This page was generated using Literate.jl.","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"CurrentModule = RedClust","category":"page"},{"location":"#Introduction","page":"Introduction","title":"Introduction","text":"","category":"section"},{"location":"","page":"Introduction","title":"Introduction","text":"RedClust is a Julia package for Bayesian clustering of high-dimensional Euclidean data using pairwise dissimilarity information instead of the raw observations. It uses an MCMC sampler to generate posterior samples from the space of all possible clustering structures on the data, and it is an implementation of the algorithm described by Natarajan et al., 2023.","category":"page"},{"location":"#Installation","page":"Introduction","title":"Installation","text":"","category":"section"},{"location":"","page":"Introduction","title":"Introduction","text":"The package can be installed by typing ]add RedClust into the Julia REPL or by using Pkg:","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"using Pkg\nPkg.add(\"RedClust\")","category":"page"},{"location":"#Basic-example","page":"Introduction","title":"Basic example","text":"","category":"section"},{"location":"","page":"Introduction","title":"Introduction","text":"using RedClust\n# Generate data\npoints, distM, clusts, probs, oracle_coclustering = \n\tgeneratemixture(N, K; α = 10, σ = data_σ, dim = data_dim)\n# Let RedClust choose the best prior hyperparameters\nparams = fitprior(points, \"k-means\", false)\n# Set the MCMC options\noptions = MCMCOptionsList(numiters = 5000)\ndata = MCMCData(points)\n# Run the sampler\nresult = runsampler(data, options, params)\n# Get a point estimate \npointestimate, index = getpointestimate(result)\n# Summary of MCMC and point estimate\nsummarise(result)\nsummarise(pointestimate, clusts)","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"A more elaborate example can be found in the Example section. Examples from the paper and its supplementary material can be found in the 'examples' branch of the Github repository for RedClust.jl.","category":"page"},{"location":"#Model","page":"Introduction","title":"Model","text":"","category":"section"},{"location":"","page":"Introduction","title":"Introduction","text":"RedClust implements the model described in Natarajan et al. (2022). The key features are-","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"The use of a random partition model with an unknown number of clusters K which allows for posterior inference on K. That is, there is a prior distribution on the space of all possible clustering structures with any number of clusters from one to the number of observations. The number of clusters is an object of inference to be determined by MCMC sampling. \nThe pairwise dissimilarities between observations are assumed to be mutually independent given the clustering assignment; that is, the clustering likelihood is a composite or pseduo-likelihood. \nThe clustering likelihood is comprised of a cohesive part and a repulsive part. The repulsive component of the likelihood imposes a strong identifiability constraint on the clustering; clusters must not only comprise points that have similar small distances among themselves, but also similar distances to points in other clusters.","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"The prior on the clustering structure can be chosen according to application in question. The current version of RedClust implements a microclustering prior (Miller et al., 2015; Betancourt et al., 2022). This means that the partition is generated by drawing cluster sizes n_1 ldots n_K from a random distribution nu (conditional upon n_1 + ldots n_K = n where n is the number of observations), and cluster labels are given by a uniform random permutation of ","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"underbrace1 ldots 1_n_1 text times ldots underbraceK ldots K_n_K text times","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"RedClust implements a microclustering prior with nu a shifted negative binomial with random parameters r and p, which are sampled from a Gamma and Beta distribution respectively.","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"beginalign*\nr sim mathrmGamma(eta sigma)\np sim mathrmBeta(u v)\nendalign*","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"This results in the following partition prior: ","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"pi(rho_n mid r p) propto K p^n-K(1-p)^rKGamma(r)^-Kprod_k=1^K n_k Gamma(n_k+r-1)","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"where rho_n is the partition and K is the number of clusters in the partition. The partition likelihood is given by","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"beginalign*\nlambda_k oversetmathrmiidsim mathrmGamma(alpha beta) quad 1 leq k leq K\ntheta_kt oversetmathrmiidsim mathrmGamma(zeta gamma) quad 1 leq k t leq K\nendalign*","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"pi(D mid rho_n boldsymbollambda boldsymboltheta r p) = left prod_k=1^K prod_substacki j in C_k i neq j fracD_ij^delta_1 - 1lambda_k^delta_1Gamma(delta_1) exp(-lambda_k D_ij) right leftprod_substackt k = 1 t neq K prod_substacki in C_k j in C_t fracD_ij^delta_2-1theta_kt^delta_2Gamma(delta_2) exp(-theta_ktD_ij) right","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"where boldsymbol D is the matrix of pairwise dissimilarities between observations.","category":"page"},{"location":"#Point-estimation","page":"Introduction","title":"Point estimation","text":"","category":"section"},{"location":"","page":"Introduction","title":"Introduction","text":"A clustering point-estimate boldsymbol c^* can be determined in a number of ways. ","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"Choosing the sample with maximum likelihood or maximum posterior. This corresponds to an MLE or MAP estimate using the MCMC samples as an approximation to the likelihood and posterior.\nSearching for a clustering that minimises the posterior expectation of a loss function ell such as the Binder loss (Binder, 1978), the Variation of Information distance (Meilă, 2007; Wade and Ghahramani, 2018), or the Normalised Information Distance (Kraskov et al., 2005). That is, ","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"boldsymbol c^* = argmin_boldsymbol c mathbbE_boldsymbol cell(boldsymbol c boldsymbol c)","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"A naive method is to restrict the search space to those clusterings visited by the MCMC sampler. This method is implemented in RedClust in the function getpointestimate.\nA better method is the SALSO algorithm (Dahl et al., 2022), implemented in the R package salso, which heuristically searches the space of all possible clusterings. You can use the RCall package in Julia to make calls to the salso package in R if you have R and salso installed. ","category":"page"},{"location":"#Bibliography","page":"Introduction","title":"Bibliography","text":"","category":"section"},{"location":"","page":"Introduction","title":"Introduction","text":"","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"Betancourt, B., Zanella, G. and Steorts, R. C. (2022), ‘Random partition models for microclustering tasks’, Journal of the American Statistical Association 117(539), 1215–1227. DOI: 10.1080/01621459.2020.1841647.","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"Binder, D. A. (1978). Bayesian Cluster Analysis. Biometrika, 65(1), 31–38. DOI: 10.2307/2335273.","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"Dahl, D. B., Johnson, D. J. and M¨uller, P. (2022), ‘Search algorithms and loss functions for Bayesian clustering’, Journal of Computational and Graphical Statistics 0(0), 1–13. DOI: 10.1080/10618600.2022.2069779.","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"Kraskov, A., Stögbauer, H., Andrzejak, R. G. and P Grassberger, P. (2005), ‘Hierarchical clustering using mutual information’, Europhysics Letters (EPL) 70(2), 278-284, DOI: 10.1209/epl/i2004-10483-y.","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"Marina Meilă (2007), ‘Comparing clusterings—an information based distance’, Journal of Multivariate Analysis, 98(5), 873-895, DOI: 10.1016/j.jmva.2006.11.013.","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"Miller, J., Betancourt, B., Zaidi, A., Wallach, H. and Steorts, R. C. (2015), ‘Microclustering: When the cluster sizes grow sublinearly with the size of the data set’, arXiv 1512.00792.","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"Natarajan, A., De Iorio, M., Heinecke, A., Mayer, E. and Glenn, S. (2023). ‘Cohesion and Repulsion in Bayesian Distance Clustering’, Journal of the Americal Statistical Association. DOI: 10.1080/01621459.2023.2191821.","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"Wade, S. and Ghahramani, Z. (2018), ‘Bayesian Cluster Analysis: Point Estimation and Credible Balls (with Discussion)’, Bayesian Analysis 13(2), 559 – 626. DOI: 10.1214/17-BA1073.","category":"page"}] +[{"location":"api/#Public-API","page":"Public API","title":"Public API","text":"","category":"section"},{"location":"api/#Main-functions","page":"Public API","title":"Main functions","text":"","category":"section"},{"location":"api/","page":"Public API","title":"Public API","text":"Modules = [RedClust]\nPages = [\"prior.jl\", \"mcmc.jl\"]\nOrder = [:function]","category":"page"},{"location":"api/#RedClust.fitprior","page":"Public API","title":"RedClust.fitprior","text":"fitprior(data, algo, diss = false;\nKmin = 1,\nKmax = Int(floor(size(data)[end] / 2),\nverbose = true)\n\nDetermines the best prior hyperparameters from the data. A notional clustering is obtained using k-means or k-medoids and the elbow method, and the distances are split into within-cluster distances and inter-cluster distances based on the notional clustering. These distances are then used to fit the prior hyperparameters using MLE and empirical Bayes sampling.\n\nRequired Arguments\n\ndata::Union{Vector{Vector{Float64}}, Matrix{Float64}}: can either be a vector of (possibly multi-dimensional) observations, or a matrix with each column an observation, or a square matrix of pairwise dissimilarities.\nalgo::String: must be one of \"k-means\" or \"k-medoids\".\n\nOptional Arguments\n\ndiss::bool: if true, data will be assumed to be a pairwise dissimilarity matrix.\nKmin::Integer: minimum number of clusters to scan for the elbow method.\nKmax::Integer: maximum number of clusters to scan for the elbow method. If left unspecified, it is set to half the number of observations.\nverbose::Bool: if false, disables all info messages and progress bars.\n\nReturns\n\nAn object of type PriorHyperparamsList.\n\n\n\n\n\n","category":"function"},{"location":"api/#RedClust.fitprior2","page":"Public API","title":"RedClust.fitprior2","text":"fitprior2(data, algo, diss = false;\nKmin = 1,\nKmax = Int(floor(size(data)[end] / 2),\nverbose = true)\n\nDetermines the best prior hyperparameters from the data. Uses the same method as fitprior to obtain values for sigma, eta, u, and v, but derives values for the cluster-specific parameters by considering within-cluster and cross-cluster distances over clusterings with K clusters for all values of K in K_mathrmmin K_mathrmmax, weighted by the prior predictive distribution of K in that range.\n\nRequired Arguments\n\ndata::Union{Vector{Vector{Float64}}, Matrix{Float64}}: can either be a vector of (possibly multi-dimensional) observations, or a matrix with each column an observation, or a square matrix of pairwise dissimilarities.\nalgo::String: must be one of \"k-means\" or \"k-medoids\".\n\nOptional Arguments\n\ndiss::bool: if true, data will be assumed to be a pairwise dissimilarity matrix.\nKmin::Integer: minimum number of clusters to scan for the elbow method.\nKmax::Integer: maximum number of clusters to scan for the elbow method. If left unspecified, it is set to half the number of observations.\nverbose::Bool: if false, disables all info messages and progress bars.\n\nReturns\n\nAn object of type PriorHyperparamsList.\n\n\n\n\n\n","category":"function"},{"location":"api/#RedClust.sampleK-Tuple{Real, Real, Real, Real, Int64, Int64}","page":"Public API","title":"RedClust.sampleK","text":"sampleK(params::PriorHyperparamsList, numsamples::Int, n::Int)::Vector{Int}\nsampleK(η::Real, σ::Real, u::Real, v::Real, numsamples::Int, n::Int)\n\nReturns a vector of length numsamples containing samples of K (number of clusters) generated from its marginal prior predictive distribution inferred from params. The parameter n is the number of observations in the model.\n\n\n\n\n\n","category":"method"},{"location":"api/#RedClust.sampledist","page":"Public API","title":"RedClust.sampledist","text":"sampledist(params::PriorHyperparamsList, type::String, numsamples = 1)::Vector{Float64}\n\nGenerate a vector of samples of length numsamples from the prior predictive distribution on the distances, as encapsulated in params. type must be either \"intercluster\" or \"intracluster\".\n\n\n\n\n\n","category":"function"},{"location":"api/#RedClust.runsampler","page":"Public API","title":"RedClust.runsampler","text":"runsampler(data,\noptions = MCMCOptionsList(),\nparams = PriorHyperparamsList();\nverbose = true) -> MCMCResult\n\nRuns the MCMC sampler on the data.\n\nArguments\n\ndata::MCMCData: contains the distance matrix.\noptions::MCMCOptionsList: contains the number of iterations, burnin, etc.\nparams::PriorHyperparamsList: contains the prior hyperparameters for the model.\nverbose::Bool: if false, disables all info messages and progress bars.\n\nReturns\n\nA struct of type MCMCResult containing the MCMC samples, convergence diagnostics, and summary statistics.\n\nSee also\n\nMCMCData, MCMCOptionsList, fitprior, PriorHyperparamsList, MCMCResult.\n\n\n\n\n\n","category":"function"},{"location":"api/#Point-estimation","page":"Public API","title":"Point estimation","text":"","category":"section"},{"location":"api/","page":"Public API","title":"Public API","text":"Modules = [RedClust]\nPages = [\"pointestimate.jl\"]\nOrder = [:function]","category":"page"},{"location":"api/#RedClust.binderloss-Tuple{Vector{Int64}, Vector{Int64}}","page":"Public API","title":"RedClust.binderloss","text":"binderloss(a::ClustLabelVector, b::ClustLabelVector;\nnormalised = true) -> Float64\n\nComputes the Binder loss between two clusterings. If normalised = true then the result is equal to one minus the rand index between a and b.\n\n\n\n\n\n","category":"method"},{"location":"api/#RedClust.getpointestimate-Tuple{MCMCResult}","page":"Public API","title":"RedClust.getpointestimate","text":"getpointestimate(samples::MCMCResult; method = \"MAP\", loss = \"VI\")\n\nComputes a point estimate from a vector of samples of cluster allocations by searching for a minimiser of the posterior expectation of some loss function.\n\nArguments\n\nmethod::String: must be one of the following -\n\n- `\"MAP\"`: maximum a posteriori.\n- `\"MLE\"`: maximum likelihood estimation.\n- `\"MPEL\"`: minimum posterior expected loss.\n`\"MAP\"` and `\"MLE\"` search among the MCMC samples for the clustering with the maximum log posterior and log likelihood respectively. `\"MPEL\"` searches for a clustering that minimises the posterior expected loss of some loss function specified by the `loss` argument. The search space is the set of samples in `samples`.\n\nloss: Determines the loss function used for the \"MPEL\" method. Must be either be a string or a function. If specified as a string, must be one of \"binder\" (Binder loss), \"omARI\" (one minus the Adjusted Rand Index), \"VI\" (Variation of Information distance), or \"ID\" (Information Distance). If specified as a function, must have a method defined for (x::Vector{Int}, y::Vector{Int}) -> Real.\n\nReturns\n\nReturns a tuple (clust, i) where clust is a clustering allocation in samples and i is its sample index.\n\n\n\n\n\n","category":"method"},{"location":"api/#RedClust.infodist-Tuple{Vector{Int64}, Vector{Int64}}","page":"Public API","title":"RedClust.infodist","text":"infodist(a::ClustLabelVector, b::ClustLabelVector;\nnormalised = true) -> Float64\n\nComputes the information distance between two clusterings. The information distance is defined as\n\nd_mathrmID(a b) = max H(A) H(B) - I(A B)\n\nwhere A and B are the cluster membership probability functions for a and b respectively, H denotes the entropy of a distribution, and I denotes the mutual information between two distributions. The information distance has range 0 log N where N is the number of observations. If normalised = true, the result is scaled to the range 0 1.\n\n\n\n\n\n","category":"method"},{"location":"api/#Summary-and-clustering-evaluation","page":"Public API","title":"Summary and clustering evaluation","text":"","category":"section"},{"location":"api/","page":"Public API","title":"Public API","text":"Modules = [RedClust]\nPages = [\"summaries.jl\"]\nOrder = [:function]","category":"page"},{"location":"api/#RedClust.evaluateclustering-Tuple{Vector{Int64}, Vector{Int64}}","page":"Public API","title":"RedClust.evaluateclustering","text":"evaluateclustering(clusts::ClustLabelVector, truth::ClustLabelVector) Returns a named tuple of values quantifying the accuracy of the clustering assignments in clusts with respect to the ground truth clustering assignments in truth.\n\nReturn values\n\nnbloss: Normalised Binder loss (= 1 - rand index). Lower is better.\nari: Adjusted Rand index. Higher is better.\nvi: Variation of Information distance (in 0 log N). Lower is better.\nnvi: Normalised VI distance (in 0 1).\nid: Information Distance (in 0 log N). Lower is better.\nnid: Normalised Information Distance (in 0 1).\nnmi: Normalised Mutual Information (in 0 1). Higher is better.\n\n\n\n\n\n","category":"method"},{"location":"api/#RedClust.summarise-Tuple{IO, Vector{Int64}, Vector{Int64}}","page":"Public API","title":"RedClust.summarise","text":"summarise([io::IO], clusts::ClustLabelVector, truth::ClustLabelVector, printoutput = true) -> String Prints a summary of the clustering accuracy of clusts with respect to the ground truth in truth. The output is printed to the output stream io, which defaults to stdout if not provided.\n\n\n\n\n\n","category":"method"},{"location":"api/#Convenience-functions","page":"Public API","title":"Convenience functions","text":"","category":"section"},{"location":"api/","page":"Public API","title":"Public API","text":"Modules = [RedClust]\nPages = [\"utils.jl\"]\nOrder = [:function]","category":"page"},{"location":"api/#RedClust.adjacencymatrix-Tuple{Vector{Int64}}","page":"Public API","title":"RedClust.adjacencymatrix","text":"adjacencymatrix(clusts::ClustLabelVector) -> Matrix{Bool} Returns the n×n adjacency matrix corresponding to the given cluster label vector clusts, where n = length(clusts).\n\n\n\n\n\n","category":"method"},{"location":"api/#RedClust.generatemixture-Tuple{Integer, Integer}","page":"Public API","title":"RedClust.generatemixture","text":"generatemixture(N, K; [α, dim, radius, σ, rng])\n\nGenerates a multivariate Normal mixture, with kernel weights generated from a Dirichlet prior. The kernels are centred at the vertices of a dim-dimensional simplex with edge length radius.\n\nRequired Arguments\n\nN::Integer: number of observations to generate.\nK::Integer: number of mixture kernels.\n\nOptional Arguments\n\nα::Float64 = K: parameter for the Dirichlet prior.\ndim::Integer = K: dimension of the observations.\nradius::Float64 = 1: radius of the simplex whose vertices are the kernel means.\nσ::Float64 = 0.1: variance of each kernel.\nrng: a random number generator to use, or an integer to seed the default random number generator with. If not provided, the default RNG provided by the Random.jl package will be used with default seeding.\n\nReturns\n\nNamed tuple containing the following fields-\n\npoints::Vector{Vector{Float64}}: a vector of N observations.\ndistancematrix::Matrix{Float64}: an N×N matrix of pairwise Euclidean distances between the observations.\nclusts::ClustLabelVector: vector of N cluster assignments.\nprobs::Float64: vector of K cluster weights generated from the Dirichlet prior, used to generate the observations.\noracle_coclustering::Matrix{Float64}: N×N matrix of co-clustering probabilities, calculated assuming full knowledge of the cluster centres and cluster weights.\n\n\n\n\n\n","category":"method"},{"location":"api/#RedClust.makematrix-Tuple{AbstractVector{<:AbstractVector}}","page":"Public API","title":"RedClust.makematrix","text":"makematrix(x::AbstractVector{<:AbstractVector}) -> Matrix\n\nConvert a vector of vectors into a matrix, where each vector becomes a column in the matrix.\n\n\n\n\n\n","category":"method"},{"location":"api/#RedClust.sortlabels-Tuple{Vector{Int64}}","page":"Public API","title":"RedClust.sortlabels","text":"sortlabels(x::ClustLabelVector) -> ClustLabelVector Returns a cluster label vector y such that x and y have the same adjacency structure and labels in y occur in sorted ascending order.\n\n\n\n\n\n","category":"method"},{"location":"api/#Example-datasets","page":"Public API","title":"Example datasets","text":"","category":"section"},{"location":"api/","page":"Public API","title":"Public API","text":"Modules = [RedClust]\nPages = [\"example_data.jl\"]\nOrder = [:function]","category":"page"},{"location":"api/#RedClust.example_dataset-Tuple{Int64}","page":"Public API","title":"RedClust.example_dataset","text":"example_dataset(n::Int)\n\nReturns a named tuple containing the dataset from the n^mathrmth simulated example in the main paper. This dataset was generated using the following code in Julia v1.8.1:\n\n\tusing RedClust\n\tusing Random: seed!\n\tseed!(44)\n\tK = 10 # Number of clusters\n\tN = 100 # Number of points\n\tdata_σ = [0.25, 0.2, 0.18][n] # Variance of the normal kernel\n\tdata_dim = [10, 50, 10][n] # Data dimension\n\tdata = generatemixture(N, K; α = 10, σ = data_σ, dim = data_dim);\n\nNote however that the above code may produce different results on your computer because the random number generator in Julia is not meant for reproducible results across different computers, different versions of Julia, or different versions of the Random.jl package, even with appropriate seeding. Therefore the datasets have been included with this package, and it is recommended to access them via this function.\n\nSee also generatemixture.\n\n\n\n\n\n","category":"method"},{"location":"api/#RedClust.example_datasets-Tuple{}","page":"Public API","title":"RedClust.example_datasets","text":"Return a read-only handle to a HDF5 file that contains example datasets from the main paper. You must remember to close the file once you are done reading from it. This function is provided for reproducibility purposes only; it is recommended to read the datasets via the convenience function example_dataset.\n\n\n\n\n\n","category":"method"},{"location":"api/#Types","page":"Public API","title":"Types","text":"","category":"section"},{"location":"api/","page":"Public API","title":"Public API","text":"Modules = [RedClust]\nOrder = [:type]","category":"page"},{"location":"api/#RedClust.ClustLabelVector","page":"Public API","title":"RedClust.ClustLabelVector","text":"Alias for Vector{Int}\n\n\n\n\n\n","category":"type"},{"location":"api/#RedClust.MCMCData","page":"Public API","title":"RedClust.MCMCData","text":"MCMCData(points::AbstractVector{<:AbstractVector{<:Float64}}) MCMCData(dissimilaritymatrix::AbstractMatrix{Float64})\n\nContains the pairwise dissimilarities for the MCMC sampler.\n\n\n\n\n\n","category":"type"},{"location":"api/#RedClust.MCMCOptionsList","page":"Public API","title":"RedClust.MCMCOptionsList","text":"MCMCOptionsList(;\nnumiters = 5000,\nburnin = floor(0.2 * numiters),\nthin = 1,\nnumGibbs = 5,\nnumMH = 1)\n\nList of options for running the MCMC.\n\n# Constructor arguments\n- `numiters::Integer`: number of iterations to run.\n- `burnin::Integer`: number of iterations to discard as burn-in.\n- `thin::Integer`: will keep every `thin` samples.\n- `numGibbs:Integer`: number of intermediate Gibbs scans in the split-merge step.\n- `numMH:Integer`: number of split-merge steps per MCMC iteration.\n\n\n\n\n\n","category":"type"},{"location":"api/#RedClust.MCMCResult","page":"Public API","title":"RedClust.MCMCResult","text":"Struct containing MCMC samples.\n\nFields\n\noptions::MCMCOptionsList: options passed to the sampler.\nparams::PriorHyperparamsList: prior hyperparameters used by the sampler.\nclusts::Vector{ClustLabelVector}: contains the clustering allocations. clusts[i] is a vector containing the clustering allocation from the ith sample.\nposterior_coclustering::Matrix{Float64}: the posterior coclustering matrix.\nK::Vector{Int}: posterior samples of K, i.e., the number of clusters. K[i] is the number of clusters in clusts[i].\nr::Vector{Float64}: posterior samples of the parameter r.\np::Vector{Float64}: posterior samples of the parameter p.\nK_mean, r_mean, p_mean: posterior mean of K, r, and p respectively.\nK_variance, r_variance, p_variance: posterior variance of K, r, and p respectively.\nK_acf::Vector{Float64}, r_acf::Vector{Float64}, p_acf::Vector{Float64}: autocorrelation function for K, r, and p respectively.\nK_iac, r_iac, and p_iac: integrated autocorrelation coefficient for K, r, and p respectively.\nK_ess::Float64, r_ess::Float64, and p_ess::Float64: effective sample size for K, r, and p respectively.\nloglik::Vector{Float64}: log-likelihood for each sample.\nlogposterior::Vector{Float64}: a function proportional to the log-posterior for each sample, with constant of proportionality equal to the normalising constant of the partition prior.\nsplitmerge_splits: Boolean vector indicating the iterations when a split proposal was used in the split-merge step. Has length numMH * numiters (see MCMCOptionsList).\nsplitmerge_acceptance_rate: acceptance rate of the split-merge proposals.\nr_acceptances: Boolean vector indicating the iterations (including burnin and the thinned out iterations) where the Metropolis-Hastings proposal for r was accepted.\nr_acceptance_rate:: Metropolis-Hastings acceptance rate for r.\nruntime: total runtime for all iterations.\nmean_iter_time: average time taken for each iteration.\n\n\n\n\n\n","category":"type"},{"location":"api/#RedClust.PriorHyperparamsList","page":"Public API","title":"RedClust.PriorHyperparamsList","text":"PriorHyperparamsList(; [kwargs])\n\nContains the prior hyperparameters for the model. It is recommended to set the values using the fitprior function. Do not set these values manually unless you know what you are doing.\n\nConstructor arguments\n\nδ1::Float64 = 1: the parameter delta_1.\nα::Float64 = 1: the shape parameter alpha in the prior for each lambda_k.\nβ::Float64 = 1: the rate parameter beta in the prior for each lambda_k.\nδ2::Float64 = 1: the parameter delta_2.\nζ::Float64 = 1: the shape parameter zeta in the prior for each theta_kt.\nγ::Float64 = 1: the rate parameter gamma in the prior for each theta_kt.\nη::Float64 = 1: the shape parameter eta in the prior for r.\nσ::Float64 = 1: the rate parameter sigma in the prior for r.\nu::Float64 = 1: the parameter u in the prior for p.\nv::Float64 = 1: the parameter v in the prior for p.\nproposalsd_r::Float64 = 1: standard deviation of the truncated Normal proposal when sampling r via a Metropolis-Hastings step.\nK_initial::Int = 1: initial number of clusters to start the MCMC.\nrepulsion::Bool = true: whether to use the repulsive component of the likelihood.\nmaxK::Int = 0: maxmimum number of clusters to allow. If set to 0 then no maximum is imposed.\n\nSee also\n\nfitprior\n\n\n\n\n\n","category":"type"},{"location":"funding/#Funding-Information","page":"Funding Information","title":"Funding Information","text":"","category":"section"},{"location":"funding/","page":"Funding Information","title":"Funding Information","text":"We are grateful to the following funding organisations for their financial support for the research that led to the main publication behind this package.","category":"page"},{"location":"funding/","page":"Funding Information","title":"Funding Information","text":"National University of Singapore, HSS Seed Fund R-607-000-449-646\nYale-NUS College, grant IG20-RA001\nSingapore Ministry of Education, Academic Research Fund Tier 2 Grant MOE-T2EP40121-0021","category":"page"},{"location":"cite/#Citation-Information","page":"Cite This Package","title":"Citation Information","text":"","category":"section"},{"location":"cite/","page":"Cite This Package","title":"Cite This Package","text":"If you want to use this package in your work, please cite it as:","category":"page"},{"location":"cite/","page":"Cite This Package","title":"Cite This Package","text":"Natarajan, A., De Iorio, M., Heinecke, A., Mayer, E. and Glenn, S. (2023). ‘Cohesion and Repulsion in Bayesian Distance Clustering’, Journal of the American Statistical Association, 119(546), pp. 1374–1384. DOI: 10.1080/01621459.2023.2191821.","category":"page"},{"location":"cite/","page":"Cite This Package","title":"Cite This Package","text":"For BibTeX users:","category":"page"},{"location":"cite/","page":"Cite This Package","title":"Cite This Package","text":"@article{NDI23,\n doi = {10.1080/01621459.2023.2191821},\n author = {Natarajan, Abhinav and De Iorio, Maria and Heinecke, Andreas and Mayer, Emanuel and Glenn, Simon},\n title = {Cohesion and Repulsion in Bayesian Distance Clustering},\n journal = {Journal of the American Statistical Association},\n volume = {119},\n issue = {546},\n pages={1374--1384},\n year = {2023}\n}","category":"page"},{"location":"changelog/#Changelog","page":"Changelog","title":"Changelog","text":"","category":"section"},{"location":"changelog/#[1.2.2]","page":"Changelog","title":"[1.2.2]","text":"","category":"section"},{"location":"changelog/#Added","page":"Changelog","title":"Added","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"tests for Julia 1.9 in CI matrix\nHDF5.jl v0.17 compatibility","category":"page"},{"location":"changelog/#[1.2.1]","page":"Changelog","title":"[1.2.1]","text":"","category":"section"},{"location":"changelog/#Added-2","page":"Changelog","title":"Added","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"tests for fitprior2 in CI pipeline.","category":"page"},{"location":"changelog/#Fixed","page":"Changelog","title":"Fixed","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"error when constructing MCMCData with input distance matrix that has non-zero diagonal entries.\nedge case for `fitprior2' when Kmin = Kmax = 1 or Kmin = Kmax = N generates a warning and fallback to default repulsion/cohesion parameters. ","category":"page"},{"location":"changelog/#[1.2.0]","page":"Changelog","title":"[1.2.0]","text":"","category":"section"},{"location":"changelog/#Added-3","page":"Changelog","title":"Added","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"added function fitprior2, alternative method to fit prior hyperparameters.","category":"page"},{"location":"changelog/#Changed","page":"Changelog","title":"Changed","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"pretty print for MCMCResult also shows whether the model includes repulsion.\nmoved basic_example.jl from the examples folder to docs. \nremoved examples (added to a separate branch). ","category":"page"},{"location":"changelog/#[1.1.0]","page":"Changelog","title":"[1.1.0]","text":"","category":"section"},{"location":"changelog/#Added-4","page":"Changelog","title":"Added","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"added function signature for sampleK to accept individual parameters.","category":"page"},{"location":"changelog/#[1.0.1]","page":"Changelog","title":"[1.0.1]","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"Documentation now updated to use Literate.jl to generate the example. ","category":"page"},{"location":"changelog/#[1.0.0]","page":"Changelog","title":"[1.0.0]","text":"","category":"section"},{"location":"changelog/#Added-5","page":"Changelog","title":"Added","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"examples from the main paper are now included in the examples folder. Data from the original paper is included in the data folder in the standard HDF5 format. \nthe examples now have more plots.\nfunctionality to sample the distances from the prior predictive distribution (see sampledist).\nfunctionality to sample K from its marginal prior predictive (see sampleK).\ngeneratemixture accepts a random number generator or a seed for the default RNG for reproducibility of results. ","category":"page"},{"location":"changelog/#Removed","page":"Changelog","title":"Removed","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"dependency on RCall and the various calls to the salso algorithm were removed. It is left to the user to make these calls if necessary. ","category":"page"},{"location":"changelog/#[0.2.2]","page":"Changelog","title":"[0.2.2]","text":"","category":"section"},{"location":"changelog/#Fixed-2","page":"Changelog","title":"Fixed","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"corrected time not printing in the correct format. \nmoved prettytime and prettynumber to utils.jl.","category":"page"},{"location":"changelog/#[0.2.1]","page":"Changelog","title":"[0.2.1]","text":"","category":"section"},{"location":"changelog/#Added-6","page":"Changelog","title":"Added","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"added tests for pretty printing.\nadded tests for custom loss function in getpointestimate. ","category":"page"},{"location":"changelog/#Fixed-3","page":"Changelog","title":"Fixed","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"verbose output in sampler missing newline. \ncustom loss function when computing a point estimate. ","category":"page"},{"location":"changelog/#Removed-2","page":"Changelog","title":"Removed","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"_infodist not in use, removed.","category":"page"},{"location":"changelog/#[0.2.0]","page":"Changelog","title":"[0.2.0]","text":"","category":"section"},{"location":"changelog/#Added-7","page":"Changelog","title":"Added","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"added log-posterior to result.\nadded log-likelihood and log-posterior plots to basic example.\npoint-estimation via maximum likelihood and maximum a posteriori.\ninfodist function to compute the information distance between clusterings.\nconvenience constructor for MCMCData.\nadd verbose option for runsampler and fitprior.\npretty printing for MCMCData, MCMCOptionsList, and PriorHyperparamsList.\nadded bibliographic references to documentation.\nadded changelog.","category":"page"},{"location":"changelog/#Breaking","page":"Changelog","title":"Breaking","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"MCMCOptions constructor changed account for change in point-estimate calculation. \nfitprior now only returns the hyperparameter list.\nremoved summarise for MCMCResult objects (use pretty printing instead).","category":"page"},{"location":"changelog/#Fixed-4","page":"Changelog","title":"Fixed","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"corrected computation of log-likelihood in result.\nfixed edge case error in fitprior when the found value of K is either 1 or N. This case arises only for edge values of Kmin and Kmax, since the elbow method will otherwise not choose 1 or N for K.","category":"page"},{"location":"changelog/#Changed-2","page":"Changelog","title":"Changed","text":"","category":"section"},{"location":"changelog/","page":"Changelog","title":"Changelog","text":"added bounds checking for Kmin and Kmax in fitprior.\nadded input message for better debugging in fitprior.\nadded progress bar for fitprior if useR = false. \nadded input validation for generatemixture.\nadded input validation for binderloss and evaluateclustering.\nadded input validation for the constructors of MCMCData and MCMCOptionsList.\nseparated computation of log-likelihood and log-prior.\nremoved redundant loss function calculations when computing point-estimate.\nBinder loss calculation (binderloss) is now approximate (using randindex from Clustering.jl) but faster.\nrunsampler no longer automatically calculates a point estimate. \nsummarise has a separate signature for MCMC output and for point estimate summary, and now provides more measures for point estimates.\nminor optimisations using @inbounds, @simd, and @turbo.\nusing StaticArrays.jl and separate function for restricted Gibbs scan. \nspeedup via non-generic implementation of matrix and vector sums with LoopVectorization.jl.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"EditURL = \"../../basic_example_for_docs.jl\"","category":"page"},{"location":"generated/example/#Example","page":"Example","title":"Example","text":"","category":"section"},{"location":"generated/example/","page":"Example","title":"Example","text":"Here we demonstrate the usage of RedClust through an example. This example can be downloaded as a Julia script here.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"We begin by setting up the necessary includes.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"using RedClust, Plots, StatsPlots\nusing Random: seed!\nusing StatsBase: counts\nusing LinearAlgebra: triu, diagind\nnothing # hide","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"#jl ## We define some colors optimized for color-blind individuals based on [this article](https://www.nature.com/articles/nmeth.1618) by Bang Wong in Nature.\nfunction wong_colors(alpha = 1.0)\n colors = [\n RGB(0/255, 114/255, 178/255), # blue\n RGB(230/255, 159/255, 0/255), # orange\n RGB(0/255, 158/255, 115/255), # green\n RGB(204/255, 121/255, 167/255), # reddish purple\n RGB(86/255, 180/255, 233/255), # sky blue\n RGB(213/255, 94/255, 0/255), # vermillion\n RGB(240/255, 228/255, 66/255), # yellow\n ]\n @. RGBA{Float32}(red(colors), green(colors), blue(colors), alpha)\nend\n## Set plotting defaults\ndefault(fontfamily = \"Computer Modern\",\ncolor_palette = wong_colors(0.8),\ngridlinewidth = 1,\nframestyle = :box,\nlinecolor = :match,\nlinewidth = 0.5,\nguidefontsize = 14,\ntickfontsize = 12,\ncolorbar_tickfontsize = 12,\nlegend_font_pointsize = 12,\nplot_titlefontsize = 14\n)\n## seed the default RNG so that documentation remains stable\nseed!(44)","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"Next we define some convenience functions for plotting.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"# Heatmap of square matrix\nfunction sqmatrixplot(X::Matrix; kwargs...)\n M, N = size(X)\n heatmap(\n X,\n aspect_ratio=:equal,\n color=:Blues,\n xlim=(1,M), ylim=(1,N),\n yflip = true, xmirror=true;\n kwargs...)\nend\n\n# Histogram with integer bins\nfunction histogram_pmf(X::AbstractVector{<:Integer}; binwidth::Int=1, kwargs...)\n xmin = minimum(X)\n xmax = maximum(X)\n binedges = xmin:binwidth:xmax\n c = counts(X)./length(X)\n bincounts = map(sum, [c[(i-xmin+1):minimum([i-xmin+1+binwidth-1, xmax-xmin+1])] for i in binedges])\n bar(binedges, bincounts,\n linewidth = 0,\n legend = false,\n xticks = binedges; kwargs...)\nend\n\n# Combine two symmetric square matrices together into the upper and lower triangle of a square matrix\nfunction combine_sqmatrices(lower::Matrix, upper::Matrix, diagonal::String = \"lower\")\n if size(lower)[1] != size(lower)[2]\n throw(ArgumentError(\"Argument `lower` must be square, has dimensions $(size(lower)).\"))\n end\n if size(upper)[1] != size(upper)[2]\n throw(ArgumentError(\"Argument `upper` must be a square matrix, has dimensions $(size(upper)).\"))\n end\n if !all(size(lower) .== size(upper))\n throw(ArgumentError(\"Arguments `lower` and `upper` must have the same size.\"))\n end\n if !(eltype(lower) <: eltype(upper)) && !(eltype(upper) <: eltype(lower))\n throw(ArgumentError(\"Arguments must have compatible entries, got $(eltype(lower)) and $(eltype(upper)).\"))\n end\n if diagonal ∉ [\"lower\", \"upper\"]\n throw(ArgumentError(\"Keyword argument `diagonal` must be either \\\"lower\\\" or \\\"upper\\\".\"))\n end\n result = copy(lower)\n temp = trues(size(lower))\n upper_idx = triu(temp, 1)\n diagonal_idx = diagind(temp)\n result[upper_idx] .= upper[upper_idx]\n result[diagonal_idx] .= ((diagonal == \"lower\") ? lower : upper)[diagonal_idx]\n return result\nend\nnothing # hide","category":"page"},{"location":"generated/example/#Generating-Data","page":"Example","title":"Generating Data","text":"","category":"section"},{"location":"generated/example/","page":"Example","title":"Example","text":"We can generate some example data using the function generatemixture.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"begin\n K = 10 # Number of clusters\n N = 100 # Number of points\n data_σ = 0.25 # Variance of the normal kernel\n data_dim = 10 # Data dimension\n α = 10 # parameter for Dirichlet prior on cluster weights\n data = generatemixture(N, K;\n α = α, σ = data_σ, dim = data_dim)\n points, distmatrix, clusts, probs, oracle_coclustering = data\n nothing # hide\nend","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"Alternatively, the function example_dataset can be used to retrieve the datasets used in the original RedClust paper.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"begin\n data = example_dataset(1)\n points, distmatrix, clusts, probs, oracle_coclustering = data\n nothing # hide\nend","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"We can visualise the true adjacency matrix of the observations with respect to the true clusters that they were drawn from, as well as the oracle coclustering matrix. The latter is the matrix of co-clustering probabilities of the observations conditioned upon the data generation process. This takes into account full information about the cluster weights (and how they are generated), the mixture kernels for each cluster, and the location and scale parameters for these kernels.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"sqmatrixplot(combine_sqmatrices(oracle_coclustering, 1.0 * adjacencymatrix(clusts)), title = \"Adjacency vs Oracle Co-clustering Probabilities \\n(upper right and lower left triangle)\\n\")","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"We can visualise the matrix of pairwise distances between the observations.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"sqmatrixplot(distmatrix, title = \"Matrix of Pairwise Distances\")","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"We can also plot the histogram of distances, grouped by whether they are inter-cluster distances (ICD) or within-cluster distances (WCD).","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"begin\n empirical_intracluster = uppertriangle(distmatrix)[\n uppertriangle(adjacencymatrix(clusts)) .== 1]\n empirical_intercluster = uppertriangle(distmatrix)[\n uppertriangle(adjacencymatrix(clusts)) .== 0]\n histogram(empirical_intercluster,\n bins = minimum(empirical_intercluster):0.05:maximum(empirical_intercluster),\n label=\"ICD\", xlabel = \"Distance\", ylabel=\"Frequency\",\n title = \"Observed distribution of distances\")\n histogram!(empirical_intracluster,\n bins = minimum(empirical_intracluster):0.05:maximum(empirical_intracluster),\n label=\"WCD\")\nend","category":"page"},{"location":"generated/example/#Prior-Hyperparameters","page":"Example","title":"Prior Hyperparameters","text":"","category":"section"},{"location":"generated/example/","page":"Example","title":"Example","text":"RedClust includes the function fitprior to heuristically choose prior hyperparameters based on the data.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"params = fitprior(points, \"k-means\", false)","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"We can check how good the chosen prior hyperparameters are by comparing the empirical distribution of distances to the (marginal) prior predictive distribution.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"begin\n pred_intracluster = sampledist(params, \"intracluster\", 10000)\n pred_intercluster = sampledist(params, \"intercluster\", 10000)\n density(pred_intercluster,\n label=\"Simulated ICD\", xlabel = \"Distance\", ylabel = \"Density\",\n linewidth = 2, linestyle = :dash)\n density!(empirical_intercluster,\n label=\"Empirical ICD\",\n color = 1, linewidth = 2)\n density!(pred_intracluster,\n label=\"Simulated WCD\",\n linewidth = 2, linestyle = :dash, color = 2)\n density!(empirical_intracluster,\n label=\"Empirical WCD\",\n linewidth = 2, color = 2)\n title!(\"Distances: Prior Predictive vs Empirical Distribution\")\nend","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"We can also evaluate the prior hyperparameters by checking the marginal predictive distribution on K (the number of clusters).","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"begin\n Ksamples = sampleK(params, 10000, N)\n histogram_pmf(Ksamples, legend = false,\n xticks=collect(0:10:maximum(Ksamples)),\n xlabel = \"\\$K\\$\", ylabel = \"Probability\", title = \"Marginal Prior Predictive Distribution of \\$K\\$\")\nend","category":"page"},{"location":"generated/example/#Sampling","page":"Example","title":"Sampling","text":"","category":"section"},{"location":"generated/example/","page":"Example","title":"Example","text":"Running the MCMC is straightforward. We set up the MCMC options using MCMCOptionsList.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"options = MCMCOptionsList(numiters=50000)","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"We then set up the input data using MCMCData.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"data = MCMCData(points)","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"We can then run the sampler using runsampler.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"result = runsampler(data, options, params)","category":"page"},{"location":"generated/example/#MCMC-Result","page":"Example","title":"MCMC Result","text":"","category":"section"},{"location":"generated/example/","page":"Example","title":"Example","text":"The MCMC result contains several details about the MCMC, including acceptance rate, runtime, and convergence diagnostics. For full details see MCMCResult. In this example we have the ground truth cluster labels, so we can evaluate the result. For example, we can compare the posterior coclustering matrix to the oracle co-clustering probabilities.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"sqmatrixplot(combine_sqmatrices(result.posterior_coclustering, oracle_coclustering),\ntitle=\"Posterior vs Oracle Coclustering Probabilities\")","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"Plot the posterior distribution of K:","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"histogram_pmf(result.K,\nxlabel = \"\\$K\\$\", ylabel = \"Probability\", title = \"Posterior Distribution of \\$K\\$\")","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"Plot the posterior distribution of r:","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"begin\n histogram(result.r, normalize = :pdf,\n legend_font_pointsize=12,\n label=\"Empirical density\", ylabel = \"Density\", xlabel = \"\\$r\\$\",\n title = \"Posterior Distribution of \\$r\\$\")\n density!(result.r,\n color=:black, linewidth = 2, linestyle=:dash,\n label=\"Kernel estimate\", legend_font_pointsize=12)\nend","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"Plot the posterior distribution of p:","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"begin\n histogram(result.p, normalize = :pdf,\n ylabel = \"Density\", xlabel = \"\\$p\\$\",\n title = \"Posterior Distribution of \\$p\\$\",\n label = \"Empirical density\",\n legend_font_pointsize=12)\n density!(result.p, color=:black, linewidth = 2, linestyle=:dash,\n label = \"Kernel estimate\",\n legend_position = :topleft)\nend","category":"page"},{"location":"generated/example/#Convergence-statistics","page":"Example","title":"Convergence statistics","text":"","category":"section"},{"location":"generated/example/","page":"Example","title":"Example","text":"Plot the traceplot of the autocorrelation function of K:","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"plot(result.K_acf, legend = false, linewidth = 1,\nxlabel = \"Lag\", ylabel = \"Autocorrelation\",\ntitle = \"Autocorrelation Function of \\$K\\$\")","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"Plot the traceplot of the autocorrelation function of r:","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"plot(result.r_acf, legend = false, linewidth = 1,\nxlabel = \"Lag\", ylabel = \"Autocorrelation\",\ntitle = \"Autocorrelation Function of \\$r\\$\")","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"Plot the traceplot of the autocorrelation function of p:","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"plot(result.p_acf, legend = false, linewidth = 1,\nxlabel = \"Lag\", ylabel = \"Autocorrelation\",\ntitle = \"Autocorrelation Function of \\$p\\$\")","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"Check the trace plot of the log-likelihood to make sure the MCMC is moving well:","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"plot(result.loglik, legend = false, linewidth = 1,\nxlabel = \"Iteration\", ylabel = \"Log likelihood\",\ntitle = \"Log-Likelihood Trace Plot\")","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"Check the trace plot of the log-posterior:","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"plot(result.logposterior, legend = false, linewidth = 1,\nxlabel = \"Iteration\", ylabel = \"Log posterior\",\ntitle = \"Log-Posterior Trace Plot\")","category":"page"},{"location":"generated/example/#Point-Estimates","page":"Example","title":"Point Estimates","text":"","category":"section"},{"location":"generated/example/","page":"Example","title":"Example","text":"The function getpointestimate finds an optimal point estimate, based on some notion of optimality. For example, to get the maximum a posteriori estimate we can run the following.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"pointestimate, index = getpointestimate(result; method=\"MAP\")","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"We can compare the point-estimate to the true clustering through their adjacency matrices.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"sqmatrixplot(combine_sqmatrices(adjacencymatrix(pointestimate), adjacencymatrix(clusts)),\ntitle = \"True Clustering vs MAP Point Estimate\")","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"We can check the accuracy of the point estimate in terms of clustering metrics.","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"summarise(pointestimate, clusts)","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"","category":"page"},{"location":"generated/example/","page":"Example","title":"Example","text":"This page was generated using Literate.jl.","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"CurrentModule = RedClust","category":"page"},{"location":"#Introduction","page":"Introduction","title":"Introduction","text":"","category":"section"},{"location":"","page":"Introduction","title":"Introduction","text":"RedClust is a Julia package for Bayesian clustering of high-dimensional Euclidean data using pairwise dissimilarity information instead of the raw observations. It uses an MCMC sampler to generate posterior samples from the space of all possible clustering structures on the data, and it is an implementation of the algorithm described by Natarajan et al., 2023.","category":"page"},{"location":"#Installation","page":"Introduction","title":"Installation","text":"","category":"section"},{"location":"","page":"Introduction","title":"Introduction","text":"The package can be installed by typing ]add RedClust into the Julia REPL or by using Pkg:","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"using Pkg\nPkg.add(\"RedClust\")","category":"page"},{"location":"#Basic-example","page":"Introduction","title":"Basic example","text":"","category":"section"},{"location":"","page":"Introduction","title":"Introduction","text":"using RedClust\n# Generate data\npoints, distM, clusts, probs, oracle_coclustering = \n\tgeneratemixture(N, K; α = 10, σ = data_σ, dim = data_dim)\n# Let RedClust choose the best prior hyperparameters\nparams = fitprior(points, \"k-means\", false)\n# Set the MCMC options\noptions = MCMCOptionsList(numiters = 5000)\ndata = MCMCData(points)\n# Run the sampler\nresult = runsampler(data, options, params)\n# Get a point estimate \npointestimate, index = getpointestimate(result)\n# Summary of MCMC and point estimate\nsummarise(result)\nsummarise(pointestimate, clusts)","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"A more elaborate example can be found in the Example section. Examples from the paper and its supplementary material can be found in the 'examples' branch of the Github repository for RedClust.jl.","category":"page"},{"location":"#Model","page":"Introduction","title":"Model","text":"","category":"section"},{"location":"","page":"Introduction","title":"Introduction","text":"RedClust implements the model described in Natarajan et al. (2022). The key features are-","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"The use of a random partition model with an unknown number of clusters K which allows for posterior inference on K. That is, there is a prior distribution on the space of all possible clustering structures with any number of clusters from one to the number of observations. The number of clusters is an object of inference to be determined by MCMC sampling. \nThe pairwise dissimilarities between observations are assumed to be mutually independent given the clustering assignment; that is, the clustering likelihood is a composite or pseduo-likelihood. \nThe clustering likelihood is comprised of a cohesive part and a repulsive part. The repulsive component of the likelihood imposes a strong identifiability constraint on the clustering; clusters must not only comprise points that have similar small distances among themselves, but also similar distances to points in other clusters.","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"The prior on the clustering structure can be chosen according to application in question. The current version of RedClust implements a microclustering prior (Miller et al., 2015; Betancourt et al., 2022). This means that the partition is generated by drawing cluster sizes n_1 ldots n_K from a random distribution nu (conditional upon n_1 + ldots n_K = n where n is the number of observations), and cluster labels are given by a uniform random permutation of ","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"underbrace1 ldots 1_n_1 text times ldots underbraceK ldots K_n_K text times","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"RedClust implements a microclustering prior with nu a shifted negative binomial with random parameters r and p, which are sampled from a Gamma and Beta distribution respectively.","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"beginalign*\nr sim mathrmGamma(eta sigma)\np sim mathrmBeta(u v)\nendalign*","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"This results in the following partition prior: ","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"pi(rho_n mid r p) propto K p^n-K(1-p)^rKGamma(r)^-Kprod_k=1^K n_k Gamma(n_k+r-1)","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"where rho_n is the partition and K is the number of clusters in the partition. The partition likelihood is given by","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"beginalign*\nlambda_k oversetmathrmiidsim mathrmGamma(alpha beta) quad 1 leq k leq K\ntheta_kt oversetmathrmiidsim mathrmGamma(zeta gamma) quad 1 leq k t leq K\nendalign*","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"pi(D mid rho_n boldsymbollambda boldsymboltheta r p) = left prod_k=1^K prod_substacki j in C_k i neq j fracD_ij^delta_1 - 1lambda_k^delta_1Gamma(delta_1) exp(-lambda_k D_ij) right leftprod_substackt k = 1 t neq K prod_substacki in C_k j in C_t fracD_ij^delta_2-1theta_kt^delta_2Gamma(delta_2) exp(-theta_ktD_ij) right","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"where boldsymbol D is the matrix of pairwise dissimilarities between observations.","category":"page"},{"location":"#Point-estimation","page":"Introduction","title":"Point estimation","text":"","category":"section"},{"location":"","page":"Introduction","title":"Introduction","text":"A clustering point-estimate boldsymbol c^* can be determined in a number of ways. ","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"Choosing the sample with maximum likelihood or maximum posterior. This corresponds to an MLE or MAP estimate using the MCMC samples as an approximation to the likelihood and posterior.\nSearching for a clustering that minimises the posterior expectation of a loss function ell such as the Binder loss (Binder, 1978), the Variation of Information distance (Meilă, 2007; Wade and Ghahramani, 2018), or the Normalised Information Distance (Kraskov et al., 2005). That is, ","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"boldsymbol c^* = argmin_boldsymbol c mathbbE_boldsymbol cell(boldsymbol c boldsymbol c)","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"A naive method is to restrict the search space to those clusterings visited by the MCMC sampler. This method is implemented in RedClust in the function getpointestimate.\nA better method is the SALSO algorithm (Dahl et al., 2022), implemented in the R package salso, which heuristically searches the space of all possible clusterings. You can use the RCall package in Julia to make calls to the salso package in R if you have R and salso installed. ","category":"page"},{"location":"#Bibliography","page":"Introduction","title":"Bibliography","text":"","category":"section"},{"location":"","page":"Introduction","title":"Introduction","text":"","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"Betancourt, B., Zanella, G. and Steorts, R. C. (2022), ‘Random partition models for microclustering tasks’, Journal of the American Statistical Association 117(539), 1215–1227. DOI: 10.1080/01621459.2020.1841647.","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"Binder, D. A. (1978). Bayesian Cluster Analysis. Biometrika, 65(1), 31–38. DOI: 10.2307/2335273.","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"Dahl, D. B., Johnson, D. J. and M¨uller, P. (2022), ‘Search algorithms and loss functions for Bayesian clustering’, Journal of Computational and Graphical Statistics 0(0), 1–13. DOI: 10.1080/10618600.2022.2069779.","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"Kraskov, A., Stögbauer, H., Andrzejak, R. G. and P Grassberger, P. (2005), ‘Hierarchical clustering using mutual information’, Europhysics Letters (EPL) 70(2), 278-284, DOI: 10.1209/epl/i2004-10483-y.","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"Marina Meilă (2007), ‘Comparing clusterings—an information based distance’, Journal of Multivariate Analysis, 98(5), 873-895, DOI: 10.1016/j.jmva.2006.11.013.","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"Miller, J., Betancourt, B., Zaidi, A., Wallach, H. and Steorts, R. C. (2015), ‘Microclustering: When the cluster sizes grow sublinearly with the size of the data set’, arXiv 1512.00792.","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"Natarajan, A., De Iorio, M., Heinecke, A., Mayer, E. and Glenn, S. (2023). ‘Cohesion and Repulsion in Bayesian Distance Clustering’, Journal of the Americal Statistical Association. DOI: 10.1080/01621459.2023.2191821.","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"","category":"page"},{"location":"","page":"Introduction","title":"Introduction","text":"Wade, S. and Ghahramani, Z. (2018), ‘Bayesian Cluster Analysis: Point Estimation and Credible Balls (with Discussion)’, Bayesian Analysis 13(2), 559 – 626. DOI: 10.1214/17-BA1073.","category":"page"}] }