Clonality (summary statistics)

clonality() Creates a tibble giving the total number of sequences, number of unique productive sequences, number of genomes, entropy, clonality, Gini coefficient, TCR/BCR convergence, and the frequency of the top productive sequences for any given sample.

Usage

clonality(study_table, rarefy = FALSE, iterations = 100, min_count = 1000)

Arguments

study_table

A tibble consisting of antigen receptor sequencing imported by the LymphoSeq2 function readImmunoSeq(). "junction_aa", "duplicate_count", and "duplicate_frequency" are required columns. Note that clonality is usually calculated from productive junction sequences. Therefore, it is not recommended to run this function using a productive sequence list aggregated by amino acids.

rarefy

A Boolean value

TRUE : Rarefied diversity metrics are calculated by sampling down each repertoire in the input table down to the repertoire with the smallest number of sequences and calculating the diversity metrics on the sampled data. The process is repeated for the number of iterations specified by the user and the diversity metrics are averaged over the number of iterations. Default 100 (the default)
FALSE (the default): Diversity metrics will be calculated considering the raw repertoire data for each of the samples

iterations

Number of iterations to run the sampled clonality metrics.

min_count

The minimum depth to which each repertoire in the study must be sampled to. Default 1000 (the default)

Value

Returns a tibble giving the total number of sequences, number of unique productive sequences, number of genomes, clonality, Gini coefficient, Simpson index, inverse Simpson index, and the frequency of the top productive sequence for each repertoire_id.

Details

Clonality is derived from the Shannon entropy, which is calculated from the frequencies of all productive sequences divided by the logarithm of the total number of unique productive sequences. This normalized entropy value is then inverted (1 - normalized entropy) to produce the clonality metric.

The Gini coefficient is an alternative metric used to calculate repertoire diversity and is derived from the Lorenz curve. The Lorenz curve is drawn such that x-axis represents the cumulative percentage of unique sequences and the y-axis represents the cumulative percentage of reads. A line passing through the origin with a slope of 1 reflects equal frequencies of all clones. The Gini coefficient is the ratio of the area between the line of equality and the observed Lorenz curve over the total area under the line of equality. Both Gini coefficient and clonality are reported on a scale from 0 to 1 where 0 indicates all sequences have the same frequency and 1 indicates the repertoire is dominated by a single sequence.

TCR/BCR convergence is defined as the average number of productive CDR3 nucleotide sequences that form the same productive CDR3 amino acid sequence.

Sequencing depth and amount of input available can often confound diversity metrics. For example, a peripheral blood sample can appear to be more clonal than a tumor sample when it is not sequenced to adequate depth. To overcome this we can sample down the sample down all repertoires to the depth of the sample with the least number of sequences and then calculate the diversity metrics. Repeating this process multiple times and averaging the diversity metrics can give a more accurate representation of sample diversity and enable comparison of repertoire samples from different experiments and different tissue of origin

Examples

file_path <- system.file("extdata", "TCRB_sequencing", package = "LymphoSeq2")
study_table <- LymphoSeq2::readImmunoSeq(path = file_path, threads = 1)
study_table <- LymphoSeq2::topSeqs(study_table, top = 100)
raw_clonality <- LymphoSeq2::clonality(study_table)
sampled_clonality <- LymphoSeq2::clonality(study_table,
  rarefy = TRUE,
  iterations = 100,
  min_count = 100
)

Usage

Arguments

Value

Details

See also

Examples