Skip to content

1 scDNA: CopyKit

Ryan Mulqueen edited this page Jun 19, 2026 · 1 revision

CopyKit Workshop

Overview

The goal of CopyKit is to help you analyze single cell DNA sequencing datasets for copy number.

This tutorial presents an example workflow of CopyKit. Detailed information for methods used of each function can be found in CopyKit user guide.

Installation

  1. Install the stable or development version from github using devtools. Note that knn smoothing is only available in development version.

  2. Install from package archive files, download v0.1.3 here.

devtools::install_github("navinlabcode/copykit")
devtools::install_github("navinlabcode/copykit@devel")

Check version and load libraries.

packageVersion("copykit")
## [1] '0.1.2'
library(copykit)
library(kableExtra)

Parallelization

Whenever possible, CopyKit uses the BiocParallel framework.

library(BiocParallel)
n_threads <- 8
register(MulticoreParam(workers = n_threads, progressbar = F), default = T)

Confirm parameters:

bpparam()
## class: MulticoreParam
##   bpisup: FALSE; bpnworkers: 8; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
##   bpexportglobals: TRUE; bpforceGC: TRUE
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: FORK

Segmentation

Run CBS segmentation from bam files.

This part takes more than 10 mins depending on the sample size thus will be skipped in the workshop.

tumor <- runVarbin(
  dir = "~/path/to/bam/files/",
  genome = "hg38",
  resolution = "220kb", # c("220kb", "55kb", "110kb", "195kb", "280kb", "500kb", "1Mb", "2.8Mb")
  remove_Y = FALSE,
  is_paired_end = FALSE,
  seed = 17,
  BPPARAM = bpparam()
)

Example data set

First we load in the example CopyKit object. This data set was generated from scDNA-seq of a human liver sample.

tumor <- readRDS("sample_obj.rds")

Then take a look at the data structure of CopyKit object.

# overview CopyKit object
print(tumor)                          
## class: CopyKit 
## dim: 11310 600 
## metadata(3): genome resolution vst
## assays(6): bincounts ft ... ratios logr
## rownames(11310): 1 2 ... 11309 11310
## rowData names(3): gc_content abspos arm
## colnames(600): PMTC6LiverC103DL6L7S3_S1255_L004_R1_001
##   PMTC6LiverC204DL1S7_S588_L002_R1_001 ...
##   PMTC6LiverC271AL6L7S5_S1423_L004_R1_001
##   PMTC6LiverC136AL6L7S3_S1288_L004_R1_001
## colData names(9): sample reads_assigned_bins ... reads_total
##   percentage_duplicates
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowRanges has: 11310 ranges
## Phylo: Phylogenetic tree with 0 tips and  nodes
## consensus dim: 0 0
# view first 5 rows of colData  
kable(colData(tumor)[1:5,])                            
sample reads_assigned_bins reads_unmapped reads_duplicates reads_multimapped reads_unassigned reads_ambiguous reads_total percentage_duplicates
PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 506994 27749 67923 0 105388 115 708169 0.096
PMTC6LiverC204DL1S7_S588_L002_R1_001 PMTC6LiverC204DL1S7_S588_L002_R1_001 476881 36334 45716 0 98424 96 657451 0.070
PMTC6LiverC258AL1S3_S258_L001_R1_001 PMTC6LiverC258AL1S3_S258_L001_R1_001 549322 27843 50278 0 111227 107 738777 0.068
PMTC6LiverC154AL1S2_S154_L001_R1_001 PMTC6LiverC154AL1S2_S154_L001_R1_001 812987 36743 78314 0 155468 179 1083691 0.072
PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 624401 35599 74604 0 131717 111 866432 0.086
# view available assays 
assays(tumor)                        
## List of length 6
## names(6): bincounts ft smoothed_bincounts segment_ratios ratios logr
# view first 5 rows and columns of segment ratios matrix
kable(assays(tumor)$segment_ratios[1:5,1:5]) 
PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 PMTC6LiverC204DL1S7_S588_L002_R1_001 PMTC6LiverC258AL1S3_S258_L001_R1_001 PMTC6LiverC154AL1S2_S154_L001_R1_001 PMTC6LiverC247AL4L5S3_S1015_L003_R1_001
1 1 0.98 0.71 1
1 1 0.98 0.71 1
1 1 0.98 0.71 1
1 1 0.98 0.71 1
1 1 0.98 0.71 1
# Check dimensions of the segment ratios matrix: 
# 11310 rows (genomic bins) by 600 columns (cells)
dim(assays(tumor)$segment_ratios)     
## [1] 11310   600

Quality Control

We can check some basic QC metrics for our sample. Our object already contains reads_total and percentage_duplicates, which are populated in the runVarbin step. Additionally, runMetrics calculates overdispersion and breakpoint counts.

tumor <- runMetrics(tumor)
## Calculating overdispersion.

## Counting breakpoints.

## Done.
plotMetrics(tumor, metric = c("reads_total", "percentage_duplicates",
                              "overdispersion", "breakpoint_count"))
Image

First we want to detect euploid cells by calculating the sample-wise coefficient of variation from the segment ratio means. The threshold can be changed from the automatic detection to a custom threshold with the argument resolution. By setting a threshold = 0.1, findAneuploidCells() will mark as euploid all cells with a coefficient of variation less or equal than 0.1.

tumor <- findAneuploidCells(tumor, seed = 17)
## number of iterations= 15

## Copykit detected 244 that are possibly diploid cells using a resolution of: 0.067

## Added information to colData(CopyKit).
kable(colData(tumor)[1:5,])
sample reads_assigned_bins reads_unmapped reads_duplicates reads_multimapped reads_unassigned reads_ambiguous reads_total percentage_duplicates overdispersion breakpoint_count is_aneuploid find_normal_cv
PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 506994 27749 67923 0 105388 115 708169 0.096 0.0044259 1 FALSE 0.02
PMTC6LiverC204DL1S7_S588_L002_R1_001 PMTC6LiverC204DL1S7_S588_L002_R1_001 476881 36334 45716 0 98424 96 657451 0.070 0.0025688 0 FALSE 0.00
PMTC6LiverC258AL1S3_S258_L001_R1_001 PMTC6LiverC258AL1S3_S258_L001_R1_001 549322 27843 50278 0 111227 107 738777 0.068 0.0009886 1 FALSE 0.06
PMTC6LiverC154AL1S2_S154_L001_R1_001 PMTC6LiverC154AL1S2_S154_L001_R1_001 812987 36743 78314 0 155468 179 1083691 0.072 0.0026183 76 TRUE 0.54
PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 624401 35599 74604 0 131717 111 866432 0.086 0.0025095 0 FALSE 0.00

Then we detect low-quality samples by a k-nearest-neighbors filtering (default k=5). Cells in which the correlation value is lower than the defined threshold are classified as low-quality cells (default resolution=0.9).

tumor <- findOutliers(tumor, k = 5, resolution = 0.9)
## Calculating correlation matrix.

## Marked 70 cells as outliers.

## Adding information to metadata. Access with colData(scCNA).

## Done.
kable(colData(tumor)[1:5,])
sample reads_assigned_bins reads_unmapped reads_duplicates reads_multimapped reads_unassigned reads_ambiguous reads_total percentage_duplicates overdispersion breakpoint_count is_aneuploid find_normal_cv cell_corr_value outlier
PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 506994 27749 67923 0 105388 115 708169 0.096 0.0044259 1 FALSE 0.02 1.000 FALSE
PMTC6LiverC204DL1S7_S588_L002_R1_001 PMTC6LiverC204DL1S7_S588_L002_R1_001 476881 36334 45716 0 98424 96 657451 0.070 0.0025688 0 FALSE 0.00 1.000 FALSE
PMTC6LiverC258AL1S3_S258_L001_R1_001 PMTC6LiverC258AL1S3_S258_L001_R1_001 549322 27843 50278 0 111227 107 738777 0.068 0.0009886 1 FALSE 0.06 0.678 TRUE
PMTC6LiverC154AL1S2_S154_L001_R1_001 PMTC6LiverC154AL1S2_S154_L001_R1_001 812987 36743 78314 0 155468 179 1083691 0.072 0.0026183 76 TRUE 0.54 0.972 FALSE
PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 624401 35599 74604 0 131717 111 866432 0.086 0.0025095 0 FALSE 0.00 1.000 FALSE

Visualize the QC results by a heatmap.

plotHeatmap(tumor, 
            order_cells = "hclust",
            label = c('is_aneuploid','outlier'),
            row_split = 'outlier', 
            n_threads = n_threads)
Image

In the next step of the QC we remove unwanted cells by sub-setting the CopyKit object.

tumor <- tumor[,colData(tumor)$outlier == FALSE]

diploid <- tumor[,colData(tumor)$is_aneuploid == FALSE]
tumor <- tumor[,colData(tumor)$is_aneuploid == TRUE]

dim(diploid); dim(tumor)
## [1] 11310   217

## [1] 11310   313

CopyKit objects can be easily merged back. This can also be used to merge 2 CopyKit objects.

re_merge <- cbind(diploid, tumor)
dim(re_merge)
## [1] 11310   530

Optional For data with low read counts, to improve copy number calling, we adopt a data smoothing method called “KNN smoothing” by borrowing information from cells’ nearest neighbors.

tumor <-knnSmooth(tumor, k = 4)

The dataset should now be ready to proceed with the analysis. ## Clustering We apply dimension reduction and embed the data into 2 dimensional space.

tumor <- runUmap(tumor, seed = 17)
kable(reducedDim(tumor)[1:5,])
PMTC6LiverC154AL1S2_S154_L001_R1_001 0.9889162 1.7724430
PMTC6LiverC72AL3S1_S840_L003_R1_001 -0.1651536 0.4005138
PMTC6LiverC223AL1S3_S223_L001_R1_001 3.1984547 0.1422417
PMTC6LiverC153AL1S6_S537_L002_R1_001 0.4976004 1.8813207
PMTC6LiverC198DL1S7_S582_L002_R1_001 1.0966156 0.0016753
plotUmap(tumor)
Image

By default, CopyKit uses hdbscan as clustering method. Alternative methods available in CopyKit include ‘leiden’ and ‘louvain’. To perform clustering, each algorithm requires us to define k, the number of nearest neighbors to consider.

To help with choosing k, CopyKit provides the helper function findSuggestedK() to bootstrap clustering over a range of k values, and returns the value that maximizes the jaccard similarity.

tumor <- findSuggestedK(tumor, k_range = 2:6)
## Calculating jaccard similarity for k range: 2 3 4 5 6

## 

## Suggested k = 6 with median jaccard similarity of: 0.931
plotSuggestedK(tumor)
Image
plotSuggestedK(tumor, geom = 'tile')
Image

Finally we cluster the cells using hdbscan with preferred k.

tumor <- findClusters(tumor)  
## Using suggested k_subclones = 6

## Finding clusters, using method: hdbscan

## Found 2 outliers cells (group 'c0')

## Found 4 subclones.

## Done.

Check the clustering result.

plotUmap(tumor,  label = "subclones")
Image
plotHeatmap(tumor, 
            order_cells = "hclust",
            row_split = "subclones",
            label = "subclones", 
            n_threads = n_threads)
Image

It is possible that a subgroup of clustering outliers is identified. Outliers are added to subgroup c0 and should be removed.

tumor_c0 <- tumor[,colData(tumor)$subclones == 'c0']
tumor <- tumor[,colData(tumor)$subclones != 'c0']
colData(tumor)$subclones <- droplevels(colData(tumor)$subclones)

Excluded data can be easily merged back. This can also be used to merge 2 CopyKit objects.

merged <- cbind(tumor, tumor_c0)

Ploidy and Integer estimation

CopyKit supports different methods of calculating integer copy number profiles.

  1. Computational ploidies: scquantum is a method that estimates ploidy, and uses the ploidy estimate to scale the data to absolute copy numbers given bincount data from single-cell copy number profiling.
tumor <- calcInteger(tumor, 
                     method = 'scquantum', 
                     assay = 'bincounts')
med_ploidy <- median(tumor@colData$ploidy)

plotMetrics(tumor, metric = 'ploidy') +
  ggplot2::annotate("text", x = Inf, y = Inf, hjust = 1.1, vjust = 1.5,
           label = paste0("Median ploidy = ", round(med_ploidy, 2)))
Image
  1. Fixed: a fixed value of ploidy (generally determined using Flow Cytometry) will be used to scale all cells.
tumor <- calcInteger(tumor, 
                     method = 'fixed', 
                     ploidy_value = 4.5)

Plot the integer copy number heatmap

plotHeatmap(tumor, 
            assay = 'integer',
            order_cells = "hclust",
            row_split = "subclones",
            label = 'subclones', 
            n_threads = n_threads)
Image

We can also check profile of any cell using the plotRatio(). Here shown profile of a cell from c1.

plotRatio(tumor, sample_name = 'PMTC6LiverC281AL6L7S5_S1433_L004_R1_001')
Image

Phylogeny

To run a phylogenetic analysis of cells’ copy number profiles, use the function runPhylo(). Available methods are Neighbor Joining and Balanced Minimum evolution. Trees can also be visualized by plotPhylo()

tumor <- runPhylo(tumor,
                  assay = 'segment_ratios',
                  metric = 'manhattan',
                  n_threads = n_threads)
phylo(tumor)
## 
## Phylogenetic tree with 311 tips and 310 internal nodes.
## 
## Tip labels:
##   PMTC6LiverC154AL1S2_S154_L001_R1_001, PMTC6LiverC72AL3S1_S840_L003_R1_001, PMTC6LiverC223AL1S3_S223_L001_R1_001, PMTC6LiverC153AL1S6_S537_L002_R1_001, PMTC6LiverC198DL1S7_S582_L002_R1_001, PMTC6LiverC161AL4L5S1_S929_L003_R1_001, ...
## 
## Rooted; includes branch lengths.
plotPhylo(tumor, label = 'subclones')
Image

Single cell phylogeny is very prone to data noise. By taking the median of all cells from the same subclone, we can build subclonal consensus profiles and construct phylogeny from there. We can also root the consensus tree by a diploid profile.

tumor <- calcConsensus(tumor, assay = 'segment_ratios')
consensus(tumor)[1:5,]
##      c1   c2   c3   c4
## V1 0.69 0.73 0.68 0.69
## V2 0.69 0.73 0.68 0.69
## V3 0.69 0.73 0.68 0.69
## V4 0.69 0.73 0.68 0.69
## V5 0.69 0.73 0.68 0.69
tumor <- runConsensusPhylo(tumor, root = 'neutral')
consensusPhylo(tumor)
## 
## Phylogenetic tree with 4 tips and 3 internal nodes.
## 
## Tip labels:
##   c1, c2, c3, c4
## 
## Rooted; includes branch lengths.
plotPhylo(tumor, label = 'subclones', consensus = TRUE)
Image

Gene copy number

To further understand the what copy number events contribute to the lineage, we can also plot consensus heatmaps annotated by oncogenes.

genes = c("CDKN2A",
          "FGFR1",
          "TP53",
          "PTEN",
          "MYC",
          "CDKN1A",
          "MDM2",
          "AURKA",
          "PIK3CA",
          "CCND1",
          "KRAS")

tumor <- calcConsensus(tumor, assay = 'integer')
plotHeatmap(tumor, 
            consensus = TRUE,
            label = 'subclones',
            genes = genes,
            n_threads = n_threads)
Image

Or more intuitively, check copy number states across of genes in different subclones.

plotGeneCopy(tumor, 
             genes = genes,
             label = 'subclones',
             dodge.width = 0.8)
Image

Also note that all the plot functions except heatmap are based on ggplot2, thus we can further customize them by adding themes, etc..

plotGeneCopy(tumor, 
             genes = genes,
             label = 'subclones',
             dodge.width = 0.8) +
  ggplot2::ylim(c(0,4)) +
  ggplot2::ggtitle("Copy number status across genes") +
  ggplot2::theme_linedraw() + 
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90))
Image

Saving plots from copykit

We can also output plots using pdf(file = ‘/path/to/file/filename.pdf’). We can control the dimensions using width and height arguments.

p1 <- plotUmap(tumor,  label = "subclones")
## Plotting Umap.

## Coloring by: subclones.
p2 <- plotHeatmap(tumor, 
            consensus = TRUE,
            label = 'subclones',
            genes = genes,
            n_threads = n_threads)
## order_cells argument is NULL. Samples are ordered according to
##               colnames(CopyKit)

## Plotting Heatmap.
pdf(file = 'copykit_output_plots.pdf')
print(p1)
print(p2)
dev.off()
## png 
##   2
pdf(file = 'copykit_consensus_heatmap.pdf', height = 3)
print(p2)
dev.off()
## png 
##   2

Session info

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Red Hat Enterprise Linux
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.3.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] BiocParallel_1.28.3         kableExtra_1.3.4           
##  [3] copykit_0.1.2               DNAcopy_1.68.0             
##  [5] Rsubread_2.8.2              SingleCellExperiment_1.16.0
##  [7] SummarizedExperiment_1.24.0 Biobase_2.54.0             
##  [9] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1        
## [11] IRanges_2.28.0              S4Vectors_0.32.4           
## [13] BiocGenerics_0.40.0         MatrixGenerics_1.6.0       
## [15] matrixStats_1.1.0           knitr_1.45                 
## 
## loaded via a namespace (and not attached):
##   [1] circlize_0.4.15        systemfonts_1.1.0      igraph_1.5.0.1        
##   [4] lazyeval_0.2.2         splines_4.1.2          ggplot2_3.5.1         
##   [7] amap_0.8-19            digest_0.6.34          foreach_1.5.2         
##  [10] yulab.utils_0.1.4      htmltools_0.5.7        magick_2.7.3          
##  [13] viridis_0.6.5          ggalluvial_0.12.5      fansi_1.0.6           
##  [16] magrittr_2.0.3         memoise_2.0.1          cluster_2.1.2         
##  [19] doParallel_1.0.17      mixtools_2.0.0         fastcluster_1.2.6     
##  [22] ComplexHeatmap_2.10.0  svglite_2.1.1          colorspace_2.1-0      
##  [25] rvest_1.0.4            xfun_0.42              dplyr_1.1.4           
##  [28] crayon_1.5.2           RCurl_1.98-1.14        jsonlite_1.8.8        
##  [31] survival_3.3-0         iterators_1.0.14       ape_5.7-1             
##  [34] glue_1.7.0             gtable_0.3.4           zlibbioc_1.40.0       
##  [37] XVector_0.34.0         webshot_0.5.2          GetoptLong_1.0.5      
##  [40] DelayedArray_0.20.0    kernlab_0.9-32         shape_1.4.6           
##  [43] prabclus_2.3-3         DEoptimR_1.1-3         scales_1.3.0          
##  [46] miniUI_0.1.1.1         Rcpp_1.0.12            viridisLite_0.4.2     
##  [49] xtable_1.8-4           clue_0.3-65            gridGraphics_0.5-1    
##  [52] tidytree_0.4.6         mclust_6.0.1           htmlwidgets_1.6.4     
##  [55] httr_1.4.7             FNN_1.1.4              RColorBrewer_1.1-3    
##  [58] fpc_2.2-11             modeltools_0.2-23      ellipsis_0.3.2        
##  [61] farver_2.1.1           pkgconfig_2.0.3        flexmix_2.3-19        
##  [64] sass_0.4.8             nnet_7.3-17            uwot_0.1.16           
##  [67] utf8_1.2.4             labeling_0.4.3         ggplotify_0.1.2       
##  [70] tidyselect_1.2.1       rlang_1.1.3            later_1.3.2           
##  [73] munsell_0.5.0          tools_4.1.2            cachem_1.0.8          
##  [76] cli_3.6.2              dbscan_1.1-12          generics_0.1.3        
##  [79] evaluate_0.23          stringr_1.5.1          fastmap_1.1.1         
##  [82] yaml_2.3.8             ggtree_3.2.1           fs_1.6.3              
##  [85] robustbase_0.99-2      purrr_1.0.2            nlme_3.1-155          
##  [88] mime_0.12              aplot_0.2.2            xml2_1.3.6            
##  [91] compiler_4.1.2         rstudioapi_0.15.0      beeswarm_0.4.0        
##  [94] plotly_4.10.4          png_0.1-8              scquantum_1.0.0       
##  [97] treeio_1.18.1          tibble_3.2.1           bslib_0.6.1           
## [100] stringi_1.8.3          highr_0.10             forcats_1.0.0         
## [103] lattice_0.22-6         bluster_1.4.0          Matrix_1.6-5          
## [106] vctrs_0.6.5            pillar_1.9.0           lifecycle_1.0.4       
## [109] jquerylib_0.1.4        GlobalOptions_0.1.2    BiocNeighbors_1.12.0  
## [112] irlba_2.3.5.1          data.table_1.15.0      bitops_1.0-7          
## [115] httpuv_1.6.14          patchwork_1.2.0        R6_2.5.1              
## [118] promises_1.2.1         gridExtra_2.3          vipor_0.4.7           
## [121] codetools_0.2-18       MASS_7.3-55            gtools_3.9.5          
## [124] rjson_0.2.21           withr_3.0.0            GenomeInfoDbData_1.2.7
## [127] diptest_0.77-0         parallel_4.1.2         grid_4.1.2            
## [130] ggfun_0.1.4            tidyr_1.3.1            class_7.3-20          
## [133] rmarkdown_2.25         segmented_2.0-2        ggnewscale_0.4.10     
## [136] shiny_1.8.0            ggbeeswarm_0.7.2

Clone this wiki locally