1 scDNA: CopyKit

CopyKit Workshop

Overview

The goal of CopyKit is to help you analyze single cell DNA sequencing datasets for copy number.

This tutorial presents an example workflow of CopyKit. Detailed information for methods used of each function can be found in CopyKit user guide.

Installation

Install the stable or development version from github using devtools. Note that knn smoothing is only available in development version.
Install from package archive files, download v0.1.3 here.

devtools::install_github("navinlabcode/copykit")
devtools::install_github("navinlabcode/copykit@devel")

Check version and load libraries.

packageVersion("copykit")

## [1] '0.1.2'

library(copykit)
library(kableExtra)

Parallelization

Whenever possible, CopyKit uses the BiocParallel framework.

library(BiocParallel)
n_threads <- 8
register(MulticoreParam(workers = n_threads, progressbar = F), default = T)

Confirm parameters:

bpparam()

## class: MulticoreParam
##   bpisup: FALSE; bpnworkers: 8; bptasks: 0; bpjobname: BPJOB
##   bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
##   bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
##   bpexportglobals: TRUE; bpforceGC: TRUE
##   bplogdir: NA
##   bpresultdir: NA
##   cluster type: FORK

Segmentation

Run CBS segmentation from bam files.

This part takes more than 10 mins depending on the sample size thus will be skipped in the workshop.

tumor <- runVarbin(
  dir = "~/path/to/bam/files/",
  genome = "hg38",
  resolution = "220kb", # c("220kb", "55kb", "110kb", "195kb", "280kb", "500kb", "1Mb", "2.8Mb")
  remove_Y = FALSE,
  is_paired_end = FALSE,
  seed = 17,
  BPPARAM = bpparam()
)

Example data set

First we load in the example CopyKit object. This data set was generated from scDNA-seq of a human liver sample.

tumor <- readRDS("sample_obj.rds")

Then take a look at the data structure of CopyKit object.

# overview CopyKit object
print(tumor)

## class: CopyKit 
## dim: 11310 600 
## metadata(3): genome resolution vst
## assays(6): bincounts ft ... ratios logr
## rownames(11310): 1 2 ... 11309 11310
## rowData names(3): gc_content abspos arm
## colnames(600): PMTC6LiverC103DL6L7S3_S1255_L004_R1_001
##   PMTC6LiverC204DL1S7_S588_L002_R1_001 ...
##   PMTC6LiverC271AL6L7S5_S1423_L004_R1_001
##   PMTC6LiverC136AL6L7S3_S1288_L004_R1_001
## colData names(9): sample reads_assigned_bins ... reads_total
##   percentage_duplicates
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowRanges has: 11310 ranges
## Phylo: Phylogenetic tree with 0 tips and  nodes
## consensus dim: 0 0

# view first 5 rows of colData  
kable(colData(tumor)[1:5,])

	sample	reads_assigned_bins	reads_unmapped	reads_duplicates	reads_unassigned	reads_ambiguous	reads_total	percentage_duplicates
PMTC6LiverC103DL6L7S3_S1255_L004_R1_001	PMTC6LiverC103DL6L7S3_S1255_L004_R1_001	506994	27749	67923	105388	115	708169	0.096
PMTC6LiverC204DL1S7_S588_L002_R1_001	PMTC6LiverC204DL1S7_S588_L002_R1_001	476881	36334	45716	98424	96	657451	0.070
PMTC6LiverC258AL1S3_S258_L001_R1_001	PMTC6LiverC258AL1S3_S258_L001_R1_001	549322	27843	50278	111227	107	738777	0.068
PMTC6LiverC154AL1S2_S154_L001_R1_001	PMTC6LiverC154AL1S2_S154_L001_R1_001	812987	36743	78314	155468	179	1083691	0.072
PMTC6LiverC247AL4L5S3_S1015_L003_R1_001	PMTC6LiverC247AL4L5S3_S1015_L003_R1_001	624401	35599	74604	131717	111	866432	0.086

# view available assays 
assays(tumor)

## List of length 6
## names(6): bincounts ft smoothed_bincounts segment_ratios ratios logr

# view first 5 rows and columns of segment ratios matrix
kable(assays(tumor)$segment_ratios[1:5,1:5])

PMTC6LiverC103DL6L7S3_S1255_L004_R1_001	PMTC6LiverC204DL1S7_S588_L002_R1_001	PMTC6LiverC258AL1S3_S258_L001_R1_001	PMTC6LiverC154AL1S2_S154_L001_R1_001	PMTC6LiverC247AL4L5S3_S1015_L003_R1_001
1	1	0.98	0.71	1
1	1	0.98	0.71	1
1	1	0.98	0.71	1
1	1	0.98	0.71	1
1	1	0.98	0.71	1

# Check dimensions of the segment ratios matrix: 
# 11310 rows (genomic bins) by 600 columns (cells)
dim(assays(tumor)$segment_ratios)

## [1] 11310   600

Quality Control

We can check some basic QC metrics for our sample. Our object already contains reads_total and percentage_duplicates, which are populated in the runVarbin step. Additionally, runMetrics calculates overdispersion and breakpoint counts.

tumor <- runMetrics(tumor)

## Calculating overdispersion.

## Counting breakpoints.

## Done.

plotMetrics(tumor, metric = c("reads_total", "percentage_duplicates",
                              "overdispersion", "breakpoint_count"))

First we want to detect euploid cells by calculating the sample-wise coefficient of variation from the segment ratio means. The threshold can be changed from the automatic detection to a custom threshold with the argument resolution. By setting a threshold = 0.1, findAneuploidCells() will mark as euploid all cells with a coefficient of variation less or equal than 0.1.

tumor <- findAneuploidCells(tumor, seed = 17)

## number of iterations= 15

## Copykit detected 244 that are possibly diploid cells using a resolution of: 0.067

## Added information to colData(CopyKit).

kable(colData(tumor)[1:5,])

	sample	reads_assigned_bins	reads_unmapped	reads_duplicates	reads_unassigned	reads_ambiguous	reads_total	percentage_duplicates	overdispersion	breakpoint_count	is_aneuploid	find_normal_cv
PMTC6LiverC103DL6L7S3_S1255_L004_R1_001	PMTC6LiverC103DL6L7S3_S1255_L004_R1_001	506994	27749	67923	105388	115	708169	0.096	0.0044259	1	FALSE	0.02
PMTC6LiverC204DL1S7_S588_L002_R1_001	PMTC6LiverC204DL1S7_S588_L002_R1_001	476881	36334	45716	98424	96	657451	0.070	0.0025688	0	FALSE	0.00
PMTC6LiverC258AL1S3_S258_L001_R1_001	PMTC6LiverC258AL1S3_S258_L001_R1_001	549322	27843	50278	111227	107	738777	0.068	0.0009886	1	FALSE	0.06
PMTC6LiverC154AL1S2_S154_L001_R1_001	PMTC6LiverC154AL1S2_S154_L001_R1_001	812987	36743	78314	155468	179	1083691	0.072	0.0026183	76	TRUE	0.54
PMTC6LiverC247AL4L5S3_S1015_L003_R1_001	PMTC6LiverC247AL4L5S3_S1015_L003_R1_001	624401	35599	74604	131717	111	866432	0.086	0.0025095	0	FALSE	0.00

Then we detect low-quality samples by a k-nearest-neighbors filtering (default k=5). Cells in which the correlation value is lower than the defined threshold are classified as low-quality cells (default resolution=0.9).

tumor <- findOutliers(tumor, k = 5, resolution = 0.9)

## Calculating correlation matrix.

## Marked 70 cells as outliers.

## Adding information to metadata. Access with colData(scCNA).

## Done.

kable(colData(tumor)[1:5,])

	sample	reads_assigned_bins	reads_unmapped	reads_duplicates	reads_unassigned	reads_ambiguous	reads_total	percentage_duplicates	overdispersion	breakpoint_count	is_aneuploid	find_normal_cv	cell_corr_value	outlier
PMTC6LiverC103DL6L7S3_S1255_L004_R1_001	PMTC6LiverC103DL6L7S3_S1255_L004_R1_001	506994	27749	67923	105388	115	708169	0.096	0.0044259	1	FALSE	0.02	1.000	FALSE
PMTC6LiverC204DL1S7_S588_L002_R1_001	PMTC6LiverC204DL1S7_S588_L002_R1_001	476881	36334	45716	98424	96	657451	0.070	0.0025688	0	FALSE	0.00	1.000	FALSE
PMTC6LiverC258AL1S3_S258_L001_R1_001	PMTC6LiverC258AL1S3_S258_L001_R1_001	549322	27843	50278	111227	107	738777	0.068	0.0009886	1	FALSE	0.06	0.678	TRUE
PMTC6LiverC154AL1S2_S154_L001_R1_001	PMTC6LiverC154AL1S2_S154_L001_R1_001	812987	36743	78314	155468	179	1083691	0.072	0.0026183	76	TRUE	0.54	0.972	FALSE
PMTC6LiverC247AL4L5S3_S1015_L003_R1_001	PMTC6LiverC247AL4L5S3_S1015_L003_R1_001	624401	35599	74604	131717	111	866432	0.086	0.0025095	0	FALSE	0.00	1.000	FALSE

Visualize the QC results by a heatmap.

plotHeatmap(tumor, 
            order_cells = "hclust",
            label = c('is_aneuploid','outlier'),
            row_split = 'outlier', 
            n_threads = n_threads)

In the next step of the QC we remove unwanted cells by sub-setting the CopyKit object.

tumor <- tumor[,colData(tumor)$outlier == FALSE]

diploid <- tumor[,colData(tumor)$is_aneuploid == FALSE]
tumor <- tumor[,colData(tumor)$is_aneuploid == TRUE]

dim(diploid); dim(tumor)

## [1] 11310   217

## [1] 11310   313

CopyKit objects can be easily merged back. This can also be used to merge 2 CopyKit objects.

re_merge <- cbind(diploid, tumor)
dim(re_merge)

## [1] 11310   530

Optional For data with low read counts, to improve copy number calling, we adopt a data smoothing method called “KNN smoothing” by borrowing information from cells’ nearest neighbors.

tumor <-knnSmooth(tumor, k = 4)

The dataset should now be ready to proceed with the analysis. ## Clustering We apply dimension reduction and embed the data into 2 dimensional space.

tumor <- runUmap(tumor, seed = 17)
kable(reducedDim(tumor)[1:5,])

PMTC6LiverC154AL1S2_S154_L001_R1_001	0.9889162	1.7724430
PMTC6LiverC72AL3S1_S840_L003_R1_001	-0.1651536	0.4005138
PMTC6LiverC223AL1S3_S223_L001_R1_001	3.1984547	0.1422417
PMTC6LiverC153AL1S6_S537_L002_R1_001	0.4976004	1.8813207
PMTC6LiverC198DL1S7_S582_L002_R1_001	1.0966156	0.0016753

plotUmap(tumor)

By default, CopyKit uses hdbscan as clustering method. Alternative methods available in CopyKit include ‘leiden’ and ‘louvain’. To perform clustering, each algorithm requires us to define k, the number of nearest neighbors to consider.

To help with choosing k, CopyKit provides the helper function findSuggestedK() to bootstrap clustering over a range of k values, and returns the value that maximizes the jaccard similarity.

tumor <- findSuggestedK(tumor, k_range = 2:6)

## Calculating jaccard similarity for k range: 2 3 4 5 6

## 

## Suggested k = 6 with median jaccard similarity of: 0.931

plotSuggestedK(tumor)

plotSuggestedK(tumor, geom = 'tile')

Finally we cluster the cells using hdbscan with preferred k.

tumor <- findClusters(tumor)

## Using suggested k_subclones = 6

## Finding clusters, using method: hdbscan

## Found 2 outliers cells (group 'c0')

## Found 4 subclones.

## Done.

Check the clustering result.

plotUmap(tumor,  label = "subclones")

plotHeatmap(tumor, 
            order_cells = "hclust",
            row_split = "subclones",
            label = "subclones", 
            n_threads = n_threads)

It is possible that a subgroup of clustering outliers is identified. Outliers are added to subgroup c0 and should be removed.

tumor_c0 <- tumor[,colData(tumor)$subclones == 'c0']
tumor <- tumor[,colData(tumor)$subclones != 'c0']
colData(tumor)$subclones <- droplevels(colData(tumor)$subclones)

Excluded data can be easily merged back. This can also be used to merge 2 CopyKit objects.

merged <- cbind(tumor, tumor_c0)

Ploidy and Integer estimation

CopyKit supports different methods of calculating integer copy number profiles.

Computational ploidies: scquantum is a method that estimates ploidy, and uses the ploidy estimate to scale the data to absolute copy numbers given bincount data from single-cell copy number profiling.

tumor <- calcInteger(tumor, 
                     method = 'scquantum', 
                     assay = 'bincounts')
med_ploidy <- median(tumor@colData$ploidy)

plotMetrics(tumor, metric = 'ploidy') +
  ggplot2::annotate("text", x = Inf, y = Inf, hjust = 1.1, vjust = 1.5,
           label = paste0("Median ploidy = ", round(med_ploidy, 2)))

Fixed: a fixed value of ploidy (generally determined using Flow Cytometry) will be used to scale all cells.

tumor <- calcInteger(tumor, 
                     method = 'fixed', 
                     ploidy_value = 4.5)

Plot the integer copy number heatmap

plotHeatmap(tumor, 
            assay = 'integer',
            order_cells = "hclust",
            row_split = "subclones",
            label = 'subclones', 
            n_threads = n_threads)

We can also check profile of any cell using the plotRatio(). Here shown profile of a cell from c1.

plotRatio(tumor, sample_name = 'PMTC6LiverC281AL6L7S5_S1433_L004_R1_001')

Phylogeny

To run a phylogenetic analysis of cells’ copy number profiles, use the function runPhylo(). Available methods are Neighbor Joining and Balanced Minimum evolution. Trees can also be visualized by plotPhylo()

tumor <- runPhylo(tumor,
                  assay = 'segment_ratios',
                  metric = 'manhattan',
                  n_threads = n_threads)
phylo(tumor)

## 
## Phylogenetic tree with 311 tips and 310 internal nodes.
## 
## Tip labels:
##   PMTC6LiverC154AL1S2_S154_L001_R1_001, PMTC6LiverC72AL3S1_S840_L003_R1_001, PMTC6LiverC223AL1S3_S223_L001_R1_001, PMTC6LiverC153AL1S6_S537_L002_R1_001, PMTC6LiverC198DL1S7_S582_L002_R1_001, PMTC6LiverC161AL4L5S1_S929_L003_R1_001, ...
## 
## Rooted; includes branch lengths.

plotPhylo(tumor, label = 'subclones')

Single cell phylogeny is very prone to data noise. By taking the median of all cells from the same subclone, we can build subclonal consensus profiles and construct phylogeny from there. We can also root the consensus tree by a diploid profile.

tumor <- calcConsensus(tumor, assay = 'segment_ratios')
consensus(tumor)[1:5,]

##      c1   c2   c3   c4
## V1 0.69 0.73 0.68 0.69
## V2 0.69 0.73 0.68 0.69
## V3 0.69 0.73 0.68 0.69
## V4 0.69 0.73 0.68 0.69
## V5 0.69 0.73 0.68 0.69

tumor <- runConsensusPhylo(tumor, root = 'neutral')
consensusPhylo(tumor)

## 
## Phylogenetic tree with 4 tips and 3 internal nodes.
## 
## Tip labels:
##   c1, c2, c3, c4
## 
## Rooted; includes branch lengths.

plotPhylo(tumor, label = 'subclones', consensus = TRUE)

Gene copy number

To further understand the what copy number events contribute to the lineage, we can also plot consensus heatmaps annotated by oncogenes.

genes = c("CDKN2A",
          "FGFR1",
          "TP53",
          "PTEN",
          "MYC",
          "CDKN1A",
          "MDM2",
          "AURKA",
          "PIK3CA",
          "CCND1",
          "KRAS")

tumor <- calcConsensus(tumor, assay = 'integer')
plotHeatmap(tumor, 
            consensus = TRUE,
            label = 'subclones',
            genes = genes,
            n_threads = n_threads)

Or more intuitively, check copy number states across of genes in different subclones.

plotGeneCopy(tumor, 
             genes = genes,
             label = 'subclones',
             dodge.width = 0.8)

Also note that all the plot functions except heatmap are based on ggplot2, thus we can further customize them by adding themes, etc..

plotGeneCopy(tumor, 
             genes = genes,
             label = 'subclones',
             dodge.width = 0.8) +
  ggplot2::ylim(c(0,4)) +
  ggplot2::ggtitle("Copy number status across genes") +
  ggplot2::theme_linedraw() + 
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90))

Saving plots from copykit

We can also output plots using pdf(file = ‘/path/to/file/filename.pdf’). We can control the dimensions using width and height arguments.

p1 <- plotUmap(tumor,  label = "subclones")

## Plotting Umap.

## Coloring by: subclones.

p2 <- plotHeatmap(tumor, 
            consensus = TRUE,
            label = 'subclones',
            genes = genes,
            n_threads = n_threads)

## order_cells argument is NULL. Samples are ordered according to
##               colnames(CopyKit)

## Plotting Heatmap.

pdf(file = 'copykit_output_plots.pdf')
print(p1)
print(p2)
dev.off()

## png 
##   2

pdf(file = 'copykit_consensus_heatmap.pdf', height = 3)
print(p2)
dev.off()

## png 
##   2

Session info

sessionInfo()

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Red Hat Enterprise Linux
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.3.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] BiocParallel_1.28.3         kableExtra_1.3.4           
##  [3] copykit_0.1.2               DNAcopy_1.68.0             
##  [5] Rsubread_2.8.2              SingleCellExperiment_1.16.0
##  [7] SummarizedExperiment_1.24.0 Biobase_2.54.0             
##  [9] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1        
## [11] IRanges_2.28.0              S4Vectors_0.32.4           
## [13] BiocGenerics_0.40.0         MatrixGenerics_1.6.0       
## [15] matrixStats_1.1.0           knitr_1.45                 
## 
## loaded via a namespace (and not attached):
##   [1] circlize_0.4.15        systemfonts_1.1.0      igraph_1.5.0.1        
##   [4] lazyeval_0.2.2         splines_4.1.2          ggplot2_3.5.1         
##   [7] amap_0.8-19            digest_0.6.34          foreach_1.5.2         
##  [10] yulab.utils_0.1.4      htmltools_0.5.7        magick_2.7.3          
##  [13] viridis_0.6.5          ggalluvial_0.12.5      fansi_1.0.6           
##  [16] magrittr_2.0.3         memoise_2.0.1          cluster_2.1.2         
##  [19] doParallel_1.0.17      mixtools_2.0.0         fastcluster_1.2.6     
##  [22] ComplexHeatmap_2.10.0  svglite_2.1.1          colorspace_2.1-0      
##  [25] rvest_1.0.4            xfun_0.42              dplyr_1.1.4           
##  [28] crayon_1.5.2           RCurl_1.98-1.14        jsonlite_1.8.8        
##  [31] survival_3.3-0         iterators_1.0.14       ape_5.7-1             
##  [34] glue_1.7.0             gtable_0.3.4           zlibbioc_1.40.0       
##  [37] XVector_0.34.0         webshot_0.5.2          GetoptLong_1.0.5      
##  [40] DelayedArray_0.20.0    kernlab_0.9-32         shape_1.4.6           
##  [43] prabclus_2.3-3         DEoptimR_1.1-3         scales_1.3.0          
##  [46] miniUI_0.1.1.1         Rcpp_1.0.12            viridisLite_0.4.2     
##  [49] xtable_1.8-4           clue_0.3-65            gridGraphics_0.5-1    
##  [52] tidytree_0.4.6         mclust_6.0.1           htmlwidgets_1.6.4     
##  [55] httr_1.4.7             FNN_1.1.4              RColorBrewer_1.1-3    
##  [58] fpc_2.2-11             modeltools_0.2-23      ellipsis_0.3.2        
##  [61] farver_2.1.1           pkgconfig_2.0.3        flexmix_2.3-19        
##  [64] sass_0.4.8             nnet_7.3-17            uwot_0.1.16           
##  [67] utf8_1.2.4             labeling_0.4.3         ggplotify_0.1.2       
##  [70] tidyselect_1.2.1       rlang_1.1.3            later_1.3.2           
##  [73] munsell_0.5.0          tools_4.1.2            cachem_1.0.8          
##  [76] cli_3.6.2              dbscan_1.1-12          generics_0.1.3        
##  [79] evaluate_0.23          stringr_1.5.1          fastmap_1.1.1         
##  [82] yaml_2.3.8             ggtree_3.2.1           fs_1.6.3              
##  [85] robustbase_0.99-2      purrr_1.0.2            nlme_3.1-155          
##  [88] mime_0.12              aplot_0.2.2            xml2_1.3.6            
##  [91] compiler_4.1.2         rstudioapi_0.15.0      beeswarm_0.4.0        
##  [94] plotly_4.10.4          png_0.1-8              scquantum_1.0.0       
##  [97] treeio_1.18.1          tibble_3.2.1           bslib_0.6.1           
## [100] stringi_1.8.3          highr_0.10             forcats_1.0.0         
## [103] lattice_0.22-6         bluster_1.4.0          Matrix_1.6-5          
## [106] vctrs_0.6.5            pillar_1.9.0           lifecycle_1.0.4       
## [109] jquerylib_0.1.4        GlobalOptions_0.1.2    BiocNeighbors_1.12.0  
## [112] irlba_2.3.5.1          data.table_1.15.0      bitops_1.0-7          
## [115] httpuv_1.6.14          patchwork_1.2.0        R6_2.5.1              
## [118] promises_1.2.1         gridExtra_2.3          vipor_0.4.7           
## [121] codetools_0.2-18       MASS_7.3-55            gtools_3.9.5          
## [124] rjson_0.2.21           withr_3.0.0            GenomeInfoDbData_1.2.7
## [127] diptest_0.77-0         parallel_4.1.2         grid_4.1.2            
## [130] ggfun_0.1.4            tidyr_1.3.1            class_7.3-20          
## [133] rmarkdown_2.25         segmented_2.0-2        ggnewscale_0.4.10     
## [136] shiny_1.8.0            ggbeeswarm_0.7.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1 scDNA: CopyKit

CopyKit Workshop

Overview

Installation

Parallelization

Segmentation

Example data set

Quality Control

Ploidy and Integer estimation

Phylogeny

Gene copy number

Saving plots from copykit

Session info

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally