-
Notifications
You must be signed in to change notification settings - Fork 4
1 scDNA: CopyKit
The goal of CopyKit is to help you analyze single cell DNA sequencing datasets for copy number.
This tutorial presents an example workflow of CopyKit. Detailed information for methods used of each function can be found in CopyKit user guide.
-
Install the stable or development version from github using
devtools. Note that knn smoothing is only available in development version. -
Install from package archive files, download
v0.1.3here.
devtools::install_github("navinlabcode/copykit")
devtools::install_github("navinlabcode/copykit@devel")Check version and load libraries.
packageVersion("copykit")## [1] '0.1.2'
library(copykit)
library(kableExtra)Whenever possible, CopyKit uses the BiocParallel framework.
library(BiocParallel)
n_threads <- 8
register(MulticoreParam(workers = n_threads, progressbar = F), default = T)Confirm parameters:
bpparam()## class: MulticoreParam
## bpisup: FALSE; bpnworkers: 8; bptasks: 0; bpjobname: BPJOB
## bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
## bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
## bpexportglobals: TRUE; bpforceGC: TRUE
## bplogdir: NA
## bpresultdir: NA
## cluster type: FORK
Run CBS segmentation from bam files.
This part takes more than 10 mins depending on the sample size thus will be skipped in the workshop.
tumor <- runVarbin(
dir = "~/path/to/bam/files/",
genome = "hg38",
resolution = "220kb", # c("220kb", "55kb", "110kb", "195kb", "280kb", "500kb", "1Mb", "2.8Mb")
remove_Y = FALSE,
is_paired_end = FALSE,
seed = 17,
BPPARAM = bpparam()
)First we load in the example CopyKit object. This data set was generated from scDNA-seq of a human liver sample.
tumor <- readRDS("sample_obj.rds")Then take a look at the data structure of CopyKit object.
# overview CopyKit object
print(tumor) ## class: CopyKit
## dim: 11310 600
## metadata(3): genome resolution vst
## assays(6): bincounts ft ... ratios logr
## rownames(11310): 1 2 ... 11309 11310
## rowData names(3): gc_content abspos arm
## colnames(600): PMTC6LiverC103DL6L7S3_S1255_L004_R1_001
## PMTC6LiverC204DL1S7_S588_L002_R1_001 ...
## PMTC6LiverC271AL6L7S5_S1423_L004_R1_001
## PMTC6LiverC136AL6L7S3_S1288_L004_R1_001
## colData names(9): sample reads_assigned_bins ... reads_total
## percentage_duplicates
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowRanges has: 11310 ranges
## Phylo: Phylogenetic tree with 0 tips and nodes
## consensus dim: 0 0
# view first 5 rows of colData
kable(colData(tumor)[1:5,]) | sample | reads_assigned_bins | reads_unmapped | reads_duplicates | reads_multimapped | reads_unassigned | reads_ambiguous | reads_total | percentage_duplicates | |
|---|---|---|---|---|---|---|---|---|---|
| PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 | PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 | 506994 | 27749 | 67923 | 0 | 105388 | 115 | 708169 | 0.096 |
| PMTC6LiverC204DL1S7_S588_L002_R1_001 | PMTC6LiverC204DL1S7_S588_L002_R1_001 | 476881 | 36334 | 45716 | 0 | 98424 | 96 | 657451 | 0.070 |
| PMTC6LiverC258AL1S3_S258_L001_R1_001 | PMTC6LiverC258AL1S3_S258_L001_R1_001 | 549322 | 27843 | 50278 | 0 | 111227 | 107 | 738777 | 0.068 |
| PMTC6LiverC154AL1S2_S154_L001_R1_001 | PMTC6LiverC154AL1S2_S154_L001_R1_001 | 812987 | 36743 | 78314 | 0 | 155468 | 179 | 1083691 | 0.072 |
| PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 | PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 | 624401 | 35599 | 74604 | 0 | 131717 | 111 | 866432 | 0.086 |
# view available assays
assays(tumor) ## List of length 6
## names(6): bincounts ft smoothed_bincounts segment_ratios ratios logr
# view first 5 rows and columns of segment ratios matrix
kable(assays(tumor)$segment_ratios[1:5,1:5]) | PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 | PMTC6LiverC204DL1S7_S588_L002_R1_001 | PMTC6LiverC258AL1S3_S258_L001_R1_001 | PMTC6LiverC154AL1S2_S154_L001_R1_001 | PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 |
|---|---|---|---|---|
| 1 | 1 | 0.98 | 0.71 | 1 |
| 1 | 1 | 0.98 | 0.71 | 1 |
| 1 | 1 | 0.98 | 0.71 | 1 |
| 1 | 1 | 0.98 | 0.71 | 1 |
| 1 | 1 | 0.98 | 0.71 | 1 |
# Check dimensions of the segment ratios matrix:
# 11310 rows (genomic bins) by 600 columns (cells)
dim(assays(tumor)$segment_ratios) ## [1] 11310 600
We can check some basic QC metrics for our sample. Our object already contains reads_total and percentage_duplicates, which are populated in the runVarbin step. Additionally, runMetrics calculates overdispersion and breakpoint counts.
tumor <- runMetrics(tumor)## Calculating overdispersion.
## Counting breakpoints.
## Done.
plotMetrics(tumor, metric = c("reads_total", "percentage_duplicates",
"overdispersion", "breakpoint_count"))
First we want to detect euploid cells by calculating the sample-wise
coefficient of variation from the segment ratio means. The threshold can
be changed from the automatic detection to a custom threshold with the
argument resolution. By setting a threshold = 0.1,
findAneuploidCells() will mark as euploid all cells with a coefficient
of variation less or equal than 0.1.
tumor <- findAneuploidCells(tumor, seed = 17)## number of iterations= 15
## Copykit detected 244 that are possibly diploid cells using a resolution of: 0.067
## Added information to colData(CopyKit).
kable(colData(tumor)[1:5,])| sample | reads_assigned_bins | reads_unmapped | reads_duplicates | reads_multimapped | reads_unassigned | reads_ambiguous | reads_total | percentage_duplicates | overdispersion | breakpoint_count | is_aneuploid | find_normal_cv | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 | PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 | 506994 | 27749 | 67923 | 0 | 105388 | 115 | 708169 | 0.096 | 0.0044259 | 1 | FALSE | 0.02 |
| PMTC6LiverC204DL1S7_S588_L002_R1_001 | PMTC6LiverC204DL1S7_S588_L002_R1_001 | 476881 | 36334 | 45716 | 0 | 98424 | 96 | 657451 | 0.070 | 0.0025688 | 0 | FALSE | 0.00 |
| PMTC6LiverC258AL1S3_S258_L001_R1_001 | PMTC6LiverC258AL1S3_S258_L001_R1_001 | 549322 | 27843 | 50278 | 0 | 111227 | 107 | 738777 | 0.068 | 0.0009886 | 1 | FALSE | 0.06 |
| PMTC6LiverC154AL1S2_S154_L001_R1_001 | PMTC6LiverC154AL1S2_S154_L001_R1_001 | 812987 | 36743 | 78314 | 0 | 155468 | 179 | 1083691 | 0.072 | 0.0026183 | 76 | TRUE | 0.54 |
| PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 | PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 | 624401 | 35599 | 74604 | 0 | 131717 | 111 | 866432 | 0.086 | 0.0025095 | 0 | FALSE | 0.00 |
Then we detect low-quality samples by a k-nearest-neighbors filtering (default k=5). Cells in which the correlation value is lower than the defined threshold are classified as low-quality cells (default resolution=0.9).
tumor <- findOutliers(tumor, k = 5, resolution = 0.9)## Calculating correlation matrix.
## Marked 70 cells as outliers.
## Adding information to metadata. Access with colData(scCNA).
## Done.
kable(colData(tumor)[1:5,])| sample | reads_assigned_bins | reads_unmapped | reads_duplicates | reads_multimapped | reads_unassigned | reads_ambiguous | reads_total | percentage_duplicates | overdispersion | breakpoint_count | is_aneuploid | find_normal_cv | cell_corr_value | outlier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 | PMTC6LiverC103DL6L7S3_S1255_L004_R1_001 | 506994 | 27749 | 67923 | 0 | 105388 | 115 | 708169 | 0.096 | 0.0044259 | 1 | FALSE | 0.02 | 1.000 | FALSE |
| PMTC6LiverC204DL1S7_S588_L002_R1_001 | PMTC6LiverC204DL1S7_S588_L002_R1_001 | 476881 | 36334 | 45716 | 0 | 98424 | 96 | 657451 | 0.070 | 0.0025688 | 0 | FALSE | 0.00 | 1.000 | FALSE |
| PMTC6LiverC258AL1S3_S258_L001_R1_001 | PMTC6LiverC258AL1S3_S258_L001_R1_001 | 549322 | 27843 | 50278 | 0 | 111227 | 107 | 738777 | 0.068 | 0.0009886 | 1 | FALSE | 0.06 | 0.678 | TRUE |
| PMTC6LiverC154AL1S2_S154_L001_R1_001 | PMTC6LiverC154AL1S2_S154_L001_R1_001 | 812987 | 36743 | 78314 | 0 | 155468 | 179 | 1083691 | 0.072 | 0.0026183 | 76 | TRUE | 0.54 | 0.972 | FALSE |
| PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 | PMTC6LiverC247AL4L5S3_S1015_L003_R1_001 | 624401 | 35599 | 74604 | 0 | 131717 | 111 | 866432 | 0.086 | 0.0025095 | 0 | FALSE | 0.00 | 1.000 | FALSE |
Visualize the QC results by a heatmap.
plotHeatmap(tumor,
order_cells = "hclust",
label = c('is_aneuploid','outlier'),
row_split = 'outlier',
n_threads = n_threads)
In the next step of the QC we remove unwanted cells by sub-setting the CopyKit object.
tumor <- tumor[,colData(tumor)$outlier == FALSE]
diploid <- tumor[,colData(tumor)$is_aneuploid == FALSE]
tumor <- tumor[,colData(tumor)$is_aneuploid == TRUE]
dim(diploid); dim(tumor)## [1] 11310 217
## [1] 11310 313
CopyKit objects can be easily merged back. This can also be used to merge 2 CopyKit objects.
re_merge <- cbind(diploid, tumor)
dim(re_merge)## [1] 11310 530
Optional For data with low read counts, to improve copy number calling, we adopt a data smoothing method called “KNN smoothing” by borrowing information from cells’ nearest neighbors.
tumor <-knnSmooth(tumor, k = 4)The dataset should now be ready to proceed with the analysis. ## Clustering We apply dimension reduction and embed the data into 2 dimensional space.
tumor <- runUmap(tumor, seed = 17)
kable(reducedDim(tumor)[1:5,])| PMTC6LiverC154AL1S2_S154_L001_R1_001 | 0.9889162 | 1.7724430 |
| PMTC6LiverC72AL3S1_S840_L003_R1_001 | -0.1651536 | 0.4005138 |
| PMTC6LiverC223AL1S3_S223_L001_R1_001 | 3.1984547 | 0.1422417 |
| PMTC6LiverC153AL1S6_S537_L002_R1_001 | 0.4976004 | 1.8813207 |
| PMTC6LiverC198DL1S7_S582_L002_R1_001 | 1.0966156 | 0.0016753 |
plotUmap(tumor)
By default, CopyKit uses
hdbscan
as clustering method. Alternative methods available in CopyKit include
‘leiden’ and ‘louvain’. To perform clustering, each algorithm requires
us to define k, the number of nearest neighbors to consider.
To help with choosing k, CopyKit provides the helper function
findSuggestedK() to bootstrap clustering over a range of k values, and
returns the value that maximizes the jaccard similarity.
tumor <- findSuggestedK(tumor, k_range = 2:6)## Calculating jaccard similarity for k range: 2 3 4 5 6
##
## Suggested k = 6 with median jaccard similarity of: 0.931
plotSuggestedK(tumor)
plotSuggestedK(tumor, geom = 'tile')
Finally we cluster the cells using hdbscan with preferred k.
tumor <- findClusters(tumor) ## Using suggested k_subclones = 6
## Finding clusters, using method: hdbscan
## Found 2 outliers cells (group 'c0')
## Found 4 subclones.
## Done.
Check the clustering result.
plotUmap(tumor, label = "subclones")
plotHeatmap(tumor,
order_cells = "hclust",
row_split = "subclones",
label = "subclones",
n_threads = n_threads)
It is possible that a subgroup of clustering outliers is identified. Outliers are added to subgroup c0 and should be removed.
tumor_c0 <- tumor[,colData(tumor)$subclones == 'c0']
tumor <- tumor[,colData(tumor)$subclones != 'c0']
colData(tumor)$subclones <- droplevels(colData(tumor)$subclones)Excluded data can be easily merged back. This can also be used to merge 2 CopyKit objects.
merged <- cbind(tumor, tumor_c0)CopyKit supports different methods of calculating integer copy number profiles.
- Computational ploidies: scquantum is a method that estimates ploidy, and uses the ploidy estimate to scale the data to absolute copy numbers given bincount data from single-cell copy number profiling.
tumor <- calcInteger(tumor,
method = 'scquantum',
assay = 'bincounts')
med_ploidy <- median(tumor@colData$ploidy)
plotMetrics(tumor, metric = 'ploidy') +
ggplot2::annotate("text", x = Inf, y = Inf, hjust = 1.1, vjust = 1.5,
label = paste0("Median ploidy = ", round(med_ploidy, 2)))
- Fixed: a fixed value of ploidy (generally determined using Flow Cytometry) will be used to scale all cells.
tumor <- calcInteger(tumor,
method = 'fixed',
ploidy_value = 4.5)Plot the integer copy number heatmap
plotHeatmap(tumor,
assay = 'integer',
order_cells = "hclust",
row_split = "subclones",
label = 'subclones',
n_threads = n_threads)
We can also check profile of any cell using the plotRatio(). Here shown
profile of a cell from c1.
plotRatio(tumor, sample_name = 'PMTC6LiverC281AL6L7S5_S1433_L004_R1_001')
To run a phylogenetic analysis of cells’ copy number profiles, use the
function runPhylo(). Available methods are Neighbor
Joining and Balanced Minimum
evolution. Trees can also be
visualized by plotPhylo()
tumor <- runPhylo(tumor,
assay = 'segment_ratios',
metric = 'manhattan',
n_threads = n_threads)
phylo(tumor)##
## Phylogenetic tree with 311 tips and 310 internal nodes.
##
## Tip labels:
## PMTC6LiverC154AL1S2_S154_L001_R1_001, PMTC6LiverC72AL3S1_S840_L003_R1_001, PMTC6LiverC223AL1S3_S223_L001_R1_001, PMTC6LiverC153AL1S6_S537_L002_R1_001, PMTC6LiverC198DL1S7_S582_L002_R1_001, PMTC6LiverC161AL4L5S1_S929_L003_R1_001, ...
##
## Rooted; includes branch lengths.
plotPhylo(tumor, label = 'subclones')
Single cell phylogeny is very prone to data noise. By taking the median of all cells from the same subclone, we can build subclonal consensus profiles and construct phylogeny from there. We can also root the consensus tree by a diploid profile.
tumor <- calcConsensus(tumor, assay = 'segment_ratios')
consensus(tumor)[1:5,]## c1 c2 c3 c4
## V1 0.69 0.73 0.68 0.69
## V2 0.69 0.73 0.68 0.69
## V3 0.69 0.73 0.68 0.69
## V4 0.69 0.73 0.68 0.69
## V5 0.69 0.73 0.68 0.69
tumor <- runConsensusPhylo(tumor, root = 'neutral')
consensusPhylo(tumor)##
## Phylogenetic tree with 4 tips and 3 internal nodes.
##
## Tip labels:
## c1, c2, c3, c4
##
## Rooted; includes branch lengths.
plotPhylo(tumor, label = 'subclones', consensus = TRUE)
To further understand the what copy number events contribute to the lineage, we can also plot consensus heatmaps annotated by oncogenes.
genes = c("CDKN2A",
"FGFR1",
"TP53",
"PTEN",
"MYC",
"CDKN1A",
"MDM2",
"AURKA",
"PIK3CA",
"CCND1",
"KRAS")
tumor <- calcConsensus(tumor, assay = 'integer')
plotHeatmap(tumor,
consensus = TRUE,
label = 'subclones',
genes = genes,
n_threads = n_threads)
Or more intuitively, check copy number states across of genes in different subclones.
plotGeneCopy(tumor,
genes = genes,
label = 'subclones',
dodge.width = 0.8)
Also note that all the plot functions except heatmap are based on
ggplot2, thus we can further customize them by adding themes, etc..
plotGeneCopy(tumor,
genes = genes,
label = 'subclones',
dodge.width = 0.8) +
ggplot2::ylim(c(0,4)) +
ggplot2::ggtitle("Copy number status across genes") +
ggplot2::theme_linedraw() +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90))
We can also output plots using pdf(file = ‘/path/to/file/filename.pdf’). We can control the dimensions using width and height arguments.
p1 <- plotUmap(tumor, label = "subclones")## Plotting Umap.
## Coloring by: subclones.
p2 <- plotHeatmap(tumor,
consensus = TRUE,
label = 'subclones',
genes = genes,
n_threads = n_threads)## order_cells argument is NULL. Samples are ordered according to
## colnames(CopyKit)
## Plotting Heatmap.
pdf(file = 'copykit_output_plots.pdf')
print(p1)
print(p2)
dev.off()## png
## 2
pdf(file = 'copykit_consensus_heatmap.pdf', height = 3)
print(p2)
dev.off()## png
## 2
sessionInfo()## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Red Hat Enterprise Linux
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.3.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] BiocParallel_1.28.3 kableExtra_1.3.4
## [3] copykit_0.1.2 DNAcopy_1.68.0
## [5] Rsubread_2.8.2 SingleCellExperiment_1.16.0
## [7] SummarizedExperiment_1.24.0 Biobase_2.54.0
## [9] GenomicRanges_1.46.1 GenomeInfoDb_1.30.1
## [11] IRanges_2.28.0 S4Vectors_0.32.4
## [13] BiocGenerics_0.40.0 MatrixGenerics_1.6.0
## [15] matrixStats_1.1.0 knitr_1.45
##
## loaded via a namespace (and not attached):
## [1] circlize_0.4.15 systemfonts_1.1.0 igraph_1.5.0.1
## [4] lazyeval_0.2.2 splines_4.1.2 ggplot2_3.5.1
## [7] amap_0.8-19 digest_0.6.34 foreach_1.5.2
## [10] yulab.utils_0.1.4 htmltools_0.5.7 magick_2.7.3
## [13] viridis_0.6.5 ggalluvial_0.12.5 fansi_1.0.6
## [16] magrittr_2.0.3 memoise_2.0.1 cluster_2.1.2
## [19] doParallel_1.0.17 mixtools_2.0.0 fastcluster_1.2.6
## [22] ComplexHeatmap_2.10.0 svglite_2.1.1 colorspace_2.1-0
## [25] rvest_1.0.4 xfun_0.42 dplyr_1.1.4
## [28] crayon_1.5.2 RCurl_1.98-1.14 jsonlite_1.8.8
## [31] survival_3.3-0 iterators_1.0.14 ape_5.7-1
## [34] glue_1.7.0 gtable_0.3.4 zlibbioc_1.40.0
## [37] XVector_0.34.0 webshot_0.5.2 GetoptLong_1.0.5
## [40] DelayedArray_0.20.0 kernlab_0.9-32 shape_1.4.6
## [43] prabclus_2.3-3 DEoptimR_1.1-3 scales_1.3.0
## [46] miniUI_0.1.1.1 Rcpp_1.0.12 viridisLite_0.4.2
## [49] xtable_1.8-4 clue_0.3-65 gridGraphics_0.5-1
## [52] tidytree_0.4.6 mclust_6.0.1 htmlwidgets_1.6.4
## [55] httr_1.4.7 FNN_1.1.4 RColorBrewer_1.1-3
## [58] fpc_2.2-11 modeltools_0.2-23 ellipsis_0.3.2
## [61] farver_2.1.1 pkgconfig_2.0.3 flexmix_2.3-19
## [64] sass_0.4.8 nnet_7.3-17 uwot_0.1.16
## [67] utf8_1.2.4 labeling_0.4.3 ggplotify_0.1.2
## [70] tidyselect_1.2.1 rlang_1.1.3 later_1.3.2
## [73] munsell_0.5.0 tools_4.1.2 cachem_1.0.8
## [76] cli_3.6.2 dbscan_1.1-12 generics_0.1.3
## [79] evaluate_0.23 stringr_1.5.1 fastmap_1.1.1
## [82] yaml_2.3.8 ggtree_3.2.1 fs_1.6.3
## [85] robustbase_0.99-2 purrr_1.0.2 nlme_3.1-155
## [88] mime_0.12 aplot_0.2.2 xml2_1.3.6
## [91] compiler_4.1.2 rstudioapi_0.15.0 beeswarm_0.4.0
## [94] plotly_4.10.4 png_0.1-8 scquantum_1.0.0
## [97] treeio_1.18.1 tibble_3.2.1 bslib_0.6.1
## [100] stringi_1.8.3 highr_0.10 forcats_1.0.0
## [103] lattice_0.22-6 bluster_1.4.0 Matrix_1.6-5
## [106] vctrs_0.6.5 pillar_1.9.0 lifecycle_1.0.4
## [109] jquerylib_0.1.4 GlobalOptions_0.1.2 BiocNeighbors_1.12.0
## [112] irlba_2.3.5.1 data.table_1.15.0 bitops_1.0-7
## [115] httpuv_1.6.14 patchwork_1.2.0 R6_2.5.1
## [118] promises_1.2.1 gridExtra_2.3 vipor_0.4.7
## [121] codetools_0.2-18 MASS_7.3-55 gtools_3.9.5
## [124] rjson_0.2.21 withr_3.0.0 GenomeInfoDbData_1.2.7
## [127] diptest_0.77-0 parallel_4.1.2 grid_4.1.2
## [130] ggfun_0.1.4 tidyr_1.3.1 class_7.3-20
## [133] rmarkdown_2.25 segmented_2.0-2 ggnewscale_0.4.10
## [136] shiny_1.8.0 ggbeeswarm_0.7.2