Calling Segmented Data From SNP Assay
Kevin R. Coombes
17 March 2011
Contents
1 Executive Summary 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Aims/Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Description of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Details 2
3 Disambiguate 3
3.1 Again: Collapsing Segments Based on Consecutive Calls . . . . . . . . . . . . . . . . 5
3.2 Finalize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Appendix 9
1 Executive Summary
1.1 Introduction
This report describes the analysis of a data set from Lynn Barron, a member of the laboratory of
Lynne V. Abruzzo. This dataset was acquired using Illumina 610K SNP chips.
This is the tenth part of a series of related reports.
1.1.1 Aims/Objectives
The objective of this report is to use the merged segment data (as produced by Report 6) to make
meaningful calls about the copy number and LOH status of each segment.
1
10-disambiguate 2
1.2 Methods
1.2.1 Description of Data
The dataset contains measurements on 176 previously untreated patients with CLL. Extensive
clinical followup is available.
1.2.2 Statistical Methods
Raw data were processed in BeadStudio to yield genotype calls, log R ratios (LRR), and B allele
frequencies (BAF) for each SNP in each sample. Since the study does not include matched normal
DNA from the samples, the BeadStudio computations were performed relative to the pool of 120
HapMap samples run by Illumina.
In Report 2, we applied the circular binary segmentation (CBS) algorithm to the intensity (log
R ratio; LRR) data for each sample and each chromosome. CBS was first described by Olshan et
al. [Biostatistics 2004; 23:657–63]; we use the implementation of CBS from the R package DNAcopy.
In Report 3, we computed the odds ratio for LOH versus no LOH in windows of width 40 along
each chromosome. In Report 4, we applied the CBS algorithm to transformed B allele frequency
(BAF) values on each chromosome of each cell line sample. In all three of those reports, we saved
the segmentation results in per-sample files. In Report 6, we pooled the segment data from the
different algorithms for each sample. We also computed summary statistics along each resulting
segment, including the LRR mean and standard deviation, a summary of the genotypes for the
SNPs in the region, and the best fit for modeling the BAF as a mixture of multiple components. In
Reports 1 and 7, we fit a statistical model to put the log R ratio data back on a properly normalized
scale that coudl be interpreted consistently in terms of copy number.
In this report, we continue to assign a meaningful “call” or interpretation to each segment. The
main point here is to remove segments with “ambigous” calls. “ambiguity” only arises for segments
that are potentially homozygous but contain fewer then 100 SNP markers. If they are flanked by
segments that (a) have the same call as each other and (b) have the same copy number call as the
ambiguos segment, then these segments are merged.
1.3 Results
We generate a tab-separated-values file (smaller.tsv) and a binary R file (smaller.rda) contain-
ing all updated segment calls for all samples and all chromosomes.
1.4 Conclusions
It would be nice to have some....
2 Details
We load the current segmentation and call data from the previous report.
> load("shrunk.rda")
10-disambiguate 3
3 Disambiguate
First, we count the number of segments assigned to each copy number level. Along the way, we
make a record of the current copy number assignments.
> ncopy <- as.numeric(substring(as.character(shrunk$Call), 2, 3))
> ncopy[shrunk$Call == "DoubleLoss"] <- 0
> table(ncopy)
ncopy 6 7
012345 49 4
1537 5836 57590 2901 768 142
We are now going to change the calls for some of the ambiguous segments. As noted above,
a segment was initially called “ambiguous” only if it was apparently homozygous (based on the
number of components in the B allele frequency plots) but contained fewer than 100 SNPs. By
merging adjacent segments (and “NoCall” segments with common flanking segments), some of these
regions now contain more than 100 SNPs.
> summary(shrunk[grep("Ambiguous", as.character(shrunk$Call)), c(1:2,
+ 4:6, 9:14, 25:27)])
SamID chrom loc.start loc.end
CLZ.5 : 281
CL081 : 275 2 : 1406 Min. : 274 Min. : 28938
CLZ.40 : 274
CL143 : 272 6 : 1167 1st Qu.: 26884934 1st Qu.: 27351531
CLZ.48 : 269
CLZ.27 : 257 7 : 1149 Median : 62973655 Median : 63541486
(Other):15325
4 : 1125 Mean : 72474032 Mean : 72938245
num.mark
Min. : 3.00 1 : 1116 3rd Qu.:104514704 3rd Qu.:104773483
1st Qu.: 31.00
Median : 48.00 5 : 986 Max. :246911720 Max. :247185943
Mean : 50.35
3rd Qu.: 68.00 (Other):10004
Max. :306.00
seg.median seg.mad AA
Min. :-0.156145 Min. :0.01796 Min. : 0.00
1st Qu.:-0.031230 1st Qu.:0.13590 1st Qu.: 9.00
Median : 0.007823 Median :0.16495 Median :18.00
Mean : 0.014625 Mean :0.17279 Mean :19.07
3rd Qu.: 0.047583 3rd Qu.:0.20038 3rd Qu.:28.00
Max. : 0.968868 Max. :0.70293 Max. :89.00
AB BB NC Mix 0.0000 nBAFComp
Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.3180 Min. :1.344
1st Qu.: 0.000 1st Qu.:10.00 1st Qu.: 0.000 1st Qu.: 0.3607 1st Qu.:2.000
Median : 1.000 Median :20.00 Median : 2.000 Median : 0.3640 Median :2.000
Mean : 1.173 Mean :21.11 Mean : 8.996 Mean : 0.4659 Mean :2.000
3rd Qu.: 2.000 3rd Qu.:31.00 3rd Qu.: 10.000 3rd Qu.: 3rd Qu.:2.000
10-disambiguate 4
Max. :11.000 Max. :99.00 Max. :262.000 Max. : 0.4700 Max. :2.667
NA's :182.0000
ABperc
Min. : 0.000
1st Qu.: 0.000
Median : 2.000
Mean : 3.208
3rd Qu.: 4.545
Max. :100.000
NA's :117.000
Whenever an ambiguous region is flanked by segments that (a) have the same call as each other
and (b) have the same copy number call as the ambiguous region, we want to make the same call
for all three segments. The next block of code compares a cetnral region with the regions on its
left and right.
> N <- nrow(shrunk)
> leftc <- c(NA, shrunk[1:(N - 1), "chrom"])
> centc <- shrunk[, "chrom"]
> ritec <- c(shrunk[2:N, "chrom"], NA)
> leftC <- c(NA, shrunk[1:(N - 1), "Call"])
> centC <- shrunk[, "Call"]
> riteC <- c(shrunk[2:N, "Call"], NA)
> leftn <- c(NA, ncopy[1:(N - 1)])
> centn <- ncopy
> riten <- c(ncopy[2:N], NA)
> one <- leftc == ritec
> two <- leftC == riteC
> three <- leftn == riten
> down1 <- leftc == centc
> down3 <- leftn == centn
> up1 <- ritec == centc
> up3 <- riten == centn
Now we only look at the ambiguous segments.
> ambi <- regexpr("Ambiguous", as.character(shrunk$Call)) > 0
> sum(ambi)
[1] 16953
We pick out three subsets of the ambiguous regions.
1. v1 defines segments where the flanking segments are on the same chromosome (one) and have
the same call (two) and have the same copy number as the central segment (up3 and down3).
10-disambiguate 5
2. v2 defines segments that do not satisfy the previosu criterion but for which the flanking region
to the left is on the same chromosome (down1) and has the same copy number (down3) as the
ambiguous segment.
3. v3 defines segments that do not match either of the preceding conditions but for which the
flanking segment on the right has matching chromosome and copy number informaiton.
Note that the checks for chromosome are to handle the special case of the first and last segments
on a chromosome.
> v1 <- ambi & one & two & down3 & up3
> v2 <- ambi & !(one & two & down3 & up3) & (down1 & down3)
> v3 <- ambi & !(one & two & down3 & up3) & !(down1 & down3) & (up1 &
+ up3)
Now we use these lists of segments to update the calls.
> mycall <- shrunk$Call
> w <- which(v1 & !is.na(v1))
> mycall[w] <- shrunk[w - 1, "Call"]
> w <- which(v2 & !is.na(v2))
> mycall[w] <- shrunk[w - 1, "Call"]
> w <- which(v3 & !is.na(v3))
> mycall[w] <- shrunk[w + 1, "Call"]
> sum(mycall != shrunk$Call)
[1] 14542
> sum(ambi & mycall != shrunk$Call)
[1] 14542
> shrunk$Call <- factor(mycall)
3.1 Again: Collapsing Segments Based on Consecutive Calls
We use this function to collapse adjacent segments with the same call. For example, if two consec-
tutive segments are both called normal, they are merged into a single segment and the statistics
describing the segment are updated.
> load("collapseFunctions.rda")
Now we actually perform the collapsing and re-calling of segments.
10-disambiguate 6
> smaller <- shrunk[1, ]
> smaller <- smaller[-1, ]
> for (chrname in levels(shrunk$chrom)) {
+ print(paste("Chromosome", chrname))
+ for (cid in levels(shrunk$SamID)) {
+ cat(paste("Chromosome", chrname, "; Sample", cid, "\n"), file = stderr())
+ nchanged <- 1
+ thedata <- shrunk[shrunk$chrom == chrname & shrunk$SamID ==
+ cid, ]
+ while (nchanged > 0) {
+ simple <- collapseCommon(thedata)
+ nchanged <- nrow(thedata) - nrow(simple)
+ thecall <- recall(simple[, "Call"], "NoCall")
+ nchanged <- nchanged + sum(thecall != simple[, "Call"])
+ simple[, "Call"] <- thecall
+ thedata <- simple
+}
+ smaller <- rbind(smaller, simple)
+}
+}
[1] "Chromosome 1"
[1] "Chromosome 2"
[1] "Chromosome 3"
[1] "Chromosome 4"
[1] "Chromosome 5"
[1] "Chromosome 6"
[1] "Chromosome 7"
[1] "Chromosome 8"
[1] "Chromosome 9"
[1] "Chromosome 10"
[1] "Chromosome 11"
[1] "Chromosome 12"
[1] "Chromosome 13"
[1] "Chromosome 14"
[1] "Chromosome 15"
[1] "Chromosome 16"
[1] "Chromosome 17"
[1] "Chromosome 18"
[1] "Chromosome 19"
[1] "Chromosome 20"
[1] "Chromosome 21"
[1] "Chromosome 22"
10-disambiguate 7
[1] "Chromosome X"
> dim(shrunk)
[1] 69346 28
> dim(smaller)
[1] 42728 28
> table(smaller$Call)
DoubleLoss N01.Homozygous N01.UnbalHet N02.Ambiguous N02.BalHet
20415
1537 4645 1191 717
N03.UnbalHet
N02.Homozygous N02.UnbalHet N03.Ambiguous N03.Homozygous 817
8680 1382 1280 653 N05.Ambiguous
124
N04.Ambiguous N04.BalHet N04.Homozygous N04.UnbalHet
N07.Ambiguous
254 180 58 254 4
N05.UnbalHet N06.Ambiguous N06.BalHet N06.UnbalHet
18 32 11 6
NoCall
470
> summary(smaller)
SamID chrom flag loc.start loc.end
CL081 : 493 6 : 3638 BAF:14783 Min. : 274 Min. : 28938
CL090 : 404 4 : 3243 CNV:20309 1st Qu.: 22246227 1st Qu.: 32756221
CLZ.11 : 398 2 : 3024 LOH: 7636 Median : 56015058 Median : 67416597
CLZ.48 : 390 8 : 2993 Mean : 65743895 Mean : 77638554
CLZ.46 : 387 X : 2772 3rd Qu.: 97369096 3rd Qu.:108898262
CL048 : 372 5 : 2571 Max. :246901621 Max. :247185943
(Other):40284 (Other):24487
num.mark seg.mean seg.sd seg.median
Min. : 2 Min. :-7.874288 Min. :5.817e-03 Min. :-7.874288
1st Qu.: 78 1st Qu.:-0.049632 1st Qu.:1.621e-01 1st Qu.:-0.030334
Median : 248 Median :-0.021311 Median :1.959e-01 Median :-0.001572
Mean : 2503 Mean :-0.190089 Mean :2.995e-01 Mean :-0.190423
3rd Qu.: 3453 3rd Qu.: 0.009656 3rd Qu.:2.426e-01 3rd Qu.: 0.026978
Max. :49517 Max. : 0.741210 Max. :5.860e+00 Max. : 0.968868
NA's :1.414e+03
seg.mad AA AB BB
Min. :0.004606 Min. : 0.0 Min. : 0.0 Min. : 0.0
1st Qu.:0.136593 1st Qu.: 18.0 1st Qu.: 0.0 1st Qu.: 21.0
10-disambiguate 8
Median :0.159553 Median : 92.0 Median : 18.0 Median : 100.0
Mean :0.230181 Mean : 774.8 Mean : 738.8 Mean : 878.3
3rd Qu.:0.189790 3rd Qu.: 1054.0 3rd Qu.: 1005.0 3rd Qu.: 1193.0
Max. :7.929879 Max. :14933.0 Max. :15663.0 Max. :17096.0
NC Silw1 Silw2 Silw3
Min. : 0.0 Min. : 0.0000 Min. : 0.0000 Min. : 0.0000
1st Qu.: 4.0 1st Qu.: 0.4472 1st Qu.: 0.7084 1st Qu.: 0.9752
Median : 26.0 Median : 0.4559 Median : 0.7494 Median : 0.9794
Mean : 110.9 Mean : 0.4574 Mean : 0.8118 Mean : 0.9654
3rd Qu.: 149.0 3rd Qu.: 0.4878 3rd Qu.: 0.9774 3rd Qu.: 0.9859
Max. :5003.0 Max. : 0.4999 Max. : 1.0000 Max. : 1.0000
NA's :3376.0000 NA's :3376.0000 NA's :3376.0000
Silw4 nG1 nG2 nG3
Min. : 0.0000 Min. : 0.0 Min. : 0.0 Min. : 0.0
1st Qu.: 0.8724 1st Qu.: 50.0 1st Qu.: 1.0 1st Qu.: 1.0
Median : 0.8891 Median : 153.0 Median : 40.0 Median : 33.0
Mean : 0.9066 Mean : 896.6 Mean : 480.5 Mean : 337.8
3rd Qu.: 0.9696 3rd Qu.: 1281.0 3rd Qu.: 692.0 3rd Qu.: 476.0
Max. : 1.0000 Max. :15792.0 Max. :9163.0 Max. :7010.0
NA's :3376.0000 NA's : 3376.0 NA's :3376.0 NA's :3376.0
nG4 ME.5 ME.0 Mix
Min. : 0.0 Min. : 0.0000 Min. : 0.000 Min. : 0.0000
1st Qu.: 52.0 1st Qu.: 0.6152 1st Qu.: 1.665 1st Qu.: 0.3424
Median : 163.0 Median : 0.8228 Median : 371.495 Median : 0.4664
Mean : 998.4 Mean : 27.4249 Mean : 2393.905 Mean : 0.4087
3rd Qu.: 1433.0 3rd Qu.: 1.0000 3rd Qu.: 3446.210 3rd Qu.: 0.4700
Max. :17806.0 Max. :8730.7368 Max. :66329.078 Max. : 0.4700
NA's : 3376.0 NA's :3376.0000 NA's : 3376.000 NA's :3376.0000
nBAFComp ABperc Call
Min. : 0.000 Min. : 0.00 N02.BalHet :20415
1st Qu.: 2.000 1st Qu.: 0.00 N02.Homozygous: 8680
Median : 2.981 Median : 27.30 N01.Homozygous: 4645
Mean : 2.648 Mean : 19.29 DoubleLoss : 1537
3rd Qu.: 3.000 3rd Qu.: 32.51 N02.UnbalHet : 1382
Max. : 4.000 Max. :100.00 N03.Ambiguous : 1280
NA's :788.000 NA's :846.00 (Other) : 4789
3.2 Finalize
We also write a TSV file of the segments after collapsing.
10-disambiguate 9
> rownames(smaller) <- NULL
> write.table(smaller, "smaller.tsv", sep = "\t", row.names = FALSE,
+ col.names = TRUE)
> save(smaller, file = "smaller.rda")
> rm(shrunk)
4 Appendix
This analysis was run in the following directory:
> getwd()
[1] "c:/MyStuff/SNP-CLL/AA"
This analysis was run in the following software environment:
> sessionInfo()
R version 2.12.0 (2010-10-15)
Platform: x86_64-pc-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages: datasets methods base
[1] stats graphics grDevices utils