The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

10-disambiguate 2 1.2 Methods 1.2.1 Description of Data The dataset contains measurements on 176 previously untreated patients with CLL. Extensive

Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by , 2017-01-21 06:55:03

Calling Segmented Data From SNP Assay

10-disambiguate 2 1.2 Methods 1.2.1 Description of Data The dataset contains measurements on 176 previously untreated patients with CLL. Extensive

Calling Segmented Data From SNP Assay

Kevin R. Coombes
17 March 2011

Contents

1 Executive Summary 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Aims/Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Description of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Details 2

3 Disambiguate 3
3.1 Again: Collapsing Segments Based on Consecutive Calls . . . . . . . . . . . . . . . . 5
3.2 Finalize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Appendix 9

1 Executive Summary

1.1 Introduction

This report describes the analysis of a data set from Lynn Barron, a member of the laboratory of
Lynne V. Abruzzo. This dataset was acquired using Illumina 610K SNP chips.

This is the tenth part of a series of related reports.

1.1.1 Aims/Objectives
The objective of this report is to use the merged segment data (as produced by Report 6) to make
meaningful calls about the copy number and LOH status of each segment.

1

10-disambiguate 2

1.2 Methods

1.2.1 Description of Data

The dataset contains measurements on 176 previously untreated patients with CLL. Extensive
clinical followup is available.

1.2.2 Statistical Methods

Raw data were processed in BeadStudio to yield genotype calls, log R ratios (LRR), and B allele
frequencies (BAF) for each SNP in each sample. Since the study does not include matched normal
DNA from the samples, the BeadStudio computations were performed relative to the pool of 120
HapMap samples run by Illumina.

In Report 2, we applied the circular binary segmentation (CBS) algorithm to the intensity (log
R ratio; LRR) data for each sample and each chromosome. CBS was first described by Olshan et
al. [Biostatistics 2004; 23:657–63]; we use the implementation of CBS from the R package DNAcopy.
In Report 3, we computed the odds ratio for LOH versus no LOH in windows of width 40 along
each chromosome. In Report 4, we applied the CBS algorithm to transformed B allele frequency
(BAF) values on each chromosome of each cell line sample. In all three of those reports, we saved
the segmentation results in per-sample files. In Report 6, we pooled the segment data from the
different algorithms for each sample. We also computed summary statistics along each resulting
segment, including the LRR mean and standard deviation, a summary of the genotypes for the
SNPs in the region, and the best fit for modeling the BAF as a mixture of multiple components. In
Reports 1 and 7, we fit a statistical model to put the log R ratio data back on a properly normalized
scale that coudl be interpreted consistently in terms of copy number.

In this report, we continue to assign a meaningful “call” or interpretation to each segment. The
main point here is to remove segments with “ambigous” calls. “ambiguity” only arises for segments
that are potentially homozygous but contain fewer then 100 SNP markers. If they are flanked by
segments that (a) have the same call as each other and (b) have the same copy number call as the
ambiguos segment, then these segments are merged.

1.3 Results

We generate a tab-separated-values file (smaller.tsv) and a binary R file (smaller.rda) contain-
ing all updated segment calls for all samples and all chromosomes.

1.4 Conclusions

It would be nice to have some....

2 Details

We load the current segmentation and call data from the previous report.

> load("shrunk.rda")

10-disambiguate 3

3 Disambiguate

First, we count the number of segments assigned to each copy number level. Along the way, we
make a record of the current copy number assignments.

> ncopy <- as.numeric(substring(as.character(shrunk$Call), 2, 3))
> ncopy[shrunk$Call == "DoubleLoss"] <- 0
> table(ncopy)

ncopy 6 7
012345 49 4

1537 5836 57590 2901 768 142

We are now going to change the calls for some of the ambiguous segments. As noted above,
a segment was initially called “ambiguous” only if it was apparently homozygous (based on the
number of components in the B allele frequency plots) but contained fewer than 100 SNPs. By
merging adjacent segments (and “NoCall” segments with common flanking segments), some of these
regions now contain more than 100 SNPs.

> summary(shrunk[grep("Ambiguous", as.character(shrunk$Call)), c(1:2,
+ 4:6, 9:14, 25:27)])

SamID chrom loc.start loc.end
CLZ.5 : 281
CL081 : 275 2 : 1406 Min. : 274 Min. : 28938
CLZ.40 : 274
CL143 : 272 6 : 1167 1st Qu.: 26884934 1st Qu.: 27351531
CLZ.48 : 269
CLZ.27 : 257 7 : 1149 Median : 62973655 Median : 63541486
(Other):15325
4 : 1125 Mean : 72474032 Mean : 72938245
num.mark
Min. : 3.00 1 : 1116 3rd Qu.:104514704 3rd Qu.:104773483
1st Qu.: 31.00
Median : 48.00 5 : 986 Max. :246911720 Max. :247185943
Mean : 50.35
3rd Qu.: 68.00 (Other):10004
Max. :306.00
seg.median seg.mad AA

Min. :-0.156145 Min. :0.01796 Min. : 0.00

1st Qu.:-0.031230 1st Qu.:0.13590 1st Qu.: 9.00

Median : 0.007823 Median :0.16495 Median :18.00

Mean : 0.014625 Mean :0.17279 Mean :19.07

3rd Qu.: 0.047583 3rd Qu.:0.20038 3rd Qu.:28.00

Max. : 0.968868 Max. :0.70293 Max. :89.00

AB BB NC Mix 0.0000 nBAFComp
Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.3180 Min. :1.344
1st Qu.: 0.000 1st Qu.:10.00 1st Qu.: 0.000 1st Qu.: 0.3607 1st Qu.:2.000
Median : 1.000 Median :20.00 Median : 2.000 Median : 0.3640 Median :2.000
Mean : 1.173 Mean :21.11 Mean : 8.996 Mean : 0.4659 Mean :2.000
3rd Qu.: 2.000 3rd Qu.:31.00 3rd Qu.: 10.000 3rd Qu.: 3rd Qu.:2.000

10-disambiguate 4

Max. :11.000 Max. :99.00 Max. :262.000 Max. : 0.4700 Max. :2.667
NA's :182.0000
ABperc
Min. : 0.000
1st Qu.: 0.000
Median : 2.000
Mean : 3.208
3rd Qu.: 4.545
Max. :100.000
NA's :117.000

Whenever an ambiguous region is flanked by segments that (a) have the same call as each other
and (b) have the same copy number call as the ambiguous region, we want to make the same call
for all three segments. The next block of code compares a cetnral region with the regions on its
left and right.

> N <- nrow(shrunk)
> leftc <- c(NA, shrunk[1:(N - 1), "chrom"])
> centc <- shrunk[, "chrom"]
> ritec <- c(shrunk[2:N, "chrom"], NA)
> leftC <- c(NA, shrunk[1:(N - 1), "Call"])
> centC <- shrunk[, "Call"]
> riteC <- c(shrunk[2:N, "Call"], NA)
> leftn <- c(NA, ncopy[1:(N - 1)])
> centn <- ncopy
> riten <- c(ncopy[2:N], NA)
> one <- leftc == ritec
> two <- leftC == riteC
> three <- leftn == riten
> down1 <- leftc == centc
> down3 <- leftn == centn
> up1 <- ritec == centc
> up3 <- riten == centn

Now we only look at the ambiguous segments.

> ambi <- regexpr("Ambiguous", as.character(shrunk$Call)) > 0
> sum(ambi)

[1] 16953

We pick out three subsets of the ambiguous regions.

1. v1 defines segments where the flanking segments are on the same chromosome (one) and have
the same call (two) and have the same copy number as the central segment (up3 and down3).

10-disambiguate 5

2. v2 defines segments that do not satisfy the previosu criterion but for which the flanking region
to the left is on the same chromosome (down1) and has the same copy number (down3) as the
ambiguous segment.

3. v3 defines segments that do not match either of the preceding conditions but for which the
flanking segment on the right has matching chromosome and copy number informaiton.

Note that the checks for chromosome are to handle the special case of the first and last segments
on a chromosome.

> v1 <- ambi & one & two & down3 & up3
> v2 <- ambi & !(one & two & down3 & up3) & (down1 & down3)
> v3 <- ambi & !(one & two & down3 & up3) & !(down1 & down3) & (up1 &
+ up3)

Now we use these lists of segments to update the calls.

> mycall <- shrunk$Call
> w <- which(v1 & !is.na(v1))
> mycall[w] <- shrunk[w - 1, "Call"]
> w <- which(v2 & !is.na(v2))
> mycall[w] <- shrunk[w - 1, "Call"]
> w <- which(v3 & !is.na(v3))
> mycall[w] <- shrunk[w + 1, "Call"]
> sum(mycall != shrunk$Call)

[1] 14542

> sum(ambi & mycall != shrunk$Call)

[1] 14542

> shrunk$Call <- factor(mycall)

3.1 Again: Collapsing Segments Based on Consecutive Calls

We use this function to collapse adjacent segments with the same call. For example, if two consec-
tutive segments are both called normal, they are merged into a single segment and the statistics
describing the segment are updated.

> load("collapseFunctions.rda")

Now we actually perform the collapsing and re-calling of segments.

10-disambiguate 6

> smaller <- shrunk[1, ]
> smaller <- smaller[-1, ]
> for (chrname in levels(shrunk$chrom)) {
+ print(paste("Chromosome", chrname))
+ for (cid in levels(shrunk$SamID)) {
+ cat(paste("Chromosome", chrname, "; Sample", cid, "\n"), file = stderr())
+ nchanged <- 1
+ thedata <- shrunk[shrunk$chrom == chrname & shrunk$SamID ==
+ cid, ]
+ while (nchanged > 0) {
+ simple <- collapseCommon(thedata)
+ nchanged <- nrow(thedata) - nrow(simple)
+ thecall <- recall(simple[, "Call"], "NoCall")
+ nchanged <- nchanged + sum(thecall != simple[, "Call"])
+ simple[, "Call"] <- thecall
+ thedata <- simple
+}
+ smaller <- rbind(smaller, simple)
+}
+}

[1] "Chromosome 1"
[1] "Chromosome 2"
[1] "Chromosome 3"
[1] "Chromosome 4"
[1] "Chromosome 5"
[1] "Chromosome 6"
[1] "Chromosome 7"
[1] "Chromosome 8"
[1] "Chromosome 9"
[1] "Chromosome 10"
[1] "Chromosome 11"
[1] "Chromosome 12"
[1] "Chromosome 13"
[1] "Chromosome 14"
[1] "Chromosome 15"
[1] "Chromosome 16"
[1] "Chromosome 17"
[1] "Chromosome 18"
[1] "Chromosome 19"
[1] "Chromosome 20"
[1] "Chromosome 21"
[1] "Chromosome 22"

10-disambiguate 7

[1] "Chromosome X"

> dim(shrunk)

[1] 69346 28

> dim(smaller)

[1] 42728 28

> table(smaller$Call)

DoubleLoss N01.Homozygous N01.UnbalHet N02.Ambiguous N02.BalHet
20415
1537 4645 1191 717
N03.UnbalHet
N02.Homozygous N02.UnbalHet N03.Ambiguous N03.Homozygous 817

8680 1382 1280 653 N05.Ambiguous
124
N04.Ambiguous N04.BalHet N04.Homozygous N04.UnbalHet
N07.Ambiguous
254 180 58 254 4

N05.UnbalHet N06.Ambiguous N06.BalHet N06.UnbalHet

18 32 11 6

NoCall

470

> summary(smaller)

SamID chrom flag loc.start loc.end

CL081 : 493 6 : 3638 BAF:14783 Min. : 274 Min. : 28938

CL090 : 404 4 : 3243 CNV:20309 1st Qu.: 22246227 1st Qu.: 32756221

CLZ.11 : 398 2 : 3024 LOH: 7636 Median : 56015058 Median : 67416597

CLZ.48 : 390 8 : 2993 Mean : 65743895 Mean : 77638554

CLZ.46 : 387 X : 2772 3rd Qu.: 97369096 3rd Qu.:108898262

CL048 : 372 5 : 2571 Max. :246901621 Max. :247185943

(Other):40284 (Other):24487

num.mark seg.mean seg.sd seg.median

Min. : 2 Min. :-7.874288 Min. :5.817e-03 Min. :-7.874288

1st Qu.: 78 1st Qu.:-0.049632 1st Qu.:1.621e-01 1st Qu.:-0.030334

Median : 248 Median :-0.021311 Median :1.959e-01 Median :-0.001572

Mean : 2503 Mean :-0.190089 Mean :2.995e-01 Mean :-0.190423

3rd Qu.: 3453 3rd Qu.: 0.009656 3rd Qu.:2.426e-01 3rd Qu.: 0.026978

Max. :49517 Max. : 0.741210 Max. :5.860e+00 Max. : 0.968868

NA's :1.414e+03

seg.mad AA AB BB

Min. :0.004606 Min. : 0.0 Min. : 0.0 Min. : 0.0

1st Qu.:0.136593 1st Qu.: 18.0 1st Qu.: 0.0 1st Qu.: 21.0

10-disambiguate 8

Median :0.159553 Median : 92.0 Median : 18.0 Median : 100.0
Mean :0.230181 Mean : 774.8 Mean : 738.8 Mean : 878.3
3rd Qu.:0.189790 3rd Qu.: 1054.0 3rd Qu.: 1005.0 3rd Qu.: 1193.0
Max. :7.929879 Max. :14933.0 Max. :15663.0 Max. :17096.0

NC Silw1 Silw2 Silw3

Min. : 0.0 Min. : 0.0000 Min. : 0.0000 Min. : 0.0000

1st Qu.: 4.0 1st Qu.: 0.4472 1st Qu.: 0.7084 1st Qu.: 0.9752

Median : 26.0 Median : 0.4559 Median : 0.7494 Median : 0.9794

Mean : 110.9 Mean : 0.4574 Mean : 0.8118 Mean : 0.9654

3rd Qu.: 149.0 3rd Qu.: 0.4878 3rd Qu.: 0.9774 3rd Qu.: 0.9859

Max. :5003.0 Max. : 0.4999 Max. : 1.0000 Max. : 1.0000

NA's :3376.0000 NA's :3376.0000 NA's :3376.0000

Silw4 nG1 nG2 nG3

Min. : 0.0000 Min. : 0.0 Min. : 0.0 Min. : 0.0

1st Qu.: 0.8724 1st Qu.: 50.0 1st Qu.: 1.0 1st Qu.: 1.0

Median : 0.8891 Median : 153.0 Median : 40.0 Median : 33.0

Mean : 0.9066 Mean : 896.6 Mean : 480.5 Mean : 337.8

3rd Qu.: 0.9696 3rd Qu.: 1281.0 3rd Qu.: 692.0 3rd Qu.: 476.0

Max. : 1.0000 Max. :15792.0 Max. :9163.0 Max. :7010.0

NA's :3376.0000 NA's : 3376.0 NA's :3376.0 NA's :3376.0

nG4 ME.5 ME.0 Mix

Min. : 0.0 Min. : 0.0000 Min. : 0.000 Min. : 0.0000

1st Qu.: 52.0 1st Qu.: 0.6152 1st Qu.: 1.665 1st Qu.: 0.3424

Median : 163.0 Median : 0.8228 Median : 371.495 Median : 0.4664

Mean : 998.4 Mean : 27.4249 Mean : 2393.905 Mean : 0.4087

3rd Qu.: 1433.0 3rd Qu.: 1.0000 3rd Qu.: 3446.210 3rd Qu.: 0.4700

Max. :17806.0 Max. :8730.7368 Max. :66329.078 Max. : 0.4700

NA's : 3376.0 NA's :3376.0000 NA's : 3376.000 NA's :3376.0000

nBAFComp ABperc Call

Min. : 0.000 Min. : 0.00 N02.BalHet :20415

1st Qu.: 2.000 1st Qu.: 0.00 N02.Homozygous: 8680

Median : 2.981 Median : 27.30 N01.Homozygous: 4645

Mean : 2.648 Mean : 19.29 DoubleLoss : 1537

3rd Qu.: 3.000 3rd Qu.: 32.51 N02.UnbalHet : 1382

Max. : 4.000 Max. :100.00 N03.Ambiguous : 1280

NA's :788.000 NA's :846.00 (Other) : 4789

3.2 Finalize

We also write a TSV file of the segments after collapsing.

10-disambiguate 9

> rownames(smaller) <- NULL
> write.table(smaller, "smaller.tsv", sep = "\t", row.names = FALSE,
+ col.names = TRUE)
> save(smaller, file = "smaller.rda")
> rm(shrunk)

4 Appendix

This analysis was run in the following directory:
> getwd()

[1] "c:/MyStuff/SNP-CLL/AA"
This analysis was run in the following software environment:

> sessionInfo()

R version 2.12.0 (2010-10-15)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: datasets methods base
[1] stats graphics grDevices utils


Click to View FlipBook Version