Resolving multicopy duplications de novo using polyploid phasing

dc.contributor.authorChaisson, Mark J
dc.contributor.authorMukherjee, Sudipto
dc.contributor.authorKannan, Sreeram
dc.contributor.authorEichler, Evan E
dc.date.accessioned2017-06-12T17:58:17Z
dc.date.available2017-06-12T17:58:17Z
dc.date.issued2017-05
dc.description.abstractWhile the rise of single-molecule sequencing systems has enabled an unprecedented rise in the ability to assemble complex regions of the genome, long segmental duplications in the genome still remain a challenging frontier in assembly. Segmental duplications are at the same time both gene rich and prone to large structural rearrangements, making the resolution of their sequences important in medical and evolutionary studies. Duplicated sequences that are collapsed in mammalian de novo assemblies are rarely identical; after a sequence is duplicated, it begins to acquire paralog specific variants. In this paper, we study the problem of resolving the variations in multicopy long-segmental duplications by developing and utilizing algorithms for polyploid phasing. We develop two algorithms: the first one is targeted at maximizing the likelihood of observing the reads given the underlying haplotypes using discrete ma- trix completion. The second algorithm is based on correlation clustering and exploits an assumption, which is often satisfied in these duplications, that each paralog has a sizable number of paralog-specific variants. We develop a detailed simulation methodology, and demonstrate the superior performance of the proposed algorithms on an array of simulated datasets. We measure the likelihood score as well as reconstruction accuracy, i.e., what fraction of the reads are clustered correctly. In both the performance metrics, we find that our algorithms dominate existing algorithms on more than 93% of the datasets. While the discrete matrix completion performs better on likelihood score, the correlation clustering algorithm performs better on reconstruction accuracy due to the stronger regularization inherent in the algorithm. We also show that our correlation-clustering algorithm can reconstruct on an average 7:0 haplotypes in 10-copy duplication data-sets whereas existing algorithms reconstruct less than 1 copy on average.en_US
dc.description.sponsorshipNIH HG002385en_US
dc.identifier.citationChaisson MJ, Mukherjee S, Kannan S, Eichler EE. (2017) Resolving multicopy duplications de novo using polyploid phasing. In: Sahinalp S. (eds) Research in Computational Molecular Biology. RECOMB 2017. Lecture Notes in Computer Science, vol 10229. Springer, Cham, 117–133.en_US
dc.identifier.urihttp://hdl.handle.net/1773/39168
dc.language.isoenen_US
dc.publisherSpringeren_US
dc.subjectmulticopy duplicationsen_US
dc.subjectpolyploid phasingen_US
dc.subjectRECOMBen_US
dc.titleResolving multicopy duplications de novo using polyploid phasingen_US
dc.typeArticleen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
AssemblyByPhasing.pdf
Size:
476.75 KB
Format:
Adobe Portable Document Format
Description:
Main article

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.6 KB
Format:
Item-specific license agreed upon to submission
Description: