Resolving multicopy duplications de novo using polyploid phasing
Loading...
Date
Authors
Chaisson, Mark J
Mukherjee, Sudipto
Kannan, Sreeram
Eichler, Evan E
Journal Title
Journal ISSN
Volume Title
Publisher
Springer
Abstract
While the rise of single-molecule sequencing systems has enabled
an unprecedented rise in the ability to assemble complex regions
of the genome, long segmental duplications in the genome still remain a
challenging frontier in assembly. Segmental duplications are at the same
time both gene rich and prone to large structural rearrangements, making
the resolution of their sequences important in medical and evolutionary
studies. Duplicated sequences that are collapsed in mammalian de novo
assemblies are rarely identical; after a sequence is duplicated, it begins
to acquire paralog specific variants. In this paper, we study the problem
of resolving the variations in multicopy long-segmental duplications by
developing and utilizing algorithms for polyploid phasing. We develop
two algorithms: the first one is targeted at maximizing the likelihood of
observing the reads given the underlying haplotypes using discrete ma-
trix completion. The second algorithm is based on correlation clustering
and exploits an assumption, which is often satisfied in these duplications,
that each paralog has a sizable number of paralog-specific variants. We
develop a detailed simulation methodology, and demonstrate the superior
performance of the proposed algorithms on an array of simulated
datasets. We measure the likelihood score as well as reconstruction accuracy,
i.e., what fraction of the reads are clustered correctly. In both
the performance metrics, we find that our algorithms dominate existing
algorithms on more than 93% of the datasets. While the discrete
matrix completion performs better on likelihood score, the correlation
clustering algorithm performs better on reconstruction accuracy due to
the stronger regularization inherent in the algorithm. We also show that
our correlation-clustering algorithm can reconstruct on an average 7:0
haplotypes in 10-copy duplication data-sets whereas existing algorithms
reconstruct less than 1 copy on average.
Description
Keywords
Citation
Chaisson MJ, Mukherjee S, Kannan S, Eichler EE. (2017) Resolving multicopy duplications de novo using polyploid phasing. In: Sahinalp S. (eds) Research in Computational Molecular Biology. RECOMB 2017. Lecture Notes in Computer Science, vol 10229. Springer, Cham, 117–133.
