Timezone: »
Storing data in synthetic DNA offers the possibility of improving information density and durability by several orders of magnitude compared to current storage technologies. However, DNA data storage requires a computationally intensive process to retrieve the data. In particular, a crucial step in the data retrieval pipeline involves clustering billions of strings with respect to edit distance. We observe that datasets in this domain have many notable properties, such as containing a very large number of small clusters that are well-separated in the edit distance metric space. In this regime, existing algorithms are unsuitable because of either their long running time or low accuracy. To address this issue, we present a novel distributed algorithm for approximately computing the underlying clusters. Our algorithm converges efficiently on any dataset that satisfies certain separability properties, such as those coming from DNA storage systems. We also prove that, under these assumptions, our algorithm is robust to outliers and high levels of noise. We provide empirical justification of the accuracy, scalability, and convergence of our algorithm on real and synthetic data. Compared to the state-of-the-art algorithm for clustering DNA sequences, our algorithm simultaneously achieves higher accuracy and a 1000x speedup on three real datasets.
Author Information
Cyrus Rashtchian (University of Washington)
Konstantin Makarychev (Northwestern University)
Miklos Racz (Princeton University)
Siena Ang (Microsoft)
Djordje Jevdjic (Microsoft Research)
Sergey Yekhanin (Microsoft)
Luis Ceze (University of Washington)
Karin Strauss (Microsoft Research)
Related Events (a corresponding poster, oral, or spotlight)
-
2017 Poster: Clustering Billions of Reads for DNA Data Storage »
Wed. Dec 6th 02:30 -- 06:30 AM Room Pacific Ballroom #74
More from the Same Authors
-
2021 Spotlight: Correlated Stochastic Block Models: Exact Graph Matching with Applications to Recovering Communities »
Miklos Racz · Anirudh Sridhar -
2021 : Learned Compiler Optimizations »
Luis Ceze -
2021 Poster: Differentially Private n-gram Extraction »
Kunho Kim · Sivakanth Gopi · Janardhan Kulkarni · Sergey Yekhanin -
2021 Poster: Correlated Stochastic Block Models: Exact Graph Matching with Applications to Recovering Communities »
Miklos Racz · Anirudh Sridhar -
2020 Poster: Improved Guarantees for k-means++ and k-means++ Parallel »
Konstantin Makarychev · Aravind Reddy · Liren Shan -
2019 Poster: An Algorithmic Framework For Differentially Private Data Analysis on Trusted Processors »
Janardhan Kulkarni · Olga Ohrimenko · Bolin Ding · Sergey Yekhanin · Joshua Allen · Harsha Nori -
2019 Poster: Correlation clustering with local objectives »
Sanchit Kalhan · Konstantin Makarychev · Timothy Zhou -
2018 Poster: Learning to Optimize Tensor Programs »
Tianqi Chen · Lianmin Zheng · Eddie Yan · Ziheng Jiang · Thierry Moreau · Luis Ceze · Carlos Guestrin · Arvind Krishnamurthy -
2018 Spotlight: Learning to Optimize Tensor Programs »
Tianqi Chen · Lianmin Zheng · Eddie Yan · Ziheng Jiang · Thierry Moreau · Luis Ceze · Carlos Guestrin · Arvind Krishnamurthy -
2017 Poster: Collecting Telemetry Data Privately »
Bolin Ding · Janardhan Kulkarni · Sergey Yekhanin