Invited Talk
New Methods for the Analysis of Genome Variation Data
Richard Durbin
Harvey's Convention Center Floor, CC
Genetic variation in genome sequences within a species such as humans underpins our biological diversity, is the basis for the genetic
contribution to disease, provides information about our ancestry, and is the substrate for evolution.
Genetic variation has a complex structure of shared inheritance from a common ancestor at
each position in the genome, with the pattern of sharing changing along the genome as a consequence of genetic recombination.
The scale of data sets that can be obtained from modern sequencing and genotyping methods, currently of the order of hundreds of
terabytes, makes analysis computationally challenging.
During the last few years, a number of tools such as BWA, Bowtie have been developed for sequence matching based on suffix array derived data structures, in particular the Burrows-Wheeler tranform (BWT) and Ferragina-Manzini (FM) index, which have the nice property that they not only give asymptotically optimal search, but also are highly compressed data structures (they underlie the bzip compression algorithms). I will discuss a number of approaches based
on these data structures for primary data processing, sequence assembly, variation detection and large scale genetic analysis, with applications to very
large scale human genetic variation data sets.