Timezone: »
One major challenge of applying machine learning in genomics is the scarcity of labeled data, which often requires expensive and time-consuming physical experiments under laboratory conditions to obtain. However, the advent of high throughput sequencing has made large quantities of unlabeled genome data available. This can be used to apply semi-supervised learning methods through representation learning. In this paper, we investigate the impact of a popular and well-established language model, namely \emph{BERT}, for sequence genome datasets. Specifically, we develop \emph{GenomeNet-BERT} to produce useful representations for downstream classification tasks.We compare its performance to strictly supervised training and baselines on different training set size setups. The conducted experiments show that this architecture provides an increase in performance compared to existing methods at the cost of more resource-intensive training.
Author Information
Noah Hurmer (University of Munich, Ludwig-Maximilians-Universität München)
Xiao-Yin To (Ludwig-Maximilians-Universität München)
Martin Binder (Department of Statistics)
Hüseyin Anil Gündüz (LMU Munich)
Philipp Münch (Harvard University)
René Mreches (Helmholtz Centre for Infection Research)
Alice McHardy (HZI)
Bernd Bischl (LMU)
Mina Rezaei (Ludwig-Maximilian University)
More from the Same Authors
-
2021 : Survival-oriented embeddings for improving accessibility to complex data structures »
Tobias Weber · Bernd Bischl · David Ruegamer -
2022 : What cleaves? Is proteasomal cleavage prediction reaching a ceiling? »
Ingo Ziegler · Bolei Ma · Ercong Nie · Bernd Bischl · David Rügamer · Benjamin Schubert · Emilio Dorigatti -
2022 Poster: FiLM-Ensemble: Probabilistic Deep Learning via Feature-wise Linear Modulation »
Mehmet Ozgur Turkoglu · Alexander Becker · Hüseyin Anil Gündüz · Mina Rezaei · Bernd Bischl · Rodrigo Caye Daudt · Stefano D'Aronco · Jan Wegner · Konrad Schindler