Timezone: »
Large repositories of source code create new challenges and opportunities for statistical machine learning. Here we first develop an infrastructure for the automated crawling, parsing, and database storage of open source software. The infrastructure allows us to gather Internet-scale source code. For instance, in one experiment, we gather 4,632 java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data first reveal robust power-law behavior for package, SLOC, and method call distributions. We then develop and apply unsupervised author-topic, probabilistic models to automatically discover the topics embedded in the code and extract topic-word and author-topic distributions. In addition to serving as a convenient summary for program function and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing developer similarity and competence, topic scattering, and document tangling, with direct applications to software engineering. Finally, by combining software textual content with structural information captured by our CodeRank approach, we are able to significantly improve software retrieval performance, increasing the AUC metric to 0.86-- roughly 10-30% better than previous approaches based on text alone.
Author Information
Erik Linstead (Chapman University)
Paul Rigor (Pacific Northwest National Laboratory)
sushil bajracharya
cristina lopes
Pierre Baldi (UC Irvine)
More from the Same Authors
-
2021 : Deep learning reconstruction of the neutrino energy with a shallow Askaryan detector »
Stephen McAleer · Christian Glaser · Pierre Baldi -
2021 : G-SpaNet: Generalized Permutationless Set Assignment for Particle Physics using Symmetry Preserving Attention »
Alexander Shmakov · Shih-chieh Hsu · Pierre Baldi -
2022 : Geometry-aware Autoregressive Models for Calorimeter Shower Simulations »
Junze Liu · Aishik Ghosh · Dylan Smith · Pierre Baldi · Daniel Whiteson -
2022 : Foundations of Attention Mechanisms in Deep Neural Network Architectures »
Pierre Baldi · Roman Vershynin -
2022 : Feasible Adversarial Robust Reinforcement Learning for Underspecified Environments »
JB Lanier · Stephen McAleer · Pierre Baldi · Roy Fox -
2023 Poster: End-To-End Latent Variational Diffusion Models for Inverse Problems in High Energy Physics »
Alexander Shmakov · Kevin Greif · Michael Fenton · Aishik Ghosh · Pierre Baldi · Daniel Whiteson -
2023 Poster: Language Models can Solve Computer Tasks »
Geunwoo Kim · Pierre Baldi · Stephen McAleer -
2023 Poster: AI for Interpretable Chemistry: Predicting Radical Mechanistic Pathways via Contrastive Learning »
Mohammadamin Tavakoli · Pierre Baldi · Ann Marie Carlton · Yin Ting Chiu · Alexander Shmakov · David Van Vranken -
2023 Poster: ClimSim: An open large-scale dataset for training high-resolution physics emulators in hybrid multi-scale climate models »
Sungduk Yu · Walter Hannah · Liran Peng · Jerry Lin · Mohamed Aziz Bhouri · Ritwik Gupta · Björn Lütjens · Justus Will · Gunnar Behrens · Nora Loose · Charles Stern · Tom Beucler · Bryce Harrop · Benjamin Hillman · Andrea Jenney · Savannah Ferretti · Nana Liu · Animashree Anandkumar · Noah Brenowitz · Veronika Eyring · Nicholas Geneva · Pierre Gentine · Stephan Mandt · Jaideep Pathak · Akshay Subramaniam · Carl Vondrick · Rose Yu · Laure Zanna · Ryan Abernathey · Fiaz Ahmed · David Bader · Pierre Baldi · Elizabeth Barnes · Christopher Bretherton · Julius Busecke · Peter Caldwell · Wayne Chuang · Yilun Han · YU HUANG · Fernando Iglesias-Suarez · Sanket Jantre · Karthik Kashinath · Marat Khairoutdinov · Thorsten Kurth · Nicholas Lutsko · Po-Lun Ma · Griffin Mooers · J. David Neelin · David Randall · Sara Shamekh · Mark Taylor · Nathan Urban · Janni Yuval · Guang Zhang · Tian Zheng · Mike Pritchard -
2023 Oral: ClimSim: An open large-scale dataset for training high-resolution physics emulators in hybrid multi-scale climate models »
Sungduk Yu · Walter Hannah · Liran Peng · Jerry Lin · Mohamed Aziz Bhouri · Ritwik Gupta · Björn Lütjens · Justus Will · Gunnar Behrens · Nora Loose · Charles Stern · Tom Beucler · Bryce Harrop · Benjamin Hillman · Andrea Jenney · Savannah Ferretti · Nana Liu · Animashree Anandkumar · Noah Brenowitz · Veronika Eyring · Nicholas Geneva · Pierre Gentine · Stephan Mandt · Jaideep Pathak · Akshay Subramaniam · Carl Vondrick · Rose Yu · Laure Zanna · Ryan Abernathey · Fiaz Ahmed · David Bader · Pierre Baldi · Elizabeth Barnes · Christopher Bretherton · Julius Busecke · Peter Caldwell · Wayne Chuang · Yilun Han · YU HUANG · Fernando Iglesias-Suarez · Sanket Jantre · Karthik Kashinath · Marat Khairoutdinov · Thorsten Kurth · Nicholas Lutsko · Po-Lun Ma · Griffin Mooers · J. David Neelin · David Randall · Sara Shamekh · Mark Taylor · Nathan Urban · Janni Yuval · Guang Zhang · Tian Zheng · Mike Pritchard -
2022 : Foundations of Attention Mechanisms in Deep Neural Network Architectures »
Pierre Baldi · Roman Vershynin -
2021 Poster: XDO: A Double Oracle Algorithm for Extensive-Form Games »
Stephen McAleer · JB Lanier · Kevin A Wang · Pierre Baldi · Roy Fox -
2020 Poster: Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games »
Stephen McAleer · JB Lanier · Roy Fox · Pierre Baldi -
2019 Poster: Modeling Dynamic Functional Connectivity with Latent Factor Gaussian Processes »
Lingge Li · Dustin Pluta · Babak Shahbaba · Norbert Fortin · Hernando Ombao · Pierre Baldi -
2018 Poster: On Neuronal Capacity »
Pierre Baldi · Roman Vershynin -
2018 Oral: On Neuronal Capacity »
Pierre Baldi · Roman Vershynin -
2017 : Poster session »
Abbas Zaidi · Christoph Kurz · David Heckerman · YiJyun Lin · Stefan Riezler · Ilya Shpitser · Songbai Yan · Olivier Goudet · Yash Deshpande · Judea Pearl · Jovana Mitrovic · Brian Vegetabile · Tae Hwy Lee · Karen Sachs · Karthika Mohan · Reagan Rose · Julius Ramakers · Negar Hassanpour · Pierre Baldi · Razieh Nabi · Noah Hammarlund · Eli Sherman · Carolin Lawrence · Fattaneh Jabbari · Vira Semenova · Maria Dimakopoulou · Pratik Gajane · Russell Greiner · Ilias Zadik · Alexander Blocker · Hao Xu · Tal EL HAY · Tony Jebara · Benoit Rostykus -
2014 Workshop: High-energy particle physics, machine learning, and the HiggsML data challenge (HEPML) »
Glen Cowan · Balázs Kégl · Kyle Cranmer · Gábor Melis · Tim Salimans · Vladimir Vava Gligorov · Daniel Whiteson · Lester Mackey · Wojciech Kotlowski · Roberto Díaz Morales · Pierre Baldi · Cecile Germain · David Rousseau · Isabelle Guyon · Tianqi Chen -
2014 Poster: Searching for Higgs Boson Decay Modes with Deep Learning »
Peter Sadowski · Daniel Whiteson · Pierre Baldi -
2014 Spotlight: Searching for Higgs Boson Decay Modes with Deep Learning »
Peter Sadowski · Daniel Whiteson · Pierre Baldi -
2013 Poster: Understanding Dropout »
Pierre Baldi · Peter Sadowski -
2013 Oral: Understanding Dropout »
Pierre Baldi · Peter Sadowski -
2012 Poster: Deep Spatio-Temporal Architectures and Learning for Protein Structure Prediction »
Pietro Di Lena · Pierre Baldi · Ken Nagata -
2012 Spotlight: Deep Spatio-Temporal Architectures and Learning for Protein Structure Prediction »
Pietro Di Lena · Pierre Baldi · Ken Nagata -
2011 Poster: A Machine Learning Approach to Predict Chemical Reactions »
Matthew A Kayala · Pierre Baldi -
2010 Workshop: Charting Chemical Space: Challenges and Opportunities for AI and Machine Learning »
Pierre Baldi · Klaus-Robert Müller · Gisbert Schneider -
2006 Poster: A Scalable Machine Learning Approach to Go »
Lin Wu · Pierre Baldi -
2006 Talk: A Scalable Machine Learning Approach to Go »
Lin Wu · Pierre Baldi