Machine learning research has always prized algorithmic contributions. However, many recent big breakthroughs have been driven by scaling up the same basic algorithms and architectures. The most recent example is OpenAI’s massive language model GPT-3, which won a best paper award at NeurIPS in 2020. GPT-3 was based on the same Transformer architecture as its predecessors, but when scaled up, it resulted in remarkable unexpected behaviors, which had a massive impact on the way we think about language models. As more progress becomes driven by scaling, how should we adapt as a community? Should it affect what problems are considered interesting? Should publication norms take scale into account, or de-emphasize algorithmic contributions? How do we ensure that smaller institutions or academic labs can meaningfully research and audit large-scale systems? From a safety perspective, if behaviors appear emergently at scale, how can we ensure that systems behave as intended? In this panel, we will explore these critical questions so that the NeurIPS community at large can continue to make fundamental advances in the era of massive scaling.
Benchmark datasets have played a crucial role in driving empirical progress in machine learning, leading to an interesting dynamic between those on a quest for state-of-the-art performance and those creating new challenging benchmarks. In this panel, we reflect on how benchmarks can lead to scientific progress, both in terms of new algorithmic innovations and improved scientific understanding. First, what qualities of a machine learning system should a good benchmark dataset seek to measure? How well can benchmarks assess performance in dynamic and novel environments, or in tasks with an open-ended set of acceptable answers? Benchmarks can also raise significant ethical concerns including poor data collection practices, under- and misrepresentation of subjects, as well as misspecification of objectives. Second, even given high-quality, carefully constructed benchmarks, which research questions can we hope to answer from leaderboard-climbing, and which ones are deprioritized or impossible to answer due to the limitations of the benchmark paradigm? In general, we hope to deepen the community’s awareness of the important role of benchmarks for advancing the science of machine learning.
Grappling with copyright law is unavoidable for ML researchers. Copyright protects works like text, photographs, and videos--all of which are used as ML training data, often without consent of the copyright owner. Relying on public domain works (like works published pre-1926), Creative Commons-licensed data (like Wikipedia) or ubiquitous data (like the Enron emails) seems like an easy way to avoid dealing with copyright. Unfortunately, only relying on those works predictably introduces bias into ML algorithms. This Workshop will not provide any legal advice, but it will equip researchers with the tools to understand copyright law and its relationship to ML bias, how the fair use doctrine may allow some copyrighted works to be used as training data without consent, and resources for obtaining legal advice related to copyright and ML research. Attendees will be able to participate in a Q&A after the presentation.
These are some of the resources mentioned in the discussion:
As machine learning becomes increasingly widespread in the real world, there has been a growing set of well-documented potential harms that need to be acknowledged and addressed. In particular, valid concerns about data privacy, algorithmic bias, automation risk, potential malicious uses, and more have highlighted the need for the active consideration of critical ethical issues in the field. In the light of this, there have been calls for machine learning researchers to actively consider not only the potential benefits of their research but also its potential negative societal impact, and adopt measures that enable positive trajectories to unfold while mitigating risk of harm. However, grappling with ethics is still a difficult and unfamiliar problem for many in the field. A common difficulty with assessing ethical impact is its indirectness: most papers focus on general-purpose methodologies (e.g., optimization algorithms), whereas ethical concerns are more apparent when considering downstream applications (e.g., surveillance systems). Also, real-world impact (both positive and negative) often emerges from the cumulative progress of many papers, so it is difficult to attribute the impact to an individual paper. Furthermore, regular research ethics mechanisms such as an Institutional Review Board (IRB) are not always a good fit for machine learning …