Mathematical reasoning is a unique aspect of human intelligence and a fundamental building block for scientific and intellectual pursuits. However, learning mathematics is often a challenging human endeavor that relies on expert instructors to create, teach and evaluate mathematical material. From an educational perspective, AI systems that aid in this process offer increased inclusion and accessibility, efficiency, and understanding of mathematics. Moreover, building systems capable of understanding, creating, and using mathematics offers a unique setting for studying reasoning in AI. This workshop will investigate the intersection of mathematics education and AI.
Sat 6:55 a.m.  7:00 a.m.

Introduction and Opening Remarks
(Opening Remarks)
SlidesLive Video » 
🔗 
Sat 7:00 a.m.  7:30 a.m.

Reasoning and Abstraction as Challenges for AI
(Invited Talk)
SlidesLive Video » 
Cezary Kaliszyk 🔗 
Sat 7:30 a.m.  8:00 a.m.

Length Generalization in Quantitative Reasoning
(Invited Talk)
SlidesLive Video » 
Behnam Neyshabur 🔗 
Sat 8:00 a.m.  8:30 a.m.

Has Progress on Math been Surprising?
(Invited Talk)
SlidesLive Video » In 2021, we commissioned forecasters to predict progress on ML benchmarks, including the MATH dataset for mathematical problemsolving. Progress on MATH ended up being much faster than predicted. I'll discuss what we should and shouldn't take away from this, my own predictions for future progress, and general implications for predicting future developments in ML. 
Jacob Steinhardt 🔗 
Sat 8:30 a.m.  10:00 a.m.

Poster Session

🔗 
Sat 10:00 a.m.  11:00 a.m.

Lunch Break
(Break)

🔗 
Sat 11:00 a.m.  11:20 a.m.

Teaching Algorithmic Reasoning via Incontext Learning
(Contributed Talk)
link »
SlidesLive Video » Large language models (LLMs) have shown increasing incontext learning capabilities through scaling up model and data size. Despite this progress, LLMs are still unable to solve algorithmic reasoning problems. While providing a rationale with the final answer has led to further improvements in multistep reasoning problems, Anil et al. (2022) showed that even simple algorithmic reasoning tasks such as parity are far from solved. In this work, we identify and study four key stages for successfully teaching algorithmic reasoning to LLMs: (1) formulating algorithms as skills, (2) teaching multiple skills simultaneously (skill accumulation), (3) teaching how to combine skills (skill composition) and (4) teaching how to use skills as tools. We show that it is possible to teach algorithmic reasoning to LLMs via incontext learning, which we refer to as algorithmic prompting. We evaluate our approach on a variety of arithmetic and quantitative reasoning tasks, and demonstrate significant boosts in performance over existing prompting techniques. In particular, for long parity, addition, multiplication and subtraction, we achieve an error reduction of approximately 10x, 9x, 5x and 2x respectively compared to the best available baselines. 
Hattie Zhou · Azade Nova · aaron courville · Hugo Larochelle · Behnam Neyshabur · Hanie Sedghi 🔗 
Sat 11:20 a.m.  11:40 a.m.

Solving Math Word Problems with Processbased and Outcomebased Feedback
(Contributed Talk)
link »
SlidesLive Video » Recent work has shown that prompting language models to generate reasoning steps improves performance on many reasoning tasks. When moving beyond prompting, this raises the question of how we should supervise the finetuning of such models: outcomebased approaches which supervise the final result, or processbased approaches which supervise the reasoning process itself? Differences between these approaches might naturally be expected not just in finalanswer errors but also in reasoning errors, which can be difficult to detect and are problematic in many realworld domains such as education. We run the first comprehensive comparison between process and outcomebased approaches trained on a natural language task, GSM8K. We find that pure outcomebased supervision produces similar finalanswer error rates with less label supervision. However, for correct reasoning steps we find it necessary to use processbased supervision or supervision from learned reward models that emulate processbased feedback. In total, we improve the previous best results from 16.8% → 12.7% finalanswer error and from 14.0% → 3.4% reasoning error among finalanswercorrect solutions. 
Jonathan Uesato · Nate Kushman · Ramana Kumar · H. Francis Song · Noah Siegel · Lisa Wang · Antonia Creswell · Geoffrey Irving · Irina Higgins 🔗 
Sat 11:40 a.m.  12:00 p.m.

ProofNet: A Benchmark for Autoformalizing and Formally Proving UndergraduateLevel Mathematics Problems
(Contributed Talk)
SlidesLive Video » 
Zhangir Azerbayev · Bartosz Piotrowski · Jeremy Avigad 🔗 
Sat 12:00 p.m.  12:30 p.m.

Towards Systematic Reasoning with Language Models
(Invited Talk)
SlidesLive Video » Mathematics requires systematic reasoning, namely the stepwise application of knowledge in a sound manner to reach a conclusion. Can language models (LMs) perform this kind of systematic reasoning with knowledge provided to it? Or, even more ambitiously, can LMs reason systematically with their own internal knowledge acquired during pretraining? In this talk, I'll attempt to answer these questions, illustrated with our recent work on using LMs for logical deduction, proof generation, and multistep textual entailment problems. While progress has been made, there is still a way to go. To illustrate this, I'll conclude by posing a (currently unsolved) grand challenge  answering Fermi problems  to the math reasoning community, requiring combining systematic reasoning, mathematics, and world knowledge together. 
Peter Clark 🔗 
Sat 12:30 p.m.  1:00 p.m.

Coffee Break
(Break)

🔗 
Sat 1:00 p.m.  1:30 p.m.

Leveraging Maths to Understand Transformers
(Invited Talk)
SlidesLive Video » 
Francois Charton 🔗 
Sat 1:30 p.m.  2:00 p.m.

Learning Mathematical Reasoning for Education
(Invited Talk)
SlidesLive Video » 
Noah Goodman 🔗 
Sat 2:00 p.m.  2:55 p.m.

MATHAI: Toward HumanLevel Mathematical Reasoning
(Discussion Panel)
SlidesLive Video » 
Francois Charton · Noah Goodman · Behnam Neyshabur · Talia Ringer · Daniel Selsam 🔗 
Sat 2:55 p.m.  3:00 p.m.

Closing Remarks

🔗 


Neural Combinatorial Logic Circuit Synthesis from InputOutput Examples
(Poster)
We propose a novel, fully explainable neural approach to synthesis of combinatorial logic circuits from inputoutput examples. The carrying advantage of our method is that it readily extends to inductive scenarios, where the set of examples is incomplete but still indicative of the desired behaviour. Our method can be employed for a virtually arbitrary choice of atoms  from logic gates to FPGA blocks  as long as they can be formulated in a differentiable fashion, and consistently yields good results for synthesis of practical circuits of increasing size. In particular, we succeed in learning a number of arithmetic, bitwise, and signalrouting operations, and even generalise towards the correct behaviour in inductive scenarios. Our method, attacking a discrete logical synthesis problem with an explainable neural approach, hints at a wider promise for synthesis and reasoningrelated tasks. 
Peter Belcak · Roger Wattenhofer 🔗 


Automatic Generation of Socratic Questions for Learning to Solve Math Word Problems
(Poster)
SlidesLive Video » Socratic questioning is an educational method that allows students to discover answers to complex problems by asking them a series of thoughtful questions. Generation of didactically sound questions is challenging, requiring an understanding of the reasoning process involved in the problem. We hypothesize that such a questioning strategy can not only enhance human performance but also assist the math word problem (MWP) solvers.In this work, we explore the ability of large language models (LMs) in generating sequential questions for guiding math word problemsolving. We propose various guided question generation schemes based on input conditioning and reinforcement learning.On both automatic and human quality evaluations, we find that LMs constrained with desirable question properties generate superior questions and improve the overall performance of a math word problem solver. 
Kumar Shridhar · Jakub Macina · Menna ElAssady · tanmay sinha · Mrinmaya Sachan 🔗 


Generating Reflexive Polytopes via Sequence Modeling
(Poster)
SlidesLive Video » We train neural network sequence models to generate reflexive lattice polytopes. We demonstrate that they can generate mathematical objects satisfying various geometric properties. We use the completeness of our datasets to give evidence that the models are understanding some underlying structure of the data. 
Bernt Ivar Utstøl Nødland 🔗 


A Causal Framework to Quantify Robustness of Mathematical Reasoning with Language Models
(Poster)
SlidesLive Video » We have recently witnessed a number of impressive results on hard mathematical reasoning problems with large language models (LLMs). At the same time, the robustness of these models has also been called into question.Building on the idea of behavioral testing, we propose a novel framework, which pins down the causal effect of each factor in the input, e.g., the surface form of the problem text, the operands, and math operators, on the output. By grounding the behavioral analysis in a causal graph describing an intuitive reasoning process, we study the behavior of LLMs in terms of robustness and sensitivity to direct interventions in the input space. We apply our framework on a test bed of bivariate math word problems.Our analysis shows that robustness does not appear to continuously improve as a function of scale, but that the recent LLM, GPT3Instruct (175B), achieves a dramatic improvement in both robustness and sensitivity, compared to all other GPT variants. 
Alessandro Stolfo · Zhijing Jin · Kumar Shridhar · Bernhard Schölkopf · Mrinmaya Sachan 🔗 


What is my math transformer doing? Three results on interpretability and generalization
(Poster)
We investigate the failure cases and outofdistribution behavior of transformers trained on matrix inversion, eigen decomposition and eigenvalue calculation. We show that incorrect model predictions still retain deep mathematical properties of the solution (e.g. correct eigenvalues, unit norm of eigenvectors), and that almost all model failures can be attributed to, and predicted from, properties of the problem or solution. This demonstrates that, when in doubt, math transformers do not hallucinate crazy solutions (as was sometimes proposed) but remain 
Francois Charton 🔗 


Learning to Understand Plane Geometry Diagram
(Poster)
SlidesLive Video » Geometry diagram parsing plays a key role in geometry problem solving, wherein the primitive extraction and relation parsing remain challenging due to the complex layout and betweenprimitive relationship. In this paper, we propose a powerful diagram parser based on deep learning and graph reasoning. Specifically, a modified instance segmentation method is proposed to extract geometric primitives, and the graph neural network (GNN) is leveraged to realize relation parsing and primitive classification incorporating geometric features and prior knowledge. All the modules are integrated into an endtoend model called PGDPNet to perform all the subtasks simultaneously. In addition, we build a new largescale geometry diagram dataset named PGDP5K with primitive level annotations. Experiments on PGDP5K and an existing dataset IMPGeometry3K show that our model outperforms stateoftheart methods in four subtasks remarkably. The full version of this paper has been accepted by IJCAI 2022. 
Mlingliang Zhang · Fei yin · Yihan Hao · Chenglin Liu 🔗 


Lemma: Bootstrapping HighLevel Mathematical Reasoning with Learned Symbolic Abstractions
(Poster)
SlidesLive Video » Humans tame the complexity of mathematical reasoning by developing hierarchies of abstractions.With proper abstractions, solutions to hard problems can be expressed concisely, thus making them more likely to be found.In this paper, we propose Learning Mathematical Abstractions (LEMMA): an algorithm that implements this idea forreinforcement learning agents in mathematical domains.LEMMA augments Expert Iterationwith an abstraction step, where solutions found so far are revisitedand rewritten in terms of new higherlevel actions, which thenbecome available to solve new problems.We evaluate LEMMA on two mathematicalreasoning tasksequation solving and fraction simplificationina stepbystep fashion.In these two domains,LEMMA improves the ability of an existing agent, bothsolving more problems and generalizing more effectively to harderproblems than those seen during training. 
Zhening Li · Gabriel Poesia Reis e Silva · Omar Costilla Reyes · Noah Goodman · Armando SolarLezama 🔗 


MWPBERT: A Numeracyaugmented Pretrained Encoder for Math Word Problems
(Poster)
SlidesLive Video » Math word problem (MWP) solving faces a dilemma in number representation learning. In order to avoid the number representation issue and reduce the search space of feasible solutions, existing works striving for MWP solving usually replace real numbers with symbolic placeholders to focus on logic reasoning. However, instead of the number value itself, it is the reusable numerical property that matters more in numerical reasoning. Therefore, we argue that injecting numerical properties into symbolic placeholders with contextualized representation learning schema canprovide a way out of the dilemma in the number representation issue here. In this work, we introduce this idea to the popular pretraining language model (PLM) techniques and build MWPBERT, an effective contextual number representation PLM. We demonstrate the effectiveness of our MWPBERT on MWP solving and several MWPspecific understanding tasks on both English and Chinese benchmarks. 
Zhenwen Liang · Jipeng ZHANG · Lei Wang · Wei QIN · Jie Shao · Xiangliang Zhang 🔗 


Inversely Eliciting Numerical Reasoning in Language Models via Solving Linear Systems
(Poster)
SlidesLive Video » Recent language models have struggled to generalize to a large range of numbers in numerical reasoning.In this paper, we propose a novel method that leverages simple numbers as anchors to characterize the implicitly inferred arithmetic expressions from language models, and then explicitly applies the expressions to original numbers to get the answers.Experimental results on several numerical reasoning benchmarks demonstrate that our approach is highly effective.More importantly, our approach works in the inference phase without extra model training, making it highly portable and achieving significant and consistent performance benefits across a variety of language models in zeroshot, fewshot, and finetuning scenarios. 
Fan Zhou · Haoyu Dong · Qian Liu · Zhoujun Cheng · Shi Han · Dongmei Zhang 🔗 


EuclidNet: Deep Visual Reasoning for Constructible Problems in Geometry
(Poster)
SlidesLive Video » In this paper, we present a deep learningbased framework for solving geometric construction problems through visual reasoning, which is useful for automated geometry theorem proving. Constructible problems in geometry often ask for the sequence of straightedgeandcompass constructions to construct a given goal given some initial setup. Our EuclidNet framework leverages the neural network architecture Mask RCNN to extract the visual features from the initial setup and goal configuration with extra points of intersection, and then generate possible construction steps as intermediary data models that are used as feedback in the training process for further refinement of the construction step sequence. This process is repeated recursively until either a solution is found, in which case we backtrack the path for a stepbystep construction guide, or the problem is identified as unsolvable. Our EuclidNet framework is validated on complex Japanese Sangaku geometry problems, demonstrating its capacity to leverage backtracking for deep visual reasoning of challenging problems. 
Man Fai Wong · Xintong Qi · CheeWei Tan 🔗 


Estimating Numbers without Regression
(Poster)
SlidesLive Video » Despite recent successes in language models, their ability to represent numbers is insufficient. Humans conceptualize numbers based on their magnitudes, effectively projecting them on a number line; whereas subword tokenization fails to explicitly capture magnitude by splitting numbers into arbitrary chunks. To alleviate this shortcoming, alternative approaches have been proposed that modify numbers at various stages of the language modeling pipeline. These methods change either the (1) notation in which numbers are written (eg scientific vs decimal), the (2) vocabulary used to represent numbers or the entire (3) architecture of the underlying language model, to directly regress to a desired number. In this work, we show that a potential tradeoff to the more complex architectural changes is to simply change the model's vocabulary instead, \eg introduce a new token for numbers in range 10100. In the context of masked number prediction, we find that a carefully designed tokenization scheme is both the simplest to implement and sufficient, i.e., with similar performance to the stateoftheart approach that requires making significant architectural changes.Finally, we evaluate the various number representation schemes on the downstream task of numerical fact estimation (for Fermi Problems) in a zeroshot setting and find similar trends, i.e., changes at the tokenization level achieve near stateoftheart results while requiring minimal resources compared to other number representation schemes. 
Avijit Thawani · Jay Pujara · Ashwin Kalyan 🔗 


Learn to Select Good Examples with Reinforcement Learning for Semistructured Mathematical Reasoning
(Poster)
Recent large pretrained language models such as GPT3 have achieved remarkable progress on mathematical reasoning tasks written in text form, such as math word problems (MWP). However, it is unknown if models can handle more complex problems that involve heterogeneous information, such as tabular data. To fill the gap, we present Tabular Math Word Problems (TabMWP), a new dataset containing 38,431 opendomain problems that require mathematical reasoning on both textual and tabular data, where each question is aligned with a tabular context. We evaluate different pretrained models on TabMWP, including the GPT3 model in a fewshot setting. As earlier studies suggest, since fewshot GPT3 relies on the selection of incontext examples, its performance is unstable and can degrade to near chance. This issue is more severe when handling complex problems like TabMWP. To mitigate this, we further propose a novel approach, PromptPG, which utilizes policy gradient to learn to select good incontext examples from a small amount of training data. Experimental results show that our method outperforms the best baseline by 5.31% in accuracy and reduces the prediction variance significantly compared to random selection. 
Pan Lu · Liang Qiu · KaiWei Chang · Ying Nian Wu · SongChun Zhu · Tanmay Rajpurohit · Peter Clark · Ashwin Kalyan 🔗 


Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs
(Poster)
The formalization of existing mathematical proofs is a notoriously difficult process. Despite decades of research on automation and proof assistants, writing formal proofs remains arduous and only accessible to a few experts. While previous studies to automate formalization focused on powerful search algorithms, no attempts were made to take advantage of available informal proofs. In this work, we introduce Draft, Sketch, and Prove (DSP), a method that maps informal proofs to formal proof sketches, and uses the sketches to guide an automated prover by directing its search to easier subproblems. We investigate two relevant setups where informal proofs are either written by humans or generated by a language model. Our experiments and ablation studies show that large language models are able to produce wellstructured formal sketches that follow the same reasoning steps as the informal proofs. Guiding an automated prover with these sketches enhances its performance from 20.9% to 39.3% on a collection of mathematical competition problems. 
Albert Jiang · Sean Welleck · Jin Peng Zhou · Timothee Lacroix · Jiacheng Liu · Wenda Li · Mateja Jamnik · Guillaume Lample · Yuhuai Wu 🔗 


Overcoming Barriers to Skill Injection in Language Modeling: Case Study in Arithmetic
(Poster)
SlidesLive Video » Through their transfer learning abilities, highlyparameterized large pretrained language models have dominated the NLP landscape for a multitude of downstream language tasks. Though linguistically proficient, the inability of these models to incorporate the learning of nonlinguistic entities (numerals and arithmetic reasoning) limits their usage for tasks that require numeric comprehension or strict mathematical reasoning. However, as we illustrate in this paper, building a general purpose language model that also happens to be proficient in mathematical reasoning is not as straightforward as training it on a numeric dataset. In this work, we develop a novel framework that enables language models to be mathematically proficient while retaining their linguistic prowess. Specifically, we offer informationtheoretic interventions to overcome the catastrophic forgetting of linguistic skills that occurs while injecting nonlinguistic skills into language models. 
Mandar Sharma · Nikhil Muralidhar · Naren Ramakrishnan 🔗 


Teaching Algorithmic Reasoning via Incontext Learning
(Poster)
SlidesLive Video » Large language models (LLMs) have shown increasing incontext learning capabilities through scaling up model and data size. Despite this progress, LLMs are still unable to solve algorithmic reasoning problems. While providing a rationale with the final answer has led to further improvements in multistep reasoning problems, Anil et al. 2022 showed that even simple algorithmic reasoning tasks such as parity are far from solved. In this work, we identify and study four key stages for successfully teaching algorithmic reasoning to LLMs: (1) formulating algorithms as skills, (2) teaching multiple skills simultaneously (skill accumulation), (3) teaching how to combine skills (skill composition) and (4) teaching how to use skills as tools. We show that it is possible to teach algorithmic reasoning to LLMs via incontext learning, which we refer to as \emph{algorithmic prompting}. We evaluate our approach on a variety of arithmetic and quantitative reasoning tasks, and demonstrate significant boosts in performance over existing prompting techniques. In particular, for long parity, addition, multiplication and subtraction, we achieve an error reduction of approximately 10x, 9x, 5x and 2x respectively compared to the best available baselines. 
Hattie Zhou · Azade Nova · aaron courville · Hugo Larochelle · Behnam Neyshabur · Hanie Sedghi 🔗 


Broken Neural Scaling Laws
(Poster)
We present a smoothly broken power law functional form that accurately models the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, or training dataset size varies) for each task within a large and diverse set of upstream and downstream tasks, in zeroshot, prompted, and finetuned settings. This set includes largescale vision and unsupervised language tasks, arithmetic, and reinforcement learning. This functional form yields extrapolations of scaling behavior that often are an order of magnitude more accurate than the ones obtained by other functional forms for neural scaling behavior. Moreover, this functional form accurately models the nonmonotonic transitions present in the scaling behavior of phenomena such as double descent and the delayed, sharp inflection points present in the scaling behavior of tasks such as arithmetic. Lastly, we use this functional form to glean insights about the limit of the predictability of scaling behavior. 
Ethan Caballero · Kshitij Gupta · Irina Rish · David Krueger 🔗 


Towards automating formalisation of theorem statements using large language models
(Poster)
SlidesLive Video » Mathematics formalisation is the task of writing mathematics (i.e., definitions, theorem statements, proofs) in natural language, as found in books and papers, into a formal language that can then be checked for correctness by a program. It is a thriving activity today, however formalisation remains cumbersome. In this paper, we explore the abilities of a large language model (Codex) to help with formalisation in the Lean theorem prover. We find that with careful inputdependent prompt selection and postprocessing, Codex is able to formalise short mathematical statements at undergrad level with nearly 75% accuracy for 120 theorem statements. 
Siddhartha Gadgil · Anand Tadipatri · Navin Goyal · Ayush Agrawal · Ashvni Narayanan 🔗 


Graph neural networks for Ramsey graphs
(Poster)
SlidesLive Video » Ramseylike problems are ubiquitous in extremal combinatorics and occupy a central place in the field. In simple terms, Ramsey theory wishes to find the minimum size of a large graph structure such that some sought substructure  generally a clique or an independent set  is guaranteed to exist. Due to considerations of computational complexity, brute force approaches to solving these problems are usually not very feasible, as the substructures cannot be checked in polynomial time. At the same time, we seek extremal graphs that completely avoid such substructures to better understand the graph theory governing their occurrence. We investigate the feasibility of Graph Neural Networks (GNNs) in terms of indicating and refining search procedures for finding these special classes of Ramseyextremal graphs, which are of interest to mathematicians. 
Amur Ghose · Amit Levi · Yingxueff Zhang 🔗 


Improving Compositional Generalization in Math Word Problem Solving
(Poster)
SlidesLive Video » Compositional generalization refers to a model's capability to generalize to newly composed input data based on the data components observed during training. It has triggered a series of compositional generalization analysis on different tasks as generalization is an important aspect of language and problem solving skills. However, the similar discussion on math word problems (MWPs) is limited. In this manuscript, we study compositional generalization in MWP solving. Specifically, we first introduce a data splitting method to create compositional splits from existing MWP datasets. Meanwhile, we synthesize data to isolate the effect of compositions. To improve the compositional generalization in MWP solving, we propose an iterative data augmentation method that includes diverse compositional variation into training data and could collaborate with MWP methods. During the evaluation, we examine a set of methods and find all of them encounter severe performance loss on the evaluated datasets. We also find our data augmentation method could significantly improve the compositional generalization of general MWP methods. 
Yunshi Lan · Lei Wang · Jing Jiang · Eepeng Lim 🔗 


ProofNet: A Benchmark for Autoformalizing and Formally Proving UndergraduateLevel Mathematics Problems
(Poster)
We introduce \textsf{ProofNet}, a benchmark for autoformalization and formal proving of undergraduatelevel mathematics. The \textsf{ProofNet} benchmarks consists of 297 theorem statements expressed in both natural language and the Lean 3 theorem prover, 100 of which are also accompanied by natural language proofs. The problems are primarily drawn from popular undergraduate pure mathematics textbooks, and cover topics such as real and complex analysis, linear algebra, abstract algebra, and topology. We intend for \textsf{ProofNet} to be a challenging benchmark that will drive progress in autoformalization and automatic theorem proving. We report baseline results on the autoformalization of statements using fewshot learning with large language models. 
Zhangir Azerbayev · Bartosz Piotrowski · Jeremy Avigad 🔗 


Learning to Reason With Relational Abstractions
(Poster)
Large language models have recently shown promising progress in mathematical reasoning when finetuned with humangenerated sequences walking through a sequence of solution steps. However, the solution sequences are not formally structured and the resulting modelgenerated sequences may not reflect the kind of systematic reasoning we might expect an expert human to produce. In this paper, we study how to build stronger reasoning capability in language models using the idea of relational abstractions. We introduce new types of sequences that more explicitly provide an abstract characterization of the transitions through intermediate solution steps to the goal state. We find that models that are supplied with such sequences as prompts can solve tasks with a significantly higher accuracy, and models that are trained to produce such sequences solve problems better than those that are trained with previously used humangenerated sequences and other baselines. Our work thus takes several steps toward elucidating and improving how language models perform on tasks requiring multistep mathematical reasoning. 
Andrew Nam · James McClelland · Mengye Ren · Chelsea Finn 🔗 


OutofDistribution Generalization in Algorithmic Reasoning Through Curriculum Learning
(Poster)
Outofdistribution generalization (OODG) is a longstanding challenge for neural networks, and is quite apparent in tasks with welldefined variables and rules, where explicit use of the rules can solve problems independently of the particular values of the variables. Large transformerbased language models have pushed the boundaries on how well neural networks can generalize to novel inputs, but their complexity obfuscates they achieve such robustness. As a step toward understanding how transformerbased systems generalize, we explore the question of OODG in smaller scale transformers. Using a reasoning task based on the puzzle Sudoku, we show that OODG can occur on complex problems if the training set includes examples sampled from the whole distribution of simpler component tasks. 
Andrew Nam · Mustafa Abdool · Trevor Maxfield · James McClelland 🔗 


On the Abilities of Mathematical Extrapolation with Implicit Models
(Poster)
Deep neural networks excel on a variety of different tasks, often surpassing human intelligence. However, when presented with outofdistribution data, these models tend to break down even on the simplest tasks. In this paper, we compare implicitlydefined and classical deep learning models on a series of mathematical extrapolation tasks, where the models are tested with outofdistribution samples during inference time. Throughout our experiments, implicit models greatly outperform classical deep learning networks that overfit the training distribution. We showcase implicit models' unique advantages for extrapolation thanks to their flexible and selective framework. Thanks to their potentially unlimited depth, implicit models not only adapt well to outofdistribution inputs but also understand the underlying structure of inputs much better. 
Alicia Tsai · Juliette Decugis · Ashwin Ganesh · Max Emerling · Laurent El Ghaoui 🔗 


Program Synthesis for Integer Sequence Generation
(Poster)
SlidesLive Video » Recent advances in program synthesis have shown success with methods that employ deep learning on synthetic data generated from domain specific languages (DSLs). In this work, we propose an algorithm for program synthesis that extends these methods. It uses transfer learning from pretrained language models, and employs a policy improvement operator based on policyguided search. This hybrid approach combats the challenges of searching a large language space with sparse rewards. We show its effectiveness on the task of integer sequence generation, a special case of programmingbyexamples with fixed inputs. Our preliminary results demonstrate that the inclusion of policyguided search leads to a 1.6% increase in the number of correct programs compared to supervised baselines. 
Natasha Butt · Auke Wiggers · Taco Cohen · Max Welling 🔗 


LILA: A Unified Benchmark for Mathematical Reasoning
(Poster)
SlidesLive Video » Mathematical reasoning skills are essential for generalpurpose intelligent systems to perform tasks from grocery shopping to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LILA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g arithmetic, calculus, (ii) language format e.g. questionanswering, fillintheblanks, (iii) language diversity e.g. no language, simple language, (iv) external knowledge e.g. commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs, thereby obtaining explainable solutions in addition to the correct answer. We introduce two evaluation datasets to measure outofdistribution performance and robustness to language perturbation. Finally, we introduce BHASKARA and its variants, a family of mathematical reasoning models finetuned on LILA. Importantly, we find that multitasking leads to significant improvements (average relative improvement of 21.83% F1 score vs singletask models), while the best performing model only obtains 60.40%, indicating the room for improvement in general mathematical reasoning and understanding. 
Swaroop Mishra · Matthew Finlayson · Pan Lu · Leonard Tang · Sean Welleck · Chitta Baral · Tanmay Rajpurohit · Oyvind Tafjord · Ashish Sabharwal · Peter Clark · Ashwin Kalyan



Solving Math Word Problems with Processbased and Outcomebased Feedback
(Poster)
Recent work has shown that prompting language models to generate reasoning steps improves performance on many reasoning tasks. When moving beyond prompting, this raises the question of how we should supervise the finetuning of such models: outcomebased approaches which supervise the final result, or processbased approaches which supervise the reasoning process itself? Differences between these approaches might naturally be expected not just in finalanswer errors but also in reasoning errors, which can be difficult to detect and are problematic in many realworld domains such as education. We run the first comprehensive comparison between process and outcomebased approaches trained on a natural language task, GSM8K. We find that pure outcomebased supervision produces similar finalanswer error rates with less label supervision. However, for correct reasoning steps we find it necessary to use processbased supervision or supervision from learned reward models that emulate processbased feedback. In total, we improve the previous best results from 16.8% to 12.7% finalanswer error and from 14.0% to 3.4% reasoning error among finalanswercorrect solutions. 
Jonathan Uesato · Nate Kushman · Ramana Kumar · H. Francis Song · Noah Siegel · Lisa Wang · Antonia Creswell · Geoffrey Irving · Irina Higgins 🔗 