Workshop
Machine Learning with New Compute Paradigms
Jannes Gladrow · Benjamin Scellier · Eric Xing · Babak Rahmani · Francesca Parmigiani · Paul Prucnal · Cheng Zhang
Room 235  236
As GPU computing comes closer to a plateau in terms of efficiency and cost due to Moore’s law reaching its limit, there is a growing need to explore alternative computing paradigms, such as (opto)analog, neuromorphic, and lowpower computing. This NeurIPS workshop aims to unite researchers from machine learning and alternative computation fields to establish a new hardwareML feedback loop.By codesigning models with specialized accelerators, we can leverage the benefits of increased throughput or lower perflop power consumption. Novel devices hold the potential to further accelerate standard deep learning or even enable efficient inference and training of hitherto computeconstrained model classes. However, new compute paradigms typically present challenges such as intrinsic noise, restricted sets of compute operations, or limited bitdepth, and thus require modelhardware codesign. This workshop’s goal is to foster crossdisciplinary collaboration to capitalize on the opportunities offered by emerging AI accelerators.
Schedule
Sat 7:00 a.m.  7:10 a.m.

Opening Remarks
(
Talk
)

Cheng Zhang 🔗 
Sat 7:10 a.m.  7:30 a.m.

Computing with physical systems: reimagining specialpurpose computers from the bottom up
(
Invited Talk
)
SlidesLive Video In this talk I will discuss how by eliminating many of the layers of abstraction used in conventional computers, and working as close to the underlying physics as possible, we may be able to create specialpurpose processors that are orders of magnitude faster or more energyefficient than the present stateoftheart. 
Peter McMahon 🔗 
Sat 7:30 a.m.  7:40 a.m.

SpiNNaker2: A LargeScale Neuromorphic System for EventBased and Asynchronous Machine Learning
(
Oral
)
link
SlidesLive Video The joint progress of artificial neural networks and domain specific hardware accelerators such as GPUs and TPUs took over many domains of machine learning research.This development is accompanied by a rapid growth of the required computational demands for larger models and more data.Concurrently, emerging properties of foundation models such as incontext learning drive new opportunities for machine learning applications.However, the computational cost of such applications is a limiting factor of the technology in datacenters, and more importantly in mobile devices and edge systems.To mediate the energy footprint and nontrivial latency of contemporary systems, neuromorphic computing systems deeply integrate computational principles of neurobiological systems by leveraging lowpower analog and digital technologies.SpiNNaker2 is a digital neuromorphic chip developed for scalable machine learning.The eventbased and asynchronous design of SpiNNaker2 allows the composition of largescale systems from thousands of chips.In this work, we present the design and operating principles of SpiNNaker2 systems.Furthermore, we outline a number of machine learning applications that we developed on either the full chip or earlier prototypes.The already available applications range from accelerating artificial neural networks over bioinspired spiking neural networks to generalized eventbased neural networks.With the successful development and deployment of SpiNNaker2, we aim to facilitate the advancement of eventbased and asynchronous algorithms for future generations of machine learning systems. 
Hector Gonzalez · Jiaxin Huang · Florian Kelber · Khaleelulla Khan Nazeer · Tim Hauke Langer · Chen Liu · Matthias Lohrmann · Amirhossein Rostami · Mark Schoene · Bernhard Vogginger · Timo Wunderlich · Yexin Yan · Mahmoud Akl · Christian Mayr

Sat 7:40 a.m.  7:50 a.m.

Scalingup Memristor Monte Carlo with magnetic domainwall physics
(
Oral
)
link
SlidesLive Video By exploiting the intrinsic random nature of nanoscale devices, Memristor Monte Carlo (MMC) is a promising enabler of edge learning systems. However, due to multiple algorithmic and devicelevel limitations, existing demonstrations have been restricted to very small neural network models and datasets. We discuss these limitations, and describe how they can be overcome, by mapping the stochastic gradient Langevin dynamics (SGLD) algorithm onto the physics of magnetic domainwall Memristors to scaleup MMC models by five orders of magnitude. We propose the pushpull pulse programming method that realises SGLD inphysics, and use it to train a domainwall based ResNet18 on the CIFAR10 dataset. On this task, we observe no performance degradation relative to a floating point model down to an update precision of between 6 and 7bits, indicating we have made a step towards a largescale edge learning system leveraging noisy analogue devices. 
Thomas Dalgaty · Shogo Yamada · Anca Molnos · Eiji Kawasaki · Thomas Mesquida · Rummens François · TATSUO SHIBATA · Yukihiro Urakawa · Yukio Terasaki · Tomoyuki Sasaki · Marc Duranton

Sat 7:50 a.m.  8:10 a.m.

Analog AI Accelerators
(
Invited Talk
)
link
SlidesLive Video Deep learning has irreversibly changed and drastically enhanced how we process information. The rapidly increasing computation time and energy costs required to train ever larger AI models make it evident that the future of artificial intelligence depends on realizing fast and energyefficient processors. With the slowdown in transistor scaling and the diminishing returns expected from future CMOS, the concept of analog computing has been put forward as an alternative. Analog neural networks process information that is stored locally and in a fullyparallel manner in the analog domain using physical device properties instead of conventional Boolean arithmetic. This presentation will give an overview of analog neural network and the underlying device technologies to implement them. 
Jesus del Alamo 🔗 
Sat 8:10 a.m.  8:25 a.m.

Break
(
Break
)

🔗 
Sat 8:25 a.m.  8:35 a.m.

AnalogOptical Computation for optimization and machinelearning inference
(
Talk
)
link
SlidesLive Video Solving optimization problems is challenging for existing digital computers and even for future quantum hardware. The practical importance of diverse problems, from healthcare to financial optimization, has driven the emergence of specialised hardware over the past decade. However, their support for problems with only binary variables severely restricts the scope of practical problems that can be efficiently embedded. We build analog iterative machine (AIM), the first instance of an optoelectronic solver that natively implements a wider class of quadratic unconstrained mixed optimization (QUMO) problems and supports alltoall connectivity of both continuous and binary variables. Beyond synthetic 7bit problems at smallscale, AIM solves the financial transaction settlement problem entirely in analog domain with higher accuracy than quantum hardware and at room temperature. With computeinmemory operation and spatialdivision multiplexed representation of variables, AIM’s design paves the path to chipscale architecture with 100 times speedup per unitpower over the latest GPUs for solving problems with 10,000 variables. The robustness of the AIM algorithm at such scale is further demonstrated by comparing it with commercial production solvers across multiple benchmarks, where for several problems we report new best solutions. By combining the superior QUMO abstraction, sophisticated gradient descent methods inspired by machine learning, and commodity hardware, AIM introduces a novel platform with a step change in expressiveness, performance, and scalability, for optimization in the postMoore’s law era. 
Jannes Gladrow 🔗 
Sat 8:35 a.m.  8:45 a.m.

Bayesian Metaplasticity from Synaptic Uncertainty
(
Oral
)
link
SlidesLive Video Catastrophic forgetting remains a challenge for neural networks, especially in lifelong learning scenarios. In this study, we introduce MEtaplasticity from Synaptic Uncertainty (MESU), inspired by metaplasticity and Bayesian inference principles. MESU harnesses synaptic uncertainty to retain information over time, with its update rule closely approximating the diagonal Newton's method for synaptic updates. Through continual learning experiments on permuted MNIST tasks, we demonstrate MESU's remarkable capability to maintain learning performance across 100 tasks without the need of explicit task boundaries. 
Djohan Bonnet · Tifenn HIRTZLIN · Tarcisius Januel · Thomas Dalgaty · Damien Querlioz · Elisa Vianello 🔗 
Sat 8:45 a.m.  9:05 a.m.

Training physical systems with Equilibrium Propagation
(
Invited Talk
)
link
SlidesLive Video The algorithm of Equilibrium Propagation (EP) 1 is highly interesting for training physical systems as it extracts backpropequivalent gradients directly from their convergence to a steady state 2,3. In my talk, I will show that it is an excellent starting point for building and training physical systems to perform classification tasks. I will first describe how we have used EP to train the hardware DWave Ising machine in a supervised way to recognize handwritten digits 4. I will then show that EP can unlock selflearning in spiking neural networks 5. Finally, I will explain how we can extend EP to unsupervised learning. 
Julie Grollier 🔗 
Sat 9:05 a.m.  9:25 a.m.

Lowprecision Sampling for Probabilistic Deep Learning
(
Invited Talk
)
link
SlidesLive Video Sampling from a probability distribution is a ubiquitous challenge in machine learning, ranging from generative AI to approximate Bayesian inference. This talk will show how to leverage lowprecision compute to accelerate Markov chain Monte Carlo (MCMC) sampling with theoretical guarantees on the convergence. First, I will introduce a general and theoretically grounded framework to enable lowprecision sampling, with applications to Stochastic Gradient Langevin Dynamics and Stochastic Gradient Hamiltonian Monte Carlo. Then I will present an approach for binary samplingoperating at 1bit precision. Finally, I will show the experimental results of lowprecision sampling on various deep learning tasks. 
Ruqi Zhang 🔗 
Sat 9:25 a.m.  10:30 a.m.

Inference analysis of optical transformers
(
Poster
)
link
This paper explores the utilization of optical computing for accelerating inference in transformer models, which have demonstrated substantial success in various applications. Optical computing offers ultrafast computation and ultrahigh energy efficiency compared to conventional electronics. Our findings suggest that optical implementation has the potential to achieve a significant 10100 times improvement in the inference throughput of computelimited transformer models. 
Xianxin Guo · Chenchen Wang · Djamshid Damry 🔗 
Sat 9:25 a.m.  10:30 a.m.

Diffractive Optical Neural Networks with Arbitrary Spatial Coherence
(
Poster
)
link
Diffractive optical neural networks (DONNs) have emerged as a promising optical hardware platform for ultrafast and energyefficient signal processing. However, previous experimental demonstrations of DONNs have only been performed using coherent light, which is not present in the natural world. Here, we study the role of spatial optical coherence in DONN operation. We propose a numerical approach to efficiently simulate DONNs under input illumination with arbitrary spatial coherence and discuss the corresponding computational complexity using coherent, partially coherent, and incoherent light. We also investigate the expressive power of DONNs and examine how coherence affects their performance. We show that under fully incoherent illumination, the DONN performance cannot surpass that of a linear model. As a demonstration, we train and evaluate simulated DONNs on the MNIST dataset using light with varying spatial coherence. 
Matthew Filipovich · Aleksei Malyshev · Alexander Lvovsky 🔗 
Sat 9:25 a.m.  10:30 a.m.

The Data Movement Bottleneck in Analog Computing Accelerators: An Analog Optical Fourier Transform and Convolution Accelerator Case Study
(
Poster
)
link
Most modern computing tasks are constrained to having digital electronic input and output data. Due to these constraints imposed by the user, any analog computing accelerator must perform an analogtodigital conversion on its input data and a subsequent digitaltoanalog conversion on its output data. This places performance limits on analog computing accelerator hardware. To avoid this the analog hardware must replace the full functionality of traditional digital electronic computer hardware. This is not currently possible for optical computing accelerators due to limitations in gain, inputoutput isolation, and information storage in current optical hardware. We conducted a case study on an analog optical Fourier transform and convolution accelerator, using 27 empiricallymeasured benchmarks, we estimate that an ideal optical accelerator that accelerates Fourier transforms and convolutions can produce an average speedup of $9.4 \times$, and a median speedup of $1.9 \times$ for the set of benchmarks. The maximum speedups achieved were $45.3 \times$ for a pure Fourier transform and $159.4 \times$ for a pure convolution. An optical Fourier transform and convolution accelerator only produces significant speedup for applications consisting exclusively of Fourier transforms and convolutions.

James Meech · Vasileios Tsoutsouras · Phillip StanleyMarbell 🔗 
Sat 9:25 a.m.  10:30 a.m.

Hierarchy of the echo state property in quantum reservoir computing
(
Poster
)
link
The echo state property (ESP) represents a fundamental concept in the reservoir computing framework that ensures stable outputonly training of reservoir networks. However, the conventional definition of ESP does not aptly describe possibly nonstationary systems, where statistical properties evolve. To address this issue, we introduce two new categories of ESP: $\textit{nonstationary ESP}$ designed for possibly nonstationary systems, and $\textit{subspace/subset ESP}$ designed for systems whose subsystems have ESP. Following the definitions, we numerically demonstrate the correspondence between nonstationary ESP in the quantum reservoir computer (QRC) framework with typical Hamiltonian dynamics and input encoding methods using nonlinear autoregressive movingaverage (NARMA) tasks. These newly defined properties present a new understanding toward the practical design of QRC and other possibly nonstationary RC systems.

Shumpei Kobayashi · Hoan Tran Quoc · Kohei Nakajima 🔗 
Sat 9:25 a.m.  10:30 a.m.

Contrastive powerefficient physical learning in resistor networks
(
Poster
)
link
The prospect of substantial reductions in the power consumption of AI is a major motivation for the development of neuromorphic hardware. Less attention has been given to the complementary research of powerefficient learning rules for such systems. Here we study selflearning physical systems trained by local learning rules based on contrastive learning. We show how the physical learning rule can be biased toward finding powerefficient solutions to learning problems, and demonstrate in simulations and laboratory experiments the emergence of a tradeoff between powerefficiency and task performance. 
Menachem Stern · Sam Dillavou · Dinesh Jayaraman · Douglas Durian · Andrea Liu 🔗 
Sat 9:25 a.m.  10:30 a.m.

EnergyBased Learning Algorithms for Analog Computing: A Comparative Study
(
Poster
)
link
This work compares seven energybased learning algorithms, namely contrastive learning (CL), equilibrium propagation (EP), coupled learning (CpL) and different variants of these algorithms depending on the type of perturbation used. The algorithms are compared on deep convolutional Hopfield networks (DCHNs) and evaluated on five vision tasks (MNIST, FashionMNIST, SVHN, CIFAR10 and CIFAR100).The results reveal that while all algorithms perform similarly on the simplest task (MNIST), differences in performance become evident as task complexity increases. Perhaps surprisingly, we find that negative perturbations yield significantly better results than positive ones, and the centered variant of EP emerges as the topperforming algorithm. Lastly, we report new stateoftheart DCHN simulations on all five datasets (both in terms of speed and accuracy), achieving a 13.5x speedup compared to Laborieux et al. (2021). 
Benjamin Scellier · Maxence Ernoult · Jack Kendall · Suhas Kumar 🔗 
Sat 9:25 a.m.  10:30 a.m.

Expanding Spiking Neural Networks With Dendrites for Deep Learning
(
Poster
)
link
As deep learning networks increase in size and performance, so do associated computational costs, approaching prohibitive levels. Dendrites offer powerful nonlinear ``onthewire'' computational capabilities, increasing the expressivity of the point neuron while preserving many of the advantages of SNNs. We seek to demonstrate the potential of dendritic computations by combining them with the lowpower eventdriven computation of Spiking Neural Networks (SNNs) for deep learning applications.To this end, we have developed a library that adds dendritic computation to SNNs within the PyTorch framework, enabling complex deep learning networks that still retain the low power advantages of SNNs. Our library leverages a dendrite CMOS hardware model to inform the software model, which enables nonlinear computation integrated with snnTorch at scale. Finally, we discuss potential deep learning applications in the context of current stateoftheart deep learning methods and energyefficient neuromorphic hardware. 
Mark Plagge · Suma Cardwell · Frances Chance 🔗 
Sat 9:25 a.m.  10:30 a.m.

Activity Sparsity Complements Weight Sparsity for Efficient RNN Inference
(
Poster
)
link
Artificial neural networks open up unprecedented machine learning capabilities at the cost of seemingly ever growing computational requirements.Concurrently, the field of neuromorphic computing develops biologically inspired spiking neural networks and hardware platforms with the goal of bridging the efficiencygap between biological brains and deep learning systems.Yet, spiking neural networks often times fall behind deep learning systems on many machine learning tasks.In this work, we demonstrate that the reduction factor of sparsely activated recurrent neural networks multiplies with the reduction factor of sparse weights.Our model achieves up to $20\times$ reduction of operations while maintaining perplexities below $60$ on the Penn Treebank language modeling task.This reduction factor has not be achieved with solely sparsely connected LSTMs, and the language modeling performance of our model has not been achieved with sparsely activated spiking neural networks.Our results suggest to further drive convergence of methods from deep learning and neuromorphic computing for efficient machine learning.

Rishav Mukherji · Mark Schoene · Khaleelulla Khan Nazeer · Christian Mayr · Anand Subramoney 🔗 
Sat 9:25 a.m.  10:30 a.m.

Algebraic Design of Physical Computing System for TimeSeries Generation
(
Poster
)
link
Recently, computational techniques that employ physical systems (physical computing systems) have been developed. To utilize physical computing systems, their design strategy is important. Although there are practical learning based methods and theoretical approaches, no general method exists that provides specific design guidelines for given systems with rigorous theoretical support. In this paper, we propose a novel algebraic design framework for a physical computing system for timeseries generation, which is capable of extracting specific design guidelines. Our approach describes inputoutput relationships algebraically and relates them to this task. We present two theorems and the results of experiments. The first theorem offers a basic strategy for algebraic design. The second theorem explores the ``replaceability" of such systems. 
Mizuka Komatsu · Takaharu Yaguchi · Kohei Nakajima 🔗 
Sat 9:25 a.m.  10:30 a.m.

PhyFF: Physical forward forward algorithm for inhardware training and inference
(
Poster
)
link
Training of digital deep learning models primarily relies on backpropagation, which poses challenges for physical implementation due to its dependency on precise knowledge of computations performed in the forward pass of the neural network. To address this issue, we propose a physical forward forward training algorithm (phyFF) that is inspired by the original forward forward algorithm. This novel approach facilitates direct training of deep physical neural networks comprising layers of diverse physical nonlinear systems, without the need for the complete knowledge of the underlying physics. We demonstrate the superiority of this method over current hardwareaware training techniques. The proposed method achieves faster training speeds, reduces digital computational requirements, and lowers training's power consumption in physical systems. 
Ali Momeni · Babak Rahmani · Matthieu Malléjac · Philipp del Hougne · Romain Fleury 🔗 
Sat 9:25 a.m.  10:30 a.m.

A Green Granular Convolutional Neural Network with SoftwareFPGA Codesigned Learning
(
Poster
)
link
Different from traditional tedious CPUGPUbased training algorithms using gradient descent methods, the softwareFPGA codesigned learning algorithm is created to quickly solve a system of linear equations to directly calculate optimal values of hyperparameters of the green granular neural network (GGNN). To reduce both $CO_2$ emissions and energy consumption effectively, a novel green granular convolutional neural network (GGCNN) is developed by using a new classifier that uses GGNNs as building blocks with new fast softwareFPGA codesigned learning. Initial simulation results indicates that the FPGA equation solver code ran faster than the Python equation solver code. Therefore, implementing the GGCNN with softwareFPGA codesigned learning is feasible. In the future, The GGCNN will be evaluated by comparing with a convolutional neural network (CNN) with the traditional softwareCPUGPUbased learning in terms of speeds, model sizes, accuracy, $CO_2$ emissions and energy consumption by using popular datasets. New algorithms will be created to divide the inputs to different input groups that will be used to build different smallsize GGNNs to solve the curse of dimensionality.

Yanqing Zhang · Huaiyuan Chu 🔗 
Sat 9:25 a.m.  10:30 a.m.

RealTime FJ/MAC PDE Solvers via Tensorized, BackPropagationFree Optical PINN Training
(
Poster
)
link
Solving partial differential equations (PDEs) numerically often requires huge computing time, energy cost, and hardware resources in practical applications. This has limited their applications in many scenarios (e.g., autonomous systems, supersonic flows) that have a limited energy budget and require near realtime response. Leveraging optical/photonic computing, this paper develops an onchip training framework for physicsinformed neural networks (PINNs), aiming to solve highdimensional PDEs with fJ/MAC power consumption and ultralow latency. Despite the ultrahigh speed of optical neural networks, training a PINN on an optical chip is hard due to (1) the large size of photonic devices, and (2) the lack of scalable optical memory devices to store the intermediate results of backpropagation (BP). To enable realistic optical PINN training, this paper presents a BPfree method to avoid the BP process. We also employ a tensorcompressed approach to improve the convergence and scalability of our optical PINN training. This training framework is designed with tensorized optical neural networks (TONN) for scalable inference acceleration and MZI phasedomain tuning for \textit{insitu} optimization. Our simulation results of a 20dim HJB PDE show that our photonic accelerator can reduce the number of MZIs by a factor of 1.17$\times 10^3$, with only 1.36 J and 1.15 s to solve this equation. This is the first realsize optical PINN training framework that can be applied to solve highdimensional PDEs.

Yequan Zhao · Xian Xiao · Xinling Yu · Ziyue Liu · Zhixiong Chen · Geza Kurczveil · Raymond Beausoleil · Zheng Zhang 🔗 
Sat 9:25 a.m.  10:30 a.m.

Enhancing LowPrecision Sampling via Stochastic Gradient Hamiltonian Monte Carlo
(
Poster
)
link
Lowprecision training has emerged as a promising lowcost technique to enhance the training efficiency of deep neural networks without sacrificing much accuracy.Its Bayesian counterpart can further provide uncertainty quantification and improved generalization accuracy.This paper investigates lowprecision samplers via Stochastics Gradient Hamiltonian Monte Carlo (SGHMC) with lowprecision and fullprecision gradients accumulators for both strongly logconcave and nonlogconcave distributions.Theoretically, our results show that, to achieve $\epsilon$error in the 2Wasserstein distance for nonlogconcave distributions, lowprecision SGHMC achieves quadratic improvement ($\tilde{\mathcal{O}}\left({\epsilon^{2}{\mu^*}^{2}\log^2\left({\epsilon^{1}}\right)}\right)$) compared to the stateoftheart lowprecision sampler, Stochastic Gradient Langevin Dynamics (SGLD) ($\tilde{\mathcal{O}}\left({{\epsilon}^{4}{\lambda^{*}}^{1}\log^5\left({\epsilon^{1}}\right)}\right)$). Moreover, we prove that lowprecision SGHMC is more robust to the quantization error compared to lowprecision SGLD due to the robustness of the momentumbased update w.r.t. gradient noise. Empirically, we conduct experiments on synthetic and MNIST, CIFAR10 \& CIFAR100 datasets which successfully validate our theoretical findings. Our study highlights the potential of lowprecision SGHMC as an efficient and accurate sampling method for largescale and resourcelimited deep learning.

Ziyi Wang · Yujie Chen · Ruqi Zhang · Qifan Song 🔗 
Sat 9:25 a.m.  10:30 a.m.

Squeezed Edge YOLO: Onboard Object Detection on Edge Devices
(
Poster
)
link
Demand for efficient onboard object detection is increasing due to its key role in autonomous navigation. However, deploying object detection models such as YOLO on resource constrained edge devices is challenging due to the high computational requirements of such models. In this paper, a Squeezed Edge YOLO is proposed which is compressed and optimized to kilobytes of parameters in order to fit onboard such edge devices. To evaluate the proposed Squeezed Edge YOLO, two use cases  human and shape detection  are used to show the model accuracy and performance. Moreover, the proposed model is deployed onboard a GAP8 processor with 8 RISCV cores and an NVIDIA Jetson Nano with 4GB of memory. Experimental results shows the proposed Squeezed Edge YOLO model size is optimized by a factor of 8x which leads to 76\% improvements in energy efficiency and 3.3x faster throughout. 
Edward Humes · Mozhgan Navardi · Tinoosh Mohsenin 🔗 
Sat 9:25 a.m.  10:30 a.m.

Virtual reservoir acceleration for CPU and GPU: Case study for coupled spintorque oscillator reservoir
(
Poster
)
link
We provide highspeed implementations for simulating reservoirs described by $N$coupled spintorque oscillators. Here $N$ also corresponds to the number of reservoir nodes. We benchmark a variety of implementations based on CPU and GPU. Our new methods are at least 2.6 times quicker than the baseline for $N$ in range $1$ to $10^4$. More specifically, over all implementations the best factor is 78.9 for $N=1$ which decreases to 2.6 for $N=10^3$ and finally increases to 23.8 for $N=10^4$. GPU outperforms CPU significantly at $N=2500$. Our results show that GPU implementations should be tested for reservoir simulations. The implementations considered here can be used for any reservoir with evolution that can be approximated using an explicit method.

Thomas de Jong · Nozomi Akashi · Tomohiro Taniguchi · Hirofumi Notsu · Kohei Nakajima 🔗 
Sat 9:25 a.m.  10:30 a.m.

Adjoint Method: The Connection between Analogbased Equilibrium Propagation Architectures and Neural ODEs
(
Poster
)
link
Analog neural networks (ANNs) hold significant potential for substantialreductions in power consumption in modern neural networks, particularly whenemploying the increasingly popular EnergyBased Models (EBMs) in tandem withthe local Equilibrium Propagation (EP) training algorithm. This paper analyzesthe relationship between this family of ANNs and the concept of Neural OrdinaryDifferential Equations (Neural ODEs). Using the adjoint method, we formallydemonstrate that ANNEP can be derived from Neural ODEs by constraining thedifferential equations to those with a steadystate response. This findingopens avenues for the ANNEP community to extend ANNs to nonsteadystatescenarios. Additionally, it provides an efficient setting for NNODEs thatsignificantly reduces the training cost. 
Mohamed Watfa · Alberto GarciaOrtiz 🔗 
Sat 9:25 a.m.  10:30 a.m.

Beyond Digital: Harnessing Analog Hardware for Machine Learning
(
Poster
)
link
A remarkable surge in utilizing large deeplearning models yields stateoftheart results in a variety of tasks. Recent model sizes often exceed billions of parameters, underscoring the importance of fast and energyefficient processing. The significant costs associated with training and inference primarily stem from the constrained memory bandwidth of current hardware and the computationally intensive nature of these models. Historically, the design of machine learning models has predominantly been guided by the operational parameters of classical digital devices. In contrast, analog computations have the potential to offer vastly improved power efficiency for both inference and training tasks. This work details several machinelearning methodologies that could leverage existing analog hardware infrastructures. To foster the development of analog hardwareaware machine learning techniques, we explore both optical and electronic hardware configurations suitable for executing the fundamental mathematical operations inherent to these models. Integrating analog hardware with innovative machine learning approaches may pave the way for costeffective AI systems at scale. 
Marvin Syed · Kirill Kalinin · Natalia Berloff 🔗 
Sat 9:25 a.m.  10:30 a.m.

Biologicallyplausible hierarchical chunking on mixedsignal neuromorphic hardware
(
Poster
)
link
Chunking is a computational principle essential for memory compression, structural decomposition, and predictive processing. Humans seamlessly group perceptual sequences in units of chunks, parsed and memorized as separate entities. On an algorithmic level, computational models such as the Hierarchical Chunking Model (HCM) propose grouping proximal observational units as chunks, which resemble human chunk learning.Here we propose a biologically plausible and highly efficient implementation of the HCM: the neuromorphic HCM (nHCM).When parsing through perceptual sequences, the nHCM uses sparsely connected spiking neurons to construct hierarchical chunk representations in an eventdriven way.Even when simulated on a standard computer, the nHCM showed remarkable improvement in speed, power consumption, and memory usage compared to its original counterpart.Then, we validate the model on mixedsignal neuromorphic hardware using recurrent spiking neural networks (SNN) with biologically plausible dynamics. We verified the robust computing properties of this implementation, overcoming the heterogeneity, variability, and low precision of the bioplausible electronic analog circuits. With a successful implementation on both computers and neuromorphic processors, we show that the algorithm, and in general the neuromorphic codesign paradigm, is inherently efficient and robust. This work demonstrates cognitivelyplausible sequence learning in energyefficient dedicated neural computing electronic processing systems. 
Atilla Schreiber · Shuchen Wu · Chenxi Wu · Giacomo Indiveri · Eric Schulz 🔗 
Sat 9:25 a.m.  10:30 a.m.

Frequency propagation: Multimechanism learning in nonlinear physical networks
(
Poster
)
link
We introduce frequency propagation, a learning algorithm for nonlinear physical networks. In a resistive electrical circuit with variable resistors, an activation current is applied at a set of input nodes at one frequency, and an error current is applied at a set of output nodes at another frequency. The voltage response of the circuit to these boundary currents is the superposition of an 'activation signal' and an 'error signal' whose coefficients can be read in different frequencies of the frequency domain. Each conductance is updated proportionally to the product of the two coefficients. The learning rule is local and proved to perform gradient descent on a loss function.We argue that frequency propagation is an instance of a multimechanism learning strategy for physical networks, be it resistive, elastic, or flow networks. Multimechanism learning strategies incorporate at least two physical quantities, potentially governed by independent physical mechanisms, to act as activation and error signals in the training process. Locally available information about these two signals is then used to update the trainable parameters to perform gradient descent. We demonstrate how earlier work implementing learning via chemical signaling in flow networks [1] also falls under the rubric of multimechanism learning.[1]  V. Anisetti, B. Scellier, and J. M. Schwarz, “Learning by noninterfering feedback chemical signaling235in physical networks,” arXiv preprint arXiv:2203.12098, 2022. 
Vidyesh Anisetti · Ananth Kandala · Benjamin Scellier · J. M. Schwarz 🔗 
Sat 9:25 a.m.  10:30 a.m.

The Benefits of SelfSupervised Learning for Training Physical Neural Networks
(
Poster
)
link
Physical Neural Networks (PNNs) are energyefficient alternatives to their digital counterparts. Because they are inherently variable, noisy and hardly differentiable, PNNs require tailored trainign methods. Additionally, while the properties of PNNs make them good candidates for edge computing, where memory and computational ressources are constrained, most of the training algorithms developed for training PNNs focus on supervised learning, though labeled data could not be accessible on the edge. Here, we propose to use SelfSupervised Learning (SSL) as an ideal framework for training PNNs (we focus here on computer vision tasks) : 1. SSL globally eliminates the reliance on labeled data and 2. as SSL enforces the network to extract highlevel concepts, networks trained with SSL should result in high robustness to noise and device variability. We investigate and show with simulations that the later properties effectively emerge when a network is trained on MNIST in the SSL settings while it does not when trained supervisely. We also explore and show empirically that we can optimize layerwise SSL objectives rather than a single global one while still achieving the performance of the global optimization on MNIST and CIFAR10. This could allow local learning without backpropagation at all, especially in the scheme we propose with stochastic optimization. We expect this preliminary work, based on simulations, to pave the way of a robust paradigm for training PNNs and hope to stimulate interest in the community of unconventional computing and beyond. 
Jeremie Laydevant · Peter McMahon · Davide Venturelli · Paul Lott 🔗 
Sat 9:25 a.m.  10:30 a.m.

Towards low power cognitive load analysis using EEG signal: A neuromorphic computing approach
(
Poster
)
link
Realtime ondevice cognitive load assessment using EEG is very useful for applications like braincomputer interfaces, robotics, adaptive learning etc. Existing deep learning based models can achieve high accuracy, but due to large memory and energy requirement, those models can not be implemented on battery driven lowcompute, lowmemory edge devices such as wearable EEG devices. In this paper, we have used braininspired spiking neural networks and neuromorphic computing paradigms, that promises at least $10^4$ times less energy requirement compared to existing solutions. We have designed two different spiking network architectures and tested on two publicly available cognitive load datasets (EEGMAT \& STEW). We achieved comparable accuracy with existing arts, without performing any artifact removal from EEG signal. Our model offers $\sim8\times$ less memory requirement, $\sim10^3\times$ less computational cost and consumes maximum 0.33 $\mu$J energy per inference.

Dighanchal Banerjee · Sounak Dey · Debatri Chatterjee · Arpan Pal 🔗 
Sat 9:25 a.m.  10:30 a.m.

Neuromorphic CoDesign as a Game
(
Poster
)
link
Codesign is a prominent topic presently in computing, speaking to the mutual benefit of coordinating design choices of several layers in the technology stack. For example, this may be designing algorithms which can most efficiently take advantage of the acceleration properties of a given architecture, while simultaneously designing the hardware to support the structural needs of a class of computation. The implications of these design decisions are influential enough to be deemed a lottery, enabling an idea to win out over others irrespective of the individual merits. Coordination is a well studied topic in the mathematics of game theory, where in many cases without a coordination mechanism the outcome is suboptimal. Here we consider what insights game theoretic analysis can offer for computer architecture codesign. In particular, we consider the interplay between algorithm and architecture advances in the field of neuromorphic computing. Analyzing developments of spiking neural network algorithms and neuromorphic hardware as a codesign game we use the Stag Hunt model to illustrate challenges for spiking algorithms or architectures to advance the field independently and advocate for a strategic pursuit to advance neuromorphic computing. 
Craig M Vineyard · William Severa · Brad Aimone 🔗 
Sat 9:25 a.m.  10:30 a.m.

Device Codesign using Reinforcement Learning and Evolutionary Optimization
(
Poster
)
link
Device discovery and circuit modeling for emerging devices, such as magnetic tunnel junctions, require detailed and timeconsuming device and circuit simulations. In this work, we propose using AIguided techniques such as reinforcement learning and evolutionary optimization to accelerate device discovery, creativity of solutions, and automate optimization to design true random number generators for a given distribution. We present preliminary results designing true random number generators using magnetic tunnel junctions optimized for performance. 
Catherine Schuman · Suma Cardwell · Karan Patel · J. Smith · Jared Arzate · Andrew Maicke · Samuel Liu · Jaesuk Kwon · Jean Anne Incorvia 🔗 
Sat 9:25 a.m.  10:30 a.m.

MeanField Assisted Deep Boltzmann Learning with Probabilistic Computers
(
Poster
)
link
Despite their appeal as physicsinspired, energybased and generative nature, general Boltzmann Machines (BM) are considered intractable to train. This belief led to simplified models of BMs with restricted intralayer connections or layerbylayer training of deep BMs. Recent developments in domainspecific hardware  specifically probabilistic computers (pcomputer) with probabilistic bits (pbit)  may change established wisdom on the tractability of deep BMs. In this paper, we show that deep and unrestricted BMs can be trained using pcomputers generating hundreds of billions of Markov Chain Monte Carlo (MCMC) samples per second, on sparse networks developed originally for use in DWave's annealers. To maximize the efficiency of learning the pcomputer, we introduce two families of MeanField Theory assisted learning algorithms, or xMFTs (x = Naive and Hierarchical). The xMFTs are used to estimate the averages and correlations during the positive phase of the contrastive divergence (CD) algorithm and our customdesigned pcomputer is used to estimate the averages and correlations in the negative phase. A custom FieldProgrammableGate Array (FPGA) emulation of the pcomputer architecture takes up to 45 billion flips per second, allowing the implementation of CD$n$ where $n$ can be of the order of millions, unlike RBMs where $n$ is typically 1 or 2. Experiments on the full MNIST dataset with the combined algorithm show that the positive phase can be efficiently computed by xMFTs without much degradation when the negative phase is computed by the pcomputer. Our algorithm can be used in other scalable Ising machines and its variants can be used to train BMs, previously thought to be intractable.

Shuvro Chowdhury · Shaila Niazi · Kerem Camsari 🔗 
Sat 9:25 a.m.  10:30 a.m.

Emergent learning in physical systems as feedbackbased aging in a glassy landscape
(
Poster
)
link
By training linear physical networks to learn linear transformations, we discern how their physical properties evolve due to weight update rules. Our findings highlight a striking similarity between the learning behaviors of such networks and the processes of aging and memory formation in disordered and glassy systems. We show that the learning dynamics resembles an aging process, where the system relaxes in response to repeated application of the feedback boundary forces in presence of an input force, thus encoding a memory of the inputoutput relationship. With this relaxation comes an increase in the correlation length, which is indicated by the twopoint correlation function for the components of the network. We also observe that the square root of the meansquared error as a function of epoch takes on a nonexponential form, which is a typical feature of glassy systems. This physical interpretation suggests that by encoding more detailed information into input and feedback boundary forces, the process of emergent learning can be rather ubiquitous and, thus, serve as a very early physical mechanism, from an evolutionary standpoint, for learning in biological systems. 
Vidyesh Anisetti · Ananth Kandala · Jennifer Schwarz 🔗 
Sat 9:25 a.m.  10:30 a.m.

Unleashing Hyperdimensional Computing with Nyström Method based Data Adaptive Encoding
(
Poster
)
link
Hyperdimensional Computing (HDC) is capable of performing machine learning tasks by first encoding data into highdimension distributed representation called hypervectors. Learning tasks can then be preformed on those hypervectors with a set of computationally efficient and simple operations. HDC has gained significant attentions in recent years due to its excellent hardware efficiency. The core of all HDC algorithms is the encoding function which determines the expressing ability of hypervectors, thus is the critical bottleneck for performance. However, existing HDC encoding methods are task dependent and often only capture very basic notion of similarity, therefore can limit the accuracy of HDC models. To unleash the potential of HDC on arbitrary tasks, we propose a novel encoding method that is inspired by Nyström method for kernel approximation. Our approach allows one to generate an encoding function that approximates any userdefined positivedefinite similarity function on the data via dotproducts between encodings in HDspace. This allows HDC to tackle a broader range of tasks with better learning accuracy while still retain its hardware efficiency. We empirically evaluate our proposed encoding method against existing HDC encoding methods that are commonly used in various classification tasks. Our results show the HDC encoding method we propose can achieve better accuracy result in various of learning tasks. On graph and string datasets, our method achieve 10\%37\% and 3\%18\% better classification accuracy, respectively. 
Quanling Zhao · Anthony Thomas · Xiaofan Yu · Tajana S Rosing 🔗 
Sat 9:25 a.m.  10:30 a.m.

Thermodynamic AI and Thermodynamic Linear Algebra
(
Poster
)
link
Many Artificial Intelligence (AI) algorithms are inspired by physics and employ stochastic fluctuations, such as generative diffusion models, Bayesian neural networks, and Monte Carlo inference. These algorithms are currently run on digital hardware, ultimately limiting their scalability and overall potential. Here, we propose a novel computing device, called Thermodynamic AI hardware, that could accelerate such algorithms. Thermodynamic AI hardware can be viewed as a novel form of computing, since it uses novel fundamental building blocks, called stochastic units (sunits), which naturally evolve over time via stochastic trajectories. In addition to these sunits, Thermodynamic AI hardware employs a Maxwell's demon device that guides the system to produce nontrivial states. We provide a few simple physical architectures for building these devices, such as RC electrical circuits. Moreover, we show that this same hardware can be used to accelerate various linear algebra primitives. We present simple thermodynamic algorithms for (1) solving linear systems of equations, (2) computing matrix inverses, (3) computing matrix determinants, and (4) solving Lyapunov equations. Under reasonable assumptions, we rigorously establish asymptotic speedups for our algorithms, relative to digital methods, that scale linearly in dimension. Numerical simulations also suggest a speedup is achievable in practical scenarios. 
Patrick Coles · Maxwell Aifer · Kaelan Donatella · Denis Melanson · Max Hunter Gordon · Thomas Ahle · Daniel Simpson · Gavin Crooks · Antonio Martinez · Faris Sbahi 🔗 
Sat 9:25 a.m.  10:30 a.m.

Nonlinear Classification Without a Processor
(
Poster
)
link
Computers, as well as most neuromorphic hardware systems, use central processing and topdown algorithmic control to train for machine learning tasks. In contrast, brains are ensembles of 100 billion neurons working in tandem, giving them tremendous advantages in power efficiency and speed. Many physical systems `learn' through history dependence, but training a physical system to perform arbitrary nonlinear tasks without a processor has not been possible. Here we demonstrate the successful implementation of such a system  a learning metamaterial. This nonlinear analog circuit is comprised of identical copies of a single simple element, each following the same local update rule. By applying voltages to our system (inputs), inference is performed by physics in microseconds. When labels are properly enforced (also via voltages), the system's internal state evolves in time, approximating gradient descent. Our system $\textit{learns on its own}$; it requires no processor. Once trained, it performs inference passively, requiring approximately 100~$\mu$W of total power dissipation across its edges. We demonstrate the flexibility and power efficiency of our system by solving nonlinear 2D classification tasks. Learning metamaterials have immense potential as fast, efficient, robust learning systems for edge computing, from smart sensors to medical devices to robotic control.

Sam Dillavou · Benjamin Beyer · Menachem Stern · Marc Miskin · Andrea Liu · Douglas Durian 🔗 
Sat 9:25 a.m.  10:30 a.m.

Neural Deep Operator Networks representation of Coherent Ising Machine Dynamics
(
Poster
)
link
Coherent Ising Machines (CIMs) are optical devices that employ parametric oscillators to tackle binary optimization problems, whose simplified dynamics are described by a series of coupled ordinary differential equations. In this study, we learn the deterministic dynamics of CIMs via the use of neural Deep Operator Networks (DeepONet). After training successfully the systems over multiple initial conditions and problem instances, we benchmark the comparative performance of the neural network. In our tests, the network is capable of delivering solutions of comparative quality to the exact dynamics up to 175 spins, but we do not identify roadblocks to go further: given sufficient training resources the CIM solvers could successfully be represented by a neural network at a large scale. 
Arsalan Taassob · Davide Venturelli · Paul Lott 🔗 
Sat 9:25 a.m.  10:30 a.m.

Exploiting Symmetric Temporally Sparse BPTT for Efficient RNN Training
(
Poster
)
link
Recurrent Neural Networks (RNNs) are useful in temporal sequence tasks. However, training RNNs involves dense matrix multiplications which requires hardware that can support a large number of arithmetic operations and memory accesses. Implementing online training of RNNs on the edge calls for optimized algorithms for an efficient deployment on hardware. Inspired by the spiking neuron model, the Delta RNN exploits temporal sparsity during inference by skipping over the update of hidden states from those inactivated neurons whose change of activation across two timesteps is below a defined threshold. This work describes a training algorithm for Delta RNNs that exploits temporal sparsity in the backward propagation phase to reduce computational requirements for training on the edge. Due to the symmetric computation graphs of forward and backward propagation during training, the gradient computation of inactivated neurons can be skipped. Results shows a reduction of ∼80% in matrix operations for training a 56k parameter Delta LSTM on the Fluent Speech Commands dataset with negligible accuracy loss. Logic simulations of a hardware accelerator designed for the training algorithm show 210X speedup in matrix computations for an activation sparsity range of 50%90%. Additionally, we show that the proposed Delta RNN training will be useful for online incremental learning on edge devices with limited computing resources. 
Xi Chen · Chang Gao · Zuowen Wang · Longbiao Cheng · Sheng Zhou · ShihChii Liu · Tobi Delbruck 🔗 
Sat 9:25 a.m.  10:30 a.m.

Poster Session I
(
Poster Session
)

🔗 
Sat 10:30 a.m.  11:30 a.m.

Lunch Break
(
Break
)

🔗 
Sat 11:30 a.m.  11:50 a.m.

The Optimization Conundrum: Balancing Data, Speed, and Accuracy in RealTime Machine Learning
(
Invited Talk
)
link
SlidesLive Video 
Mihaela van der Schaar 🔗 
Sat 11:50 a.m.  12:00 p.m.

A Lagrangian Perspective on Dual Propagation
(
Oral
)
link
SlidesLive Video The search for "biologically plausible" learning algorithms has converged on the idea of representing gradients as activity differences. However, most approaches require a high degree of synchronization (distinct phases during learning) and introduce high computational overhead, which raises doubt regarding their biological plausibility as well as their potential usefulness for neuromorphic computing. Furthermore, they commonly rely on applying infinitesimal perturbations (nudges) to output units, which is impractical in noisy environments. Recently it has been shown that by modelling artificial neurons as dyads with two oppositely nudged compartments, it is possible for a fully local learning algorithm to bridge the performance gap to backpropagation, without requiring separate learning phases, while also being compatible with significant levels of nudging. However, the algorithm, called dual propagation, has the drawback that convergence of its inference method relies on symmetric nudging of the output units, which may be infeasible in biological and analog implementations. Starting from a modified version of LeCun's Lagrangian approach to backpropagation, we derive a slightly altered variant of dual propagation, which is robust to asymmetric nudging. 
Rasmus Høier · Christopher Zach 🔗 
Sat 12:00 p.m.  12:10 p.m.

Scaling of Optical Transformers
(
Oral
)
link
SlidesLive Video The rapidly increasing size of deeplearning models has renewed interest in alternatives to digitalelectronic computers as a means to dramatically reduce the energy cost of running stateoftheart neural networks. Optical matrixvector multipliers are best suited to performing computations with very large operands, which suggests that large Transformer models could be a good target for them. However, the ability of optical accelerators to run efficiently depends on the model being run, and if the model can be run at all when subject to the noise, error, and low precision of analogoptical hardware. Here we investigate whether Transformers meet the criteria to be efficient when running optically, what benefits can be had for doing so, and how worthwhile it is at scale. We found using smallscale experiments on and simulation of a prototype hardware accelerator that Transformers may run on optical hardware, and that elements of their design  the ability to parallelprocess data using the same weights, and trends in scaling them to enormous widths  allow them to achieve an asymptotic energyefficiency advantage running optically compared to on digital hardware. Based on a model of a full optical accelerator system, we predict that wellengineered, largescale optical hardware should be able to achieve a 100× energyefficiency advantage over current digitalelectronic processors in running some of the largest current Transformer models, and if both the models and the optical hardware are scaled to the quadrillionparameter regime, optical accelerators could have a > 8,000× energyefficiency advantage. 
Maxwell Anderson · ShiYuan Ma · Tianyu Wang · Logan Wright · Peter McMahon 🔗 
Sat 12:10 p.m.  12:30 p.m.

LowPrecision & Analog Computational Techniques for Sustainable & Accurate AI Inference & Training
(
Invited Talk
)
link
SlidesLive Video The recent rise of Generative AI has led to a dramatic increase in the sizes and computational needs for AI models. This compute explosion has raised serious cost and sustainability concerns in both the training & deployment phases of these large models. Lowprecision techniques, that lower the precision of the weights, activations, and gradients, have been successfully employed to reduce the trainingprecision from 32bits down to 8bits (FP8) and the inferenceprecision down to 4bits (INT4). These advances have enabled more than a 10fold improvement in compute efficiency over the past decade – however, it is expected that further gains may be limited. Recent developments in analog computational techniques offer the promise of achieving an additional 10100X enhancement in crucial metrics, including energy efficiency and computational density. In this presentation, we will provide an overview of these significant recent breakthroughs, which are likely to play a pivotal role in advancing Generative AI and making it more sustainable and accessible to a wider audience. 
Kailash Gopalakrishnan 🔗 
Sat 12:30 p.m.  2:00 p.m.

Poster Session II
(
Poster Session
)

🔗 
Sat 2:00 p.m.  3:00 p.m.

Panel Discussion
(
Panel
)
SlidesLive Video 
🔗 