Workshop
Machine Learning with New Compute Paradigms
Jannes Gladrow · Benjamin Scellier · Eric Xing · Babak Rahmani · Francesca Parmigiani · Paul Prucnal · Cheng Zhang
Room 235 - 236
As GPU computing comes closer to a plateau in terms of efficiency and cost due to Moore’s law reaching its limit, there is a growing need to explore alternative computing paradigms, such as (opto-)analog, neuromorphic, and low-power computing. This NeurIPS workshop aims to unite researchers from machine learning and alternative computation fields to establish a new hardware-ML feedback loop.By co-designing models with specialized accelerators, we can leverage the benefits of increased throughput or lower per-flop power consumption. Novel devices hold the potential to further accelerate standard deep learning or even enable efficient inference and training of hitherto compute-constrained model classes. However, new compute paradigms typically present challenges such as intrinsic noise, restricted sets of compute operations, or limited bit-depth, and thus require model-hardware co-design. This workshop’s goal is to foster cross-disciplinary collaboration to capitalize on the opportunities offered by emerging AI accelerators.
Schedule
Sat 7:00 a.m. - 7:10 a.m.
|
Opening Remarks
(
Talk
)
>
|
Cheng Zhang 🔗 |
Sat 7:10 a.m. - 7:30 a.m.
|
Computing with physical systems: reimagining special-purpose computers from the bottom up
(
Invited Talk
)
>
SlidesLive Video In this talk I will discuss how by eliminating many of the layers of abstraction used in conventional computers, and working as close to the underlying physics as possible, we may be able to create special-purpose processors that are orders of magnitude faster or more energy-efficient than the present state-of-the-art. |
Peter McMahon 🔗 |
Sat 7:30 a.m. - 7:40 a.m.
|
SpiNNaker2: A Large-Scale Neuromorphic System for Event-Based and Asynchronous Machine Learning
(
Oral
)
>
link
SlidesLive Video The joint progress of artificial neural networks and domain specific hardware accelerators such as GPUs and TPUs took over many domains of machine learning research.This development is accompanied by a rapid growth of the required computational demands for larger models and more data.Concurrently, emerging properties of foundation models such as in-context learning drive new opportunities for machine learning applications.However, the computational cost of such applications is a limiting factor of the technology in data-centers, and more importantly in mobile devices and edge systems.To mediate the energy footprint and non-trivial latency of contemporary systems, neuromorphic computing systems deeply integrate computational principles of neurobiological systems by leveraging low-power analog and digital technologies.SpiNNaker2 is a digital neuromorphic chip developed for scalable machine learning.The event-based and asynchronous design of SpiNNaker2 allows the composition of large-scale systems from thousands of chips.In this work, we present the design and operating principles of SpiNNaker2 systems.Furthermore, we outline a number of machine learning applications that we developed on either the full chip or earlier prototypes.The already available applications range from accelerating artificial neural networks over bio-inspired spiking neural networks to generalized event-based neural networks.With the successful development and deployment of SpiNNaker2, we aim to facilitate the advancement of event-based and asynchronous algorithms for future generations of machine learning systems. |
Hector Gonzalez · Jiaxin Huang · Florian Kelber · Khaleelulla Khan Nazeer · Tim Hauke Langer · Chen Liu · Matthias Lohrmann · Amirhossein Rostami · Mark Schoene · Bernhard Vogginger · Timo Wunderlich · Yexin Yan · Mahmoud Akl · Christian Mayr
|
Sat 7:40 a.m. - 7:50 a.m.
|
Scaling-up Memristor Monte Carlo with magnetic domain-wall physics
(
Oral
)
>
link
SlidesLive Video By exploiting the intrinsic random nature of nanoscale devices, Memristor Monte Carlo (MMC) is a promising enabler of edge learning systems. However, due to multiple algorithmic and device-level limitations, existing demonstrations have been restricted to very small neural network models and datasets. We discuss these limitations, and describe how they can be overcome, by mapping the stochastic gradient Langevin dynamics (SGLD) algorithm onto the physics of magnetic domain-wall Memristors to scale-up MMC models by five orders of magnitude. We propose the push-pull pulse programming method that realises SGLD in-physics, and use it to train a domain-wall based ResNet18 on the CIFAR-10 dataset. On this task, we observe no performance degradation relative to a floating point model down to an update precision of between 6 and 7-bits, indicating we have made a step towards a large-scale edge learning system leveraging noisy analogue devices. |
Thomas Dalgaty · Shogo Yamada · Anca Molnos · Eiji Kawasaki · Thomas Mesquida · Rummens François · TATSUO SHIBATA · Yukihiro Urakawa · Yukio Terasaki · Tomoyuki Sasaki · Marc Duranton
|
Sat 7:50 a.m. - 8:10 a.m.
|
Analog AI Accelerators
(
Invited Talk
)
>
link
SlidesLive Video Deep learning has irreversibly changed and drastically enhanced how we process information. The rapidly increasing computation time and energy costs required to train ever larger AI models make it evident that the future of artificial intelligence depends on realizing fast and energy-efficient processors. With the slowdown in transistor scaling and the diminishing returns expected from future CMOS, the concept of analog computing has been put forward as an alternative. Analog neural networks process information that is stored locally and in a fully-parallel manner in the analog domain using physical device properties instead of conventional Boolean arithmetic. This presentation will give an overview of analog neural network and the underlying device technologies to implement them. |
Jesus del Alamo 🔗 |
Sat 8:10 a.m. - 8:25 a.m.
|
Break
(
Break
)
>
|
🔗 |
Sat 8:25 a.m. - 8:35 a.m.
|
Analog-Optical Computation for optimization and machine-learning inference
(
Talk
)
>
link
SlidesLive Video Solving optimization problems is challenging for existing digital computers and even for future quantum hardware. The practical importance of diverse problems, from healthcare to financial optimization, has driven the emergence of specialised hardware over the past decade. However, their support for problems with only binary variables severely restricts the scope of practical problems that can be efficiently embedded. We build analog iterative machine (AIM), the first instance of an opto-electronic solver that natively implements a wider class of quadratic unconstrained mixed optimization (QUMO) problems and supports all-to-all connectivity of both continuous and binary variables. Beyond synthetic 7-bit problems at small-scale, AIM solves the financial transaction settlement problem entirely in analog domain with higher accuracy than quantum hardware and at room temperature. With compute-in-memory operation and spatial-division multiplexed representation of variables, AIM’s design paves the path to chip-scale architecture with 100 times speed-up per unit-power over the latest GPUs for solving problems with 10,000 variables. The robustness of the AIM algorithm at such scale is further demonstrated by comparing it with commercial production solvers across multiple benchmarks, where for several problems we report new best solutions. By combining the superior QUMO abstraction, sophisticated gradient descent methods inspired by machine learning, and commodity hardware, AIM introduces a novel platform with a step change in expressiveness, performance, and scalability, for optimization in the post-Moore’s law era. |
Jannes Gladrow 🔗 |
Sat 8:35 a.m. - 8:45 a.m.
|
Bayesian Metaplasticity from Synaptic Uncertainty
(
Oral
)
>
link
SlidesLive Video Catastrophic forgetting remains a challenge for neural networks, especially in lifelong learning scenarios. In this study, we introduce MEtaplasticity from Synaptic Uncertainty (MESU), inspired by metaplasticity and Bayesian inference principles. MESU harnesses synaptic uncertainty to retain information over time, with its update rule closely approximating the diagonal Newton's method for synaptic updates. Through continual learning experiments on permuted MNIST tasks, we demonstrate MESU's remarkable capability to maintain learning performance across 100 tasks without the need of explicit task boundaries. |
Djohan Bonnet · Tifenn HIRTZLIN · Tarcisius Januel · Thomas Dalgaty · Damien Querlioz · Elisa Vianello 🔗 |
Sat 8:45 a.m. - 9:05 a.m.
|
Training physical systems with Equilibrium Propagation
(
Invited Talk
)
>
link
SlidesLive Video The algorithm of Equilibrium Propagation (EP) 1 is highly interesting for training physical systems as it extracts backprop-equivalent gradients directly from their convergence to a steady state 2,3. In my talk, I will show that it is an excellent starting point for building and training physical systems to perform classification tasks. I will first describe how we have used EP to train the hardware D-Wave Ising machine in a supervised way to recognize handwritten digits 4. I will then show that EP can unlock self-learning in spiking neural networks 5. Finally, I will explain how we can extend EP to unsupervised learning. |
Julie Grollier 🔗 |
Sat 9:05 a.m. - 9:25 a.m.
|
Low-precision Sampling for Probabilistic Deep Learning
(
Invited Talk
)
>
link
SlidesLive Video Sampling from a probability distribution is a ubiquitous challenge in machine learning, ranging from generative AI to approximate Bayesian inference. This talk will show how to leverage low-precision compute to accelerate Markov chain Monte Carlo (MCMC) sampling with theoretical guarantees on the convergence. First, I will introduce a general and theoretically grounded framework to enable low-precision sampling, with applications to Stochastic Gradient Langevin Dynamics and Stochastic Gradient Hamiltonian Monte Carlo. Then I will present an approach for binary sampling---operating at 1-bit precision. Finally, I will show the experimental results of low-precision sampling on various deep learning tasks. |
Ruqi Zhang 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Inference analysis of optical transformers
(
Poster
)
>
link
This paper explores the utilization of optical computing for accelerating inference in transformer models, which have demonstrated substantial success in various applications. Optical computing offers ultra-fast computation and ultra-high energy efficiency compared to conventional electronics. Our findings suggest that optical implementation has the potential to achieve a significant 10-100 times improvement in the inference throughput of compute-limited transformer models. |
Xianxin Guo · Chenchen Wang · Djamshid Damry 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Diffractive Optical Neural Networks with Arbitrary Spatial Coherence
(
Poster
)
>
link
Diffractive optical neural networks (DONNs) have emerged as a promising optical hardware platform for ultra-fast and energy-efficient signal processing. However, previous experimental demonstrations of DONNs have only been performed using coherent light, which is not present in the natural world. Here, we study the role of spatial optical coherence in DONN operation. We propose a numerical approach to efficiently simulate DONNs under input illumination with arbitrary spatial coherence and discuss the corresponding computational complexity using coherent, partially coherent, and incoherent light. We also investigate the expressive power of DONNs and examine how coherence affects their performance. We show that under fully incoherent illumination, the DONN performance cannot surpass that of a linear model. As a demonstration, we train and evaluate simulated DONNs on the MNIST dataset using light with varying spatial coherence. |
Matthew Filipovich · Aleksei Malyshev · Alexander Lvovsky 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
The Data Movement Bottleneck in Analog Computing Accelerators: An Analog Optical Fourier Transform and Convolution Accelerator Case Study
(
Poster
)
>
link
Most modern computing tasks are constrained to having digital electronic input and output data. Due to these constraints imposed by the user, any analog computing accelerator must perform an analog-to-digital conversion on its input data and a subsequent digital-to-analog conversion on its output data. This places performance limits on analog computing accelerator hardware. To avoid this the analog hardware must replace the full functionality of traditional digital electronic computer hardware. This is not currently possible for optical computing accelerators due to limitations in gain, input-output isolation, and information storage in current optical hardware. We conducted a case study on an analog optical Fourier transform and convolution accelerator, using 27 empirically-measured benchmarks, we estimate that an ideal optical accelerator that accelerates Fourier transforms and convolutions can produce an average speedup of $9.4 \times$, and a median speedup of $1.9 \times$ for the set of benchmarks. The maximum speedups achieved were $45.3 \times$ for a pure Fourier transform and $159.4 \times$ for a pure convolution. An optical Fourier transform and convolution accelerator only produces significant speedup for applications consisting exclusively of Fourier transforms and convolutions.
|
James Meech · Vasileios Tsoutsouras · Phillip Stanley-Marbell 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Hierarchy of the echo state property in quantum reservoir computing
(
Poster
)
>
link
The echo state property (ESP) represents a fundamental concept in the reservoir computing framework that ensures stable output-only training of reservoir networks. However, the conventional definition of ESP does not aptly describe possibly non-stationary systems, where statistical properties evolve. To address this issue, we introduce two new categories of ESP: $\textit{non-stationary ESP}$ designed for possibly non-stationary systems, and $\textit{subspace/subset ESP}$ designed for systems whose subsystems have ESP. Following the definitions, we numerically demonstrate the correspondence between non-stationary ESP in the quantum reservoir computer (QRC) framework with typical Hamiltonian dynamics and input encoding methods using nonlinear autoregressive moving-average (NARMA) tasks. These newly defined properties present a new understanding toward the practical design of QRC and other possibly non-stationary RC systems.
|
Shumpei Kobayashi · Hoan Tran Quoc · Kohei Nakajima 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Contrastive power-efficient physical learning in resistor networks
(
Poster
)
>
link
The prospect of substantial reductions in the power consumption of AI is a major motivation for the development of neuromorphic hardware. Less attention has been given to the complementary research of power-efficient learning rules for such systems. Here we study self-learning physical systems trained by local learning rules based on contrastive learning. We show how the physical learning rule can be biased toward finding power-efficient solutions to learning problems, and demonstrate in simulations and laboratory experiments the emergence of a trade-off between power-efficiency and task performance. |
Menachem Stern · Sam Dillavou · Dinesh Jayaraman · Douglas Durian · Andrea Liu 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Energy-Based Learning Algorithms for Analog Computing: A Comparative Study
(
Poster
)
>
link
This work compares seven energy-based learning algorithms, namely contrastive learning (CL), equilibrium propagation (EP), coupled learning (CpL) and different variants of these algorithms depending on the type of perturbation used. The algorithms are compared on deep convolutional Hopfield networks (DCHNs) and evaluated on five vision tasks (MNIST, Fashion-MNIST, SVHN, CIFAR-10 and CIFAR-100).The results reveal that while all algorithms perform similarly on the simplest task (MNIST), differences in performance become evident as task complexity increases. Perhaps surprisingly, we find that negative perturbations yield significantly better results than positive ones, and the centered variant of EP emerges as the top-performing algorithm. Lastly, we report new state-of-the-art DCHN simulations on all five datasets (both in terms of speed and accuracy), achieving a 13.5x speedup compared to Laborieux et al. (2021). |
Benjamin Scellier · Maxence Ernoult · Jack Kendall · Suhas Kumar 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Expanding Spiking Neural Networks With Dendrites for Deep Learning
(
Poster
)
>
link
As deep learning networks increase in size and performance, so do associated computational costs, approaching prohibitive levels. Dendrites offer powerful nonlinear ``on-the-wire'' computational capabilities, increasing the expressivity of the point neuron while preserving many of the advantages of SNNs. We seek to demonstrate the potential of dendritic computations by combining them with the low-power event-driven computation of Spiking Neural Networks (SNNs) for deep learning applications.To this end, we have developed a library that adds dendritic computation to SNNs within the PyTorch framework, enabling complex deep learning networks that still retain the low power advantages of SNNs. Our library leverages a dendrite CMOS hardware model to inform the software model, which enables nonlinear computation integrated with snnTorch at scale. Finally, we discuss potential deep learning applications in the context of current state-of-the-art deep learning methods and energy-efficient neuromorphic hardware. |
Mark Plagge · Suma Cardwell · Frances Chance 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Activity Sparsity Complements Weight Sparsity for Efficient RNN Inference
(
Poster
)
>
link
Artificial neural networks open up unprecedented machine learning capabilities at the cost of seemingly ever growing computational requirements.Concurrently, the field of neuromorphic computing develops biologically inspired spiking neural networks and hardware platforms with the goal of bridging the efficiency-gap between biological brains and deep learning systems.Yet, spiking neural networks often times fall behind deep learning systems on many machine learning tasks.In this work, we demonstrate that the reduction factor of sparsely activated recurrent neural networks multiplies with the reduction factor of sparse weights.Our model achieves up to $20\times$ reduction of operations while maintaining perplexities below $60$ on the Penn Treebank language modeling task.This reduction factor has not be achieved with solely sparsely connected LSTMs, and the language modeling performance of our model has not been achieved with sparsely activated spiking neural networks.Our results suggest to further drive convergence of methods from deep learning and neuromorphic computing for efficient machine learning.
|
Rishav Mukherji · Mark Schoene · Khaleelulla Khan Nazeer · Christian Mayr · Anand Subramoney 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Algebraic Design of Physical Computing System for Time-Series Generation
(
Poster
)
>
link
Recently, computational techniques that employ physical systems (physical computing systems) have been developed. To utilize physical computing systems, their design strategy is important. Although there are practical learning based methods and theoretical approaches, no general method exists that provides specific design guidelines for given systems with rigorous theoretical support. In this paper, we propose a novel algebraic design framework for a physical computing system for time-series generation, which is capable of extracting specific design guidelines. Our approach describes input-output relationships algebraically and relates them to this task. We present two theorems and the results of experiments. The first theorem offers a basic strategy for algebraic design. The second theorem explores the ``replaceability" of such systems. |
Mizuka Komatsu · Takaharu Yaguchi · Kohei Nakajima 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
PhyFF: Physical forward forward algorithm for in-hardware training and inference
(
Poster
)
>
link
Training of digital deep learning models primarily relies on backpropagation, which poses challenges for physical implementation due to its dependency on precise knowledge of computations performed in the forward pass of the neural network. To address this issue, we propose a physical forward forward training algorithm (phyFF) that is inspired by the original forward forward algorithm. This novel approach facilitates direct training of deep physical neural networks comprising layers of diverse physical nonlinear systems, without the need for the complete knowledge of the underlying physics. We demonstrate the superiority of this method over current hardware-aware training techniques. The proposed method achieves faster training speeds, reduces digital computational requirements, and lowers training's power consumption in physical systems. |
Ali Momeni · Babak Rahmani · Matthieu Malléjac · Philipp del Hougne · Romain Fleury 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
A Green Granular Convolutional Neural Network with Software-FPGA Co-designed Learning
(
Poster
)
>
link
Different from traditional tedious CPU-GPU-based training algorithms using gradient descent methods, the software-FPGA co-designed learning algorithm is created to quickly solve a system of linear equations to directly calculate optimal values of hyperparameters of the green granular neural network (GGNN). To reduce both $CO_2$ emissions and energy consumption effectively, a novel green granular convolutional neural network (GGCNN) is developed by using a new classifier that uses GGNNs as building blocks with new fast software-FPGA co-designed learning. Initial simulation results indicates that the FPGA equation solver code ran faster than the Python equation solver code. Therefore, implementing the GGCNN with software-FPGA co-designed learning is feasible. In the future, The GGCNN will be evaluated by comparing with a convolutional neural network (CNN) with the traditional software-CPU-GPU-based learning in terms of speeds, model sizes, accuracy, $CO_2$ emissions and energy consumption by using popular datasets. New algorithms will be created to divide the inputs to different input groups that will be used to build different small-size GGNNs to solve the curse of dimensionality.
|
Yanqing Zhang · Huaiyuan Chu 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Real-Time FJ/MAC PDE Solvers via Tensorized, Back-Propagation-Free Optical PINN Training
(
Poster
)
>
link
Solving partial differential equations (PDEs) numerically often requires huge computing time, energy cost, and hardware resources in practical applications. This has limited their applications in many scenarios (e.g., autonomous systems, supersonic flows) that have a limited energy budget and require near real-time response. Leveraging optical/photonic computing, this paper develops an on-chip training framework for physics-informed neural networks (PINNs), aiming to solve high-dimensional PDEs with fJ/MAC power consumption and ultra-low latency. Despite the ultra-high speed of optical neural networks, training a PINN on an optical chip is hard due to (1) the large size of photonic devices, and (2) the lack of scalable optical memory devices to store the intermediate results of back-propagation (BP). To enable realistic optical PINN training, this paper presents a BP-free method to avoid the BP process. We also employ a tensor-compressed approach to improve the convergence and scalability of our optical PINN training. This training framework is designed with tensorized optical neural networks (TONN) for scalable inference acceleration and MZI phase-domain tuning for \textit{in-situ} optimization. Our simulation results of a 20-dim HJB PDE show that our photonic accelerator can reduce the number of MZIs by a factor of 1.17$\times 10^3$, with only 1.36 J and 1.15 s to solve this equation. This is the first real-size optical PINN training framework that can be applied to solve high-dimensional PDEs.
|
Yequan Zhao · Xian Xiao · Xinling Yu · Ziyue Liu · Zhixiong Chen · Geza Kurczveil · Raymond Beausoleil · Zheng Zhang 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Enhancing Low-Precision Sampling via Stochastic Gradient Hamiltonian Monte Carlo
(
Poster
)
>
link
Low-precision training has emerged as a promising low-cost technique to enhance the training efficiency of deep neural networks without sacrificing much accuracy.Its Bayesian counterpart can further provide uncertainty quantification and improved generalization accuracy.This paper investigates low-precision samplers via Stochastics Gradient Hamiltonian Monte Carlo (SGHMC) with low-precision and full-precision gradients accumulators for both strongly log-concave and non-log-concave distributions.Theoretically, our results show that, to achieve $\epsilon$-error in the 2-Wasserstein distance for non-log-concave distributions, low-precision SGHMC achieves quadratic improvement ($\tilde{\mathcal{O}}\left({\epsilon^{-2}{\mu^*}^{-2}\log^2\left({\epsilon^{-1}}\right)}\right)$) compared to the state-of-the-art low-precision sampler, Stochastic Gradient Langevin Dynamics (SGLD) ($\tilde{\mathcal{O}}\left({{\epsilon}^{-4}{\lambda^{*}}^{-1}\log^5\left({\epsilon^{-1}}\right)}\right)$). Moreover, we prove that low-precision SGHMC is more robust to the quantization error compared to low-precision SGLD due to the robustness of the momentum-based update w.r.t. gradient noise. Empirically, we conduct experiments on synthetic and MNIST, CIFAR-10 \& CIFAR-100 datasets which successfully validate our theoretical findings. Our study highlights the potential of low-precision SGHMC as an efficient and accurate sampling method for large-scale and resource-limited deep learning.
|
Ziyi Wang · Yujie Chen · Ruqi Zhang · Qifan Song 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Squeezed Edge YOLO: Onboard Object Detection on Edge Devices
(
Poster
)
>
link
Demand for efficient onboard object detection is increasing due to its key role in autonomous navigation. However, deploying object detection models such as YOLO on resource constrained edge devices is challenging due to the high computational requirements of such models. In this paper, a Squeezed Edge YOLO is proposed which is compressed and optimized to kilobytes of parameters in order to fit onboard such edge devices. To evaluate the proposed Squeezed Edge YOLO, two use cases - human and shape detection - are used to show the model accuracy and performance. Moreover, the proposed model is deployed onboard a GAP8 processor with 8 RISC-V cores and an NVIDIA Jetson Nano with 4GB of memory. Experimental results shows the proposed Squeezed Edge YOLO model size is optimized by a factor of 8x which leads to 76\% improvements in energy efficiency and 3.3x faster throughout. |
Edward Humes · Mozhgan Navardi · Tinoosh Mohsenin 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Virtual reservoir acceleration for CPU and GPU: Case study for coupled spin-torque oscillator reservoir
(
Poster
)
>
link
We provide high-speed implementations for simulating reservoirs described by $N$-coupled spin-torque oscillators. Here $N$ also corresponds to the number of reservoir nodes. We benchmark a variety of implementations based on CPU and GPU. Our new methods are at least 2.6 times quicker than the baseline for $N$ in range $1$ to $10^4$. More specifically, over all implementations the best factor is 78.9 for $N=1$ which decreases to 2.6 for $N=10^3$ and finally increases to 23.8 for $N=10^4$. GPU outperforms CPU significantly at $N=2500$. Our results show that GPU implementations should be tested for reservoir simulations. The implementations considered here can be used for any reservoir with evolution that can be approximated using an explicit method.
|
Thomas de Jong · Nozomi Akashi · Tomohiro Taniguchi · Hirofumi Notsu · Kohei Nakajima 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Adjoint Method: The Connection between Analog-based Equilibrium Propagation Architectures and Neural ODEs
(
Poster
)
>
link
Analog neural networks (ANNs) hold significant potential for substantialreductions in power consumption in modern neural networks, particularly whenemploying the increasingly popular Energy-Based Models (EBMs) in tandem withthe local Equilibrium Propagation (EP) training algorithm. This paper analyzesthe relationship between this family of ANNs and the concept of Neural OrdinaryDifferential Equations (Neural ODEs). Using the adjoint method, we formallydemonstrate that ANN-EP can be derived from Neural ODEs by constraining thedifferential equations to those with a steady-state response. This findingopens avenues for the ANN-EP community to extend ANNs to non-steady-statescenarios. Additionally, it provides an efficient setting for NN-ODEs thatsignificantly reduces the training cost. |
Mohamed Watfa · Alberto Garcia-Ortiz 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Beyond Digital: Harnessing Analog Hardware for Machine Learning
(
Poster
)
>
link
A remarkable surge in utilizing large deep-learning models yields state-of-the-art results in a variety of tasks. Recent model sizes often exceed billions of parameters, underscoring the importance of fast and energy-efficient processing. The significant costs associated with training and inference primarily stem from the constrained memory bandwidth of current hardware and the computationally intensive nature of these models. Historically, the design of machine learning models has predominantly been guided by the operational parameters of classical digital devices. In contrast, analog computations have the potential to offer vastly improved power efficiency for both inference and training tasks. This work details several machine-learning methodologies that could leverage existing analog hardware infrastructures. To foster the development of analog hardware-aware machine learning techniques, we explore both optical and electronic hardware configurations suitable for executing the fundamental mathematical operations inherent to these models. Integrating analog hardware with innovative machine learning approaches may pave the way for cost-effective AI systems at scale. |
Marvin Syed · Kirill Kalinin · Natalia Berloff 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Biologically-plausible hierarchical chunking on mixed-signal neuromorphic hardware
(
Poster
)
>
link
Chunking is a computational principle essential for memory compression, structural decomposition, and predictive processing. Humans seamlessly group perceptual sequences in units of chunks, parsed and memorized as separate entities. On an algorithmic level, computational models such as the Hierarchical Chunking Model (HCM) propose grouping proximal observational units as chunks, which resemble human chunk learning.Here we propose a biologically plausible and highly efficient implementation of the HCM: the neuromorphic HCM (nHCM).When parsing through perceptual sequences, the nHCM uses sparsely connected spiking neurons to construct hierarchical chunk representations in an event-driven way.Even when simulated on a standard computer, the nHCM showed remarkable improvement in speed, power consumption, and memory usage compared to its original counterpart.Then, we validate the model on mixed-signal neuromorphic hardware using recurrent spiking neural networks (SNN) with biologically plausible dynamics. We verified the robust computing properties of this implementation, overcoming the heterogeneity, variability, and low precision of the bio-plausible electronic analog circuits. With a successful implementation on both computers and neuromorphic processors, we show that the algorithm, and in general the neuromorphic co-design paradigm, is inherently efficient and robust. This work demonstrates cognitively-plausible sequence learning in energy-efficient dedicated neural computing electronic processing systems. |
Atilla Schreiber · Shuchen Wu · Chenxi Wu · Giacomo Indiveri · Eric Schulz 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Frequency propagation: Multi-mechanism learning in nonlinear physical networks
(
Poster
)
>
link
We introduce frequency propagation, a learning algorithm for nonlinear physical networks. In a resistive electrical circuit with variable resistors, an activation current is applied at a set of input nodes at one frequency, and an error current is applied at a set of output nodes at another frequency. The voltage response of the circuit to these boundary currents is the superposition of an 'activation signal' and an 'error signal' whose coefficients can be read in different frequencies of the frequency domain. Each conductance is updated proportionally to the product of the two coefficients. The learning rule is local and proved to perform gradient descent on a loss function.We argue that frequency propagation is an instance of a multi-mechanism learning strategy for physical networks, be it resistive, elastic, or flow networks. Multi-mechanism learning strategies incorporate at least two physical quantities, potentially governed by independent physical mechanisms, to act as activation and error signals in the training process. Locally available information about these two signals is then used to update the trainable parameters to perform gradient descent. We demonstrate how earlier work implementing learning via chemical signaling in flow networks [1] also falls under the rubric of multi-mechanism learning.[1] - V. Anisetti, B. Scellier, and J. M. Schwarz, “Learning by non-interfering feedback chemical signaling235in physical networks,” arXiv preprint arXiv:2203.12098, 2022. |
Vidyesh Anisetti · Ananth Kandala · Benjamin Scellier · J. M. Schwarz 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
The Benefits of Self-Supervised Learning for Training Physical Neural Networks
(
Poster
)
>
link
Physical Neural Networks (PNNs) are energy-efficient alternatives to their digital counterparts. Because they are inherently variable, noisy and hardly differentiable, PNNs require tailored trainign methods. Additionally, while the properties of PNNs make them good candidates for edge computing, where memory and computational ressources are constrained, most of the training algorithms developed for training PNNs focus on supervised learning, though labeled data could not be accessible on the edge. Here, we propose to use Self-Supervised Learning (SSL) as an ideal framework for training PNNs (we focus here on computer vision tasks) : 1. SSL globally eliminates the reliance on labeled data and 2. as SSL enforces the network to extract high-level concepts, networks trained with SSL should result in high robustness to noise and device variability. We investigate and show with simulations that the later properties effectively emerge when a network is trained on MNIST in the SSL settings while it does not when trained supervisely. We also explore and show empirically that we can optimize layer-wise SSL objectives rather than a single global one while still achieving the performance of the global optimization on MNIST and CIFAR-10. This could allow local learning without backpropagation at all, especially in the scheme we propose with stochastic optimization. We expect this preliminary work, based on simulations, to pave the way of a robust paradigm for training PNNs and hope to stimulate interest in the community of unconventional computing and beyond. |
Jeremie Laydevant · Peter McMahon · Davide Venturelli · Paul Lott 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Towards low power cognitive load analysis using EEG signal: A neuromorphic computing approach
(
Poster
)
>
link
Real-time on-device cognitive load assessment using EEG is very useful for applications like brain-computer interfaces, robotics, adaptive learning etc. Existing deep learning based models can achieve high accuracy, but due to large memory and energy requirement, those models can not be implemented on battery driven low-compute, low-memory edge devices such as wearable EEG devices. In this paper, we have used brain-inspired spiking neural networks and neuromorphic computing paradigms, that promises at least $10^4$ times less energy requirement compared to existing solutions. We have designed two different spiking network architectures and tested on two publicly available cognitive load datasets (EEGMAT \& STEW). We achieved comparable accuracy with existing arts, without performing any artifact removal from EEG signal. Our model offers $\sim8\times$ less memory requirement, $\sim10^3\times$ less computational cost and consumes maximum 0.33 $\mu$J energy per inference.
|
Dighanchal Banerjee · Sounak Dey · Debatri Chatterjee · Arpan Pal 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Neuromorphic Co-Design as a Game
(
Poster
)
>
link
Co-design is a prominent topic presently in computing, speaking to the mutual benefit of coordinating design choices of several layers in the technology stack. For example, this may be designing algorithms which can most efficiently take advantage of the acceleration properties of a given architecture, while simultaneously designing the hardware to support the structural needs of a class of computation. The implications of these design decisions are influential enough to be deemed a lottery, enabling an idea to win out over others irrespective of the individual merits. Coordination is a well studied topic in the mathematics of game theory, where in many cases without a coordination mechanism the outcome is sub-optimal. Here we consider what insights game theoretic analysis can offer for computer architecture co-design. In particular, we consider the interplay between algorithm and architecture advances in the field of neuromorphic computing. Analyzing developments of spiking neural network algorithms and neuromorphic hardware as a co-design game we use the Stag Hunt model to illustrate challenges for spiking algorithms or architectures to advance the field independently and advocate for a strategic pursuit to advance neuromorphic computing. |
Craig M Vineyard · William Severa · Brad Aimone 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Device Codesign using Reinforcement Learning and Evolutionary Optimization
(
Poster
)
>
link
Device discovery and circuit modeling for emerging devices, such as magnetic tunnel junctions, require detailed and time-consuming device and circuit simulations. In this work, we propose using AI-guided techniques such as reinforcement learning and evolutionary optimization to accelerate device discovery, creativity of solutions, and automate optimization to design true random number generators for a given distribution. We present preliminary results designing true random number generators using magnetic tunnel junctions optimized for performance. |
Catherine Schuman · Suma Cardwell · Karan Patel · J. Smith · Jared Arzate · Andrew Maicke · Samuel Liu · Jaesuk Kwon · Jean Anne Incorvia 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Mean-Field Assisted Deep Boltzmann Learning with Probabilistic Computers
(
Poster
)
>
link
Despite their appeal as physics-inspired, energy-based and generative nature, general Boltzmann Machines (BM) are considered intractable to train. This belief led to simplified models of BMs with restricted intralayer connections or layer-by-layer training of deep BMs. Recent developments in domain-specific hardware -- specifically probabilistic computers (p-computer) with probabilistic bits (p-bit) -- may change established wisdom on the tractability of deep BMs. In this paper, we show that deep and unrestricted BMs can be trained using p-computers generating hundreds of billions of Markov Chain Monte Carlo (MCMC) samples per second, on sparse networks developed originally for use in D-Wave's annealers. To maximize the efficiency of learning the p-computer, we introduce two families of Mean-Field Theory assisted learning algorithms, or xMFTs (x = Naive and Hierarchical). The xMFTs are used to estimate the averages and correlations during the positive phase of the contrastive divergence (CD) algorithm and our custom-designed p-computer is used to estimate the averages and correlations in the negative phase. A custom Field-Programmable-Gate Array (FPGA) emulation of the p-computer architecture takes up to 45 billion flips per second, allowing the implementation of CD-$n$ where $n$ can be of the order of millions, unlike RBMs where $n$ is typically 1 or 2. Experiments on the full MNIST dataset with the combined algorithm show that the positive phase can be efficiently computed by xMFTs without much degradation when the negative phase is computed by the p-computer. Our algorithm can be used in other scalable Ising machines and its variants can be used to train BMs, previously thought to be intractable.
|
Shuvro Chowdhury · Shaila Niazi · Kerem Camsari 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Emergent learning in physical systems as feedback-based aging in a glassy landscape
(
Poster
)
>
link
By training linear physical networks to learn linear transformations, we discern how their physical properties evolve due to weight update rules. Our findings highlight a striking similarity between the learning behaviors of such networks and the processes of aging and memory formation in disordered and glassy systems. We show that the learning dynamics resembles an aging process, where the system relaxes in response to repeated application of the feedback boundary forces in presence of an input force, thus encoding a memory of the input-output relationship. With this relaxation comes an increase in the correlation length, which is indicated by the two-point correlation function for the components of the network. We also observe that the square root of the mean-squared error as a function of epoch takes on a non-exponential form, which is a typical feature of glassy systems. This physical interpretation suggests that by encoding more detailed information into input and feedback boundary forces, the process of emergent learning can be rather ubiquitous and, thus, serve as a very early physical mechanism, from an evolutionary standpoint, for learning in biological systems. |
Vidyesh Anisetti · Ananth Kandala · Jennifer Schwarz 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Unleashing Hyperdimensional Computing with Nyström Method based Data Adaptive Encoding
(
Poster
)
>
link
Hyperdimensional Computing (HDC) is capable of performing machine learning tasks by first encoding data into high-dimension distributed representation called hypervectors. Learning tasks can then be preformed on those hypervectors with a set of computationally efficient and simple operations. HDC has gained significant attentions in recent years due to its excellent hardware efficiency. The core of all HDC algorithms is the encoding function which determines the expressing ability of hypervectors, thus is the critical bottleneck for performance. However, existing HDC encoding methods are task dependent and often only capture very basic notion of similarity, therefore can limit the accuracy of HDC models. To unleash the potential of HDC on arbitrary tasks, we propose a novel encoding method that is inspired by Nyström method for kernel approximation. Our approach allows one to generate an encoding function that approximates any user-defined positive-definite similarity function on the data via dot-products between encodings in HD-space. This allows HDC to tackle a broader range of tasks with better learning accuracy while still retain its hardware efficiency. We empirically evaluate our proposed encoding method against existing HDC encoding methods that are commonly used in various classification tasks. Our results show the HDC encoding method we propose can achieve better accuracy result in various of learning tasks. On graph and string datasets, our method achieve 10\%-37\% and 3\%-18\% better classification accuracy, respectively. |
Quanling Zhao · Anthony Thomas · Xiaofan Yu · Tajana S Rosing 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Thermodynamic AI and Thermodynamic Linear Algebra
(
Poster
)
>
link
Many Artificial Intelligence (AI) algorithms are inspired by physics and employ stochastic fluctuations, such as generative diffusion models, Bayesian neural networks, and Monte Carlo inference. These algorithms are currently run on digital hardware, ultimately limiting their scalability and overall potential. Here, we propose a novel computing device, called Thermodynamic AI hardware, that could accelerate such algorithms. Thermodynamic AI hardware can be viewed as a novel form of computing, since it uses novel fundamental building blocks, called stochastic units (s-units), which naturally evolve over time via stochastic trajectories. In addition to these s-units, Thermodynamic AI hardware employs a Maxwell's demon device that guides the system to produce non-trivial states. We provide a few simple physical architectures for building these devices, such as RC electrical circuits. Moreover, we show that this same hardware can be used to accelerate various linear algebra primitives. We present simple thermodynamic algorithms for (1) solving linear systems of equations, (2) computing matrix inverses, (3) computing matrix determinants, and (4) solving Lyapunov equations. Under reasonable assumptions, we rigorously establish asymptotic speedups for our algorithms, relative to digital methods, that scale linearly in dimension. Numerical simulations also suggest a speedup is achievable in practical scenarios. |
Patrick Coles · Maxwell Aifer · Kaelan Donatella · Denis Melanson · Max Hunter Gordon · Thomas Ahle · Daniel Simpson · Gavin Crooks · Antonio Martinez · Faris Sbahi 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Nonlinear Classification Without a Processor
(
Poster
)
>
link
Computers, as well as most neuromorphic hardware systems, use central processing and top-down algorithmic control to train for machine learning tasks. In contrast, brains are ensembles of 100 billion neurons working in tandem, giving them tremendous advantages in power efficiency and speed. Many physical systems `learn' through history dependence, but training a physical system to perform arbitrary nonlinear tasks without a processor has not been possible. Here we demonstrate the successful implementation of such a system - a learning meta-material. This nonlinear analog circuit is comprised of identical copies of a single simple element, each following the same local update rule. By applying voltages to our system (inputs), inference is performed by physics in microseconds. When labels are properly enforced (also via voltages), the system's internal state evolves in time, approximating gradient descent. Our system $\textit{learns on its own}$; it requires no processor. Once trained, it performs inference passively, requiring approximately 100~$\mu$W of total power dissipation across its edges. We demonstrate the flexibility and power efficiency of our system by solving nonlinear 2D classification tasks. Learning meta-materials have immense potential as fast, efficient, robust learning systems for edge computing, from smart sensors to medical devices to robotic control.
|
Sam Dillavou · Benjamin Beyer · Menachem Stern · Marc Miskin · Andrea Liu · Douglas Durian 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Neural Deep Operator Networks representation of Coherent Ising Machine Dynamics
(
Poster
)
>
link
Coherent Ising Machines (CIMs) are optical devices that employ parametric oscillators to tackle binary optimization problems, whose simplified dynamics are described by a series of coupled ordinary differential equations. In this study, we learn the deterministic dynamics of CIMs via the use of neural Deep Operator Networks (DeepONet). After training successfully the systems over multiple initial conditions and problem instances, we benchmark the comparative performance of the neural network. In our tests, the network is capable of delivering solutions of comparative quality to the exact dynamics up to 175 spins, but we do not identify roadblocks to go further: given sufficient training resources the CIM solvers could successfully be represented by a neural network at a large scale. |
Arsalan Taassob · Davide Venturelli · Paul Lott 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Exploiting Symmetric Temporally Sparse BPTT for Efficient RNN Training
(
Poster
)
>
link
Recurrent Neural Networks (RNNs) are useful in temporal sequence tasks. However, training RNNs involves dense matrix multiplications which requires hardware that can support a large number of arithmetic operations and memory accesses. Implementing online training of RNNs on the edge calls for optimized algorithms for an efficient deployment on hardware. Inspired by the spiking neuron model, the Delta RNN exploits temporal sparsity during inference by skipping over the update of hidden states from those inactivated neurons whose change of activation across two timesteps is below a defined threshold. This work describes a training algorithm for Delta RNNs that exploits temporal sparsity in the backward propagation phase to reduce computational requirements for training on the edge. Due to the symmetric computation graphs of forward and backward propagation during training, the gradient computation of inactivated neurons can be skipped. Results shows a reduction of ∼80% in matrix operations for training a 56k parameter Delta LSTM on the Fluent Speech Commands dataset with negligible accuracy loss. Logic simulations of a hardware accelerator designed for the training algorithm show 2-10X speedup in matrix computations for an activation sparsity range of 50%-90%. Additionally, we show that the proposed Delta RNN training will be useful for online incremental learning on edge devices with limited computing resources. |
Xi Chen · Chang Gao · Zuowen Wang · Longbiao Cheng · Sheng Zhou · Shih-Chii Liu · Tobi Delbruck 🔗 |
Sat 9:25 a.m. - 10:30 a.m.
|
Poster Session I
(
Poster Session
)
>
|
🔗 |
Sat 10:30 a.m. - 11:30 a.m.
|
Lunch Break
(
Break
)
>
|
🔗 |
Sat 11:30 a.m. - 11:50 a.m.
|
The Optimization Conundrum: Balancing Data, Speed, and Accuracy in Real-Time Machine Learning
(
Invited Talk
)
>
link
SlidesLive Video |
Mihaela van der Schaar 🔗 |
Sat 11:50 a.m. - 12:00 p.m.
|
A Lagrangian Perspective on Dual Propagation
(
Oral
)
>
link
SlidesLive Video The search for "biologically plausible" learning algorithms has converged on the idea of representing gradients as activity differences. However, most approaches require a high degree of synchronization (distinct phases during learning) and introduce high computational overhead, which raises doubt regarding their biological plausibility as well as their potential usefulness for neuromorphic computing. Furthermore, they commonly rely on applying infinitesimal perturbations (nudges) to output units, which is impractical in noisy environments. Recently it has been shown that by modelling artificial neurons as dyads with two oppositely nudged compartments, it is possible for a fully local learning algorithm to bridge the performance gap to backpropagation, without requiring separate learning phases, while also being compatible with significant levels of nudging. However, the algorithm, called dual propagation, has the drawback that convergence of its inference method relies on symmetric nudging of the output units, which may be infeasible in biological and analog implementations. Starting from a modified version of LeCun's Lagrangian approach to backpropagation, we derive a slightly altered variant of dual propagation, which is robust to asymmetric nudging. |
Rasmus Høier · Christopher Zach 🔗 |
Sat 12:00 p.m. - 12:10 p.m.
|
Scaling of Optical Transformers
(
Oral
)
>
link
SlidesLive Video The rapidly increasing size of deep-learning models has renewed interest in alternatives to digital-electronic computers as a means to dramatically reduce the energy cost of running state-of-the-art neural networks. Optical matrix-vector multipliers are best suited to performing computations with very large operands, which suggests that large Transformer models could be a good target for them. However, the ability of optical accelerators to run efficiently depends on the model being run, and if the model can be run at all when subject to the noise, error, and low precision of analog-optical hardware. Here we investigate whether Transformers meet the criteria to be efficient when running optically, what benefits can be had for doing so, and how worthwhile it is at scale. We found using small-scale experiments on and simulation of a prototype hardware accelerator that Transformers may run on optical hardware, and that elements of their design --- the ability to parallel-process data using the same weights, and trends in scaling them to enormous widths --- allow them to achieve an asymptotic energy-efficiency advantage running optically compared to on digital hardware. Based on a model of a full optical accelerator system, we predict that well-engineered, large-scale optical hardware should be able to achieve a 100× energy-efficiency advantage over current digital-electronic processors in running some of the largest current Transformer models, and if both the models and the optical hardware are scaled to the quadrillion-parameter regime, optical accelerators could have a > 8,000× energy-efficiency advantage. |
Maxwell Anderson · Shi-Yuan Ma · Tianyu Wang · Logan Wright · Peter McMahon 🔗 |
Sat 12:10 p.m. - 12:30 p.m.
|
Low-Precision & Analog Computational Techniques for Sustainable & Accurate AI Inference & Training
(
Invited Talk
)
>
link
SlidesLive Video The recent rise of Generative AI has led to a dramatic increase in the sizes and computational needs for AI models. This compute explosion has raised serious cost and sustainability concerns in both the training & deployment phases of these large models. Low-precision techniques, that lower the precision of the weights, activations, and gradients, have been successfully employed to reduce the training-precision from 32-bits down to 8-bits (FP8) and the inference-precision down to 4-bits (INT4). These advances have enabled more than a 10-fold improvement in compute efficiency over the past decade – however, it is expected that further gains may be limited. Recent developments in analog computational techniques offer the promise of achieving an additional 10-100X enhancement in crucial metrics, including energy efficiency and computational density. In this presentation, we will provide an overview of these significant recent breakthroughs, which are likely to play a pivotal role in advancing Generative AI and making it more sustainable and accessible to a wider audience. |
Kailash Gopalakrishnan 🔗 |
Sat 12:30 p.m. - 2:00 p.m.
|
Poster Session II
(
Poster Session
)
>
|
🔗 |
Sat 2:00 p.m. - 3:00 p.m.
|
Panel Discussion
(
Panel
)
>
SlidesLive Video |
🔗 |