Expo
GRAID: Synthetic Data Generation with Geometric Constraints and Multi-Agentic Reflection for Harmful Content Detection
The rapid integration of LLMs for synthetic data generation offers a powerful solution for data scarcity, a critical issue for training machine learning models. However, the resulting data quality is often questionable, requiring costly and time-consuming manual review. This oversight is especially challenging in the crucial domain of trustworthy and safe AI.
A significant need here is the automated identification of adversarial and harmful inputs, a process known as red-teaming, to improve guardrailing systems before deployment. Developing an effective, fully automated red-teaming approach capable of generating diverse, out-of-domain harmful content has been a long-standing challenge.
To address data scarcity in harmful text detection & the challenge of automated red-teaming, we introduce GRAID (Geometric & Reflective AI-Driven Data Augmentation), a novel, versatile, & dynamic multi-agent framework.
GRAID operates in two stages:
1) Geometric Generation: A constrained, fine-tuned LLM generates geometrically controlled examples to diversify content within the original embedding space.x000D
2) Reflective Augmentation: A multi-agentic reflective process promotes stylistic diversity & uncovers difficult edge cases.
This combination ensures both reliable coverage of the input space & nuanced exploration of harmful content. We demonstrate that augmenting a harmful text classification dataset with GRAID significantly improves the performance of downstream real-world guardrail models. Furthermore, GRAID captures data variability in new geometric domains while preserving data relationships.
While initially focused on harmful text detection, GRAID’s modular design makes it an inherently domain-agnostic framework adaptable to various applications beyond classification. In this talk, we will detail how GRAID distinguishes itself from existing solutions, discuss its building blocks, & share insights on its easy adaptation for diverse synthetic data generation needs.
Multimodal Data Foundation at Industry-Scale
Pre-training is fundamental to foundation models, enabling them to acquire broad knowledge that gives rise to emerging capabilities at later training stages, and scaling is the key for pre-training. In this talk, we present a recipe for building and curating pre-training, multimodal image-text paired data from scratch on a global scale, enabling mutual benefits between English and non-English data. We would like to share our key observations and insights with the community on: (1) why scaling matters, including the foundational role of data and key principles to hold for scaling; (2) how to design simple yet scalable data algorithms that enable industry-scale data collection and training without data filters, serving both research and production needs; (3) how the scaling improves Meta’s products at conventional and frontier machine learning areas. Submission is facilitated by Cogs & Marvel but is entirely organized, executed, and implemented by Meta.
Data Scout: “From Prompt to Corpus: Accelerating Domain-Specific Data Collection with LLMGuided Discovery”
Training large language models for specialized disciplines such as advanced mathematics, molecular biology, or legal reasoning is limited by the scarcity of large, high quality, domain specific corpora. Most publicly available datasets are dominated by general purpose web text. When available, specialized data are fragmented across diverse sources such as preprints, conference papers, forums, lecture notes, and digitized books. No single source offers comprehensive real-world coverage across scientific domains. Consequently, scaling up authentic domain data remains a bottleneck: collecting a subset of relevant tokens often requires downloading and filtering hundreds of terabytes of raw web material, a process that is both time consuming and costly. x000D x000D We introduce Data Scout, a modular, LLM powered pipeline that turns a high-level user intent (e.g., “I need data for advanced mathematics”) into a vetted, list of seed URLs in minutes. The system first expands the original intent using an LLM that generates a hierarchical subgraph of related concepts; this taxonomy drives a diversified set of search queries that systematically cover the target domain while respecting known licensing signals. Candidate URLs are then filtered by the same LLM using chain of thought prompting based on topical relevance, licensing clarity, and crawlability. Our results show that that the list of selected candidate URLs when crawled can yield a high percentage of relevant pages (40%+) related to the user’s intended topic or query, compared to less than 1 percent in general web-scale corpora. Data Scout is available with both CLI and GUI front ends. By democratizing domain specific data acquisition, Data Scout enables researchers without dedicated crawling infrastructure to bootstrap large, high-fidelity corpora, accelerating the development of specialized LLMs across various niche domains fields.
Toward General Full Autonomy: Open Research Challenges in Scalable Self-Driving
Speaker: Ben Snyder, Director of AI Research for Autonomous Vehicles, General Motorsx000D
x000D
GM’s driver-assistance systems now power millions of vehicles across North America, logging over half a billion hands-free miles with zero reported crashes—demonstrating the safety and scalability of autonomy at unprecedented consumer scale. Following GM’s acquisition of Cruise earlier this year, the combined teams now bring together a decade of experience to accelerate progress toward full, generalized autonomy. This talk will dive into the open research challenges ahead—from model architecture and the balance between imitation and reinforcement learning, to leveraging vision-language models for long-tail, common-sense reasoning, and advancing the training, deployment, and simulation infrastructure needed to scale truly generalized autonomy.
Telling Stories at Scale: Multimodal ML in the Global Media Landscape
Netflix has long been recognized as a pioneer in personalization, leveraging member preference data to recommend the most engaging shows, movies, and games. In recent years, we have expanded our use of machine learning and data-driven approaches to support a wide array of upstream creative and operational workflows. In this talk, we will discuss how modern AI methods – contrastive learning, transformers, cross-modal retrieval, graph neural networks, etc. – are transforming the curation, localization, promotion, and launch of stories on a global scale. A distinctive aspect of our work is the integration of highly creative assets—text, images, video, and speech—alongside traditional tabular datasets. We will highlight unique challenges that arise at the intersection of multimedia, personalization, and web-scale products, and share how advanced ML/AI techniques are addressing these challenges to connect great stories with our worldwide audience.
Beyond Benchmarks: Rethinking Reasoning in Language Models
Reasoning is often described as the next frontier for AI, but what does it really mean for a model to “reason”, and how should we measure it? Popular benchmarks like GSM8K suggest steady progress, yet controlled studies reveal that models can fail dramatically under small changes—such as swapping numbers or adding irrelevant details. Large Reasoning Models (LRMs), which generate explicit chains of thought, raise new optimism but also expose clear limits: they often underperform standard models on simple tasks, improve briefly at medium complexity, and then collapse on harder ones despite having unused compute. Crucially, reasoning is not the same as knowledge recall, tool use, or agent-like behavior. True reasoning involves solving novel problems, decomposing them into steps, generalizing to new contexts, recombining partial results, and finally generating novel hypotheses—capabilities current systems largely lack. Today’s evaluations, focused on final answers and contaminated benchmarks, risk giving a misleading sense of progress. This talk will provide a critical review of reasoning in language models, highlight why current evaluations can be deceptive, and emphasize that reasoning is not just about “what" models answer, but “how" they solve problems.
Foundational Generative Recommendations for E-Commerce
Modern commerce platforms face the challenge of delivering personalized recommendations across billions of items to users with diverse intents, temporal dynamics, and cold-start scenarios. We present a generative foundation model for commerce built on Hierarchical Sequential Transduction Units (HSTU) that integrates Liquid Foundation Models (LFM) and custom CUDA kernels developed in collaboration with Nvidia for efficient training and online serving. Our approach demonstrates that generative methods unlock substantial gains through three key innovations: (1) large-scale contrastive learning with hard negative sampling; (2) temporal mechanisms that fuse multi-scale time signals (session, day, season) with commerce-specific features; and (3) optimized training and inference kernels. While results are promising, significant challenges remain in handling non-stationary preferences, growing product catalogue, and multi-objective optimization—we discuss our roadmap toward truly foundational commerce models that generalize across domains and market conditions.
Pushing the boundaries of chemical synthesis with RetroChimera
Retrosynthesis - the task of planning chemical reaction recipes to synthesize complex molecules - remains a bottleneck in the discovery of novel pharmaceuticals. We recently released RetroChimera - a model for predicting chemical reactions - which demonstrated robustness well outside of training distribution by transferring zero-shot to internal reaction data at a major pharmaceutical company. We also found that industrial organic chemists prefer predictions from RetroChimera over real patented reactions in terms of quality, revealing a high degree of alignment. In this demo, we will showcase the model, let attendees query it live, and show them how to interpret the results.
BeeAI
The BeeAI Framework is an open-source project for building reliable AI agents that combine autonomy with control. Current agent frameworks focus primarily on prompting and orchestration, leaving critical questions of predictability and safety unaddressed. BeeAI fills this gap with a lightweight framework that enables developers to build agents whose reasoning abilities are preserved while execution is constrained by declarative, rule-based requirements. At the core of the framework is the RequirementAgent, a novel agent design that enforces deterministic, controlled behaviors across heterogeneous language models. With RequirementAgent, developers can ensure consistent and reliable execution patterns regardless of differences in model reasoning, tool-calling abilities, or stochastic variation. This approach provides practitioners with a unified abstraction layer that simplifies the deployment of complex AI systems into production settings. As an incubating Linux Foundation AI project, BeeAI is gaining adoption in open source and enterprise contexts as organizations seek robust ways to operationalize AI agents at scale. At NeurIPS EXPO, we will showcase BeeAI’s architecture, real-world use cases, and lessons learned from applying declarative control to agent autonomy.
Learning to Steer LLMs with AI Steerability 360 and In-Context Explainability 360
Current algorithms for aligning LLM behavior are often implemented for narrow settings, making it difficult for researchers and developers to understand their effectiveness across model architectures, datasets, and tasks. To help provide a more informed and principled approach to steering model behavior, we present the AI Steerability 360 (AISteer360) and In-Context Explainability 360 (ICX360) toolkits. Participants will first be guided through a conceptual overview for how model behavior can be influenced across four model control surfaces: input (prompting), structural (weights/architecture), state (activations/attentions), and output (decoding). After the conceptual overview, we will guide attendees through how to apply some recently developed explainability tools (from ICX360) for understanding why models produce given, potentially undesirable, outputs and how this information is used to design targeted steering inventions (via AISteer360). Closing the loop, we will evaluate if the baseline behavior (of the original, unsteered model) was successfully mitigated by the selected steering inventions and investigate if steering introduced any unintended behavioral side-effects. All of the experiments throughout the demonstration will be facilitated solely by the tools in the two toolkits, illustrating their power to design end-to-end steering workflows. Attendees will come away with a practical understanding of how to apply these toolkits to their own alignment challenges.
LLM-Powered Intelligent Data Engineering: From Workflow Design to Ingestion andQuality Assurance
Modern enterprises depend on efficient data engineering pipelines to unlock value from diverse and large-scale datasets. Yet, current processes for workflow design, schema ingestion, and data quality validation remain complex, error-prone, and dependent on technical expertise. This creates barriers for non-expert users, slows down development, and introduces risks of data inconsistency.x000D x000D We present a suite of LLM-powered frameworks that reimagine enterprise data engineering across three critical dimensions: (i) From Natural Language to Executable ETL Flows, enabling intuitive pipeline creation with natural language specifications and automatic operator/property inference, (ii) All You Can Ingest, an end-to-end schema mapping and transformation framework that unifies semantic alignment, code synthesis, and robust validation, and (iii) Quality Assessment of Tabular Data, a scalable approach for auto-generating interpretable quality rules and executable validators tailored to specific datasets.x000D x000D Together, these innovations demonstrate how Large Language Models (LLMs), augmented with retrieval, code synthesis, reasoning, and guardrails, can transform the data engineering lifecycle into a more accessible, adaptive, and trustworthy process, reducing manual effort, accelerating time-to-value, and ensuring data fidelity at enterprise scale.
Who Needs Attention Anyway? Elastic State Models for Real-Time Streaming Tasks
Large attention models are great for offline reasoning, but their cost grows with context and their behavior is hard to bound. For systems that must react under tight latency and safety constraints - robots, simulators, industrial and chemical processes - that compute model is a poor fit.
We explore an Elastic State Model (ESM): a streaming state-space backbone with a small geometric correction block that only “wakes up” when the dynamics get fragile. At each step, a fast SSM predicts the next state, estimates how sensitive it is to small perturbations, and - when needed - takes a few extra preconditioned steps in latent space to correct the trajectory. Compute stays cheap on easy stretches and increases only around junctions, shocks, and high-stakes events, keeping latency and compute per step tightly bounded and enabling online adaptation at inference time, without retraining.
We illustrate this with two contrasting demos: a maze exploration agent that automatically spends more compute at new junctions and tight passages, and a protein “repair” scenario where extra effort is focused only on damaged or unstable regions of a molecule. Together they show how the same ESM block can power responsive, budget-aware decision-making in both robotics-style navigation and molecular simulation.
Building Safe, Compliant, and Observable Agentic Systems for Generative AI Applications
SCOPE: Enterprise Agent Governance Framework
As Large Language Model (LLM) agents are increasingly deployed in mission-critical applications, ensuring their safety, compliance, and observability becomes paramount. We present SCOPE, a comprehensive governance framework designed for regulated environments like banking and healthcare.
The SCOPE acronym represents our five core pillars:
S – Safety (Multi-layer Safety Guardrails)
C – Compliance (Policy-as-Code)
O – Observability (Measurable Observability & Audit Trails)
P – Permissions (Identity-Aware Permissions / IAM)
E – Escalation (Human-in-the-Loop Escalation)
Built on Google's Agent Development Kit (ADK), SCOPE implements a "Defense in Depth" architecture. It combines fast ML-based classification (~50ms) with LLM-based contextual analysis for robust protection. It features Role-Based Access Control (RBAC) baked into the agent's core and enables hot-patching of business rules without code changes.
We demonstrate SCOPE's effectiveness through a live Banking Customer Service Agent that handles account inquiries, transactions, and fraud reports while maintaining compliance with PCI-DSS and SOC2 requirements. The framework is open-source and production-ready, offering a practical blueprint for trustworthy agent deployment.
Build verifiable apps using Generative AI and Automated Reasoning
Recent advancements in Generative AI have enabled customers to use LLMs to generate infrastructure code using AWS CLI commands. Because humans can make mistakes, when deployed such LLM-generated infrastructure code can have negative impacts, including on security.x000D
Motivated by this challenge, this demonstration introduces participants to automated reasoning tooling that enhanced security in production for Amazon Q chat. x000D
AWS Q Chat enables natural language interaction with AWS resources while employing automated reasoning to verify every generated API call against comprehensive semantic logic models. This prevents potentially harmful operations before execution and suggests corrections, creating a feedback loop that iterates until verifiably correct code is produced. Through this work, we demonstrate how organizations can leverage GenAI's efficiency while maintaining the rigorous verification standards required for production environments and participants will learn how to integrate these tools into their workflows to prevent security regressions and ensure reliable infrastructure management. This tutorial Scientists, Engineers, Security professionals and anyone interested in applying formal verification to their infrastructure.
ContextForge
The rapid rise of autonomous AI agents across enterprises is creating a new class of security and governance challenges that are not adequately addressed with today’s technology. Context Forge MCP Gateway is an open-source, security-focused middleware that provides fine-grained control and extensibility for agent operations. With over 2.6k GitHub stars and a rapidly growing user community, Context Forge addresses emerging threat classes including prompt injection, data leakage, and misuse of sensitive resources. At its core, Context Forge introduces a plugin architecture modeled after Linux Security Modules, embedding reusable security hooks at critical points in agent execution (e.g., prompt handling, tool invocation, data transformation). This modular foundation enables organizations to enforce contextual policies at scale—ranging from PII redaction and provenance tagging to prompt injection detection and policy-based access control. With 39 plugins already available, Context Forge is establishing a standards-aligned ecosystem for securing agent workflows in real-world enterprise deployments. By blending research-driven design with open-source adoption it creates a practical path for organizations to advance agent trustworthiness, safety, and compliance.
ALICE: Agentic Logic for Incident and Codebug Elimination
Modern incident root-cause analysis (RCA) is constrained by partial observability, symptom-centric signals, and the overwhelming noise present in logs, traces, and metrics. Diagnosing production failures often depends on instrumentation quality and human expertise, while latent software defects, configuration errors, and zero-day failure modes remain difficult to pinpoint. To address these challenges, we demonstrate a multi-agent system for incident diagnostics that augments observability data with application source code and static analysis signals.x000D x000D Our system introduces two cooperating agents: the Code Context Agent (COCOA), which builds a knowledge graph of program dependencies, control/data flows, and caller–callee relationships; and the Incident Diagnostics Agent (IDA), which performs agentic reasoning over an entity topology graph enriched with observability streams. Together, these agents extend topology-aware planning (TAP) to simultaneously operate on program dependency graphs and infrastructure entity graphs, thereby linking runtime symptoms with underlying code-level causes.x000D x000D This demo showcases how multi-agent collaboration enables deeper, context-sensitive RCA. We walk through real-world inspired scenarios—including incidents where critical log lines are hidden in noisy observability streams or where latent defects emerge only after system updates—illustrating how the system surfaces root causes that would otherwise remain invisible. By bridging program analysis with runtime observability, our approach moves beyond symptom-driven diagnostics toward a more reliable, automated framework for incident management.
Efficient LiDAR Processing with AI Models Leveraging Heterogeneous Compute
This demo showcases heterogeneous compute execution of a LiDAR model running in real time on an edge device. The LiDAR processing, specifically 3D sparse convolution (spconv3d) network, runs on the Qualcomm Adreno GPU, while the Region Proposal Network (RPN) executes on the Qualcomm Hexagon NPU. This division of labor across specialized processors reduces on-device inference latency and maximizes overall efficiency. Additionally, a lightweight, learnable voxel removal layer that hierarchically prunes redundant voxels further reduces inference time without compromising detection accuracy. x000D
x000D
"This Proposal is provided for review and evaluation purposes only. Do not redistribute to any third party without the express prior written consent of Qualcomm Technologies, Inc." x000D
x000D
Implementation challenge that we tackle x000D
x000D
LiDAR models often combine different types of operations: irregular, sparse computations (e.g., SpConv3D) and dense convolutional layers (e.g., CNNs). These operations have distinct hardware affinities—SpConv3D is better suited for SIMT-style GPUs, while CNNs benefit from SIMD-style NPUs. Efficient execution requires mapping each part of the model to the most appropriate compute unit. x000D
x000D
Another challenge is the variability in voxel density across LiDAR frames. Not all voxels contribute meaningfully to object detection, many represent ground planes or distant background and can be safely discarded. However, identifying and removing these in a lightweight, learnable way is non-trivial.
Parallel generation with verification on device
In this work, we address the challenges of efficiently generating and verifying multiple responses from large language models (LLMs) directly on device. While sampling with non-zero temperature often yields improved responses compared to greedy approaches, selecting the best response requires generating several candidates and evaluating them without incurring significant latency or resource overhead. Cloud-based solutions often rely on separate verification models, which are impractical for on-device deployment due to resource constraints. Our proposed solution leverages multi-stream execution graphs and parallel LLM generation, enabling joint generation and verification within a unified framework. Combined with post-processing techniques such as majority voting, this approach minimizes latency and optimizes the selection of high-quality responses, paving the way for more effective on-device LLM inference. x000D
x000D
Specific challenge that we tackle (research/implementation-wise) x000D
x000D
Using non-zero temperature sampling with language models can result in higher-quality responses compared to greedy sampling, although this is not always assured. Achieving optimal output often requires generating multiple candidate responses and selecting the most suitable one for the user. This technique is widely adopted to enhance inference-time performance. When implemented on device, however, it presents two primary challenges: minimizing the latency associated with generating several responses and determining a resource-efficient method for selecting the best response from the generated set.
Soft Prompts for On-Device Content Moderation
We demonstrate the first on-device integration of a safety-aligned large language model (LLM) using soft prompt distillation, powered by our proposed TV-DiSP framework. Our system showcases how a mobile device can run a quantized LLM equipped with learned soft prompts to moderate harmful or toxic content in real-time. The demo highlights the difference in LLM outputs with and without our soft prompts when subjected to adversarial or unsafe inputs, enabling efficient and safe deployment of LLMs on edge devices. x000D
x000D
LLMs are known to produce unsafe or toxic outputs when prompted harmfully. Traditional safety mechanisms rely on dual-model architectures—pairing a base LLM with a separate guard model—which are memory and computationally expensive and unsuitable for deployment on resource-constrained devices like smartphones. The challenge is to achieve robust safety alignment without compromising latency, memory, or model utility in edge environments.
Generating group photos of multiple people from text and reference images
Reference-based multi-human image generation is emerging as a critical capability for personalization, synthetic data creation, and benchmarking generative models. Unlike single-subject generation, this task requires compositional reasoning to place multiple individuals—each with distinct identities—into a coherent scene guided by a text prompt. Existing models often fail to preserve identities or maintain spatial fidelity, which limits their applicability for real-world scenarios such as social content creation or training vision systems. x000D
x000D
Our demo addresses these challenges by showcasing a state-of-the-art system for reference-based multi-human generation. The system takes reference images of multiple individuals and a text description of the desired scene, then produces a high-quality image featuring all participants in context. Built on the Flux-Kontext backbone and trained using synthetic data from DisCo (arXiv:2510.01399), our RL-based approach optimizes multiple rewards including Human Preference Score (HPS3) and Average ID Similarity. Evaluation on MultiHuman-Testbench (arXiv:2506.20879) confirms state-of-the-art performance. x000D
x000D
This demo showcases fast generation on a laptop powered by a Snapdragon processor, highlighting the efficiency and scalability of our solution.
Disaggregated LLM Serving on AI Accelerators
This demo showcases disaggregated serving on Qualcomm Cloud AI 100 Ultra Card, a power-efficient AI inference accelerator purpose-built for large language models (LLMs) serving. The accelerator has been deployed across multiple cloud service providers (CSPs) globally and is actively serving state-of-the-art LLMs and other generative AI workloads. x000D
x000D
LLM inference typically involves two distinct stages: prefill and decode. The prefill stage is compute bound, while the decode stage is memory bound. Applying uniform parallelism strategies across both stages often results in suboptimal performance, particularly in key metrics such as Time to First Token (TTFT) and Requests Per Minute (RPM) at the cluster level. x000D
x000D
This demo highlights the performance benefits of disaggregated parallelism strategies tailored to the unique characteristics of each stage. By optimizing the execution of prefill and decode independently, we demonstrate significant improvements in TTFT and overall throughput. x000D
x000D
Key benefits: x000D
x000D
Improved TTFT: Faster initial response times for LLM queries. x000D
x000D
Higher throughput: Increased number of requests served per minute at the cluster level. x000D
x000D
Optimized resource utilization: Efficient mapping of compute and memory resources to match workload characteristics. x000D
x000D
SLA-adherent performance: Maintains service quality and responsiveness within strict latency and throughput requirements.
Introduction to Generative Computing
This hands-on workshop introduces a proposal that treats LLMs as computing elements governed by established software development principles—particularly task decomposition and modularization—at both the programming model (Mellea) and model level (LLM intrinsics).x000D x000D LLM outputs are often unpredictable and incorrect. Agentic frameworks and prompt optimization libraries attempt to manage this by giving control to the LLM, but this leads to systems that are hard to debug, maintain, and scale. Mellea offers an alternative: a programming model that restores developer control through modular design, information hiding, and compositional contracts. This enables predictable fault models, better portability, and lower inference costs. Attendees will gain hands-on experience building applications using the Melleaic approach.x000D x000D Extending these principles to the model level, the workshop introduces a modularization framework for LLMs using activated LoRAs. These produce components—LLM intrinsics—that match fine-tuned model accuracy for specific tasks but with significantly lower inference costs and latency, thanks to KV cache reuse. Participants will build applications using a pre-built library of RAG LLM intrinsics and learn how to train their own.x000D x000D Presented by the creators of Mellea and the inventors of LLM intrinsics and aLoRA, this workshop equips attendees with foundational skills for scalable model/application co-design.
Checkmate: Fine-tune your own small language model for real-time chess reasoning and gameplay on AWS Trainium
In this hands-on workshop, participants will leverage AWS Trainium to fine-tune and deploy their own chess-playing language models. Building on recent research showing language models' effectiveness in reasoning, attendees will work with various chess datasets to create AI models that not only play chess but explain their strategic thinking through natural language. The 90-minute session will cover model fine-tuning techniques, optimization strategies specific to Trainium's architecture, and real-time deployment to a chess engine. The workshop culminates in a live tournament where participants' models compete against each other, providing immediate feedback on their implementations. Participants will leave with a working chess reasoning model, practical experience in fine-tuning language models on Trainium, and transferable skills for similar tasks. Python programming experience and familiarity with LLM concepts are required, in addition to a basic understanding of the rules of chess. Workshop materials and AWS credits will be provided.
Interpretable AI for Risk-Based Assessment in Global Supply Chains
Managing operational and compliance risks across a large, diverse supplier base is increasingly complex. Traditional audit-based approaches are resource-intensive and limited in scope, making a risk-based strategy essential to focus attention where potential issues are most likely to arise. To address this challenge, Amazon developed PRISM AI (Predictive Risk Intelligence for Supplier Management), an interpretable machine learning system that predicts and explains supplier-level risk across global supply chains. Trained on tens of thousands of audit and assessment records, PRISM integrates multiple data sources including self-assessment questionnaires, incident reports, external media signals, and geo-sector indicators. These inputs enable near–real-time identification of elevated risk patterns and emerging concerns across supplier networks. The model supports suppliers with varying data availability—those with extensive records, limited information, or none—by combining transfer learning, rule-based heuristics, and domain-specific indicators. Each prediction is accompanied by transparent attribution, showing which factors, such as certification gaps or regional exposure, most influenced the score. Built with monotonic constraints, the system ensures logically consistent and explainable outputs suitable for regulatory and operational contexts.
This demo provides NeurIPS participants with a hands-on view of how AI research can be operationalized for large-scale, real-world impact. PRISM helps compliance teams prioritize reviews, streamline supplier onboarding, and enhance oversight efficiency. For researchers, it illustrates techniques for building interpretable models under data imbalance and for integrating structured and unstructured signals. For practitioners, it demonstrates how AI can advance responsible sourcing and sustainability objectives across complex global ecosystems.
Multimodal AI Forensic Search for Video Surveillance
Video surveillance often requires searching for specific targets from long-duration videos using multiple cameras. Traditional tracking‑and‑detection pipelines demand heavy manual filtering, and even recent multimodal approaches such as using CLIP remain limited to shallow visual attributes (e.g., clothing color) and weak temporal reasoning. This makes forensic search labor‑intensive. x000D
x000D
We present ForeSea, a novel AI forensic search system that supports rich multimodal queries (text + image) and returns timestamped evidence of key events. ForeSea is organized as a multi‑stage pipeline that couples tracking and retrieval with time‑aware VideoLLM reasoning: (1) uses tracking model to filter out irrelevant segments (e.g., frames without people) and produces person‑centric clips; (2) retrieval constructs an index over tracked clips to form a searchable database; and (3) during inference, the multimodal query is embedded to retrieve the top N candidate clips, which are then fed into a time-aware VideoLMM that performs temporal grounding and generates precise answers from concise input. Through ForeSea's multi-stage pipeline, we can search for targets using both image and text queries (e.g., asking 'When does this person get involved in a fight?' with an image of the person). This approach eliminates the need for detailed textual descriptions and enables effective temporal understanding across long videos. x000D
x000D
To evaluate LMM based forensic search, we introduce AI Forensic‑QA, a benchmark for multimodal video question answering with temporal grounding. On this benchmark, ForeSea achieves an 8.6 % accuracy improvement and a 6.9 (IoU) gain over strong baselines. To the best of our knowledge, this is the first benchmark in this domain to support multimodal queries evaluation. Our live demo showcases multimodal search, timestamped evidence visualization, and side‑by‑side comparisons with SOTA models.
Reasoning through Multimodal End-to-End Decision Transformer Networks and Vision Language Action (VLA) models
This demonstration showcases the live output and visualization capabilities of an edge-integrated VLA model for path planning in automated driving scenarios. By harnessing raw multimodal sensor inputs, including visual and voice data, the VLA model processes information in real time to generate safe, explainable, and repeatable driving trajectories. The system operates on a Snapdragon Ride Elite SoC platform and incorporates safety guardrails, enabling robust decision-making and transparent reasoning. Attendees will observe how end-to-end AI networks interpret complex environmental cues to deliver actionable driving paths, with a special focus on complex use cases involving vulnerable road users and other actors on the road. This demonstration highlights advances in multimodal reasoning and edge deployment for next-generation intelligent mobility solutions.
SwiftEdit: Fast Text-guided Image Editing via One-step Diffusion on a Mobile Device
In this demo, we show an on-device inference of our one-step diffusion image editing model (SwiftEdit) [1] that performs interactive image editing based on the user’s source image and text prompt, running on an Android smartphone powered by Qualcomm Technologies’ latest Snapdragon Mobile Platform. On A100 GPUs, this technique can run in real-time with 0.23s per single edit operation. We expect SwiftEdit to perform each edit operation in seconds on the smartphone, demonstrating efficient and responsive on-device diffusion inference. x000D
x000D
Scientific Challenge that we tackle x000D
x000D
Existing text-guided image editing methods fell short of the speed demands required for real-world and on-device applications due to the costly multi-step inversion and sampling process involved. In response to this, we developed SwiftEdit that performed image editing using just one-step inversion and one-step image reconstruction. x000D
x000D
Efficiently running SwiftEdit requires concurrently on-boarding multiple deep models, including IP-Adapter (Vision Encoder and Image Projection), SwiftBrush (U-Net, VAE, Text Encoder), and SwiftBrush-based Inversion Network. This poses significant challenges for efficient execution and inter-module communication, while enabling an interactive image editing experience for the user — with all computation performed entirely on the edge device.
Mobile Video Diffusion Transformers
We demonstrate Neogradon, the first video diffusion transformer (DiT) designed to run on low-power NPUs in mobile devices, such as phones and laptops. Despite DiTs huge memory and computation cost due to the quadratic attention over thousands of video tokens, we show that mobile devices can run these models when being designed for efficiency. To achieve this level of efficiency: x000D
x000D
We replace the original large text encoder with a much smaller one with minimal quality loss through our novel distillation framework, which doesn’t require any image or video data. x000D
x000D
We propose an asymmetric decoder distillation approach, which allows us to replace the native codec-latent-VAE decoder with a more efficient one, without disturbing the generative latent-space of the video generation pipeline. x000D
x000D
With our block pruning strategy, we remove entire blocks from the MMDiT denoiser based on their relative importance and recover the original performance through a two-stage distillation process. x000D
x000D
We reduce the diffusion sampling cost using our novel extended version of DMD (distribution matching distillation) for the pyramidal flow-matching objective. x000D
x000D
Neodragon generates 49 frames of 640x1024 resolution within 7.6 seconds on the Qualcomm Hexagon NPU with the VBench total score of 81.61, setting a new state of the art for mobile video generation. x000D
x000D
"This Proposal is provided for review and evaluation purposes only. Do not redistribute to any third party without the express prior written consent of Qualcomm Technologies, Inc."
Large-Scale Real-World Physical AI Systems
Motivation and Scope x000D
x000D
Physical AI systems comprise of four things: namely sensors like cameras and lidar, mechanical and electronic control unit, AI models to reason about the environment, and actuators to convert decisions to physical actions. It marries multiple domains like sensor design, perception, low-power real-time hardware design, and control loop action design. Autonomous driving is the most mature physical AI domain deployed for over 10 years, but it still has many open challenges. Humanoid robots are an emerging physical AI domain with potential for near term commercial deployment. One of the major challenges in physical AI is to scale to all real-world scenarios including corner cases in a safe manner. A scalable AI data flywheel is the most critical module to achieve this. Traditional physical AI models have a modular decomposition of perception and action tasks, but the community is increasingly moving towards a single end-to-end AI model. Furthermore, recent advancements in LLMs and VLMs are leading to VLA (Vision-Language-Action) based end-to-end models. In the future, there will likely be a convergence of physical AI models across different domains like driving and robotics. The proposed workshop covers the latest research and best practices in industrial research of physical AI by leaders in the domain. It also covers emerging technologies like VLA based foundation models, AI data flywheel, and cross-embodiment learning focused on Physical AI.
Exploring Trust and Reliability in LLM Evaluation
The current paradigm of Large Language Model (LLM) evaluation faces a crisis of reliability. Traditional leaderboards—built on static benchmarks and surface-level metrics—have become increasingly distorted by benchmark contamination, prompt overfitting, and evaluation methodologies that fail to reflect model behavior in real-world use. As reasoning models emerge that generate detailed internal thought processes (e.g.,
x000D
This lack of rigor not only undermines scientific progress and cross-model comparability, but also poses significant enterprise and societal risks, as evaluation results inform model selection, deployment safety, and governance in high-stakes environments.x000D
x000D
This workshop aims to reassert rigor in LLM evaluation by convening researchers and practitioners to address three intertwined challenges: (1) developing fair and consistent evaluation methods for reasoning and non-reasoning models, (2) confronting widespread contamination across public benchmarks and open-weight models, and (3) defining robust data curation and validation practices to prevent future contamination in both pretraining and post-training pipelines.x000D
x000D
By combining empirical findings, methodological advances, and practical case studies, this session—led by Capital One in collaboration with leading AI labs—seeks to chart a concrete path toward trustworthy, contamination-proof, and utility-aligned LLM evaluation frameworks.x000D
x000D
This 1.5-hour workshop will be structured around three highly focused, 25-minute talks, followed by a moderated discussion aimed at forging actionable paths forward for the community:x000D
x000D
Talk 1: Robust Evaluation for Reasoning & Non-Reasoning Modelsx000D
x000D
Talk 2: Benchmark Contamination — Detection, Measurement, & Findingsx000D
x000D
Talk 3: Preventing Contamination — Building Clean & Reliable Data Pipelines
CausalFairness: An Open-Source Python Library for Causal Fairness Analysis
As machine learning (ML) systems are increasingly deployed in high-stakes domains, the need for robust methods to assess fairness has become more critical. While statistical fairness metrics are widely used due to their simplicity, they are limited in their ability to explain why disparities occur, as they rely on associative relationships in the data. In contrast, causal fairness metrics aim to uncover the underlying data-generating mechanisms that lead to observed disparities, enabling a deeper understanding of the influence of sensitive attributes and their proxies. Despite their promise, causal fairness metrics have seen limited adoption due to their technical and computational complexity. To address this gap, we present CausalFairness, the first open-source Python package designed to compute a diverse set of causal fairness metrics at both the group and individual levels. The metrics implemented are broadly applicable across classification and regression tasks (with easy extensions for intersectional analysis) and were selected for their significance in the fairness literature. We also demonstrate how standard statistical fairness metrics can be decomposed into their causal components, providing a complementary view of fairness grounded in causal reasoning. In this active learning talk participants will learn how to quantify bias using CausalFairness at the group (Counterfactual Equalized Odds , Counterfactual Effects) and individual (Counterfactual Fairness) levels by applying each method to three datasets - 1) the Adult Income dataset, 2) the COMPAS dataset, 3) Law School Admission Council (LSAC) Dataset. The session will elucidate on the intuition for computing and interpreting each metric, and conclude with a discussion of their limitations.
Creative and Protective AI for Music and Entertainment
Generative AI is reshaping how we create, experience, and safeguard music and entertainment. This workshop presents technologies that expand creative expression while honoring responsibility. On the creative side, we share collaborative artworks with leading sound artists, neural engines for sound design and performance, and automatic mixing that adapts to musical intent. We also present a large multimodal dataset for multishot speech video that supports research on coherent and controllable speech, together with specialized language models that orchestrate camera transitions, gestures, vocal cues, and sound effects. On the protective side, we advance AI methods for data attribution, traceability, and responsible model behavior that safeguard creative data and prevent unintended memorization, ensuring fairness, transparency, and respect for creators’ rights. Together, these threads outline an ecosystem in which AI amplifies artistic practice while preserving the integrity of human contribution.
Multimodal Superintelligence Workshop
Multimodal machine learning is among the most promising directions of artificial intelligence. With remarkable progress in academia and industry on this topic, we are at the cusp of building next-generation multimodal models, i.e. multimodal superintelligence. These models can be defined as being able to observe, think, and act across several modalities. At this important junction, our workshop provides a forum for researchers to align and cross-polinate ideas. The Workshop on Multimodal Superintelligence will provide a venue where the community can gather to discuss the current state of multimodal machine learning science. We will also focus on topics such as cross-modal reasoning, alignment, fusion and co-learning.
Building an AI Ecosystem for Multiscale Biological Discovery
Understanding biological systems requires resolving their structure and organization across scales, from tissues to individual molecules. Advances in imaging and molecular profiling now generate vast multimodal datasets that capture biological architecture and dynamics with unprecedented fidelity. Unlocking insights from this data demands computational approaches capable of linking observations across spatial, temporal, and molecular dimensions.
At the Chan Zuckerberg Imaging Institute (CZII), we are building the infrastructure, datasets, and community connections to enable this transformation. Our cryo-electron tomography (cryoET) processing pipeline supports high-throughput reconstruction and standardized metadata integration, forming the foundation for reproducible, machine-learning–ready datasets. The CryoET Data Portal (cryoetdataportal.czscience.com) provides open access to raw data, reconstructions, and curated annotations contributed by leading structural biology labs worldwide. Its programmatic API tools support segmentation, particle picking, and model benchmarking, creating a foundation for AI-driven structural discovery.
To catalyze progress in automated molecular identification, the CZ Imaging Institute recently organized a Kaggle challenge inviting participants to develop models for detecting and labeling macromolecular complexes in real-world cryoET data. Building on this success, upcoming challenges organized by the CZI & CZ Biohub Network will extend this approach to datasets spanning different biological scales, from tissue architecture and cellular organization to subcellular and molecular structure.
Together, these efforts form an open, interoperable ecosystem for machine learning in biological imaging. By combining standardized data infrastructure, scalable computation, and community-driven innovation, we aim to bridge the worlds of imaging and AI and accelerate the discovery of life’s organization across all scales.
Recent developments in embodied AI
Embodied AI is the study of systems that can perceive and interact with the physical world in real time. Real-world interactions pose unique challenges for AI systems since they naturally require a deep understanding of the physical world and/or its inhabitants. This understanding is often taken for granted in humans, where it is typically labelled as “intuitive physics” or “common sense”. It is widely agreed that solving this challenge would be as rewarding as it is hard, since it would be equivalent to creating truly capable “world models”, with countless applications in robotics, human-computer interaction, and even in advancing language modeling through concept grounding. Like other areas in AI, embodied AI has seen dramatic advances in recent years, fueled by the success of using pre-trained large language models as a central ingredient to allow for end-to-end training. While this development stands as one of many examples of the power of pre-trained language models, recently the converse has come true as well: embodied AI is increasingly being drawn on to understand real-world common sense and concept grounding in language models themselves, bringing back its early vision as a way to understand human-like cognition and world models.
This talk will provide an in-depth discussion of embodied AI, with a focus on recent advances based on multi-modal large language models. It will discuss how end-to-end training has made it possible to instill key aspects of real-world common sense in a model and how this had enabled highly ambitious use-cases, such as generalist (“common sense”) robot control and real-world visual interaction (“chatbots that can see and hear you”). The talk will also discuss practical considerations, such as streaming inference at the edge, end-to-end training data generation and the role of reinforcement learning, as well as open challenges in state tracking and long-term memory.
Ring-1T, Ring-linear and Ming-Flash-Omni: Scaling Knowledge-Enhanced Large Language Models for Reasoning and Efficiency
The Ling 2.0 series represents a new generation of large language models designed around knowledge enhancement, reasoning efficiency, and scalable architecture innovation. Built upon trillion-scale sparse MoE foundations, Ling-1T achieves ~50B active parameters per token with FP8 mixed-precision pipelines and 1F1B interleaved scheduling, realizing over 40% training-throughput gains with negligible accuracy loss (<0.1%).x000D
This talk presents the technical journey behind Ling-mini, Ling-flash, and Ling-1T, focusing on (1) efficient large-scale training systems for trillion-parameter models; (2) the Ling Scaling Law and its implications for cross-domain reasoning; (3) hybrid attention and RL-based alignment strategies that enable both concise reasoning and long-context understanding; and (4) how these architectural and algorithmic advances empower industrial applications such as financial risk modeling and knowledge-grounded agents.x000D
We will conclude with open-sourced implementations (inclusionAI on Hugging Face and ModelScope) and future research directions toward trustworthy, efficient, and domain-enhanced LLMs.
Session 1 : Ring-1T: Scaling Reinforcement Learning for Trillion-Scale Thinking Model
Session 2: Ring-linear: An Efficient Hybrid Architecture for Long-Context Reasoning
Session 3: Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation
Agentic AI/RL
The transition from static language models to agentic AI systems driven by reinforcement learning (RL) places environments at the center of research and deployment. Environments provide the substrate where agents act, learn, and are evaluated—ranging from lightweight simulators and synthetic tasks to rich multi-agent ecosystems and real-world interfaces. Building and scaling these environments requires specialized tools and systems: standardized hubs for discovery and sharing, interfaces for reproducibility, and infrastructure that connects environments seamlessly to trainers, inference engines, and evaluation pipelines.
This workshop will highlight the tools, environments, and system innovations enabling the next generation of agentic AI. Topics will include scalable RL environment frameworks, benchmarks for safety and robustness, high-performance simulators optimized for heterogeneous hardware, and environment–trainer integration at scale. We will also explore how environments interface with large-model post-training workflows, providing the data and feedback loops necessary for reward shaping, alignment, and deployment in production systems.
By convening researchers, environment developers, and systems engineers, the workshop will create a venue to examine how environments, tools, and infrastructure together shape the future of agentic AI.
Distributed Orthonormal Updates for Large-Scale Training
We propose a 50-minute technical talk on recent advances in orthonormal update methods for large-scale AI model training. This topic is rapidly gaining attention in the community, emerging as a strong successor to AdamW following the success of orthonormal optimizers in training production-scale models such as Kimi-K2 and GLM-4.5.x000D
The talk will center on the design and practice of orthonormal updates, focusing on optimizers such as Muon and Dion. While we will briefly discuss their theoretical foundations, the emphasis will be on practical usage: how to integrate these optimizers into modern training pipelines, interpret their algorithmic components, and leverage the implementation guidelines provided in our open-source codebase at github.com/microsoft/dion.x000D
The talk is designed to engage both researchers and practitioners in the NeurIPS community:x000D
Academic perspective: presents a new class of optimizers grounded in theory along with how they interact with distributed training.x000D
Industrial perspective: highlights how orthonormal updates are implemented in practice and what best practices are.x000D
This topic lies at the intersection of optimization theory, scalable systems, and large-model training—an area of growing importance for both the research and applied machine learning communities.
Building Foundational Models for Robotics at Tesla
Tesla's robots, both wheeled and legged, are developed with the goal of achieving general-purpose capability, analogous to the versatility observed in humans and animals. These systems rely primarily on scalable sensing modalities such as vision, audio etc, enabling robust performance within stringent power and cost constraints.
This talk will describe the principles and methodology behind constructing foundation models for robotics at Tesla. We will discuss the architecture, data and training of large-scale multimodal models that control these robots in an end-to-end pixels-to-actuation fashion. We will also examine evaluation protocols, safety considerations, and strategies for reliable real-world deployment. Finally, we project the transformational benefits to society that widespread deployment of such advanced robotic systems can deliver.
AI Assistants in the Wild: Agents, Adaptation, and Memory-Augmented Deployment
Motivation and Scope
Generative AI is evolving from offline, single modality models into interactive agentic systems that perceive, decide, and act in the real world. This shift marks a transition from static generation to dynamic, context-aware interaction. As these systems move toward deployment on edge devices such as mobile phones, augmented reality glasses, and robots, they face constraints in compute, memory, and latency. Beyond efficiency and responsiveness, a new frontier is emerging: agents equipped with persistent memory that enables long-term adaptation and personalization.
This workshop explores a timely and focused question. How do we build generative agents that are not only efficient and responsive but also able to accumulate, recall, and adapt based on personal memory over time? We aim to bring together perspectives from generative modeling, agentic learning, efficient model design, and memory systems to close the gap between lab scale prototypes and real-world deployment. x000D
x000D
Key Themes x000D
Personal Memory Systems for AI Assistants: Architectures for persistent memory, retrieval-augmented generation, and long-term personalization. x000D
Real-World Adaptation Few-shot generalization, continual learning, and task inference for evolving agent behavior. x000D
Grounded and Trustworthy Generation: Techniques for hallucination mitigation, constraint-aware generation, and safety under uncertainty. x000D
Deployment on Edge Platforms: Challenges and solutions for deploying generative agents on mobile, AR, and robotics platforms. x000D
x000D
This focused workshop aligns with emerging themes at NeurIPS including agentic learning, trustworthy AI, efficient multimodal generation, and embodied intelligence. It will spotlight the systems, algorithms, and design decisions needed to make generative AI truly adaptive and persistent, outside the data center and into the wild.
Using the Virtual Cell Platform to Accelerate Machine Learning in Biology
Biology presents some of the most complex and high-impact challenges for machine learning, and single-cell transcriptomics is at the frontier of this work. In this workshop, we introduce the Virtual Cell Platform (VCP), a unified environment designed to accelerate model development and evaluation in biology. Using single-cell transcriptomics as a case study, we will demonstrate how the VCP enables researchers to train, benchmark, and interpret models in a reproducible and biologically meaningful way.
Participants will gain a primer on single-cell transcriptomics and learn how to evaluate models with cz-benchmarks, an open-source Python package providing standardized, community-driven tasks and metrics. Through the VCP CLI, attendees will pull datasets, run packaged models, and compare results programmatically. Hands-on exercises will guide participants through interactive visualizations, side-by-side model comparisons, and deep dives into model behavior using VCP’s no-code interface and BYOD (Bring Your Own Data) module.
By the end of the session, attendees will understand how to use the VCP to actively test and refine models during development, ensure biological relevance, and contribute models and benchmarks back to the community. This workshop highlights how the Virtual Cell Platform transforms ML infrastructure into a one-stop, researcher-friendly ecosystem, empowering the NeurIPS community to push the boundaries of AI in biology.
On Device/Edge AI
From smartphones and wearables to autonomous vehicles, robots, and AR/VR systems, the demand for models that are efficient, private, and adaptive in real-time has never been higher. Yet deploying state-of-the-art AI at the edge remains challenging: researchers and practitioners must navigate heterogeneous hardware, memory and power constraints, compression and distillation trade-offs, as well as privacy, safety, and reliability requirements.
This workshop will bring together researchers, practitioners, and industry leaders to explore the frontiers of Edge AI. Topics will include lightweight model architectures, compiler/toolchain optimizations (e.g., quantization, pruning, sparsity), advances in frameworks such as ExecuTorch and TensorRT, distributed learning across devices, privacy-preserving training, and emerging applications where latency and trust are critical. Beyond technical advances, we will examine the broader implications for democratizing AI—enabling billions of devices to act as intelligent, personalized agents while reducing dependence on the cloud.
Multi-Agent Systems in Industry: From Research to Real-World Impact
This workshop will bridge the gap between the theoretical advancements in multi-agent systems and their practical applications in industry. The session will feature a series of poster presentations showcasing state-of-the-art, real-world multi-agent systems that are driving innovation across various sectors. We will delve into the challenges and opportunities of deploying these systems at scale, covering topics such as:x000D
Human-in-the-loop collaboration: Designing systems where AI agents and human experts work in synergy.x000D
Scalability and efficiency: Architecting multi-agent systems for large-scale industrial applications.x000D
Safety and reliability: Ensuring the robustness and predictability of autonomous agents in critical systems.x000D
Domain-specific applications: Highlighting successful implementations in areas such as software engineering, scientific research, and creative content generation.x000D
The goal of this workshop is to foster a discussion on the practical challenges and future directions of multi-agent systems, providing attendees with actionable insights and a deeper understanding of how these technologies are shaping the future of industry.
The Co-X Framework: Versatile AI Agents for Automating and Augmenting Professional Workflows
Beyond monolithic models, the future of AI in industry lies in specialized agents that collaborate with human experts. This talk introduces the "Co-X" framework, a novel approach for creating a diverse ecosystem of collaborative agents tailored to specific professional domains. We will present four key agents built on this framework: the Co-AI Researcher, the Co-ML Engineer for automating software development cycles, the Co-Data Scientist for automating data analysis and insight generation, and the Co-Director for augmenting creative content generation. We will discuss the foundational technologies that enable this versatility—including long-term memory, tool use, and human-in-the-loop feedback—and demonstrate how the Co-X framework is poised to redefine productivity and innovation across industries.
Juries, Not Judges! Industry-Scale Evaluation of Trustworthy AI via Dynamic LLM Panels
As Large Language Models (LLMs) become central to high-stakes applications, the reliability of their evaluation systems is under intense scrutiny, especially in the financial industry. Traditional approaches - human annotation, single LLM judges, and static model juries - struggle to balance scalability, cost, and trustworthiness. We will discuss a promising framework: LLM Jury-on-Demand, a dynamic, learning-based framework that assembles an optimal panel of LLM evaluators for each task instance, leveraging predictive modeling to select and weight judges based on context-specific reliability. Our system adapts in real time, outperforming static ensembles and single judges in alignment with human expert judgment across summarization and retrieval-augmented generation benchmarks. This talk will showcase how adaptive LLM juries can transform evaluation of AI systems, offering robust, scalable, and context-aware solutions for industry and research. Attendees will gain practical insights into building trustworthy LLM evaluation pipelines, see live demos, and discuss future directions for reliable AI assessment in critical domains.
Cosmos World Foundation Model Platform for Physical AI
Abstract: In this talk, I will introduce NVIDIA Cosmos, our World Foundation Model platform designed to advance Physical AI. Cosmos is built around three core pillars: Predict, Transfer, and Reason. I will provide updates on the latest releases—Predict 2.5 and Transfer 2.5—highlighting key improvements in generalization, efficiency, and scalability. In addition, I will share a preview of ongoing research directions that extend Cosmos toward richer world modeling and reasoning capabilities. Together, these developments aim to push the boundaries of how AI perceives, simulates, and interacts with complex real-world environments.
Building the Open AI Ecosystem: Models, Infrastructure and Community Adoption
Openness is becoming as important as scale in advancing AI. This panel brings together leaders from research, industry, and infrastructure to examine the rise of open-weight frontier models and what they enable.
Discussion will focus on the challenges of training and aligning these systems, their impact on reproducibility and safety, and the new forms of collaboration they support across academia, government, and enterprise.
Attendees will gain a clear view of how open access is reshaping AI development and deployment worldwide.