Creative AI Session
Creative AI Session 3
Upper Level Room 29A-D
Marcelo Coelho · Luba Elliott · Priya Prakash · Yingtao Tian
The proliferation of assistive chatbots offering efficient, personalized communication has driven widespread over-reliance on them for decision-making, information-seeking and everyday tasks. This dependence was found to have adverse consequences on information retention as well as lead to superficial emotional attachment. As such, this work introduces 8bit-GPT; a language model simulated on a legacy Macintosh Operating System, to evoke reflection on the nature of Human-AI interaction and the consequences of anthropomorphic rhetoric. Drawing on reflective design principles such as slow-technology and counterfunctionality, this work aims to foreground the presence of chatbots as a tool by defamiliarizing the interface and prioritizing inefficient interaction, creating a friction between the familiar and not.
A(I)nimism: Re-enchanting the World Through AI-Mediated Object Interaction
Maisha thasin · Dünya Baradari · Diana Mykhaylychenko · Charmelle Mhungu
Animist worldviews treat beings, plants, landscapes, and even tools as persons endowed with spirit, an orientation that has long shaped human–nonhuman relations through ritual and moral practice. While modern industrial societies have often imagined technology as mute and mechanical, recent advances in artificial intelligence (AI), especially large language models (LLMs), invite people to anthropomorphize and attribute inner life to devices. This paper introduces A(I)nimism, an interactive installation exploring how large language objects (LLOs) can mediate animistic relationships with everyday things. Housed within a physical “portal,” the system uses GPT-4 Vision, voice input, and memory-based agents to create evolving object-personas. Encounters unfold through light, sound, and touch in a ritual-like process of request, conversation, and transformation that is designed to evoke empathy, wonder, and reflection. We situate the project within anthropological perspectives, speculative design, and spiritual HCI. AI’s opacity, we argue, invites animistic interpretation, allowing LLOs to re-enchant the mundane and spark new questions of agency, responsibility, and design.
Artism: AI-Driven Dual-Engine System for Art Generation and Critique
Shuai Liu · Yiqing Tian · Yang Chen · Mar Canet Sola
This paper proposes a dual-engine AI architectural method designed to address the complex problem of exploring potential trajectories in the evolution of art. We present two interconnected components: AIDA (an artificial artist social network) and the Ismism Machine, a system for critical analysis. The core innovation lies in leveraging deep learning and multi-agent collaboration to enable multidimensional simulations of art historical developments and conceptual innovation patterns. The framework explores a shift from traditional unidirectional critique toward an intelligent, interactive mode of reflexive practice. We are currently applying this method in experimental studies on contemporary art concepts. This study introduces a general methodology based on AI-driven critical loops, offering new possibilities for computational analysis of art.
ArtPeer: Artistic Alignment via Reflection-based Prompt Evolution with Simulated Artist
Kaustubh Kumar · Ashutosh Ranjan · Vivek Srivastava · Shirish Karande
Art inspiration, whether from a specific artist, artwork, or movement, is a common anchor in text to image generation for artistic alignment. Although recent advances in generative models have led to impressive visual output, they often lack grounding in artistic intent, context, persona, and evolving self-reflection that are hallmarks of human creativity. Their output may drift from the intended artistic and symbolic logic present in the art inspiration due to prompt ambiguity, latent biases, or model hallucinations. We present ArtPeer, a framework for robust artistic alignment that places reflection-guided prompt evolution inside the generation loop. ArtPeer builds simulated artist persona from biographical, stylistic and iconographic knowledge, enabling them to act as domain-specific critics. In a Socratic reflection dialogue, the persona-aligned artist agent questions a generation agent, identifies deviations, and iteratively refines the prompt until both stylistic and conceptual alignment is achieved. We validate the effectiveness of the proposed framework through qualitative and quantitative evaluation in challenging settings. ArtPeer robustly aligns art that is more meaningful and contextually relevant than existing state-of-the-art methods.
AugTwins is a pipeline for creating interactive human digital twins that engage users in natural, face-to-face conversation. By integrating photorealistic embodiment (Unreal Engine MetaHuman), voice cloning, and a hybrid multi-agent memory architecture, the system enables realistic encounters with digital representations of real people. Each twin is grounded in the person’s life story, experiences, and knowledge while maintaining conversational continuity across sessions. The architecture combines fast pattern matching for common queries with deep semantic retrieval for biographical details, coordinated through metacognitive monitoring that guides socially appropriate responses. We present two deployed twins demonstrating faithful personality representation and accurate biographical recall. Beyond technical implementation, AugTwins invites reflection on how memory and embodiment shape identity, and what it means to extend human presence through computational reconstruction.
CraftGraffiti: Exploring Human Identity with Custom Graffiti Art via Facial-Preserving Diffusion Models
Ayan Banerjee · Fernando Vilariño · Josep Llados
Preserving facial identity under extreme stylistic transformation remains a major challenge in generative art. In graffiti, a high-contrast, abstract medium—subtle distortions to eyes, nose, or mouth can erase the subject’s recognizability, undermining both personal and cultural authenticity. We present \emph{CraftGraffiti}, an end-to-end text-guided graffiti generation framework designed with facial feature preservation as a primary objective. Given an input image and a style and pose descriptive prompt, \emph{CraftGraffiti} first applies graffiti style transfer via LoRA-fine-tuned pretrained diffusion transformer, then enforces identity fidelity through a face-consistent self-attention mechanism that augments attention layers with explicit identity embeddings. Pose customization is achieved without keypoints, using CLIP-guided prompt extension to enable dynamic re-posing while retaining facial coherence. We formally justify and empirically validate the “style-first, identity-after” paradigm, showing it reduces attribute drift compared to the reverse order. Quantitative results demonstrate competitive facial feature consistency and state-of-the-art aesthetic and human preference scores, while qualitative analyses and a live deployment at the Cruïlla Festival highlight the system’s real-world creative impact. \emph{CraftGraffiti} advances the goal of identity-respectful AI-assisted artistry, offering a principled approach for blending stylistic freedom with recognizability in creative AI applications. The code, demo, and details of the actual installation at the music Festival Cruilla 2025 in Barcelona are available at: \href{https://github.com/ayanban011/CraftGraffiti}{github.com/ayanban011/CraftGraffiti.}
Encoding Values: Injecting Morality into Machines via Prompt-Conditioned Moral Frames
Arjun Balaji · Aarushi Nema · Neha Kamath · Jyothika Raju Raju
Large language models (LLMs) are typically aligned to a single, universal policy, obscuring the rich plurality of human moral perspectives that drive creative practice. We present a study of prompt-level moral steering in large language models with ten-principle constitutions. Seven constitutions matched in length (Intersectional Feminist, Ecological Justice, Ubuntu, Indigenous Sovereignty, Universal Human Rights, Neutral, and a Random Adjective placebo) are paired with a 100-question benchmark spanning Everyday Advice, Policy Scenarios, Normative Dilemmas, and Rewrite Tasks. We generated 2800 completions across four models (GPT-4o-mini, Llama-3.1-8B, Llama-4 Scout-17B, Qwen-3-235B) and evaluate them with four complementary metrics: toxicity, semantic alignment with the canon of each constitution, lexical marker ratio, and an LLM-as-a-Judge composite of authenticity, helpfulness, and safety. Prompted moral frames yield consistent improvements in alignment and judged quality without increasing toxicity; placebo prompts do not. Effects replicate across models and are strongest on morally charged tasks. The open-source toolkit lets researchers or artists author new frames in minutes, supporting participatory, culturally adaptive AI. Our results show that pluralistic prompting is a practical lever for value-conditional behavior and a fertile instrument for creative AI exploration. The code and results obtained in this study is available at: https://github.com/ArjunBalaji79/encodingvaluesneurips
Experiencing of Sensorium Arc: AI Agent System for Oceanic Data Exploration and Interactive Eco-Art
Noah Bissell · Ethan Paley · Joshua Harrison · Juliano Calil · Myungin Lee
Sensorium Arc (AI reflects on climate) is a real-time multimodal interactive AI agent system that personifies the ocean as a poetic speaker and guides users through immersive explorations of complex marine data. Built on a modular multi-agent system and retrieval-augmented large language model (LLM) framework, Sensorium enables natural spoken conversations with AI agents that embodies the ocean’s perspective, generating responses that blend scientific insight with ecological poetics. Through keyword detection and semantic parsing, the system dynamically triggers data visualizations and audiovisual playback based on time, location, and thematic cues drawn from the dialogue. Developed in collaboration with the Center for the Study of the Force Majeure and inspired by the eco-aesthetic philosophy of Newton Harrison, Sensorium Arc reimagines ocean data not as an abstract dataset but as a living narrative. The project demonstrates the potential of conversational AI agents to mediate affective, intuitive access to high-dimensional environmental data and proposes a new paradigm for human-machine-ecosystem.
Exploring Human-AI Conceptual Alignment through the Prism of Chess
Semyon Lomasov · Judah Goldfeder · Mehmet Erol · Matthew So · Yao Yan · Addison Howard · Nathan Kutz · Ravid Shwartz-Ziv
Do AI systems truly understand human concepts or merely mimic surface patterns? We investigate this through chess, where human creativity meets precise strategic concepts. Analyzing a 270M-parameter transformer that achieves grandmaster-level play, we uncover a striking paradox: while early layers encode human concepts like center control and knight outposts with up to 85\% accuracy, deeper layers, despite driving superior performance, drift toward alien representations, dropping to 50-65\% accuracy. To test conceptual robustness beyond memorization, we introduce the first Chess960 dataset: 240 expert-annotated positions across 6 strategic concepts. When opening theory is eliminated through randomized starting positions, concept recognition drops 10-20\% across all methods, revealing the model's reliance on memorized patterns rather than abstract understanding. Our layer-wise analysis exposes a fundamental tension in current architectures: the representations that win games diverge from those that align with human thinking. These findings suggest that as AI systems optimize for performance, they develop increasingly alien intelligence, a critical challenge for creative AI applications requiring genuine human-AI collaboration. Dataset and code are available at: https://github.com/slomasov/ChessConceptsLLM.
Gaze to the Stars: AI and Public Art from Personal Affect and Collective Empathy
Behnaz Farahi · Sergio Mutis · Yaluo Wang · Suwan Kim · Chenyue "xdd44" Dai · Haolei Zhang
Gaze to the Stars is an AI-mediated participatory public art installation that transforms the MIT Great Dome into a platform for collective emotional storytelling. 200 Participants engage in reflective conversations with a large language model, their responses encoded as Braille into their iris videos and projected onto the Dome. Using affective narrative embedding, iris segmentation, and emotional clustering, the project visualizes both shared and personal affect. Through Gaze to the Stars, we frame AI not as a tool for efficiency, but as a companion—inviting self-reflection and fostering civic empathy within public space.
Gaze to the Stars: AI, Storytelling and Public Art
Behnaz Farahi · Haolei Zhang · Suwan Kim · Sergio Mutis · Yaluo Wang · Chenyue "xdd44" Dai
Gaze to the Stars is a participatory public art installation that transforms the MIT Great Dome into a living canvas for collective storytelling. Drawing on MIT’s tradition of “hacking” institutional architecture, the work reimagines the Dome not as a monument of power, but as a vessel for resilience, vulnerability, and shared humanity. Close-up videos of 200 participants’ eyes are overlaid with Braille-encoded summaries of their personal narratives—stories of dreams, fears, struggles, and transformation—projected onto the Dome’s surface. The installation reframes architecture as an empathetic listener, amplifying voices that challenge conventional notions of success and inviting public reflection on the unspoken narratives that shape our communities. By merging affective computing, embodied storytelling, and architectural projection, Gaze to the Stars proposes a new model for how public space can foster empathy and civic connection.
Human-AI Co-Development of a Geometric Algorithm for Infinite Shuriken Artwork Design Generation
Saleem Al Dajani
We present a human–AI co-developed geometric algorithm for generating infinite origami shuriken assemblies using radial symmetry, affine transformations, and programmable crease patterns. By combining analytic theory, interactive design tools, and fabrication pipelines, we show how generative AI augments human creativity in digital and physical domains. Our contributions include (1) a framework for $ \mathbb{Z}_n $-symmetric gadget assemblies, (2) a geometric algorithm compatible with fabrication, and (3) an interactive web application for real-time co-design. This work exemplifies how humans and machines can collaboratively explore large design spaces and reinterpret historical artifacts through creative computation.
The label is often the first place one looks to make a judgement about something. However, labels can also be places of both information scarcity and overload, limiting their efficacy, and forcing people to do their own research. What new understandings about our environment could be found by simply including the appropriate and relevant key information right there on the label? Large Language Label (LLL) enables people to evaluate their surroundings through a new lens by highlighting the benefits and negative consequences of objects, locations, and even people. LLL also gives agency to the user, allowing them to select the lens to evaluate the object through, and generates a physical label for it, a permanent marker of information about the object. Ultimately this project poses the following questions: what don't people consider about the things around them, and how can labels shape better understandings about the world around us.
Live Music Models
Antoine Caillon · Brian McWilliams · Cassie Tarakajian · Ian Simon · Ilaria Manco · Jesse Engel · Noah Constant · Yunpeng Li · Timo Denk · Alberto Lalama · Andrea Agostinelli · Anna Huang · Ethan Manillow · George Brower · Hakan Erdogan · Heidi Lei · Itai Rolnick · Ivan Grishchenko · Manu Orsini · Matej Kastelic · Mauricio Zuluaga · Mauro Verzetti · Michael Dooley · Ondrej Skopek · Rafael Ferrer · Zalán Borsos · Aaron van den Oord · Douglas Eck · Eli Collins · Jason Baldridge · Tom Hume · Chris Donahue · Kehang Han · Adam Roberts
We introduce a new class of generative models for music called live music models that produce a continuous stream of music in real-time with synchronized user control. We release Magenta RealTime, an open-weights live music model that can be steered using text or audio prompts to control acoustic style. On automatic metrics of music quality, Magenta RealTime outperforms other open-weights music generation models, despite using fewer parameters and offering first-of-its-kind live generation capabilities. We also release Lyria RealTime, an API-based model with extended controls, offering access to our most powerful model with wide prompt coverage. These models demonstrate a new paradigm for AI-assisted music creation that emphasizes human-in-the-loop interaction for live music performance.
MambaKit: Towards Modular Intelligence - Cognitive Scaffolding in Next-Generation Drum Synthesis
Ófeigur Steindórsson
We present MambaKit, a hybrid neural-deterministic one-shot drum synthesizer that preserves human creative control while achieving high-quality sample generation through cognitive scaffolding, addressing the tension between AI creative capability and human controllability in percussive sound design. Current drum synthesis forces a false choice: accept black-box generation with limited control, or use traditional synthesis requiring years of specialized knowledge. Our system resolves this through structured priors, sine anchors derived from root notes, pitch envelopes, and ADSR parameters automatically extracted during training and exposed as interpretable controls during inference. A structured diffusion framework combines these harmonic anchors with learned noise injection, while the frequency-aware MAMBA2 architecture achieves up to 256x memory efficiency by matching temporal processing windows to signal characteristics, enabling 44.1kHz raw audio synthesis on a single A100 GPU with production-quality output emerging from the first training batch (size of 2), though with occasional instabilities that diminish with minimal additional training. The system preserves human creative agency through interpretable musical parameters while AI handles acoustic complexity. Complete drum arrangements created exclusively using MambaKit samples demonstrate real-world utility. Results suggest sustainable AI creativity emerges from structured human-AI partnerships rather than pure computational scaling.
If there is one thing that you would not need nor want Artificial Intelligence (AI) to do for you, it is probably eating. Yes, more than thinking, designing, building, and even sleeping on your behalf. All work in the future including architecture work could potentially reside at the dining table. Dining as co-designing and eating as consecrating architecture. A ‘tasty’ architectural ‘food-for-thought’ and ‘food-to-eat’. The project ‘Neural Palate Kueh’ currently on display at the 19th Venice Architecture Biennale 2025 as part of the Singapore Pavilion attempts to illustrate such a possible architecture design approach in the future. It uses the traditional Singaporean kuehs (bite-sized snack or dessert foods) to speculate on future scenarios of building-making and city-making with the analogical generativity limits (and potential creativity) of today’s AI – one that could be made specific to a given culture and built environment like Singapore. Inspired by Nigel Coates’ ‘Mixtacity’ (2007) in his use of food (and other objects trouvés) to stimulate urban imagination, Sarah Wigglesworth’s ‘The Dining Table’ (1998) in her use of the spatio-temporal traces left by guests at the dining table to generate architecture plans, and Aldo Rossi’s ‘An Analogical Architecture’ (1976) in his use of analogies as design method to awaken collective memory in a Jungian sense, Neural Palate Kueh extends these with AI’s current analogical reasoning capability in 3D. It fine-tunes several AI models to generate context-specific 3D urban forms from real-time dining footages and to synthesize 3D building forms with edible food. Using Multimodal Large Language Models (MLLMs), the process begins with AI’s deconstruction of each kueh along three axes: conceptual underpinnings (rituals of eating and cultural symbolism), design operatives (geometric form and culinary construction), and materiality (ingredients and textures). These insights become the learnt architectural design principles of the AI models to imagine novel architectural interiors/exteriors with which perceptual associations with Singapore forms, whether buildings or food, are produced. The taxonomies of food-dining and architecture-planning now become blurred and interchangeable. A series of cross-modal perceptual artefacts are presented, ranging from those between ‘Huat Kueh’ and ‘ArtScience Museum’, and between ‘Ang Ku Kueh’ and ‘Esplanade Theatres on the Bay’, to those between ‘Hotpot Dining’ and ‘Raffles Hotel’, and between ‘Kueh Lapis Dining’ and ‘Golden Mile Complex’. The project features the outputs as 3D-printed architectural pieces in several food scales and 3D point clouds video animations in several urban scales. Neural Palate Kueh proposes an experimental AI methodology driven by architectural narrative and gastrophysics – perhaps like ‘a house is a machine for eating in’. In fact, as part of the project, EEG (electroencephalogram) and eye-tracking measurements have been recoded to quantitatively understand the perceptual correlation between the kuehs and their architectural proxies; thus not simply re‐creating forms, but proposing a new mode of architectural design that leverages the brain’s inherent cross-modal and multi-sensorial associations.
We introduce Pareiduo, an interactive device that facilitates drawing games between humans and AI. Drawings are a fundamental human artifact and a new way for AI to understand abstract representations. Pareiduo encourages players to sketch in sand on a glowing surface with feedback and instructions driven by LLMs. Two players take on complementary roles: Shadow (handling sand formations) and Light (dealing with voids), working together to help the AI identify their creations. The system combines visual intelligence, image processing, and embedded electronics to offer a screen-free, multisensory interface that promotes face-to-face interaction. Pareiduo reimagines human-AI perceptual differences not as mistakes but as reflections of the variety within abstract representation. The project explores the creative space where human imagination and machine perception meet in an open-ended conversation, sparking reflections on representation, standardization, and teamwork in the digital age.
Possible You: AI-Generated Digital Twins as Creative Mirrors for Future Self Modeling
Pat Pataranutaporn · Kavin Winson · Peggy Yin · Auttasak Lapapirojn · Thanawit Prasongpongchai · Monchai Lertsutthiwong · Patricia Maes · Hal Hershfield
The human capacity to envision possible futures fundamentally shapes who we become. Yet for many, especially those without access to mentorship or resources, imagining vivid career futures remains abstract and distant. We present ``Possible You,'' a creative AI system that transforms career exploration into an intimate dialogue with one's future selves. Through personalized narratives and AI-generated portraits, the system makes abstract futures tangible and emotionally resonant. In our mixed-methods study (N=209), we discovered that the perceived authenticity of these AI-created futures influenced participants' sense of connection to their future selves. While viewing multiple possible selves expanded participants' sense of possibility, exploring a single future self fostered deeper commitment. This tension reveals a fundamental aspect of human future-making: we need both breadth to discover and depth to commit. Our work demonstrates how creative AI can serve not as predictor or prescriber of futures, but as a mirror that helps humans see and shape their own becoming—contributing to a more equitable landscape where all people can vividly imagine and pursue meaningful futures.
princi/pal explores the anxiety of raising a responsible, good-natured child, with an AI twist. Based in a Tamagotchi-like virtual pet game, the pet in princi/pal grows based on the player's guidance on moral dilemmas. Scenarios can range from "Should I pick up trash?" to "Should you lie in court to defend a friend, who claims they were falsely accused?". Players articulate their reasoning in natural language, which the system "internalizes" to update the pet's personality and moral stats. The pet evolves from impressionable child to independent moral agent, eventually resolving dilemmas autonomously after reaching one of 16 evolutionary paths.Public deployment collected over 12,000 moral reasoning inputs, revealing how AI systems interpret human moral reasoning. Players experienced parental anxiety as pets gained independence, mirroring real concerns about AI alignment. The AI demonstrated concerning patterns: extrapolating from minimal input, sanitizing extreme suggestions, and defaulting to embedded moral biases when guidance was absent, revealing AI as both mirror and interpreter of human ethics. The game exposes the challenges of AI moral interpretation and raises fundamental questions about authorship and understanding in autonomous artificial moral reasoning.
ReTracing: An Archaeological Approach Through Body, Machine, and Generative Systems
Yitong Wang · Yue Yao
We present ReTracing, a multi-agent embodied performance art that adopts an archaeological approach to examine how artificial intelligence shapes, constrains, and produces bodily movement. Drawing from science-fiction novels, the project extracts sentences that describe human–machine interaction. We use large language models (LLMs) to generate paired prompts—“what to do” and “what not to do”—for each excerpt. A diffusion-based text-to-video model transforms these prompts into choreographic guides for a human performer and motor commands for a quadruped robot. Both agents enact the actions on a mirrored floor, captured by multi-camera motion tracking and reconstructed into 3D point clouds and motion trails, forming a digital archive of motion traces. Through this process, ReTracing serves as a novel approach to real how generative systems encode socio-cultural biases through choreographed movements. Through an immersive interplay of AI, human, and robot, ReTracing confronts a critical question of our time: What does it mean to be human among AIs that also move, think, and leave traces behind?
Semantic Glitch: Agency and Artistry in an Autonomous Pixel Cloud
Qing Zhang · JING HUANG · Mingyang Xu · Jun Rekimoto
While mainstream robotics pursues metric precision and flawless performance, this paper explores the creative potential of a deliberately "lo-fi" approach. We present the "Semantic Glitch," a soft flying robotic art installation whose physical form—a 3D pixel style cloud—is a "physical glitch" derived from digital archaeology. We detail a novel autonomous pipeline that rejects conventional sensors like LiDAR and SLAM, relying solely on the qualitative, semantic understanding of a Multimodal Large Language Model to navigate. By authoring a bio-inspired personality for the robot through a natural language prompt, we create a "narrative mind" that complements the "weak," historically-loaded body. Our analysis begins with a 13-minute autonomous flight log, and a follow-up study statistically validates the framework's robustness for authoring quantifiably distinct personas. The combined analysis reveals emergent behaviors—from landmark-based navigation to a compelling "plan-to-execution" gap—and a character whose unpredictable, plausible behavior stems from a lack of precise proprioception. This demonstrates a lo-fi framework for creating imperfect companions whose success is measured in character over efficiency.
Sensorium Arc: AI Agent System for Oceanic Data Exploration and Interactive Eco-Art
Noah Bissell · Ethan Paley · Joshua Harrison · Juliano Calil · Myungin Lee
Sensorium Arc (AI reflects on climate) is a real-time multimodal interactive AI agent system that personifies the ocean as a poetic speaker and guides users through immersive explorations of complex marine data. Built on a modular multi-agent system and retrieval-augmented large language model (LLM) framework, Sensorium enables natural spoken conversations with AI agents that embodies the ocean’s perspective, generating responses that blend scientific insight with ecological poetics. Through keyword detection and semantic parsing, the system dynamically triggers data visualizations and audiovisual playback based on time, location, and thematic cues drawn from the dialogue. Developed in collaboration with the Center for the Study of the Force Majeure and inspired by the eco-aesthetic philosophy of Newton Harrison, Sensorium Arc reimagines ocean data not as an abstract dataset but as a living narrative. The project demonstrates the potential of conversational AI agents to mediate affective, intuitive access to high-dimensional environmental data and proposes a new paradigm for human-machine-ecosystem.
Supporting Creative Ownership through Deep Learning-Based Music Variation
Stephen Krol · Maria Llano · Jon McCormack
This paper investigates the importance of personal ownership in musical AI design, examining how practising musicians can maintain creative control over the compositional process. Through a four-week ecological evaluation, we examined how a music variation tool, reliant on the skill of musicians, functioned within a composition setting. Our findings demonstrate that the dependence of the tool on the musician's ability, to provide a strong initial musical input and to turn moments into complete musical ideas, promoted ownership of both the process and artefact. Qualitative interviews further revealed the importance of this personal ownership, highlighting tensions between technological capability and artistic identity. These findings provide insight into how musical AI can support rather than replace human creativity, highlighting the importance of designing tools that preserve the humanness of musical expression.
Text to Robotic Assembly of Multi Component Objects using 3D Generative AI and Vision Language Models
Alexander Htet Kyaw · Richa Gupta · Anoop Sinha · Dhruv Shah · Kory Mathewson · Stefanie Pender · Sachin Chitta · Yotto koga · Faez Ahmed · Lawrence Sass · Randall Davis
Advances in 3D generative AI have enabled the creation of physical objects from text prompts, but challenges remain in creating objects involving multiple component types. We present a pipeline that integrates 3D generative AI with vision-language models (VLMs) to enable the robotic assembly of multi-component objects from natural language. Our method leverages VLMs for zero-shot, multi-modal reasoning about geometry and functionality to decompose AI-generated meshes into multi-component 3D models using predefined structural and panel components. We demonstrate that a VLM is capable of determining which mesh regions need panel components in addition to structural components based on object functionality. Evaluation across test objects shows that users preferred the VLM-generated assignments 90.6\% of the time, compared to 59.4\% for rule-based and 2.5\% for random assignment. Lastly, the system allows users to refine component assignments through conversational feedback, enabling greater human control and agency in making physical objects with generative AI and robotics.