Data Generation for Benchmarking Deep Learning on Materials Images via Noise Injection and CycleGAN
Abstract
Manual annotation of material microscopy images is time-consuming, costly, and requires domain expertise. This annotation bottleneck limits model training and fair benchmarking. Prior cycle-consistent generative adversarial network (CycleGAN)-based data generation, despite being promising, often relied on computationally expensive simulations and struggled to capture the diverse noise characteristics, making it task-specific. In this study, we introduce an automated pipeline which simplifies dataset generation and improves generality by combining parametric simulations, diverse modality-specific noise injection, and CycleGAN-based texture transfer while preserving the ground-truth masks. Case studies on rubber materials with stripe-like noise in optical microscopy highlight its versatility. This pipeline was evaluated on a public transmission electron microscopy (TEM) nanoparticle dataset to obtain a quantitative comparison with manual annotations. Our results show that the segmentation accuracy approached that of human-labeled data while also reproducing characteristic imaging artifacts. This framework reduces dataset cost, explicitly addresses noise diversity, and enables customized, reproducible, and noise-aware benchmarks aligned with real experimental settings.