Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Machine Learning for Systems

MycroftMycroft: Towards Effective and Efficient External Data Augmentation

Zain Sarwar · Van Tran · Arjun Bhagoji · Nicholas Feamster · Ben Zhao · Supriyo Chakraborty


Abstract: In data-scarce domains like networked systems, external data augmentation may often be necessary to improve training data quality, as model trainers usually only have visibility into limited portions of the underlying data distribution. However, relevant data is often privately owned, making it both difficult and expensive for trainers to identify and acquire the needed training data. In this study, we introduce MycroftMycroft, a data-efficient approach that allows model trainers to evaluate the utility of private data from various owners while operating under a limited data-sharing budget. MycroftMycroft leverages feature space distances to identify small, high-utility data subsets from each data owner, which serve as indicators of the overall dataset's utility. In domains with differentiable models, MycroftMycroft can effectively apply gradient matching techniques to identify these valuable data subsets. Our experiments, including novel threat detection in IoT networks and image classification in the vision domain, show that MycroftMycroft quickly reaches performance levels comparable to the baseline where all the data is shared.

Chat is not available.