Workshop: Workshop on Machine Learning Safety

Measuring Robustness with Black-Box Adversarial Attack using Reinforcement Learning

Soumyendu Sarkar · Sajad Mousavi · Ashwin Ramesh Babu · Vineet Gundecha · Sahand Ghorbanpour · Alexander Shmakov


A measure of robustness against naturally occurring distortions is key to the trustworthiness, safety, and success of machine learning models on deployment. We investigate an adversarial black-box attack that adds minimum Gaussian noise distortions to input images to make deep learning models misclassify. We used a Reinforcement Learning (RL) agent as a smart hacker to explore the input images to add minimum distortions to the most sensitive regions to induce misclassification. The agent employs a smart policy also to remove noises introduced earlier, which has less impact on the trained model at a given state. This novel approach is equivalent to doing a deep tree search to add noises without an exhaustive search, leading to faster and optimal convergence. Also, this adversarial attack method effectively measures the robustness of image classification models with the misclassification inducing minimum L2 distortion of Gaussian noise similar to many naturally occurring distortions. Furthermore, the proposed black-box L2 adversarial attack tool beats state-of-the-art competitors in terms of the average number of queries by a significant margin with a 100\% success rate while maintaining a very competitive L2 score, despite limiting distortions to Gaussian noise. For the ImageNet dataset, the average number of queries achieved by the proposed method for ResNet-50, Inception-V3, and VGG-16 models are 42%, 32%, and 31% better than the state-of-the-art "Square-Attack" approach while maintaining a competitive L2.Demo:

Chat is not available.