Machine learning models commonly exhibit unexpected failures post-deployment due to either data shifts or uncommon situations in the training environment. Domain experts typically go through the tedious process of inspecting the failure cases manually, identifying failure modes and then attempting to fix the model. In this work, we aim to standardise and bring principles to this process by answering a critical question: how do we know that we have identified meaningful and distinct failure types? We suggest that the quality of the identified failure types can be validated by measuring the intra- and inter-type generalisation after fine-tuning and introduce metrics to compare different subtyping methods. In addition, we propose a data-driven method for identifying failure types based on clustering in the gradient space. We evaluate its utility on a classification and an object detection tasks and we show that gradient clustering was able to not only identify failure types with the highest quality according to our metrics but also to identify clinically important failures like undetected catheters close to the ultrasound probe in intracardiac echocardiography.