Deep learning models have reached or surpassed human-level performance in the field of medical imaging, especially in disease diagnosis using chest x-rays. However, prior work has found that such classifiers can exhibit biases in the form of gaps in predictive performance across protected groups. In this paper, we benchmark the performance of nine methods in improving the fairness of these classifiers. We utilize the minimax definition of fairness, which focuses on maximizing the performance of the worst-case group. Our experiments show that certain methods are able to improve worst-case performance for selected metrics and protected attributes. However, we find that the magnitude of such gains is limited. Finally, we provide best practices for selecting fairness definitions for use in the clinical setting.