Recent developments have shown multiple ways to tackle whole-slide image clas- sification with weak labels, a challenging task due to memory constraints. A recent example is Clustering-constrained Attention Multiple instance learning (CLAM), which encodes whole-slide images (WSI) into a smaller set of features. The down- side of this approach is that the encoder uses ImageNet pre-trained weights for feature extraction, which might result in suboptimal features for downstream clas- sification tasks. In this study we propose to train the CLAM model end-to-end using streaming stochastic gradient descent, which can train deep neural networks at near static memory cost regardless of image input size. This way the encoder can learn task-specific feature representations of whole-slide images. We show that it is possible to train with images of 65536 × 65536 at 0.5µm, and obtain improved results for public datasets of metastasis detection in breast cancer.