In recent years, generative adversarial networks (GAN) with their sensitivity to detailed structures have been established as state-of-the-art in various synthesis tasks in medical imaging. However, GAN models based on convolutional neural network (CNN) backbones perform local processing with small filters. This inductive bias compromises the learning of long-range spatial dependencies. Here, we propose a novel generative approach for medical image synthesis, ResViT, to combine local precision of convolution operators with contextual sensitivity of vision transformers. ResViT employs an encoder-decoder architecture with a central bottleneck composed of novel aggregated residual transformer blocks (ART) that are expressive for both local and long-distance interactions among image features. Demonstrations on MRI datasets indicate the superiority of ResViT in terms of qualitative observations and quantitative metrics.