The industrial Wave Energy Converters (WEC) have evolved into complex multi-generator designs, but a lack of effective control has limited their potential for higher energy capture efficiency. The Multi-Agent Reinforcement Learning (MARL) controller can handle these complexities and support multiple objectives of energy capture efficiency, reduction of structural stress, and proactive protection against high waves. However, even with well-trained agent algorithms like Proximal Policy Optimization (PPO), MARL is limited in performance. In this paper, we explore different function approximations for the policy and critic networks in modeling the sequential nature of the system dynamics and find that they are key to better performance. We investigated the performance of fully connected neural networks (FCN), LSTMs, and the Transformer model variants with varying depths. We propose a novel transformer architecture, Skip Transformer-XL (STrXL), with gated residual connections around the multi-head attention, multi-layer perceptron, and transformer block. Our results suggest that STrXL performed best and beat the state-of-the-art GTrXL with faster training convergence. STrXL boosts energy efficiency by an average of 25% to 28% for the entire wave spectrum over the existing spring damper (SD) controller for waves at different angles. Furthermore, unlike the default SD controller, the transformer controller almost eliminated the mechanical stress from the rotational yaw motion.