Traditional seismic inversion (SI) maps the hundreds of terabytes of raw-field data to subsurface properties in gigabytes. This inversion process is expensive, requiring over a year of human and computational effort. Recently, data-driven approaches equipped with Deep learning (DL) are envisioned to improve SI efficiency. However, these improvements are restricted to data with highly reduced scale and complexity. To extend these approaches to real-scale seismic data, researchers need to process raw nav-merge seismic data into an image and perform convolution. We argue that this convolution-based way of SI is not only computationally expensive but also conceptually problematic. Seismic data is not naturally an image and need not be processed as images. In this work, we go beyond convolution and propose a novel SI method. We solve the scalability of SI by proposing a new auxiliary learning paradigm for SI (Aux-SI). This paradigm breaks the SI into local inversion tasks, which predicts each small chunk of subsurface properties using surrounding seismic data. Aux-SI combines these local predictions to obtain the entire subsurface model. However, even this local inversion is still challenging due to: (1) high-dimensional, spatially irregular multi-modal seismic data, (2) there is no concrete spatial mapping (or alignment) between subsurface properties and raw data. To handle these challenges, we propose an all-MLP architecture, Multi-Modal Information Unscrambler (MMI-Unscrambler), that unscrambles seismic information by ingesting all available multi-modal data. The experiment shows that MMI-Unscrambler outperforms both SOTA U-Net and Transformer models on simulation data. We also scale MMI-Unscrambler to raw-field nav-merge data on Gulf-of-Mexico to obtain a geologically sound velocity model with an SSIM score of 0.8. To the best of our knowledge, this is the first successful demonstration of the DL approach on SI for real, large-scale, and complicated raw field data.