Multimodal few-shot learning is challenging due to the large domain gap between vision and language modalities. As an effort to bridge this gap, we introduce a meta-learning approach for multimodal few-shot learning, to leverage its strong ability of accruing knowledge across tasks. The full model is based on frozen foundation vision and language models to use their already learned capacity. To translate the visual features into the latent space of the language model, we introduce a light-weight meta-mapper, acting as a meta-learner. By updating only the parameters of the meta-mapper, our model learns to quickly adapt to unseen samples with only a few gradient updates. Unlike prior multimodal few-shot learners, which need a hand-engineered task induction, our model is able to induce the task in a completely data-driven manner. The experiments on recent multimodal few-shot benchmarks demonstrate that our meta-learning approach yields better multimodal few-shot learners while being computationally more efficient compared to its counterparts.