WebQA is a new benchmark for multimodal multihop reasoning in which systems are presented with the same style of data as humans when searching the web: snippets and images. Upon seeing a question, the system must identify which candidates potentially inform the answer from a candidate pool. Then the system is expected to aggregate information from selected candidates with reasoning to generate an answer in natural language form. Each datum is a question paired with a series of potentially long snippets or images that serve as "knowledge carriers" over which to reason. Systems will be evaluated on both supporting fact retrieval and answer generation to measure correctness and interpretability. To demonstrate multihop multimodal reasoning ability, models should be able to 1) understand and represent knowledge from different modalities, 2) identify and aggregate relevant knowledge fragments scattered across multiple sources, 3) make inference and do natural language generation.