NIPS 2016
Skip to yearly menu bar Skip to main content


Large Scale Computer Vision Systems

Manohar Paluri · Lorenzo Torresani · Gal Chechik · Dario Garcia · Du Tran

Room 111

Computer Vision is a mature field with long history of academic research, but recent advances in deep learning provided machine learning models with new capabilities to understand visual content. There have been tremendous improvements on problems like classification, detection, segmentation, which are basic proxies for the ability of a model to understand the visual content. These are accompanied by a steep rise of Computer Vision adoption in industry at scale, and by more complex tasks such as Image Captioning and Visual Q&A. These go well beyond the classical problems and open the doors to a whole new world of possibilities. As industrial applications mature, the challenges slowly shift towards challenges in data, in scale, and in moving from purely visual data to multi-modal data.

The unprecedented adoption of Computer Vision to numerous real world applications processing billions of "live" media content daily, raises a new set of challenges, including:

1. Efficient Data Collection (Smart sampling, weak annotations, ...)
2. Evaluating performance in the wild (long tails, embarrassing mistakes, calibration)
3. Incremental learning: Evolve systems incrementally in complex environments (new data, new categories, federated architectures ...)
4. Handling tradeoffs: Computation vs Accuracy vs Supervision
5. Outputs are various types (Binary predictions, embeddings etc.)
6. Machine learning feedback loops
7. Minimizing technical debt as system matures
8. On-device vs On-cloud vs Split
9. Multi-modal content understanding

We will bring together researchers and practitioners who are interested to address this new set of challenges and provide a venue to share how industry and academia approach these problems. We will invite prominent speakers from academia and industry to give their perspectives on these challenges. In addition, we will have 5 minute spotlights for selected papers submitted to the workshop and a poster session for all selected submissions. The topics of submissions should be related to the above mentioned list of challenges. We will end the session with a panel discussion including the speakers on the future of large scale vision and its applications in the wild.

In the second part we will looke at how specifically this applies to video understanding. Video understanding aims at developing computer methods that can interpret videos at different semantic levels. Applications include video categorization, event detection, semantic segmentation, description, summarization, tagging, content-­based retrieval, surveillance, and many more. Although in the last two decades the field of video analytics has witnessed significant progress, most problems in this area still remain largely unsolved. In recent years video understanding has become an even more critical and timely problem to address because of the tremendous growth of videos on the Internet, most of which do not contain tags or descriptions and thus necessitate automatic analysis to become searchable or browsable. At the same time the rise of online video repositories represents an opportunity for the creation of new pivotal large­-scale datasets for research in this area. Given the recent breakthroughs achieved by deep learning in other big data domains, we believe that video understanding may very well be on the verge of a technical revolution that will spur significant advances in this area.

In order to foster further progress by the research community, we propose to organize a one-­day workshop to discuss emerging innovations and ideas about the problems and challenges related to video understanding. The workshop will consist of a series of invited talks from researchers in this area. In addition, we will publicly announce and present a new large­-scale benchmark for video comprehension [1] that has the potential to become an instrumental resource for future research in this field. Compared to existing video datasets, our proposed benchmark has much bigger scale and it casts video understanding in the novel form of multiple choice tests that assess the ability of the algorithm to comprehend the semantics of the video.

This workshop will be the first of a series of annual meetings that we will organize to stimulate steady progress in this area. In each subsequent edition of this workshop, we will host an annual challenge on our continuously expanding video comprehension benchmark in order to motivate students and researchers to push the envelope on this problem. We hope to bring together researchers with common interests in video analysis to share, learn, and make good progress toward better video understanding methods.

[1] D. Tran, M. Paluri, and L. Torresani, “ViCom: Benchmark and Methods for Video Comprehension,” CoRR, abs/1606.07373, July 2016,

Live content is unavailable. Log in and register to view live content

Timezone: America/Los_Angeles


Log in and register to view live content