Skip to yearly menu bar Skip to main content


Workshop

Learning and privacy with incomplete data and weak supervision

Giorgio Patrini · Tony Jebara · Richard Nock · Dimitrios Kotzias · Felix Xinnan Yu

512 dh

Sat 12 Dec, 5:30 a.m. PST

Can we learn to locate objects in images, only from the list of objects those images contain? Or the sentiment of a phrase in a review from the overall score? Can we tell who voted for Obama in 2012? Or which population strata are more likely to be infected by Ebola, only looking at geographical incidence and census data? Are large corporations able to infer sensitive traits of their customers such as sex preferences, unemployment or ethnicity, only based on state-level statistics?

In contrast, how can we publicly release data containing personal information to the research community, while guaranteeing that individuals’ sensitive information will not be compromised? How realistic is the idea of outsourcing machine-learning tasks without sharing datasets but only a few statistics sufficient for training?

Despite their diversity, solutions to those problems can be surprisingly alike, as they all play with the same elements: variables without a clear one-to-one mapping, and the search for/the protection against models and statistics sufficient to recover the relevant variables.

Aggregate statistics and obfuscated data are abundant, as they are released much more frequently than plain individual-level information; the latter are often too sensitive because of privacy constraints or business value, or too expensive to collect. Learning in those scenarios has been conceptualized, for example, by multiple instance learning, learning from label proportions, and learning from noisy labels, and it is common in a variety of application fields, such as computer vision, sentiment analysis and bioinformatics, whenever labels for single image patches, sentences or proteins are unknown, while higher-level supervision is possible.

This problem is not limited to computer science, though. In fact, as natural, social and medical disciplines have studied the problem of inference from aggregates for a long time, including the so-called ecological inference in political science, econometrics and epidemiology, and the modifiable areal unit problem in spatial statistics.

But as those approaches are shown to be effective in practice, to the point that the available statistics reveal sensitive attributes with high accuracy, the question is turned around into a search for privacy guarantees. Traditional statistics has studied the problem of confidential data release. Research in k-anonymity, l-diversity and, more recently, differential privacy has proposed procedures to mask data in a way that one can trade-off protection and usefulness for statistical analysis.

Live content is unavailable. Log in and register to view live content

Timezone: America/Los_Angeles

Schedule

Log in and register to view live content