PyCon 2020

April 15-23, 2020

Machine learning is hungry for data, usually collected from users of a product and often including a lot of personal or sensitive information. What if we could build accurate machine learning models while still preserving user privacy? There’s a growing number of tools in Python to help us achieve this, ranging from federated learning, where a user’s data remains on their own device, to algorithms for training models on encrypted data. In this talk, I’ll tour the landscape of these tools and review what works, what doesn’t work, and where they fit in a machine learning pipeline.

Data privacy is a huge concern for everyone in tech these days, thanks to both legislation such as the GDPR, and user opinions driven by scandals in the media. Machine learning is at the forefront of this because it’s hungry for large amounts of training data, but it’s also an area where there’s lots of research on developing solutions that protect user privacy.

When I started learning about privacy-preserving machine learning, I found a bewildering number of research papers, introducing some really cool solutions, but very little practical advice on how to apply them in a real-world situation. This is the talk I wish I could have attended at the start of my learning journey! I’ll review the landscape of Python solutions for privacy-preserving ML and show how they fit into a machine learning pipeline. I’ll explain the tradeoffs of each method and also talk a little about the ethics of using personal data for training ML models. Tools and packages covered will include TensorFlow Privacy, TensorFlow Encrypted and PySyft.