Large scale distributed training has become an essential element to scaling the productivity for ML engineers. Today, ML models are getting larger and more complex in terms of compute and memory requirements. The amount of data we train on at Facebook is huge. In this talk, we will learn about the Distributed Training Platform to support large scale data and model parallelism. We will touch base on Distributed Training support for PyTorch and how we are offering a flexible training platform for ML engineers to increase their productivity at facebook scale.
Dwarak Rajagopal is a Senior Engineering Manager and Technical Lead in AI Infrastructure at Facebook. He currently leads the core development of PyTorch 1.0, an open source deep learning platform and the center of Facebook's effort to scale Research to Production in deep learning. Prior to Facebook, as the head of Core Platforms in Uber ATG, he led the Onboard Infra, ML and Data Platforms for the self driving software stack and built out the engineering team in SF.
Mohamed Fawzy is a senior manager at Facebook. In his six years at the company, he’s worked on its distributed storage system and was part of the team that developed cold storage, Facebook’s exabyte archiver storage system that keeps your memories safe. Mohamed started the Distributed AI Group to build large-scale distributed training infrastructure for deep learning and support all use cases within the company including large scale ranking and recommendation, computer vision, machine translation and speech.