This blog post is co-written with Chaoyang He and Salman Avestimehr from FedML.
Analyzing real-world healthcare and life sciences (HCLS) data poses several practical challenges, such as distributed data silos, lack of sufficient data at any single site for rare events, regulatory guidelines that prohibit data sharing, infrastructure requirement, and cost incurred in creating a centralized data repository. Because they are in a highly regulated domain, HCLS partners and customers seek privacy-preserving mechanisms to manage and analyze large-scale, distributed, and sensitive data.
To mitigate these challenges, we propose using an open-source federated learning (FL) framework called FedML, which enables you to analyze sensitive HCLS data by training a global machine learning model from distributed data held locally at different sites. FL doesn’t require moving or sharing data across sites or with a centralized server during the model training process.
In this two-part series, we demonstrate how you can deploy a cloud-based FL framework on AWS. In the first post, we described FL concepts and the FedML framework. In the second post, we present the use cases and dataset to show its effectiveness in analyzing real-world healthcare datasets, such as the eICU data, which comprises a multi-center critical care database collected from over 200 hospitals.
Although the volume of HCLS-generated data has never been greater, the challenges and constraints associated with accessing such data limits its utility for future research. Machine learning (ML) presents an opportunity to address some of these concerns and is being adopted to advance data analytics and derive meaningful insights from diverse HCLS data for use cases like care delivery, clinical decision support, precision medicine, triage and diagnosis, and chronic care management. Because ML algorithms are often not adequate in protecting the privacy of patient-level data, there is a growing interest among HCLS partners and customers to use privacy-preserving mechanisms and infrastructure for managing and analyzing large-scale, distributed, and sensitive data. [1]
We have developed an FL framework on AWS that enables analyzing distributed and sensitive health data in a privacy-preserving manner. It involves training a shared ML model without moving or sharing data across sites or with a centralized server during the model training process, and can be implemented across multiple AWS accounts. Participants can either choose to maintain their data in their on-premises systems or in an AWS account that they control. Therefore, it brings analytics to data, rather than moving data to analytics.
In this post, we showed how you can deploy the open-source FedML framework on AWS. We test the framework on eICU data, a multi-center critical care database collected from over 200 hospitals, to predict in-hospital patient mortality. We can use this FL framework to analyze other datasets, including genomic and life sciences data. It can also be adopted by other domains that are rife with distributed and sensitive data, including finance and education sectors.
Advancements in technology have led to an explosive growth of data across industries, including HCLS. HCLS organizations often store data in siloes. This poses a major challenge in data-driven learning, which requires large datasets to generalize well and achieve the desired level of performance. Moreover, gathering, curating, and maintaining high-quality datasets incur significant time and cost.
Federated learning mitigates these challenges by collaboratively training ML models that use distributed data, without the need to share or centralize them. It allows diverse sites to be represented within the final model, reducing the potential risk for site-based bias. The framework follows a client-server architecture, where the server shares a global model with the clients. The clients train the model based on local data and share parameters (such as gradients or model weights) with the server. The server aggregates these parameters to update the global model, which is then shared with the clients for next round of training, as shown in the following figure. This iterative process of model training continues until the global model converges.
In recent years, this new learning paradigm has been successfully adopted to address the concern of data governance in training ML models. One such effort is MELLODDY, an Innovative Medicines Initiative (IMI)-led consortium, powered by AWS. It’s a 3-year program involving 10 pharmaceutical companies, 2 academic institutions, and 3 technology partners. Its primary goal is to develop a multi-task FL framework to improve the predictive performance and chemical applicability of drug discovery-based models. The platform comprises multiple AWS accounts, with each pharma partner retaining full control of their respective accounts to maintain their private datasets, and a central ML account coordinating the model training tasks.
The consortium trained models on billions of data points, consisting of over 20 million small molecules in over 40,000 biological assays. Based on experimental results, the collaborative models demonstrated a 4% improvement in categorizing molecules as either pharmacologically or toxicologically active or inactive. It also led to a 10% increase in its ability to yield confident predictions when applied to new types of molecules. Finally, the collaborative models were typically 2% better at estimating values of toxicological and pharmacological activities.
FedML is an open-source library to facilitate FL algorithm development. It supports three computing paradigms: on-device training for edge devices, distributed computing, and single-machine simulation. It also offers diverse algorithmic research with flexible and generic API design and comprehensive reference baseline implementations (optimizer, models, and datasets). For a detailed description of the FedML library, refer to FedML.
The following figure presents the open-source library architecture of FedML.
As seen in the preceding figure, from the application point of view, FedML shields details of the underlying code and complex configurations of distributed training. At the application level, such as computer vision, natural language processing, and data mining, data scientists and engineers only need to write the model, data, and trainer in the same way as a standalone program and then pass it to the FedMLRunner object to complete all the processes, as shown in the following code. This greatly reduces the overhead for application developers to perform FL.
The FedML algorithm is still a work in progress and constantly being improved. To this end, FedML abstracts the core trainer and aggregator and provides users with two abstract objects, FedML.core.ClientTrainer
and FedML.core.ServerAggregator
, which only need to inherit the interfaces of these two abstract objects and pass them to FedMLRunner. Such customization provides ML developers with maximum flexibility. You can define arbitrary model structures, optimizers, loss functions, and more. These customizations can also be seamlessly connected with the open-source community, open platform, and application ecology mentioned earlier with the help of FedMLRunner, which completely solves the long lag problem from innovative algorithms to commercialization.
Finally, as shown in the preceding figure, FedML supports distributed computing processes, such as complex security protocols and distributed training as a Directed Acyclic Graph (DAG) flow computing process, making the writing of complex protocols similar to standalone programs. Based on this idea, the security protocol Flow Layer 1 and the ML algorithm process Flow Layer 2 can be easily separated so that security engineers and ML engineers can operate while maintaining a modular architecture.
The FedML open-source library supports federated ML use cases for edge as well as cloud. On the edge, the framework facilitates training and deployment of edge models to mobile phones and internet of things (IoT) devices. In the cloud, it enables global collaborative ML, including multi-Region, and multi-tenant public cloud aggregation servers, as well as private cloud deployment in Docker mode. The framework addresses key concerns with regards to privacy-preserving FL such as security, privacy, efficiency, weak supervision, and fairness.
In this post, we showed how you can deploy the open-source FedML framework on AWS. This allows you to train an ML model on distributed data, without the need to share or move it. We set up a multi-account architecture, where in a real-world scenario, organizations can join the ecosystem to benefit from collaborative learning while maintaining data governance. In the next post, we use the multi-hospital eICU dataset to demonstrate its effectiveness in a real-world scenario.
Please review the presentation at re:MARS 2022 focused on “Managed Federated Learning on AWS: A case study for healthcare” for a detailed walkthrough of this solution.
[1] Kaissis, G.A., Makowski, M.R., Rückert, D. et al. Secure, privacy-preserving and federated machine learning in medical imaging. Nat Mach Intell 2, 305–311 (2020). https://doi.org/10.1038/s42256-020-0186-1
[2] FedML https://fedml.ai
Jasper Research Lab’s new shadow generation research and model enable brands to create more photorealistic…
We’re announcing new updates to Gemini 2.0 Flash, plus introducing Gemini 2.0 Flash-Lite and Gemini…
Interactive digital agents (IDAs) leverage APIs of stateful digital environments to perform tasks in response…
This post is co-written with Martin Holste from Trellix. Security teams are dealing with an…
As AI continues to unlock new opportunities for business growth and societal benefits, we’re working…
An internal email obtained by WIRED shows that NOAA workers received orders to pause “ALL…