Google Cloud Biotech Acceleration Tooling

Bio-pharma organizations can now leverage quick start tools and setup scripts to begin running scalable workloads in the cloud today.

This capability is a boon for research scientists and organizations in the bio-pharma space, from those developing treatments for diseases to those creating new synthetic biomaterials. Google Cloud’s solutions teams continue to shape products with customer feedback and contribute to platforms on which Google Cloud customers can build.

This guide provides a way to get started with simplified cloud architectures for specific workloads. Cutting edge research and biotechnology development organizations are often science first and can therefore save valuable resources by leveraging existing technology infrastructure starting points embedded with Google’s best practices. Biotech Acceleration Tooling frees up scientist and researcher bandwidth, while still enabling flexibility. The majority of the tools outlined in this guide come with quick start Terraform scripts to automate the stand up of environments for biopharma workloads.

Solution overview

This deployment creates the underlying infrastructure in accordance with Google’s best practices, configuring appropriate networking including VPC networking, security, data access, and analytics notebooks. All environments are created with Terraform scripts, which define cloud and on-prem resources in configuration files. A consistent workflow can be used to provision infrastructure.

If beginning from scratch, you will need to first consider security, networking, and identity access management set up to keep your organization’s computing environment safe. To do this, follow the steps below:

Login to Google Cloud Platform
Use Terraform Automation Repository within Security Foundations Blueprint to deploy your new environment

Workloads needed can vary, and so should solutions tooling. We offer easy to deploy code and workflows for various biotech use cases including AlphaFold, genomics sequencing, cancer data analysis, clinical trials, and more.

AlphaFold

AlphaFold is an AI system developed by DeepMind that predicts a protein’s 3D structure from its amino acid sequence. It regularly achieves accuracy competitive with experiments. It is useful for researchers doing drug discovery and protein design, often computational biologists and chemists. To get started running AlphaFold batch inference on your own protein sequences, leverage these setup scripts. To better understand the batch inference solution, see this explanation of optimized inference pipeline and video explanation. If your team does not need to run AlphaFold at scale and is comfortable running structures one at a time on less optimized hardware, see the simplified AlphaFold run guide.

Genomics Tooling

Researchers today have the ability to generate an incredible amount of biological data. Once you have this data, the next step is to refine and analyze it for meaning. Whether you are developing your own algorithms or running common tools and workflows, you now have a large number of software packages to help you out.

Here we make a few recommendations for what technologies to consider. Your technology choice should be based on your own needs and experience. There is no “one size fits all” solution.

Genomics tools that may be of assistance for your organization include generalized genomics sequencing pipelines, Cromwell genomics, Databiosphere dsub genomics, and DeepVariant.

Cromwell

The Broad Institute has developed the Workflow Definition Language (WDL) and an associated runner called Cromwell. Together these have allowed the Broad to build, run at scale, and publish its recommended practices pipelines. If you want to run the Broad’s published GATK workflows or are interested in using the same technology stack, take a look at this deployment of Cromwell.

Dsub

This module is packaged to use databiosphere dsub as a Workflow engine, containerized tools (FastQC) and Google cloud lifescience API to automate execution of pipeline jobs. The function can be easily modified to adopt to other bioinformatic tools out there.

Dsub is a command-line tool that makes it easy to submit and run batch scripts in the cloud. The cloud function has embedded dsub libraries to execute pipeline jobs in Google cloud.

DeepVariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

Cancer Data Analysis

ISB-CGC (ISB Cancer Gateway in the Cloud) enables researchers to analyze cloud-based cancer data through a collection of powerful web-based tools and Google Cloud technologies. It is one of three National Cancer Institute (NCI) Cloud Resources tasked with bringing cancer data and computation power together through cloud platforms.

Interactive web-based Cancer Data Analysis & Exploration

Explore and analyze ISB-CGC cancer data through a suite of graphical user interfaces (GUIs) that allow users to select and filter data from one or more public data sets (such as TCGA, CCLE, and TARGET), combine these with your own uploaded data and analyze using a variety of built-in visualization tools.

Cancer data analysis using Google BigQuery

Processed data is consolidated by data type (ex. Clinical, DNA Methylation, RNAseq, Somatic Mutation, Protein Expression, etc.) from sources including the Genomics Data Commons (GDC) and Proteomics Data Commons (PDC) and transformed into ISB-CGC Google BigQuery tables. This allows users to quickly analyze information from thousands of patients in curated BigQuery tables using Structured Query Language (SQL). SQL can be used from the Google BigQuery Console but can also be embedded within Python, R and complex workflows, providing users with flexibility. The easy, yet cost effective, “burstability” of BigQuery allows you to, within minutes (as compared to days or weeks on a non-cloud based system), calculate statistical correlations across millions of combinations of data points.

Available Cancer Data Sources

Clinical Trials Studies

The FDA’s MyStudies platform enables organizations to quickly build and deploy studies that interact with participants through purpose-built apps on iOS and Android. MyStudies apps can be distributed to participants privately or made available through the App Store and Google Play.

This open-source repository contains the code necessary to run a complete FDA MyStudies instance, inclusive of all web and mobile applications.

Open-source deployment tools are included for semi-automated deployment to Google Cloud Platform (GCP). These tools can be used to deploy the FDA MyStudies platform in just a few hours. These tools follow compliance guidelines to simplify the end-to-end compliance journey. Deployment to other platforms and on-premise systems can be performed manually.

Data Science

For generalized data science pipelines to build custom predictive models or do interactive analysis within notebooks, check out our data science workflow setup scripts to get to work immediately. These include database connections and setup, virtual private cloud enablement, and notebooks.

Reference material

Life sciences public datasets
Drug discovery and in silico virtual screening on GCP
Semantic scientific literature search
Research workloads on GCP

Genomics and Secondary Analysis
Patient Monitoring
Variant Analysis
Healthcare API for Machine Learning and Analytics
Radiological Image Extraction

RAD Lab – a secure sandbox for innovation

During research, scientists are often asked to spin up research modules in the cloud to create more flexibility and collaboration opportunities for their projects. However, lacking the necessary cloud skills, many projects never get off the ground.

To accelerate innovation, RAD Lab is a Google Cloud-based sandbox environment which can help technology and research teams advance quickly from research and development to production. RAD Lab is a cloud-native research, development, and prototyping solution designed to accelerate the stand-up of cloud environments by encouraging experimentation, without risk to existing infrastructure. It’s also designed to meet public sector and academic organizations’ specific technology and scalability requirements with a predictable subscription model to simplify budgeting and procurement. You canfind the repository here.

RAD Lab delivers a flexible environment to collect data for analysis, giving teams the liberty to experiment and innovate at their own pace, without the risk of cost overruns. Key features include:

Open-source environment that runs on the cloud for faster deployment—with no hardware investment or vendor lock-in.
Built on Google Cloud tools that are compliant with regulatory requirements like FedRAMP, HIPAA, and GDPR security policies.
Common IT governance, logging, and access controls across all projects.
Integration with analytics tools like BigQuery, Vertex AI, and pre-built notebook templates.
Best-practice operations guidance, including documentation and code examples, that accelerate training, testing, and building cloud-based environments.
Optional onboarding workshops for users, conducted by Google Cloud specialists.

The next generation of RAD Lab includes RAD Lab UI, which provides a modern interface for less technical users to deploy Google Cloud resources – in just three steps.

^{This guide would not have been possible without the contributions of Alex Burdenko, Emily Du, Joan Kallogjeri, Marshall Worster, Shweta Maniar, and the RAD Lab team.}