Blazing Fast Secondary Analysis

Testing Clara Parabricks on GCP

Nov 26, 2022

A Google engineer recently shared that Nvidia’s Clara Parabricks was now available in the GCP Marketplace. Having worked with Parabricks on AWS previously, I was eager to try it out. Here’s how I did it and what I found.

Using Parabricks on GCP

Go the GCP Marketplace and search for `Parabricks’ - shown below is the deployment page for the NVIDIA Clara Parabricks offer. This page includes pricing and other information. Click the offer to load it into the GCP Deployment Manager.

Set up the deployment

Here’s my test configuration details. I set up a larger base disk (using 250 GB, rather than the suggested 100 GB). I selected the `us-central1-a` region for testing and configured the GCE instance access to only my laptop’s IP (for SSH, etc..). I used the suggested `n1-standard-32` GCE instance type and 1 NVIDIA T4 GPU. Once the VM was ready, I connected using the SSH-in browser tools.

On connect the instance prompted me to update the the NVIDIA drivers. I typed ‘Y’ to confirm.

Set up the Test Runs

Using NVIDIA’s Parabricks documentation, I downloaded and unzipped the 14 GB of sample data using the two commands shown below

wget -O parabricks_sample.tar.gz \
"https://s3.amazonaws.com/parabricks.sample/parabricks_sample.tar.gz"

tar xvf parabricks_sample.tar.gz

There are three example pipelines that I then ran:

FQ2BAM Tutorial - this tool/pipeline aligns, sorts (by coordinate), and marks duplicates in pair-ended FASTQ file data.
Haplotype Caller- this tool/pipeline converts a BAM file to a VCF file.
VCF QC By Bam - this tools/pipeline allows you to inspect the results of aligned reads and a suitable VCF.

Pipeline Run Results

Wow - these tests ran quickly! Each of the tests ran in minutes, which supports the claim on NVIDIA’s website about Parabricks (shown below).

45 minutes for 30x WGS Analysis with Parabricks!

All three of the pipelines ran using the commands in the tutorials (with one exception - there is a missing library in the third pipeline source files, so the final HTML report isn’t produced [although all of the other output files are produced]).

I can’t remember ever being able to running GATK HaplotypeCaller pipeline job in just 7 minutes! Shown below is the output from my test run on GCP.

Interestingly NVIDIA’s documentation includes AWS test run results in their examples. Below, I’ll summarize and compare for these tests below using the results of my test runs on GCP as well.

| Pipeline   | GCP       | AWS       |
|------------|-----------|-----------|
| FQ2BAM     | 7 minutes | 6 minutes |
| Haplotype  | 7 minutes | 6 minutes |
| VCF QC     | 5 minutes | 3 minutes |

Next Steps

I am excited to share my results with my current bioinformatics partners. Given that there are a number of pipeline types (shown below) that are already available, this GCP Marketplace solution is a compelling tool for many use cases.

Also, convenient for testing is the ability to cleanly remove the GCE VM by deleting the deployment.

Lynn Langit's Cloud World

Discussion about this post