Building Genomics Pipelines with AWS Lambda and Apache Spark

schedule 01:15 PM - 02:00 PM place Legends Ballroom people 81 Attending

Lynn Langit shares lessons learned and cloud data pipeline patterns via examples from work she’s doing with CSIRO Bioinformatics Australia. The team there, led by Dr. Denis Bauer, is analyzing a number of large genomic datasets.

First, Lynn examines real-time analysis with cloud-based solutions. Keeping runtime constant can be challenging for problems that vary in complexity, such as genome engineering. The CSIRO GT-Scan2 tool works by instantaneously recruiting additional Lambda functions as the complexity increases. It was built using a microservices pattern (serverless) using AWS services.

Next, Lynn will demo a Jupyter notebook which shows how genomic research can leverage Apache Spark to massively parallelize the generation of random forests to identify disease genes efficiently.She’ll discuss the pipeline’s use of an OSS library written by the team at CSIRO (VariantSpark).

VariantSpark can analyze 3,000 samples with 80 million features in under 30 minutes. This pipeline enables real-time diagnosis by finding similar patients. This platform is contributing to motor neuron disease research (publicized by the Ice Bucket Challenge) in Australia.

 
1 favorite thumb_down thumb_up 0 comments visibility_off  Remove from Watchlist visibility  Add to Watchlist
 

Target Audience

Data analysts, software engineers, architects and anyone with an interest in cloud based solutions, lambda functions or genome research.

schedule Submitted 2 months ago

Comments Subscribe to Comments

comment Comment on this Proposal