Working with Large Numbers of Non-Trivial ETL Pipelines

schedule May 7th 11:25 - 11:55 AM place Green Room

Data pipelines need to be flexible, modular and easily monitored. They are not just set-and-forget. The team that monitors a pipeline might not have developed it and may not be experts on the dataset. End users must have confidence in the output.

This talk is a practical walkthrough of a suggested pipeline architecture on AWS using Step functions, Spot instance, AWS Batch, Glue, Lambda and Data Dog.

I'll be covering techniques using AWS and DataDog, but many of the approaches are applicable in an Apache Airflow/Kibana environment.

 
1 favorite thumb_down thumb_up 0 comments visibility_off  Remove from Watchlist visibility  Add to Watchlist
 

Outline/Structure of the Case Study

The talk will start with an overview of AWS step functions including the recent changes as well as advice about when to use a pipelining tool.

I'll then show how to configure your pipelines and how to wrap the stages to enable JSON formatted logs that include the state machine identifier and the execution id. This approach allows dev ops to view the logs as if it were one continuous program.

Finally, I'll show how to examine the data using Data Dog measures generated directly from log data as well as custom metrics pushed to Data Dog via an SNS queue.

Learning Outcome

As a result of this session the audience should:

  • Understand how to use AWS step functions.
  • See the advantages of JSON logging.
  • Have new ideas about how to monitor their data pipelines.

Target Audience

Data Engineers, Data Architects, Dev Ops

Prerequisites for Attendees

Familiarity with the concepts of ETL.

schedule Submitted 1 month ago