Mastering Data with Spark and Machine Learning

Enterprise data on customers, vendors, products etc is siloed and represented differently in diverse systems, hurting analytics, compliance, regulatory reporting and 360 views. Traditional rule based MDM systems with legacy architectures struggle to unify this growing data. This talk covers a modern master data application using Spark, Cassandra, ML and Elastic.


Outline/Structure of the Talk

This talk is about the design and architecture of a master data management system using Spark, Cassandra and Elastic. Our application unifies non transactional master data in multiple data domains like customer, organization, product etc through multiple systems like ERP, CRM, custom applications of different business units. The challenges here are that

  • each source and data type has its own schema and format
  • data volumes run into millions of records
  • linking similar records is a fuzzy matching and computationally expensive exercise.

To unify and master this diverse data, we use Spark Data Source API and Machine Learning. We discuss how the abstraction offered by the Data Source API allows us to consume and manipulate the different datasets easily. After aligning required attributes, we use Spark to cluster and classify probable matches using a human in the loop feedback system. These matched and clustered records are persisted to Cassandra and exposed to data stewards through an AJAX based GUI. The Spark job also indexes the records to Elastic which lets the data steward query and search clusters more effectively.

The talk will cover the end to end flow, design and architecture of the different components as well as the configuration per source and type to support the different and unknown datasets and schemas. I will also talk about the performance gain using Spark, the machine learning for data matching and stewardship as well as the role of Cassandra and Elastic in the application.

Learning Outcome

This talk with benefit master data management professionals, data quality analysts, record linkage, entity resolution and data stewardship and governance practitioners. In addition, engineers and data scientists can learn from the complexities of our application and take away key design and architecture principles. ML and AI engineers have key takeaways of learning, building and deploying data labelling techniques.

Target Audience

Data Scientists and Engineers, ML practitioners, Data Management, MDM, Data Governance and Stewardship practitioners

Prerequisites for Attendees

Basic familiarity with distributed technologies like Spark are welcome but not needed.

schedule Submitted 1 year ago

Public Feedback