Mastering Data with Spark and Machine Learning

Enterprise data on customers, vendors, products etc is siloed and represented differently in diverse systems, hurting analytics, compliance, regulatory reporting and 360 views. Traditional rule based MDM systems with legacy architectures struggle to unify this growing data. This talk covers a modern master data application using Spark, Cassandra, ML and Elastic.

3 favorite thumb_down thumb_up 4 comments visibility_off  Remove from Watchlist visibility  Add to Watchlist

Outline/Structure of the Talk

This talk is about the design and architecture of a master data management system using Spark, Cassandra and Elastic. Our application unifies non transactional master data in multiple data domains like customer, organization, product etc through multiple systems like ERP, CRM, custom applications of different business units. The challenges here are that

  • each source and data type has its own schema and format
  • data volumes run into millions of records
  • linking similar records is a fuzzy matching and computationally expensive exercise.

To unify and master this diverse data, we use Spark Data Source API and Machine Learning. We discuss how the abstraction offered by the Data Source API allows us to consume and manipulate the different datasets easily. After aligning required attributes, we use Spark to cluster and classify probable matches using a human in the loop feedback system. These matched and clustered records are persisted to Cassandra and exposed to data stewards through an AJAX based GUI. The Spark job also indexes the records to Elastic which lets the data steward query and search clusters more effectively.

The talk will cover the end to end flow, design and architecture of the different components as well as the configuration per source and type to support the different and unknown datasets and schemas. I will also talk about the performance gain using Spark, the machine learning for data matching and stewardship as well as the role of Cassandra and Elastic in the application.

Learning Outcome

This talk with benefit master data management professionals, data quality analysts, record linkage, entity resolution and data stewardship and governance practitioners. In addition, engineers and data scientists can learn from the complexities of our application and take away key design and architecture principles. ML and AI engineers have key takeaways of learning, building and deploying data labelling techniques.

Target Audience

Data Scientists and Engineers, ML practitioners, Data Management, MDM, Data Governance and Stewardship practitioners

Prerequisites for Attendees

Basic familiarity with distributed technologies like Spark are welcome but not needed.

schedule Submitted 4 months ago

Public Feedback

comment Suggest improvements to the Speaker
  • Dr. Vikas Agrawal
    By Dr. Vikas Agrawal  ~  3 months ago
    reply Reply

    Dear Sonal: Can you please add details of how the 45 minutes talk will be broken up as Naresh mentioned below?

    Warm Regards


  • Kuldeep Jiwani
    By Kuldeep Jiwani  ~  4 months ago
    reply Reply

    Hi Sonal,

    Thanks for submitting a proposal on the various architectures used for productionisation of ML, specially covering Spark the de-facto Big Data tool. It definitely looks good.

    Also it is clear that audiences will take away key architecture techniques from this talk.

    Can you also elaborate a bit more on the other take aways like data labelling techniques and fuzzy matching techniques.

    • Sonal Goyal
      By Sonal Goyal  ~  4 months ago
      reply Reply

      Hi Kuldeep,

      Thanks for your comment and also for liking my proposal. Training data is always a challenge for ML adoption and I will be giving an overview about data stewardship labelling of edge cases and using that as the building block for our algorithm. 




      • Naresh Jain
        By Naresh Jain  ~  4 months ago
        reply Reply

        Hi Sonal,

        Thank you for clarifying the details.

        Can you please help me understand how you plan to spend the 45 mins?

        I'm hoping you would be spending the majority of your time getting into the details of the data labelling and fuzzy matching techniques used by your algo. IMHO the overall problem context and architecture is well-addressed historically and we don't need to spend much time on it.