Spark is a general purpose distributed computing platform, designed to handle both batch and streaming applications. It extends on the map reduce paradigm initially coined by the Google in it's 2004 research paper. It leverages functional programming paradigm for doing the transformations on the datasets residing in cluster's memory. 

 Matei Zaharia, creator of Spark mentioned the importance of using functional programming language  - " At the time we started, I really wanted a PL that supports a language-integrated interface (where people write functions inline, etc) because I thought that was the way people would want to program these applications after seeing research systems that had it .. "



Outline/Structure of the Demonstration

  • Introduction to Spark
  • How Spark builds and manages distributed datasets as Scala collection
  • Higher order functions in Spark vs Scala
  • Demonstration
    • Transformations on RDD in Spark using functions
    • Dataframes API in Spark
    • Typed transformations on Datasets
  • Questions

Learning Outcome

  • Get an overview of Spark, the Big data processing Engine
  • Understand why lazy collections need to be functional in nature
  • How to optimize iterative algorithms in Big data world

Target Audience

People interested in Big data and functional programming

schedule Submitted 3 years ago

Public Feedback

    • Eric Torreborre

      Eric Torreborre - Streams, effects and beautiful folds, a winning trilogy

      Eric Torreborre
      Eric Torreborre
      Sr. Software Engineer
      schedule 4 years ago
      Sold Out!
      45 Mins

      Most applications are just reading data, transforming it and writing it somewhere else. And there are great libraries in the Scala eco-system to support these use cases: Akka-Stream, fs2, Monix,... But if you look under the hood and try to understand how those libraries work you might be a bit scared by their complexity!

      In this talk you will learn how to build a very minimal "streaming library" where all the difficult concerns are left to other libraries: eff for asynchronous computations and resources management, origami for extracting useful data out of the stream. Then you will decide how to spend your complexity budget and when you should pay for more powerful abstractions.