Topological space creation and Clustering at BigData scale

location_city Bengaluru schedule Sep 1st 12:15 - 01:00 PM place Grand Ball Room 2 people 42 Interested

Every data has an inherent natural geometry associated with it. We are generally influenced by how the world visually appears to us and apply the same flat Euclidean geometry to data. The data geometry could be curved, may have holes, distances cannot be defined in all cases. But if we still impose Euclidean geometry on it, then we may be distorting the data space and also destroying the information content inside it.

In the space of BigData world we have to regularly handle TBs of data and extract meaningful information from it. We have to apply many Unsupervised Machine Learning techniques to extract such information from the data. Two important steps in this process is building a topological space that captures the natural geometry of the data and then clustering in that topological space to obtain meaningful clusters.

This talk will walk through "Data Geometry" discovery techniques, first analytically and then via applied Machine learning methods. So that the listeners can take back, hands on techniques of discovering the real geometry of the data. The attendees will be presented with various BigData techniques along with showcasing Apache Spark code on how to build data geometry over massive data lakes.

 
 

Outline/Structure of the Talk

The key outline would be in the following order:

  • Understanding the need of data geometry
  • Importance of geometry as dimensionality and inter-relation amongst data increases
  • Euclidean flat geometry cannot handle non-linear relations
  • Defining topological spaces, metric spaces
  • Distance measurement over curved surfaces
  • Merits and demerits of global and local Manifolds
  • Applied Machine Learning techniques on discovering data geometry
  • Building a distance matrix for a given distance function
  • Challenges in building a 10 million x 10 million distance matrix
  • Performance tips on building a trillion points matrix over Apache Spark
  • Distributed clustering over the obtained distance matrix
  • Performance tips on big data clustering over Apache Spark

Learning Outcome

The following should be the learnings from the talk:

  • How to find the true geometry of the data
  • Understanding of topological spaces and metrics
  • Manifold construction
  • Practical methods of capturing inherent data geometry
  • Clustering techniques to capture natural data structures
  • Methodology and tips on creating huge distance matrices
  • Apache Spark code snippets on scaling distance matrices creation
  • Apache Spark code snippets on big data clustering

Target Audience

BigData Machine Learning developers and data scientist

Prerequisites for Attendees

Basic knowledge of distance function and clustering

schedule Submitted 2 years ago

Public Feedback