Topological space creation and Clustering at BigData scale
Every data has an inherent natural geometry associated with it. We are generally influenced by how the world visually appears to us and apply the same flat Euclidean geometry to data. The data geometry could be curved, may have holes, distances cannot be defined in all cases. But if we still impose Euclidean geometry on it, then we may be distorting the data space and also destroying the information content inside it.
In the space of BigData world we have to regularly handle TBs of data and extract meaningful information from it. We have to apply many Unsupervised Machine Learning techniques to extract such information from the data. Two important steps in this process is building a topological space that captures the natural geometry of the data and then clustering in that topological space to obtain meaningful clusters.
This talk will walk through "Data Geometry" discovery techniques, first analytically and then via applied Machine learning methods. So that the listeners can take back, hands on techniques of discovering the real geometry of the data. The attendees will be presented with various BigData techniques along with showcasing Apache Spark code on how to build data geometry over massive data lakes.
Outline/Structure of the Talk
The key outline would be in the following order:
- Understanding the need of data geometry
- Importance of geometry as dimensionality and inter-relation amongst data increases
- Euclidean flat geometry cannot handle non-linear relations
- Defining topological spaces, metric spaces
- Distance measurement over curved surfaces
- Merits and demerits of global and local Manifolds
- Applied Machine Learning techniques on discovering data geometry
- Building a distance matrix for a given distance function
- Challenges in building a 10 million x 10 million distance matrix
- Performance tips on building a trillion points matrix over Apache Spark
- Distributed clustering over the obtained distance matrix
- Performance tips on big data clustering over Apache Spark
Learning Outcome
The following should be the learnings from the talk:
- How to find the true geometry of the data
- Understanding of topological spaces and metrics
- Manifold construction
- Practical methods of capturing inherent data geometry
- Clustering techniques to capture natural data structures
- Methodology and tips on creating huge distance matrices
- Apache Spark code snippets on scaling distance matrices creation
- Apache Spark code snippets on big data clustering
Target Audience
BigData Machine Learning developers and data scientist
Prerequisites for Attendees
Basic knowledge of distance function and clustering
Links
Previous talk given in OracleCode Delhi 2017:
"Performance Diagnostic Techniques for Big Data Solutions Using Machine Learning"
https://developer.oracle.com/code/newdelhi
Public Feedback