#### Topological space creation and Clustering at BigData scale

Every data has an inherent natural geometry associated with it. We are generally influenced by how the world visually appears to us and apply the same flat Euclidean geometry to data. The data geometry could be curved, may have holes, distances cannot be defined in all cases. But if we still impose Euclidean geometry on it, then we may be distorting the data space and also destroying the information content inside it.

In the space of BigData world we have to regularly handle TBs of data and extract meaningful information from it. We have to apply many Unsupervised Machine Learning techniques to extract such information from the data. Two important steps in this process is building a topological space that captures the natural geometry of the data and then clustering in that topological space to obtain meaningful clusters.

**This talk will walk through "Data Geometry" discovery techniques, first analytically and then via applied Machine learning methods. So that the listeners can take back, hands on techniques of discovering the real geometry of the data. The attendees will be presented with various BigData techniques along with showcasing Apache Spark code on how to build data geometry over massive data lakes.**

#### Outline/structure of the Session

The key outline would be in the following order:

- Understanding the need of data geometry
- Importance of geometry as dimensionality and inter-relation amongst data increases
- Euclidean flat geometry cannot handle non-linear relations
- Defining topological spaces, metric spaces
- Distance measurement over curved surfaces
- Merits and demerits of global and local Manifolds
- Applied Machine Learning techniques on discovering data geometry
- Building a distance matrix for a given distance function
- Challenges in building a 10 million x 10 million distance matrix
- Performance tips on building a trillion points matrix over Apache Spark
- Distributed clustering over the obtained distance matrix
- Performance tips on big data clustering over Apache Spark

#### Learning Outcome

The following should be the learnings from the talk:

- How to find the true geometry of the data
- Understanding of topological spaces and metrics
- Manifold construction
- Practical methods of capturing inherent data geometry
- Clustering techniques to capture natural data structures
- Methodology and tips on creating huge distance matrices
- Apache Spark code snippets on scaling distance matrices creation
- Apache Spark code snippets on big data clustering

#### Target Audience

BigData Machine Learning developers and data scientist

#### Prerequisite

Basic knowledge of distance function and clustering

#### Links

**Previous talk given in OracleCode Delhi 2017:**

"Performance Diagnostic Techniques for Big Data Solutions Using Machine Learning"

https://developer.oracle.com/code/newdelhi

## Comments

addAddcancelCancelcommentComment on this Submission~5 months agoThanks for the proposal, Kuldeep! :-)

This is in interesting topic.

It is true that, Euclidean distance does not always capture the notion of closeness from a domain standpoint.

However, I have a gut feel that if a suitable transformations are applied on the features, euclidean distance can be still be used.

From the audience perspective, it would be nice if you can unearth the limitations of using the above approach.

Please share your thoughts.

~5 months agoHi Vishal,

Good to hear your thoughts on Euclidean geometry :)

Yes I agree that in some cases we can apply some transformations to make Euclidean work. Like when one or more variables are correlated then a combination of them can be used and then Euclidean geometry works fine.

But often the data has many correlated variables and have non-linear relations. A simple example of such a scenario could be a sphere where both the axes are tied together. So either we construct a manifold like MDS, t-SNE, LLE, etc, which in itself is a new topological space creation and then apply euclidean on the transformed space. But here the manifold itself has taken care of the non-linearity.

Moreover Euclidean can only be applied to numerical data, while the categorical and ordinal data is left out. Then being just numerical doesn't suffice by itself. Even though the feature is a decimal number, but if it's a probability then again we need other distance functions like KL-Divergence, Mutual Information, etc. For example lets say we have a group of words like basketball, game, play, etc they all belong to a sports topic then we can find the association probability of each word to a topic. Then similarity between two words may mean more likeliness to a topic. In such cases we can also use SimRank as a distance between words.

So all in all, there are many cases where we need to create a distance matrix using a non-euclidean distance function and do clustering over it. The above talk is focused on how to do the same when data size is in TBs.

~5 months agoThis is a very well thought out proposal. Thank you, Kuldeep.

For the program committee to get more confidence in your expertise, can you please share links to past video presentation and/or articles you've published on this topic?

~5 months agoThanks Naresh for the comment.

I have given a talk on a related topic talking about data science at big data scale last year in May 2017. I have provided the link above for last year's Oracle Code Conference 2017 (https://developer.oracle.com/code/newdelhi). The Oracle's website should have the details. Somehow they uploaded other participants talk ppt's and video but not mine. So I don't have a video link. But in case it helps I have an interview video taken just after the talk: https://twitter.com/OracleDevs/status/862279591380918275?s=09