Generation of Locality Polygons using Open Source Road Network Data and Non-Linear Multi-classification Techniques
One of the principal problems in the developing world is the poor localization of its addresses. This inhibits discoverability of local trade, reduces availability of amenities such as creation of bank accounts and delivery of goods and services (e.g., e-commerce) and delays emergency services such as fire brigades and ambulances. In general, people in the developing World identify an address based on neighbourhood/locality names and points of interest (POIs), which are neither standardized nor any official records exist that can help in locating them systematically. In this paper, we describe an approach to build accurate geographical boundaries (polygons) for such localities.
As training data, we are provided with two pieces of information for millions of address records: (i) a geocode, which is captured by a human for the given address, (ii) set of localities present in that address. The latter is determined either by manual tagging or by using an algorithm which is able to take a raw address string as input and output meaningful locality information present in that address. For example, for the address, “A-161 Raheja Atlantis Sector 31 Gurgaon 122002”, its geocode is given as (28.452800, 77.045903), and the set of localities present in that address is given as (Raheja Atlantis, Sector 31, Gurgaon, Pin-code 122002). Development of this algorithm are part of any other project we are working on; details about the same can be found here.
Many industries, such as the food-delivery industry, courier-delivery industry, KYC (know-your-customer) data-collection industry, are likely to have huge amounts of such data. Such crowdsourced data usually contain large a amount of noise, acquired either due to machine/human error in capturing the geocode, or due to error in identifying the correct set of localities from a poorly written address. For example, for the address, “Plot 1000, Sector 31 opposite Sector 40 road, Gurgaon 122002”, a machine may output the set of localities present in this address as (Sector 31, Sector 40, Gurgaon, Pin-code 122002), even though it is clear that the address does not lie in Sector 40.
The solution described in this paper is expected to consume the provided data and output polygons for each of the localities identified in the address data. We assume that the localities for which we must build polygons are non-overlapping, e.g., this assumption is true for pin-codes. The problem is solved in two phases.
In the first phase, we separate the noisy points from the points that lie within a locality. This is done by formulating the problem as a non-linear multi-classification problem. The latitudes and longitudes of all non-overlapping localities act as features, and their corresponding locality name acts as a label, in the training data. The classifier is expected to partition the 2D space containing the latitudes and longitudes of the union of all non-overlapping localities into disjoint regions corresponding to each locality. These partitions are defined as non-linear boundaries, which are obtained by optimizing for two objectives: (i) the area enclosed by the boundaries should maximize the number of points of the corresponding locality and minimize the number of points of other localities, (ii) the separation boundary should be smooth. We compare two algorithms, decision trees and neural networks for creating such partitions.
In the second phase, we extract all the points that satisfy the partition constraints, i.e., lie within the boundary of a locality L, as candidate points, for generating the polygon for locality L. The resulting polygon must contain all candidate points and should have the minimum possible area while maintaining the smoothness of the polygon boundary. This objective can be achieved by algorithms such as concave hull. However, since localities are always bounded by roads, we have further enhanced our locality polygons by leveraging open source data of road networks. To achieve this, we solve a non-linear optimisation problem which decides the set of roads to be selected, so that the enclosed area is minimized, while ensuring that all the candidate points lie within the enclosed area. The output of this optimisation problem is a set of roads, which represents the boundary of a locality L.
Outline/Structure of the Case Study
- Description of problem statement and impact of solving the problem
- Description of available data sources to solve the problem
- Description of any supporting work that has helped in development of this work
- Description of the algorithms that solve the problem
- Illustration of results and their application in the logistics industry
Learning Outcome
- Overview of the logistics industry and its challenges with the problem of unstructured address data in developing countries
- Application of graph based generative machine learning techniques that enable disambiguation of a raw address string into a structured list containing city, locality, sub-locality, etc.
- Application of (multi-label) classification techniques to create decision boundaries between what we consider noise in the data and clean data
- Application of optimisation techniques to draw geographical boundaries of cities, localities, sub-localities, etc.
Target Audience
ML enthusiasts, product managers, data scientists, people working in the logistics industry
Video
Links
- Learning to Decode Unstructured Indian Addresses (related blog)
- What is the right addressing scheme for India? (related article)
- Economic Impact of Discoverability of Localities and Addresses in India (related blog)
schedule Submitted 4 years ago
People who liked this proposal, also liked:
-
keyboard_arrow_down
Indranil Chandra - Data Science Project Governance Framework
45 Mins
Talk
Executive
Data Science Project Governance Framework is a framework that can be followed by any new Data Science business or team. It will help in formulating strategies around how to leverage Data Science as a business, how to architect Data Science based solutions and team formation strategy, ROI calculation approaches, typical Data Science project lifecycle components, commonly available Deep Learning toolsets and frameworks and best practices used by Data Scientists. I will use an actual use case while covering each of these aspects of building the team and refer to examples from my own experiences of setting up Data Science teams in a corporate/MNC setup.
A lot of research is happening all around the world in various domains to leverage Deep Learning, Machine Learning and Data Science based solutions to solve problems that would otherwise be impossible to solve using simple rule based systems. All the major players in the market and businesses are also getting started and setting up new Data Science teams to take advantages of modern State-of-the-Art ML/DL techniques. Even though most of the Data Scientists are great at knowledge of mathematical modeling techniques, they lack the business acumen and management knowledge to drive Data Science based solutions in a corporate/MNC setup. On the other hand, management executives in most of the corporates/MNCs do not have first hand knowledge of setting up new Data Science team and approach to solving business problems using Data Science. This session will help bridge the above mentioned gap and help Executives and Data Scientists provide a common ground around which they can easily build any Data Science business/team from ground zero.
GitHub Link -> https://github.com/indranildchandra/DataScience-Project-Governance-Framework
-
keyboard_arrow_down
Karthik Bharadwaj T - Failure Detection using Driver Behaviour from Telematics
45 Mins
Case Study
Beginner
Telematics data have a potential to unlock revenue of 1.5 trillion. Unfortunately this data has not been tapped by many users.
In this case study Karthik Thirumalai would discuss how we can use telematics data to identify driver behaviour and do preventive maintenance in automobile.
-
keyboard_arrow_down
Karthik Bharadwaj T - 7 Habits to Ethical AI
45 Mins
Talk
Beginner
While AI is been put to use in solving great problems of the world, it is subjected to questions the morality of how it is constructed, used and put into use. Karthik Thirumalai addresses the 7 habits of building ethical AI solutions and how it could be put to use for a better world. These habits Data Governance, Fairness, Privacy and Security, Accountability, Transparency, Education help organizations to successfully implement their AI strategy which reflects fundamental human principles and moral values.
-
keyboard_arrow_down
Lakshya - The Natural Language Decathlon: A Multitask Challenge for NLP
45 Mins
Talk
Intermediate
Deep learning has significantly improved state-of-the-art performance for natural language processing (NLP) tasks, but each one is typically studied in isolation. The Natural Language Decathlon (decaNLP) is a new benchmark for studying general NLP models that can perform a variety of complex, natural language tasks. By requiring a single system to perform ten disparate natural language tasks, decaNLP offers a unique setting for multitask, transfer, and continual learning.
and is publicly available on github in order to use for tasks like Question Answering, Machine Translation, Summarization, Sentiment Analysis etc. -
keyboard_arrow_down
Dr Hari Krishna Maram - Future of Technology
20 Mins
Talk
Executive
Future of Technology covered trends in technology across the globe and innovation changing the future