One of the principal problems in the developing world is the poor localization of its addresses. This inhibits discoverability of local trade, reduces availability of amenities such as creation of bank accounts and delivery of goods and services (e.g., e-commerce) and delays emergency services such as fire brigades and ambulances. In general, people in the developing World identify an address based on neighbourhood/locality names and points of interest (POIs), which are neither standardized nor any official records exist that can help in locating them systematically. In this paper, we describe an approach to build accurate geographical boundaries (polygons) for such localities.
As training data, we are provided with two pieces of information for millions of address records: (i) a geocode, which is captured by a human for the given address, (ii) set of localities present in that address. The latter is determined either by manual tagging or by using an algorithm which is able to take a raw address string as input and output meaningful locality information present in that address. For example, for the address, “A-161 Raheja Atlantis Sector 31 Gurgaon 122002”, its geocode is given as (28.452800, 77.045903), and the set of localities present in that address is given as (Raheja Atlantis, Sector 31, Gurgaon, Pin-code 122002). Development of this algorithm are part of any other project we are working on; details about the same can be found here.
Many industries, such as the food-delivery industry, courier-delivery industry, KYC (know-your-customer) data-collection industry, are likely to have huge amounts of such data. Such crowdsourced data usually contain large a amount of noise, acquired either due to machine/human error in capturing the geocode, or due to error in identifying the correct set of localities from a poorly written address. For example, for the address, “Plot 1000, Sector 31 opposite Sector 40 road, Gurgaon 122002”, a machine may output the set of localities present in this address as (Sector 31, Sector 40, Gurgaon, Pin-code 122002), even though it is clear that the address does not lie in Sector 40.
The solution described in this paper is expected to consume the provided data and output polygons for each of the localities identified in the address data. We assume that the localities for which we must build polygons are non-overlapping, e.g., this assumption is true for pin-codes. The problem is solved in two phases.
In the first phase, we separate the noisy points from the points that lie within a locality. This is done by formulating the problem as a non-linear multi-classification problem. The latitudes and longitudes of all non-overlapping localities act as features, and their corresponding locality name acts as a label, in the training data. The classifier is expected to partition the 2D space containing the latitudes and longitudes of the union of all non-overlapping localities into disjoint regions corresponding to each locality. These partitions are defined as non-linear boundaries, which are obtained by optimizing for two objectives: (i) the area enclosed by the boundaries should maximize the number of points of the corresponding locality and minimize the number of points of other localities, (ii) the separation boundary should be smooth. We compare two algorithms, decision trees and neural networks for creating such partitions.
In the second phase, we extract all the points that satisfy the partition constraints, i.e., lie within the boundary of a locality L, as candidate points, for generating the polygon for locality L. The resulting polygon must contain all candidate points and should have the minimum possible area while maintaining the smoothness of the polygon boundary. This objective can be achieved by algorithms such as concave hull. However, since localities are always bounded by roads, we have further enhanced our locality polygons by leveraging open source data of road networks. To achieve this, we solve a non-linear optimisation problem which decides the set of roads to be selected, so that the enclosed area is minimized, while ensuring that all the candidate points lie within the enclosed area. The output of this optimisation problem is a set of roads, which represents the boundary of a locality L.