Banks and financial institutes in India over the last few years have increasingly faced defaults by corporates. In fact, NBFC stocks have suffered huge losses in recent times. It has triggered a contagion which spilled over to other financial stocks too and adversely affected benchmark indices resulting in short term bearishness. This makes it imperative to investigate ways to prevent rather than cure such situations. However, the banks face a twin-faced challenge in terms of identifying the probable wilful defaulters from the rest and moral hazard among the bank employees who are many a time found to be acting on behest of promoters of defaulting firms. The first challenge is aggravated by the fact that due diligence of firms before the extension of loan is a time-consuming process and the second challenge hints at the need for placement of automated safeguards to reduce mal-practises originating out of the human behaviour. To address these challenges, the automation of loan sanctioning process is a possible solution. Hence, we identified important firmographic variables viz. financial ratios and their historic patterns by looking at the firms listed as dirty dozen by Reserve Bank of India. Next, we used k-means clustering to segment these firms and label them into various categories viz. normal, distressed defaulter and wilful defaulter. Besides, we utilized text and sentiment analysis to analyze the annual reports of all BSE and NSE listed firms over the last 10 years. From this, we identified word tags which resonate well with the occurrence of default and are indicators of financial performance of these firms. A rigorous analysis of these word tags (anagrams, bi-grams and co-located words) over a period of 10 years for more than 100 firms indicate the existence of a relation between frequency of word tags and firm default. Lift estimation of firmographic financial ratios namely Altman Z score and frequency of word tags for the first time uncovers the importance of text analysis in predicting financial performance of firms and their default. Our investigation also reveals the possibility of using neural networks as a predictor of firm default. Interestingly, the neural network developed by us utilizes the power of open source machine learning libraries and throws open possibilities of deploying such a neural network model by banks with a small one-time investment. In short, our work demonstrates the ability of machine learning in addressing challenges related to prevention of wilful default. We envisage that the implementation of neural network based prediction models and text analysis of firm-specific financial reports could help financial industry save millions in recovery and restructuring of loans.


Outline/Structure of the Case Study

1. Introduction: Wilful default by firms in the last 10 years and challenges faced by the financial industry in preventing it (5 minutes)

2. How data speaks for different firms: (15 minutes)

2a. Clustering of 3500 firms based on 12 different financial ratios and evolution of clusters over the last 10 years

2b. Analysis of financial reports of BSE, NSE listed and dirty dozen firms: Findings from Sentiment and Lift analysis

3. Deep learning basics and some tips on how to use open source python libraries for data cleaning and analysis (5 minutes)

4. Development of the neural network model for default prediction: feature selection, selection of transfer function and number of hidden layers. This will elaborate on one use case of ANN to predict default and discuss the model accuracy(10 minutes)

5. Conclusion: Implications of machine learning techniques viz. text analysis and neural networks for the financial industry. Short term and long term perspective on the importance of machine learning in fintech space. (10 minutes)

Learning Outcome

1. Understanding of current challenges faced by the financial industry and how technology can alleviate some of these challenges

2. Understanding of drivers behind wilful corporate default and how machine learning can make the prediction using these drivers

3. Understanding of various machine learning techniques viz. sentiment analysis, lift analysis, feature selection and deep learning

4. Understanding of how financial industry will be disrupted by ever-growing machine learning techniques and power

5. Understanding of methods of data collection, data cleaning and imputation techniques

6. Understanding of how to leverage academic research environment to maximize learning from an MBA programme

7. Financial engineers and analysts will be able to learn how they can leverage machine learning in their day to day work

8. Data scientists will learn how to gather, clean and analyse financial data

Target Audience

Data scientists, fintech firms, financial engineers, academic researchers, regulators, banks.

Prerequisites for Attendees

The presentation requires general awareness of machine learning, challenges being faced by fintech companies, banks and recent performance of financial institutions. It will touch upon the basics of finance and machine learning before getting into more in-depth technical analysis carried out. Hence, a person exposed to these areas via news, print media and blogs will also be able to understand the matter equally well as a seasoned financial expert and data scientist.


schedule Submitted 3 years ago

  • 45 Mins

    Since we originally proposed the need for a first-class language, compiler and ecosystem for machine learning (ML) - a view that is increasingly shared by many, there have been plenty of interesting developments in the field. Not only have the tradeoffs in existing systems, such as TensorFlow and PyTorch, not been resolved, but they are clearer than ever now that both frameworks contain distinct "static graph" and "eager execution" interfaces. Meanwhile, the idea of ML models fundamentally being differentiable algorithms – often called differentiable programming – has caught on.

    Where current frameworks fall short, several exciting new projects have sprung up that dispense with graphs entirely, to bring differentiable programming to the mainstream. Myia, by the Theano team, differentiates and compiles a subset of Python to high-performance GPU code. Swift for TensorFlow extends Swift so that compatible functions can be compiled to TensorFlow graphs. And finally, the Flux ecosystem is extending Julia’s compiler with a number of ML-focused tools, including first-class gradients, just-in-time CUDA kernel compilation, automatic batching and support for new hardware such as TPUs.

    This talk will demonstrate how Julia is increasingly becoming a natural language for machine learning, the kind of libraries and applications the Julia community is building, the contributions from India (there are many!), and our plans going forward.

  • Dipanjan Sarkar

    Dipanjan Sarkar - Explainable Artificial Intelligence - Demystifying the Hype

    45 Mins

    The field of Artificial Intelligence powered by Machine Learning and Deep Learning has gone through some phenomenal changes over the last decade. Starting off as just a pure academic and research-oriented domain, we have seen widespread industry adoption across diverse domains including retail, technology, healthcare, science and many more. More than often, the standard toolbox of machine learning, statistical or deep learning models remain the same. New models do come into existence like Capsule Networks, but industry adoption of the same usually takes several years. Hence, in the industry, the main focus of data science or machine learning is more ‘applied’ rather than theoretical and effective application of these models on the right data to solve complex real-world problems is of paramount importance.

    A machine learning or deep learning model by itself consists of an algorithm which tries to learn latent patterns and relationships from data without hard-coding fixed rules. Hence, explaining how a model works to the business always poses its own set of challenges. There are some domains in the industry especially in the world of finance like insurance or banking where data scientists often end up having to use more traditional machine learning models (linear or tree-based). The reason being that model interpretability is very important for the business to explain each and every decision being taken by the model.However, this often leads to a sacrifice in performance. This is where complex models like ensembles and neural networks typically give us better and more accurate performance (since true relationships are rarely linear in nature).We, however, end up being unable to have proper interpretations for model decisions.

    To address and talk about these gaps, I will take a conceptual yet hands-on approach where we will explore some of these challenges in-depth about explainable artificial intelligence (XAI) and human interpretable machine learning and even showcase with some examples using state-of-the-art model interpretation frameworks in Python!

  • 90 Mins

    Machine learning and deep learning have been rapidly adopted in various spheres of medicine such as discovery of drug, disease diagnosis, Genomics, medical imaging and bioinformatics for translating biomedical data into improved human healthcare. Machine learning/deep learning based healthcare applications assist physicians to make faster, cheaper and more accurate diagnosis.

    We have successfully developed three deep learning based healthcare applications and are currently working on two more healthcare related projects. In this workshop, we will discuss one healthcare application titled "Deep Learning based Craniofacial Distance Measurement for Facial Reconstructive Surgery" which is developed by us using TensorFlow. Craniofacial distances play important role in providing information related to facial structure. They include measurements of head and face which are to be measured from image. They are used in facial reconstructive surgeries such as cephalometry, treatment planning of various malocclusions, craniofacial anomalies, facial contouring, facial rejuvenation and different forehead surgeries in which reliable and accurate data are very important and cannot be compromised.

    Our discussion on healthcare application will include precise problem statement, the major steps involved in the solution (deep learning based face detection & facial landmarking and craniofacial distance measurement), data set, experimental analysis and challenges faced & overcame to achieve this success. Subsequently, we will provide hands-on exposure to implement this healthcare solution using TensorFlow. Finally, we will briefly discuss the possible extensions of our work and the future scope of research in healthcare sector.

  • Favio Vázquez

    Favio Vázquez - Complete Data Science Workflows with Open Source Tools

    90 Mins

    Cleaning, preparing , transforming, exploring data and modeling it's what we hear all the time about data science, and these steps maybe the most important ones. But that's not the only thing about data science, in this talk you will learn how the combination of Apache Spark, Optimus, the Python ecosystem and Data Operations can form a whole framework for data science that will allow you and your company to go further, and beyond common sense and intuition to solve complex business problems.

  • Anupam Purwar

    Anupam Purwar - An Industrial IoT system for wireless instrumentation: Development, Prototyping and Testing

    45 Mins

    The next generation machinery viz. turbines, aircraft and boilers will rely heavily on smart data acquisition and monitoring to meet their performance and reliability requirements. These systems require the accurate acquisition of various parameters like pressure, temperature and heat flux in real time for structural health monitoring, automation and intelligent control. This calls for the use of sophisticated instrumentation to measure these parameters and transmit them in real time. In the present work, a wireless sensor network (WSN) based on a novel high-temperature thermocouple cum heat flux sensor has been proposed. The architecture of this WSN has been evolved keeping in mind its robustness, safety and affordability. WiFi communication protocol based on IEEE 802.11 b/g/n specification has been utilized to create a secure and low power WSN. The thermocouple cum heat flux sensor and instrumentation enclosure have been designed using rigorous finite element modelling. The sensor and wireless transmission unit have been housed in an enclosure capable of withstanding temperature and pressure in the range of 100 bars and 2500K respectively. The sensor signal is conditioned before being passed to the wireless ESP8266 based ESP12E transmitter, which transmits data to a web server. This system uploads the data to a cloud database in real time. Thus, providing seamless data availability to decision maker sitting across the globe without any time lag and with ultra-low power consumption. The real-time data is envisaged to be used for structural health monitoring of hot structures by identifying patterns of temperature rise which have historically resulted in damage using Machine learning (ML). Such type of ML application can save millions of dollars wasted in the replacement and maintenance of industrial equipment by alerting the engineers in real time.

  • Maryam Jahanshahi

    Maryam Jahanshahi - Applying Dynamic Embeddings in Natural Language Processing to Analyze Text over Time

    Maryam Jahanshahi
    Maryam Jahanshahi
    Research Scientist
    schedule 4 years ago
    Sold Out!
    45 Mins
    Case Study

    Many data scientists are familiar with word embedding models such as word2vec, which capture semantic similarity of words in a large corpus. However, word embeddings are limited in their ability to interrogate a corpus alongside other context or over time. Moreover, word embedding models either need significant amounts of data, or tuning through transfer learning of a domain-specific vocabulary that is unique to most commercial applications.

    In this talk, I will introduce exponential family embeddings. Developed by Rudolph and Blei, these methods extend the idea of word embeddings to other types of high-dimensional data. I will demonstrate how they can be used to conduct advanced topic modeling on datasets that are medium-sized, which are specialized enough to require significant modifications of a word2vec model and contain more general data types (including categorical, count, continuous). I will discuss how my team implemented a dynamic embedding model using Tensor Flow and our proprietary corpus of job descriptions. Using both categorical and natural language data associated with jobs, we charted the development of different skill sets over the last 3 years. I will specifically focus the description of results on how tech and data science skill sets have developed, grown and pollinated other types of jobs over time.

  • Dr. Neha Sehgal

    Dr. Neha Sehgal - Open Data Science for Smart Manufacturing

    45 Mins

    Open Data offers a tremendous opportunity in transformation of today’s manufacturing sector to smarter manufacturing. Smart Manufacturing initiatives include digitalising production processes and integrating IoT technologies for connecting machines to collect data for analysis and visualisation.

    In this talk, an understanding of linkage between various industries within manufacturing sector through lens of Open Data Science will be illustrated. The data on manufacturing sector companies, company profiles, officers and financials will be scraped from UK Open Data API’s. The work I plan to showcase in ODSC is part of UK Made Smarter Project, where the work has been useful for major aerospace alliances to find out the champions and strugglers (SMEs) within manufacturing sector based on the open data gathered from multiple sources. The talk includes discussion on data extraction, data cleaning, data transformation - transforming raw financial information about companies to key metrics of interest - and further data analytics to create clusters of manufacturing companies into "Champions" and "Strugglers". The talk showcased examples of powerful R Shiny based dashboards of interest for suppliers, manufacturer and other key stakeholders in supply chain network.

    Further analysis includes network analysis for industries, clustering and deploying the model as an API using Google Cloud Platform. The presenter will discuss about the necessity of 'Analytical Thinking' approach as an aid to handle complex big data projects and how to overcome challenges while working with real-life data science projects.