CLASSIEfier: Using Machine Learning to Paint a Picture of Social Sector Trends
Tracking the flow of funding and other support to social sector organisations in Australia has historically been difficult because of inconsistencies in categorisation, or the absence of categorisation entirely. Our Community (Melbourne based social enterprise) developed CLASSIE to serve as a universal classification system for Australian social sector initiatives and entities. We are now developing a Machine learning algorithm to reduce or remove the need for manual (human) classification. Once released, CLASSIEfier will allow us to classify historical records on behalf of grantmakers and other social sector supporters, and reduce the need for human intervention in classification of current and future records. In a long term will allow us to answer fundamental questions such as: Where is the money going? Are we helping the areas in most need?
I will present the project scope and development of CLASSIEfier, highlighting my experiences using Machine Learning in the social sector. I will also list the difficulties of working with text and sensitive data, and the methodologies to identify and mitigate algorithmic biases.
Outline/Structure of the Case Study
The talk will be 15 to 17 mins long leaving around 13 min for introduction and questions. It will go as follows:
- Personal Introduction (1min): Introducing Our Community’s data initiatives. Our Community has created two main enterprises to propel the Australian’s social sector towards the data science ecosystem: a) the Innovation Lab and b) OC house. I will quickly explain their main objectives.
- Project introduction (1min): I will introduce CLASSIE and CLASSIEfier. CLASSIE is a taxonomy that provides a standard classification across the social sector (Subjects and Beneficiaries). CLASSIEfier is the machine learning algorithm that will automatically classify grant applications (and others) using the CLASSIE taxonomy.
- How did we scope CLASSIEfier? (5min). I will explain the Initial data challenges we faced with this project. Examples: a) our database had 300,000 grant applications unclassified, of those only 6000 were classified by users. These were not enough labels to train a supervised machine learning model. b) Data in the social sector is highly susceptible to biases, therefore, we couldn’t use publicly available libraries to train the models.
- How did CLASSIEfier evolve? (5min) CLASSIEfier had two subsequent stages: a) The search for labels. For example, We implemented a keyword matching component to pre-classify the grant applications, but this needed human input to reject misclassifications.
b) Model training. I will explain the differences that we find between supervised, semi-supervised, binary and multilabel models.
- Reflexion on the Data Science for social good concept (1min). The Data Science for social good concept is exponentially growing. It follows the motto: first, identify the problem and then find the right tool to solve it.
- Results and conclusions (2min). Comparison between the initial plan and the final results. CLASSIEfier became an iterative process where we had pre-classification, re-training and cross evaluation between the data and the taxonomy.
The Audience will learn project scoping, how to deal with text data and how to avoid biases.
Prerequisites for Attendees
The audience should know the basic steps to build a machine learning model