As part of QCRI’s Artificial Intelligence for Monitoring Elections (AIME) project, I liaised with Kaggle to work with a top notch Data Scientist to carry out a proof of concept study. As I’ve blogged in the past, crowdsourced election monitoring projects are starting to generate “Big Data” which cannot be managed or analyzed manually in real-time. Using the crowdsourced election reporting data recently collected by Uchaguzi during Kenya’s elections, we therefore set out to assess whether one could use machine learning to automatically tag user-generated reports according to topic, such as election-violence. The purpose of this post is to share the preliminary results from this innovative study, which we believe is the first of it’s kind.
The aim of this initial proof-of-concept study was to create a model to classify short messages (crowdsourced election reports) into several predetermined categories. The classification models were developed by applying a machine learning technique called gradient boosting on word features extracted from the text of the election reports along with their titles. Unigrams, bigrams and the number of words in the text and title were considered in the model development. The tf-idf weighting function was used following internal validation of the model.
The results depicted above confirm that classifiers can be developed to automatically classify short election observation reports crowdsourced from the public. The classification was generated by 10-fold cross validation. Our classifier is able to correctly predict whether a report is related to violence with an accuracy of 91%, for example. We can also accurately predict 89% of reports that relate to “Voter Issues” such as registration issues and reports that indicate positive events, “Fine” (86%).
The plan for this Summer and Fall is to replicate this work for other crowdsourced election datasets from Ghana, Liberia, Nigeria and Uganda. We hope the insights gained from this additional research will reveal which classifiers and/or “super classifiers” are portable across certain countries and election types. Our hypothesis, based on related crisis computing research, is that classifiers for certain types of events will be highly portable. However, we also hypothesize that the application of most classifiers across countries will result in lower accuracy scores. To this end, our Artificial Intelligence for Monitoring Elections platform will allow election monitoring organizations (end users) to create their own classifiers on the fly and thus meet their own information needs.
Big thanks to Nao for his excellent work on this predictive modeling project.
Pingback: Automatically Classifying Crowdsourced Election Report « Afronline – The Voice Of Africa
This is a great project, Patrick! I believe one application of this is also to make different crowdsourced monitoring platforms comparable, since you can extract the same categories from all their reports, regardless of what predefined categories the project had.
I assume that machine learning and NLP is only usable on English language reports at this time though, right?
Another question that crossed my mind: Will the Artificial Intelligence for Monitoring Elections platform also tackle the issue of validating reports at some point?
Hi Max, thanks for reading. Good point re comparability. NLP and machine learning is not language dependent, which is yet another advantage over key word search. Artificial intelligence and machine learning has been used to predict credible reports. See:
Patrick, this is fantastic! This approach could be applied to any other sector, no? For example, crowdsourced data on local level communal issues (potholes, grafitti, waste, etc etc). instead of using it to point the authorities in the right direction for problem solving, this way it could provide a layer of analytics… may come back to you with some ideas on how we could possibly apply this in some other cases. ps. great to have you back from vacation, missed you!!
Hey Millie, great to hear from you! Yes, machine learning could be applied to many other sectors and not only event-data. Please do come back with any ideas you may have–yours tend to always be on the brilliant side 🙂