Lessons and Progress: Building the Data Analysis and Collection Tool for OCHA ROMENA
A conversation with a member of the development team:
Q: This project began with the goal of providing the OCHA ROMENA team with data-driven insight on the the current need for humanitarian aid and help to create more accurate and effective plans to respond to these challenges. How is the team going about that?
The hypothesis is that if we can give them a view of social stream data that a) mentions certain keywords/phrases and b) targets locations in Libya, this would be of help. If we, further, perform sentiment analysis and allow them to see how both the mentions and the sentiment trends over time, this would be of further help. Under that hypothesis, we’ve built out a dashboard to give views across that data, and a pipeline to populate those views.
Q: What, specifically, have you been working on (as compared to the rest of the team) and what progress have you made since the 4-day Sprint in February?
- Build a binary classification model to allow us to “smart filter” twitter data that is not related to actual problems in Libya but rather to problems with the US political system – colloquially called the Benghazi filter (or less politically-correctly, the deTrumpifier)
- Integrate two different sentiment-analysis tools (one a free Python module - senti_classifier, one the Microsoft Text Analytics API). This will allow us to compare the performance of the two, but they don’t support Arabic text, so…
- Integrate Microsoft’s text-translation API to auto-translate Arabic text to English
- Build out an Apache-Storm-based pipeline for time-window aggregated data of most popular terms, used to drive the time-series window and overall sentiment window of the dashboard
- Clean up the heatmap-generation Apache-Storm-based pipeline to work across all keywords/sectors/groups/statuses that Jonathan defined
1 through 4 are all since the Feb sprint, and work is ongoing on 2-4.
Q: What are the major challenges you have faced along the way? Are facing right now? What have you done to address them?
My biggest challenge – as with most Machine Learning projects – is data. We knew that finding social data in Libya was going to be difficult due to their extensive use of WhatsApp and private FB, and limited use of Twitter. What we did not expect was the outsize role of the Hillary/Benghazi issue in muddying those waters. In order to build the classifier to filter those out, I had to construct a labeled dataset (by basically looking for tweets containing both Hillary and Benghazi) and train on those. Ideally, this could be improved by spending more time tuning the labeled data, but there’s too much work to do on other areas to spend the time necessary to do so.
Another major challenge is the lack of knowledge of Arabic, making it difficult to judge the quality of the translator results or the subsequent sentiment extraction on those. We have to assume that this is a “lower bar” on sentiment analysis performance, and I’ve just been making sure that the actual analyzer can be swapped out easily by the UN if/when they find one that works well on Arabic text.
Q: What do you expect to see in the coming month?
I expect to see the entire pipeline come together shortly – it’s close, but there’s still work to be done. Once that is complete, only then can we really judge the quality of the decision support we’re giving. At that point we can start iterating and improving. We also need to “incrementalize” the pipeline – right now it “boils the ocean” and processes all data every time, since we’ve been changing so many things. That will need to change so it only processes recent data in the time horizons it cares about. We also want to integrate “smarter” zones than the current heatmap – so e.g. instead of squares, you would see Libyan regions, cities, neighborhoods.
Q: What will the UN team need to be able to do in order to manage and support this tool into the future?
We will be putting together a hand-off plan with details. But in brief, they’ll need a place to host the ingest servers and run the aggregation pipeline on the required schedule. They will need to migrate away from our Azure subscription, our Twitter and Facebook keys, and our Text Analytics and Translator keys.
Q: What is the most important, or significant, thing you’ve learned while working on this project?
Everything takes at least twice as long as you think it will. Also, classification models trained from synthetically labeled datasets can actually perform better than expected.