Language and the Data Pipeline: Building the Data Analysis and Collection Tool for OCHA ROMENA
The following is an update from a member of the development team on the project's progress:
The central idea we're building upon is to identify Twitter and Facebook trends across a pre-defined set of keywords and keyword groups. While this tool will never tell you where exactly what is going on, it allows the team to identify certain trends across different geo-boundaries. This will help to identify anomalies which could be indicators that something is going on.
Building out the data pipeline
I've been working on the data pipeline: basically from ingesting data from Twitter, de-duping the messages, perform the natural language processing and the inference of keywords and locations. We made some great progress across all these components and further developed the ability to fuzzy match keywords and locations.
The more data, the more the machine can learn
That said, not understanding arabic is a major challenge when it comes to NLP and inference. While we worked with translators, we decided to constrain our work to English data streams only for now. it just reminded me once more how difficult it is to build solutions for a language one is not capable of speaking and writing.
Having more data will help us to refine our approach, especially in the area of machine learning and inference. This will be ongoing. The UN team needs to run the tool as a service and apply dev-ops practices. It needs to constantly refine and retrain the models.