Automated UN Document Summaries
How might we make the United Nations's historical document archive, and other future UN documentation, more accessible?
The UN's Official Document System Search (ODS), built by the Technical Services Team in UN-OICT, is an open door to the entire collection of UN documents all the way back to its founding. In many ways, this archive's trail is a record of the world's history for the past decades. How might we make that record, and other future UN documentation, more accessible?
Our team collaborated with machine intelligence research company Fast Forward Labs (FFL) to improve keyword and topic extraction on official UN documents. Employing machine learning techniques, FFL were able to enrich the set of tags and classifiers applicable to UN documents to cover a greater range within the corpus. This approach offers flexibility across multiple languages and can be more sophisticated than regular expression recognition.
The first experiment, lead by FFL's Data Scientist Micha Gorelick, resulted in a system that reorders the sentences in each document according to their potential to act as summaries. These higher-potential sentences are automatically brought to the top, rearranging the document into a more efficient read through a process called "extraction-based summarization."
For example, the 2007 paper "Arrangements for the Secretary-General’s high-level event entitled 'The future in our hands: addressing the leadership challenge of climate change'” can be summarized, in part, as the following:
For comparison, the original:
And a sortable / rearrangeable version using using the described process. Sorting by “score” will rearrange the document in its summary-optimized form:
DISTRIBUTION OF STRONG SUMMARY SENTENCES ACROSS THE COLLECTION
The tool was run across the thousands of existing documents in the UN ODS. This chart that shows where in the documents the most important sentences tend to occur across the entire collection. The x-axis shows percentage into the document. The y-axis shows the number of sentences across the entire corpus that appear as the top 5 sentences in an article.
Many occur in the beginning of the document, but there's quite a wide distribution. This would indicate that there is limited structural consistency from document-to-document across the collection.
In addition to improving search and providing more efficient access to existing documents, this document might also help in analyzing long-form text-based data in projects like the OCHA Libya Monitoring Tool, described in the Unite Newsletter here.
Watch Fast Forward Labs' webinar below to learn more about the technology, process, and potential applications for this: