use machine learning to correctly attribute quotes

<span>Photo: DrAfter123/Getty Images</span>” src=”https://s.yimg.com/ny/api/res/1.2/hvehQPn35Y2uIhRJXRz7ew–/YXBwaWQ9aGlnaGxhbmRlcjt3PTk2MDtoPTU3Nw–/https://media.zenfs.com/en/theguardian_763/abb1f1d565a60c2a734cd914cdd39958″ data-src = “https://s.yimg.com/ny/api/res/1.2/hvehQPn35Y2uIhRJXRz7ew–/YXBwaWQ9aGlnaGxhbmRlcjt3PTk2MDtoPTU3Nw–/https://media.zenfs.com/en/theguardian_763/abb1f1d565a60c2a734cd914cdd39958″/></div>
</div>
</div>
<p><figcaption class=Photo: DrAfter123/Getty Images

Anna, Michel, Alice – The Guardian

Why do we care so much about quotes?

As we discussed in Talking sense: using machine learning to understand quotes, there are many good reasons to recognize quotation marks. Quotation marks enable information to be transmitted directly from a source, accurately capturing the intended sentiment and meaning. Not only are they a vital piece of accurate reporting but they can also bring a story to life. The information extracted from them can be used to check facts and allows us to gain insight into public opinion. For example, exact quotes can be used to track translated comments on the same topic over time, or to explore those comments as a function of identity, e.g. gender or race. The existence of a comprehensive set of extracts and their sources is therefore a rich data asset that can be used to explore demographic and socio-economic trends and changes.

We’ve already used AI to help extract accurate quotes from the Guardian’s extensive archive, and we thought it could help us again for the next stage of accurate quote assignment. This time, we turned to students from UCL Center for Doctoral Training in Data Intensive Science. As part of their PhD program involving working on industry projects, we asked these students to explore deep learning options that could help with citation attribution. In particular, they looked at machine learning tools to a method called conference resolution.

Tara, Alicja, Paul – UCL

What is core conference resolution?

In everyday language, when we mention the same entity multiple times, we tend to use different expressions to refer to it. The care of conference resolution that is to group together all references in a piece of text that refer back to the same entity. We call the original entity the before and subsequent references, anaphora. In the simple example below:

Sarah to enjoy Cup of tea in the morning. she good it with milk.

Sarah The precursor to the anaphoric quotation is ‘she‘. The preposition or reference or both can be a group of words rather than a single one. So, in the example another group is made up of the sentence Cup of tea and the word it as core offering entities.

Why is solving the core conference so hard?

You might think that grouping references of the same entity together is a trivial task in machine learning, but there are many layers of complexity to this problem. The task requires linking ambiguous anaphora (eg “she” or “The First Lady”) with an unambiguous predicate (eg “Michelle Obama”) which can be many sentences, or even paragraphs, before the quote in question occurs. Depending on the writing style, there could be many more entities embedded in the text that does not refer to any references interest. Combined with the complexity of the citation, which may be several long words, this task is even more difficult.

Furthermore, the mood expressed through language is very sensitive to the choice of words we use. For example, see how the preposition of the word they does in the following sentences changes because of the change in the verb that follows it:

The city councilors refused permission to the demonstrators because of them fear violence

The city councilors refused permission to the demonstrators because of them suggested violence

(These two somewhat different sentences are actually part of the Winograd schema challenge, a well-known test of machine intelligence, proposed as an extension of the Turing Test, a test to show whether or not a computer can think like a human. )

The example shows us that grammar alone cannot be relied upon to solve this task; it is essential to understand the semantics. This means that it is not possible to devise rule-based methods (without prohibitive difficulty) to tackle this task perfectly. This is what motivated us to look at using machine learning to tackle the core conference resolution problem.

Artificial intelligence to the rescue

A typical machine learning heuristic for a core conference solution would follow steps like this:

  • Extract a set of references related to real-world entities

  • For each citation, calculate a set of features

  • Based on those features, find the most likely prediction for each reference

The AI ​​workhorse to perform those steps is a language model. Basically, it is a language model probability distribution over a sequence of words. Many of you have probably come across OpenAI’s ChatGPT, which is powered by a large language model.

To analyze language and make predictions, language models are created and used embedded word. Word embeddings are basically the mapping of words to points in semantic space, where words with the same meaning are placed next to each other. For example, the location of the points corresponding to ‘cat’ and ‘lion’ would be closer together than the points corresponding to ‘cat’ and ‘piano’.

Similar words with different meanings ([river] bank vs bank [financial institution], for example) are used in different contexts and will therefore be located in different places in the semantic space. This distinction is crucial in more sophisticated examples, such as the Winograd Schema. These embeds are the features mentioned in the recipe above.

Embedded language models use words to represent a string of text as numbers, which contain contextual meaning. We can use this numerical representation to perform analytical tasks; in our case, a core conference solution. We show the language model many labeled examples (see later) that train the model, in conjunction with the word embedding, to identify key referents when shown text it has not seen before, based on the meaning of that text.

For this task, we chose language models built by ExplosionAI as they fit well with the Guardian’s existing data science pipeline. To use them, however, they needed to be properly trained, and to do that we needed the right data.

Train the model using labeled data

An AI model can be taught by presenting it with many labeled examples that represent the task we want it to complete. In our case, this involved first manually labeling more than a hundred Guardian articles, drawing links between ambiguous references / anaphors and their antecedents.

Although this may not seem like the brightest task, the performance of any model is threatened by the quality of the data fed to it, so the data labeling step is critical to the value of the final product. Due to the complex nature of the language and the resulting subjectivity of labelling, there were many complexities in this task which required the creation of a set of rules to standardize the data across human annotators. Therefore, a full time was spent with Anna, Michel and Alice on this stage of the project; and we were all thankful when it was finished!

Although it was very informative and time-consuming to do, a hundred annotated articles were still not enough to fully reflect the variety of language that a chosen model could encounter. Therefore, to maximize the use of our small data set, we selected three off-the-shelf language models, namely Coreferee, Spacy core assembly model and FastCoref which are already trained on hundreds of thousands of generic examples. We then ‘fine-tuned’ them to suit our specific needs using our annotated data.

This approach allowed us to produce models that achieved greater accuracy on Guardian-specific data compared to using the models straight out of the box.

These models should allow the matching of extracts with sources from Guardian articles on a highly automated basis with greater precision than ever before. The next step is to run a large-scale trial of the Guardian archive and see what journalistic questions this approach can help us answer.

Leave a Reply

Your email address will not be published. Required fields are marked *