Semantic Search for Background Linking in News Articles
Udhav Sethi
University of Waterloo
Anup Anand Deshmukh
University of Waterloo
ABSTRACT
The task of background linking aims at recommending news
articles to a reader that are most relevant for providing con-
text and background for the query article. For this task, we
propose a two-stage approach, IR-BERT, which combines the
retrieval power of BM25 with the contextual understanding
gained through a BERT-based model. We further propose
the use of a diversity measure to evaluate the effectiveness
of background linking approaches in retrieving a diverse set
of documents. We provide a comparison of IR-BERT with
other participating approaches at TREC 2021. We have open
sourced our implementation on Github
1
.
Author Keywords
Natural Language Processing; Information Retrieval; BERT;
Background Linking; TREC
INTRODUCTION
Online news services have become key sources of information
and have affected the way we consume and share news. While
drafting a news article, it is often assumed that the reader
has sufficient information about the article’s background story.
This may not always be the case, which warrants the need
to provide the reader with links to useful articles that can set
the context for the article in focus. These articles may or
may not be by the same author, can be dated before or after
the query article, and serve to provide additional information
about the article’s topic or introduce the reader to its key ideas.
However, determining what can be categorized as an article
providing background context and retrieving such documents
is not straightforward.
Motivated by this problem, the background linking task was
introduced in the news track of TREC 2018. This task aims
to retrieve a list of articles that can be incorporated into an
“explainer” box alongside the current article to help the reader
understand or learn more about the story or main issues con-
tained therein.
In this paper, we propose a two-stage approach, IR-BERT, to
address the problem of background linking. The first stage
1
https://github.com/Anup-Deshmukh/TREC_background_linking
ACM ISBN 978-1-4503-2138-9.
DOI: 10.1145/1235
filters the corpus to identify a set of candidate documents
which are relevant to the article in focus. This is achieved
by combining weighted keywords extracted from the query
document into an effective search query and using BM25 [9]
to search the corpus. The second stage leverages Sentence-
BERT [8] to learn contextual representations of the query in
order to perform semantic search over the shortlisted candi-
dates. We hypothesize that employing a language model can
be beneficial to understanding the context of the query article
and helping identify articles that provide useful background
information.
This paper is structured as follows: In section 2, we provide
an overview of prior work that motivates our strategies. In sec-
tion 3, we describe in detail our retrieval approach, followed
by sections 4 and 5, where we describe our experiments and
discuss the retrieval performance of our method. Finally, we
summarize and conclude our work in section 6.
RELATED WORK
BM25 [9] is one of the most popular ranking functions used
by search engines to estimate the relevance of documents
to a given search query. It is based on a bag-of-words re-
trieval function that ranks a set of documents based on the
query terms appearing in each document, regardless of their
proximity within the document. Several previous approaches
to background linking are built using BM25. The Anserini
toolkit developed by Yang et al. [13] further standardizes the
open-source Lucene library. Along with BM25, it has been
used to effectively tackle the background linking problem [12].
Another set of approaches, ICTNET [4] and DMINR [5], lever-
age the use of named entities in the query article to build a
search query for BM25.
Previous work has also exploited language models such as
BERT for the task of ad-hoc retrieval [13, 7]. BERT [2] is pre-
trained on large open-domain textual data and has achieved
state-of-the-art results in many downstream NLP tasks. It
has also proven to be an effective re-ranker in many informa-
tion retrieval tasks. For example, Dai and Callan [1] showed
that employing BERT leads to significant improvements in
retrieval tasks where queries are written in natural languages.
This is a direct consequence of their ability to better leverage
language structures. Nogueira and Cho [6] used BERT on top
of Anserini to re-rank passages in the TREC Complex Answer
Retrieval (CAR) task [3]. Similar re-ranking mechanisms have
also shown promise in open-domain question answering [14].
The task of semantic search is very relevant to ad-hoc retrieval
where queries are written in natural languages. There are two
main issues with using BERT for finding the semantic sim-