CitePal - Overview

Motivation

The idea behind the CitePaL is to provide build a tool that aids researchers in quickly accessing relevant background and motivation for a given paper. The main use case is a researcher who encounters a new paper in a line of research that they are less familiar with.

Current Methods

Current methods in practice

Currently, researchers find foundational papers by word-of-mouth (talking to peers/mentors or hearing about papers during lectures), by combing through reference lists, or through keyword searches on paper aggregation sites. We hope CitePaL can be a tool to supplement or replace the last two strategies.

Related work

Our approach (detailed below) relies on clustering the references and on surfacing most relevant papers for specific clusters. As far as we know, there is no current work which integrates both of these features into one system. However, we build on a large literature in citation clustering and paper recommendation systems. We note a few key papers below and how they differ from our approach.

Eigenfactor Recommends ( West et al 2016) analyzes a paper's citation graph to recommend similar "Expert" and "Classic" papers. This process is similar to our recommendation algorithm of surfacing papers using shared references. However, combining our algorithm with clustering allows for a more natural exploration of the different modes of a paper.
SPECTER embeddings (Cohan et al 2020) form the foundation of our clustering algorithm. However, they could be used directly for predicting similar papers, as proposed in the original paper. This covers a different use case than ours, as we focus on analyzing a paper's references, which may be in a different topic than the paper itself. Our tool would highlight more classic papers, whereas SPECTER embeddings would uncover similar contemporary papers.
Threddy (Kang et al 2022) allows exploration of related papers based on specific concepts identified in a paper. This is a powerful system for finding references but requires more interaction from the user on a paper than our demo.

Approach

We built a tool that aids researchers in finding background papers when reading unfamiliar content. Using our tool, a researcher inputs the url of a single paper, and receives a list of three or four key papers that the given paper builds on, along with an interactive interface to explore the corpus of reference papers. The interactive interface consists of two components; a paper hierarcy and a scatter plot. Using this tool, users can explore the corpus of reference papers, examine the main topics related to these references, and navigate to the recommended 'foundational' papers.

Reference Hierarchy

To group papers, we use hierachical clustering (using Ward method with euclidean distance) SPECTER embeddings provided by Semantic Scholar. We use GPT-3 (using the davinci-003 model) to generate a title for each group based on the titles of the groups or papers below it. We use the following prompt:

I am writing a review paper including the following papers:
  [input list of paper references or group titles]
A review paper summarizing and synthesizing all of the above papers
would have the short but descriptive title of ...

Visual Exploration

To facilitate the visual exploration of referenced papers, we used Altair to createa an interactive scatter plot. Each point represents a single paper and the corpus of papers is made of two levels of references; the references of the selected paper, and the references of those papers. This allows users to explore a wider range of papers that may be relevant. Relevance of any paper is represented by three metrics, the date it was published (x axis), the citation count from Semantic Scholar (y axis), and the number of cross references (color and size). Cross references refers to how many papers in the selected corpus (all the paper shown in the plot), cite a given paper. So if the cross reference count for a given paper is five, that means that five papers within the corpus cite that paper. The selection of papers is set but the user based on the checked boxes in the paper hierarchy above. Finally we've incorporated a tooltip so that users can hover over any point to see the title, first author and year of publication.

Recommended Papers

The top ten recommended papers that appear as links after the user selects “Show the graph!” or ordered first by number of “shared” citations, i.e. how many of the papers in the selected hierarchies reference that paper and then by total number of citations. Drawbacks of this method are that it preserves any existing biases for which papers are cited more frequently. We hope that by providing the entire chart in an eye-catching way it will encourage users to explore the wider corpus of referenced papers more fully.

Example Dataset

The example dataset is made up of assigned readings from CSE599D: The Future of Scholarly Communication. Exploring the auto-generated hierarchies of these papers allows a user to quickly see an overview of the range of topics covered in the course.