3. Methods#

This paper describes a methodology for systematically monitoring the use of UNHCR microdata. We develop a workflow for searching literature repositories with an initial focus on academic research, but with plans to include “grey literature” in the future. We take a three-step approach to producing a comprehensive and informative list of the papers that reference a particular dataset.

First, for a given dataset, we begin by combining information from the metadata fields available for the study and using these terms to form multiple and overlapping search queries; these include the primary authoring organization, the year of the survey, and the country in which the study was carried out.

Second, we use the generated query strings to generate sets of search results for each UNHCR microdata set from Semantic Scholar.

Third, once a list of possible references has been compiled for each dataset, we compute for each reference a measure of its relevance to the dataset in question using Natural Language Processing (NLP). We first generate a list of topics contained in each reference using the topic modeling and analysis tools made available by NLP4Dev. To classify each possible reference by relevance to a given dataset, we consider 1) the share of identified topics related to forced displacement, 2) the set of countries mentioned in the reference, and 3) the frequency of a curated set of key-word tags related to forced displacement.

Below we summarize the three primary methods that were used throughout this process. Each is described in greater detail in a link on the left side of the page.

Step

Input

Output

Benefit

1

Query generation

Metadata for an individual dataset

List of query strings

Information about topics and countries included, field of study and nature of use

2

Semantic search

One or more query strings for each dataset

List(s) of citing papers

Information about topics and countries included, field of study and nature of use

3

Topic modeling/ sentiment analysis

Text of individual citing papers

Information about topics and countries included, field of study and nature of use

We can determine how the dataset is used in each matching publication without resorting to manual review/analysis

3.1. Tools#

We use two external tools: APIs powered by Semantic Scholar and NLP4Dev. We load corpuses from both and store them in Google Cloud Storage.

3.2. Citations#

The following papers informed our approach: [CKPT92].

BK16

Eirik Bakke and David R. Karger. Expressive query construction through direct manipulation of nested relational results. In Proceedings of the 2016 International Conference on Management of Data. ACM, June 2016. URL: https://doi.org/10.1145/2882903.2915210, doi:10.1145/2882903.2915210.

CK06

Harr Chen and David R. Karger. Less is more: probabilistic models for retrieving fewer relevant documents. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, August 2006. URL: https://doi.org/10.1145/1148170.1148245, doi:10.1145/1148170.1148245.

CAvZC19

Arman Cohan, Waleed Ammar, Madeleine van Zuylen, and Field Cady. Structural scaffolds for citation intent classification in scientific publications. In Proceedings of the 2019 Conference of the North. Association for Computational Linguistics, 2019. URL: https://doi.org/10.18653/v1/n19-1361, doi:10.18653/v1/n19-1361.

CKPT92

Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey. Scatter/gather: a cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR \textquotesingle 92. ACM Press, 1992. URL: https://doi.org/10.1145/133160.133214, doi:10.1145/133160.133214.

FCB16

Ethan Fast, Binbin Chen, and Michael S. Bernstein. Empath: understanding topic signals in large-scale text. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, May 2016. URL: https://doi.org/10.1145/2858036.2858535, doi:10.1145/2858036.2858535.

JKH+18

David Jurgens, Srijan Kumar, Raine Hoover, Dan McFarland, and Dan Jurafsky. Measuring the evolution of a scientific field through citation frames. Transactions of the Association for Computational Linguistics, 6:391–406, December 2018. URL: https://doi.org/10.1162/tacl_a_00028, doi:10.1162/tacl_a_00028.

TST06

Simone Teufel, Advaith Siddharthan, and Dan Tidhar. An annotation scheme for citation function. In Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, 80–87. Sydney, Australia, July 2006. Association for Computational Linguistics. URL: https://aclanthology.org/W06-1312.

VEHE15

Marco Antonio Valenzuela-Escarcega, Vu A. Ha, and Oren Etzioni. Identifying meaningful citations. In AAAI Workshop: Scholarly Big Data. 2015.