3.3. Query Generation#

We begin by generating a set of search queries based on the metadata for each dataset.

The full query creation notebook can be found here.

We start with simple methods of combining the metadata. The key metadata components that we use as an input into queries are:

  • Core survey name

  • Survey abbreviation (e.g., VaSyR)

  • Lead organization (e.g., UNHCR) - short or long versions may be used

  • Category (e.g., Socioeconomic Asessment of Refugees)

  • Country

  • Year

For our initial exploration, we define six simple query types which represent different combinations of the information above :

  • Query type 1: lead_org_short + year + name + shortcode

  • Query type 2: lead_org_short + year + name + full_name

  • Query type 3: lead_org_short + year + name + shortcode_fullname

  • Query type 4: lead_org_long + year + name + shortcode

  • Query type 6: lead_org_long + year + name + shortcode_fullname

Defining query structure is an important step, as the search output can be sensitive to the contents of the query used for the search. For example, the six queries above return different number of results if entered into Semantic Scholar.

The table below shows the number of results returned by Semantic Scholar for each dataset.

import pandas as pd
df = pd.read_csv("../data/semantic_scholar_query_results_with_web_count.csv")
print(df)
      id  query_type1_pubs  query_type2_pubs  query_type3_pubs  \
0     16                93                 2                 0   
1    114                12                 0                 0   
2    131                 3                 0                 0   
3    132                34                 0                 0   
4    136                15                49                 2   
..   ...               ...               ...               ...   
478  699                 2                 8                 0   
479  700                 2                 8                 0   
480  701                 1                 1                 0   
481  703                 6                 0                 0   
482  704                 0                 0                 0   

     query_type4_pubs  query_type5_pubs  query_type6_pubs  \
0                   0                 0                 0   
1                1978                 2                 2   
2                   0                 0                 0   
3                   0                 0                 0   
4                  15                49                 2   
..                ...               ...               ...   
478                 0                 0                 0   
479                 0                 0                 0   
480                 0                 0                 0   
481                 0                 0                 0   
482                 0                 0                 0   

     no_related_pubs_jdc_website  
0                              0  
1                              1  
2                              0  
3                              0  
4                              2  
..                           ...  
478                            0  
479                            0  
480                            0  
481                            0  
482                            0  

[483 rows x 8 columns]

Each of these query strings is subsequently passed into the Semantic Scholar API. The following function generates a single query which can serve as an argument into another function.

# This function takes in the name of our dataframe, the id of the data set and the query number and returns the query.
def query_finder(df_name: pd.DataFrame, dataset_id: int = 189, query_number:int =1 )->str:
    """ Function takes in the query dataframe, the dataset ID and the query type number and retuns the query itself"""
    df_indexid = df_name.set_index('id')
    query = df_indexid.loc[189][f"query_type{query_number}"]
    return query 

Future work will draw on more sophisticated methods of constructing queries.

3.3.1. Next steps#

After generating these queries, we treat them as inputs to the Semantic Scholar API. See the next section for more details.

3.3.2. Citations#

The following papers informed our approach: [BK16].

BK16

Eirik Bakke and David R. Karger. Expressive query construction through direct manipulation of nested relational results. In Proceedings of the 2016 International Conference on Management of Data. ACM, June 2016. URL: https://doi.org/10.1145/2882903.2915210, doi:10.1145/2882903.2915210.

CK06

Harr Chen and David R. Karger. Less is more: probabilistic models for retrieving fewer relevant documents. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, August 2006. URL: https://doi.org/10.1145/1148170.1148245, doi:10.1145/1148170.1148245.

CAvZC19

Arman Cohan, Waleed Ammar, Madeleine van Zuylen, and Field Cady. Structural scaffolds for citation intent classification in scientific publications. In Proceedings of the 2019 Conference of the North. Association for Computational Linguistics, 2019. URL: https://doi.org/10.18653/v1/n19-1361, doi:10.18653/v1/n19-1361.

CKPT92

Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey. Scatter/gather: a cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR \textquotesingle 92. ACM Press, 1992. URL: https://doi.org/10.1145/133160.133214, doi:10.1145/133160.133214.

FCB16

Ethan Fast, Binbin Chen, and Michael S. Bernstein. Empath: understanding topic signals in large-scale text. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, May 2016. URL: https://doi.org/10.1145/2858036.2858535, doi:10.1145/2858036.2858535.

JKH+18

David Jurgens, Srijan Kumar, Raine Hoover, Dan McFarland, and Dan Jurafsky. Measuring the evolution of a scientific field through citation frames. Transactions of the Association for Computational Linguistics, 6:391–406, December 2018. URL: https://doi.org/10.1162/tacl_a_00028, doi:10.1162/tacl_a_00028.

TST06

Simone Teufel, Advaith Siddharthan, and Dan Tidhar. An annotation scheme for citation function. In Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, 80–87. Sydney, Australia, July 2006. Association for Computational Linguistics. URL: https://aclanthology.org/W06-1312.

VEHE15

Marco Antonio Valenzuela-Escarcega, Vu A. Ha, and Oren Etzioni. Identifying meaningful citations. In AAAI Workshop: Scholarly Big Data. 2015.