Query Generation
Contents
3.3. Query Generation#
We begin by generating a set of search queries based on the metadata for each dataset.
The full query creation notebook can be found here.
We start with simple methods of combining the metadata. The key metadata components that we use as an input into queries are:
Core survey name
Survey abbreviation (e.g., VaSyR)
Lead organization (e.g., UNHCR) - short or long versions may be used
Category (e.g., Socioeconomic Asessment of Refugees)
Country
Year
For our initial exploration, we define six simple query types which represent different combinations of the information above :
Query type 1: lead_org_short + year + name + shortcode
Query type 2: lead_org_short + year + name + full_name
Query type 3: lead_org_short + year + name + shortcode_fullname
Query type 4: lead_org_long + year + name + shortcode
Query type 6: lead_org_long + year + name + shortcode_fullname
Defining query structure is an important step, as the search output can be sensitive to the contents of the query used for the search. For example, the six queries above return different number of results if entered into Semantic Scholar.
The table below shows the number of results returned by Semantic Scholar for each dataset.
import pandas as pd
df = pd.read_csv("../data/semantic_scholar_query_results_with_web_count.csv")
print(df)
id query_type1_pubs query_type2_pubs query_type3_pubs \
0 16 93 2 0
1 114 12 0 0
2 131 3 0 0
3 132 34 0 0
4 136 15 49 2
.. ... ... ... ...
478 699 2 8 0
479 700 2 8 0
480 701 1 1 0
481 703 6 0 0
482 704 0 0 0
query_type4_pubs query_type5_pubs query_type6_pubs \
0 0 0 0
1 1978 2 2
2 0 0 0
3 0 0 0
4 15 49 2
.. ... ... ...
478 0 0 0
479 0 0 0
480 0 0 0
481 0 0 0
482 0 0 0
no_related_pubs_jdc_website
0 0
1 1
2 0
3 0
4 2
.. ...
478 0
479 0
480 0
481 0
482 0
[483 rows x 8 columns]
Each of these query strings is subsequently passed into the Semantic Scholar API. The following function generates a single query which can serve as an argument into another function.
# This function takes in the name of our dataframe, the id of the data set and the query number and returns the query.
def query_finder(df_name: pd.DataFrame, dataset_id: int = 189, query_number:int =1 )->str:
""" Function takes in the query dataframe, the dataset ID and the query type number and retuns the query itself"""
df_indexid = df_name.set_index('id')
query = df_indexid.loc[189][f"query_type{query_number}"]
return query
Future work will draw on more sophisticated methods of constructing queries.
3.3.1. Next steps#
After generating these queries, we treat them as inputs to the Semantic Scholar API. See the next section for more details.
3.3.2. Citations#
The following papers informed our approach: [BK16].
- BK16
Eirik Bakke and David R. Karger. Expressive query construction through direct manipulation of nested relational results. In Proceedings of the 2016 International Conference on Management of Data. ACM, June 2016. URL: https://doi.org/10.1145/2882903.2915210, doi:10.1145/2882903.2915210.
- CK06
Harr Chen and David R. Karger. Less is more: probabilistic models for retrieving fewer relevant documents. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, August 2006. URL: https://doi.org/10.1145/1148170.1148245, doi:10.1145/1148170.1148245.
- CAvZC19
Arman Cohan, Waleed Ammar, Madeleine van Zuylen, and Field Cady. Structural scaffolds for citation intent classification in scientific publications. In Proceedings of the 2019 Conference of the North. Association for Computational Linguistics, 2019. URL: https://doi.org/10.18653/v1/n19-1361, doi:10.18653/v1/n19-1361.
- CKPT92
Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey. Scatter/gather: a cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR \textquotesingle 92. ACM Press, 1992. URL: https://doi.org/10.1145/133160.133214, doi:10.1145/133160.133214.
- FCB16
Ethan Fast, Binbin Chen, and Michael S. Bernstein. Empath: understanding topic signals in large-scale text. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, May 2016. URL: https://doi.org/10.1145/2858036.2858535, doi:10.1145/2858036.2858535.
- JKH+18
David Jurgens, Srijan Kumar, Raine Hoover, Dan McFarland, and Dan Jurafsky. Measuring the evolution of a scientific field through citation frames. Transactions of the Association for Computational Linguistics, 6:391–406, December 2018. URL: https://doi.org/10.1162/tacl_a_00028, doi:10.1162/tacl_a_00028.
- TST06
Simone Teufel, Advaith Siddharthan, and Dan Tidhar. An annotation scheme for citation function. In Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, 80–87. Sydney, Australia, July 2006. Association for Computational Linguistics. URL: https://aclanthology.org/W06-1312.
- VEHE15
Marco Antonio Valenzuela-Escarcega, Vu A. Ha, and Oren Etzioni. Identifying meaningful citations. In AAAI Workshop: Scholarly Big Data. 2015.