Semantic search
Contents
3.4. Semantic search#
We made use of the Semantic Scholar API. For each of the query strings generated in the last section, we pass them to the “Paper Lookup” API endpoint, which we access through the SemanticScholar Python library using the following call:
sch = SemanticScholar()
query_type1 = []
for i in dfq["query_type1"]:
results = sch.search_paper(i)
query_type1.append(results)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[1], line 1
----> 1 sch = SemanticScholar()
2 query_type1 = []
3 for i in dfq["query_type1"]:
NameError: name 'SemanticScholar' is not defined
The output is an object which includes (among others) the following information:
Dataset ID
Paper IDs (includes SS’s internal ID, as well as external IDs such as DOI, arXiv, Pubmed, etc. where available)
Title
Authors
Publication venue
Publication year
Field of Study
Count of inbound and outbound citations
URL
Abstract
PDF availability (indicates whether the full PDF is included in the corpus, and whether the abstract, body, and bibliographic entries are available in parsedt format)
We extract abstracts from this output, giving us an abstract for each dataset-query-paper combination where an abstract is available. The abstract, together with the indicator of PDF availability, serves as an input into our topic modeling procedure (see the next section.)
def semantic_search(query:str)->pd.DataFrame:
""" This function takes in a query (from the query_finder function) and returns a dataframe of abtracts and corpus IDS"""
sch = SemanticScholar()
results = sch.search_paper(query)
#metadata
### Add title
abstracts = [item.abstract for item in results]
identifiers = [item.externalIds for item in results]
API_corpus_ID = [i.get("CorpusId") for i in identifiers] # we can return whatever we want here.
return pd.DataFrame(list(zip(abstracts, API_corpus_ID)), columns = ['abstract', 'corpus_id'])
Where available, we also parse PDF content. When we parse a PDF, the output is an object which includes (among others) the following information:
Paper ID
PDF hash
Abstract (if available)
Body text (if available)
Bibliographic entries
3.4.1. Citations#
The following papers informed our approach: [FCB16],[CK06].
- BK16
Eirik Bakke and David R. Karger. Expressive query construction through direct manipulation of nested relational results. In Proceedings of the 2016 International Conference on Management of Data. ACM, June 2016. URL: https://doi.org/10.1145/2882903.2915210, doi:10.1145/2882903.2915210.
- CK06
Harr Chen and David R. Karger. Less is more: probabilistic models for retrieving fewer relevant documents. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, August 2006. URL: https://doi.org/10.1145/1148170.1148245, doi:10.1145/1148170.1148245.
- CAvZC19
Arman Cohan, Waleed Ammar, Madeleine van Zuylen, and Field Cady. Structural scaffolds for citation intent classification in scientific publications. In Proceedings of the 2019 Conference of the North. Association for Computational Linguistics, 2019. URL: https://doi.org/10.18653/v1/n19-1361, doi:10.18653/v1/n19-1361.
- CKPT92
Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey. Scatter/gather: a cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR \textquotesingle 92. ACM Press, 1992. URL: https://doi.org/10.1145/133160.133214, doi:10.1145/133160.133214.
- FCB16
Ethan Fast, Binbin Chen, and Michael S. Bernstein. Empath: understanding topic signals in large-scale text. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, May 2016. URL: https://doi.org/10.1145/2858036.2858535, doi:10.1145/2858036.2858535.
- JKH+18
David Jurgens, Srijan Kumar, Raine Hoover, Dan McFarland, and Dan Jurafsky. Measuring the evolution of a scientific field through citation frames. Transactions of the Association for Computational Linguistics, 6:391–406, December 2018. URL: https://doi.org/10.1162/tacl_a_00028, doi:10.1162/tacl_a_00028.
- TST06
Simone Teufel, Advaith Siddharthan, and Dan Tidhar. An annotation scheme for citation function. In Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, 80–87. Sydney, Australia, July 2006. Association for Computational Linguistics. URL: https://aclanthology.org/W06-1312.
- VEHE15
Marco Antonio Valenzuela-Escarcega, Vu A. Ha, and Oren Etzioni. Identifying meaningful citations. In AAAI Workshop: Scholarly Big Data. 2015.