3.5. Topic Modeling and Sentiment Analysis#

We imported the documents available in the Semantic Scholar and NPL4Dev corpuses.

We make use of the [NLP4Dev API] (https://www.nlp4dev.org/content-api). Depending on whether the paper contents are available in parsed or PDF formay, we use either the “Analyze Text” or the “Analyze File” endpoint. In both cases we make use of the Mallet topic model, which allows us to use the abstract or full text of a paper as an input and retrieve information about the paper content as an output.

We run the abstracts or PDF text from the Semantic Scholar results through the NLP4DEV API, which returns the following information:

  • country counts

  • country group

  • JDC tags (a list of phrases defined by NLP4Dev as being related to forced displacement)

  • the tag counts

  • dataset ID

  • Corpus ID

  • the country that is mentioned the most.

The JDC tags include the following:

  • “asylum seeker”

  • “climate refugee”

  • “country of asylum”

  • “exile”

  • ” forced displacement”

  • “host community”

  • “internally displaced population”

  • “ocha”, “population of concern”

  • “refugee”

  • ” refugee camp”

  • “repatriate”

  • “resetlement area”

  • “returnee”

  • “stateless”

  • “unhcr”

def NLP4Dev(abstract_or_fulltext,text_source):

    if(abstract_or_fulltext == 'PDF'):
        #extract output using "Analyze_file" endpoint
        url = "https://www.nlp4dev.org/nlp/models/mallet/analyze_file"
        payload={"model_id": "6fd8b418cbe4af7a1b3d24debfafa1ee"}
        files=[('file',('mdl-explorer-app/data/02637758211070565.pdf',open('data/02637758211070565.pdf','rb'),'application/pdf'))]
        headers = {}
        response = requests.request("POST", url, headers=headers, data=payload, files=files)
        resp = eval(response.text)
        topic_distribution = resp.get('doc_topic_words')
        country_distribution = resp.get('country_details')
        tag_distribution = resp.get('jdc_tag_counts')
    elif(abstract_or_fulltext == 'parsed_PDF'):
        #extract output using "Analyze_text" endpoint on full extracted text
        url = "https://www.nlp4dev.org/nlp/models/mallet/analyze_text"
        payload = { "model_id": "6fd8b418cbe4af7a1b3d24debfafa1ee", "text": text_source }
        headers = { 'Content-Type': 'application/x-www-form-urlencoded',}
        response = requests.request("POST", url, headers=headers, data=payload)
        resp = eval(response.text)
        topic_distribution = resp.get('doc_topic_words')
        country_distribution = resp.get('country_details')
        tag_distribution = resp.get('jdc_tag_counts')
    elif(abstract_or_fulltext == 'abstract_only'):
        #extract output using "Analyze_text" endpoint on abstract text
        url = "https://www.nlp4dev.org/nlp/models/mallet/analyze_text"
        payload = { "model_id": "6fd8b418cbe4af7a1b3d24debfafa1ee", "text": text_source }
        headers = { 'Content-Type': 'application/x-www-form-urlencoded',}
        response = requests.request("POST", url, headers=headers, data=payload)
        resp = eval(response.text)
        topic_distribution = pd.DataFrame(resp.get('doc_topic_words'))
        country_distribution = pd.DataFrame(resp.get('country_details'))
        tag_distribution = pd.DataFrame(resp.get('jdc_tag_counts'))
        topic_39_percentage = abs(topic_distribution.value[2])
       
        if len(country_distribution) > 0:
            df_indexid = country_distribution.set_index('count')
            majority_country = df_indexid.loc[max(country_distribution['count'])]['name']
        else:
            majority_country = 'NONE'
        
        jdc_tag_count = len(tag_distribution)

        
        return topic_distribution , country_distribution , tag_distribution , majority_country , topic_39_percentage, jdc_tag_count

[ include a sample of text showing topic distribution ]

We use the outputs from the NLP4Dev to calculate a “relevance” score for each dataset-paper compbination

def calculate_relevance(ref,dataset, topic_39_percentage, jdc_tag_count,majority_country,relevance_threshold, paper_title = None):
    #We should define ref as a global variable if the app will always be reading from that one file
    # removed the paper titles argument because we are not using it anywhere in the function. Check with Mureji.

    dataset_country = ref.loc[ref['id'] == dataset, 'nation'].iloc[0]
    dataset_title = ref.loc[ref['id'] == dataset, 'title'].iloc[0]
    if((topic_39_percentage > relevance_threshold)and (majority_country == dataset_country)):
        relevance = 1
    elif((topic_39_percentage > relevance_threshold)and (majority_country != dataset_country)):
        relevance = 0.5
    else:
        relevance = 0
        
    final_output = pd.DataFrame(list(zip([dataset_title],[relevance]
                    ,[topic_39_percentage]
                    ,[jdc_tag_count])), columns = ["Dataset Name","Relevance","topic 39 Percentage", "JDC Tag Counts"])  
    return final_output

The outputs of this relevance function are returned to users in our user tool.

3.5.1. Citations#

The following papers informed our approach: [FCB16],[CK06].

BK16

Eirik Bakke and David R. Karger. Expressive query construction through direct manipulation of nested relational results. In Proceedings of the 2016 International Conference on Management of Data. ACM, June 2016. URL: https://doi.org/10.1145/2882903.2915210, doi:10.1145/2882903.2915210.

CK06

Harr Chen and David R. Karger. Less is more: probabilistic models for retrieving fewer relevant documents. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, August 2006. URL: https://doi.org/10.1145/1148170.1148245, doi:10.1145/1148170.1148245.

CAvZC19

Arman Cohan, Waleed Ammar, Madeleine van Zuylen, and Field Cady. Structural scaffolds for citation intent classification in scientific publications. In Proceedings of the 2019 Conference of the North. Association for Computational Linguistics, 2019. URL: https://doi.org/10.18653/v1/n19-1361, doi:10.18653/v1/n19-1361.

CKPT92

Douglass R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey. Scatter/gather: a cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR \textquotesingle 92. ACM Press, 1992. URL: https://doi.org/10.1145/133160.133214, doi:10.1145/133160.133214.

FCB16

Ethan Fast, Binbin Chen, and Michael S. Bernstein. Empath: understanding topic signals in large-scale text. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, May 2016. URL: https://doi.org/10.1145/2858036.2858535, doi:10.1145/2858036.2858535.

JKH+18

David Jurgens, Srijan Kumar, Raine Hoover, Dan McFarland, and Dan Jurafsky. Measuring the evolution of a scientific field through citation frames. Transactions of the Association for Computational Linguistics, 6:391–406, December 2018. URL: https://doi.org/10.1162/tacl_a_00028, doi:10.1162/tacl_a_00028.

TST06

Simone Teufel, Advaith Siddharthan, and Dan Tidhar. An annotation scheme for citation function. In Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, 80–87. Sydney, Australia, July 2006. Association for Computational Linguistics. URL: https://aclanthology.org/W06-1312.

VEHE15

Marco Antonio Valenzuela-Escarcega, Vu A. Ha, and Oren Etzioni. Identifying meaningful citations. In AAAI Workshop: Scholarly Big Data. 2015.