Sarker Lab Emory University

Mining Social Media Big Data for Toxicovigilance: Studying Substance Use via NLP and Machine Learning Methods

According to the World Health Organization, toxicovigilance is the active process of identifying and evaluating the toxic risks existing in a community, and evaluating the measures taken to reduce or eliminate them. Our toxicovigilance research focuses on prescription and illicit drug use/misuse and drug use disorder.

Currently, our work is primarily funded by the National Institute on Drug Abuse (NIDA) of the National Institutes of Health (NIH). This project primarily focuses on characterizing prescription drug misuse/abuse/nonmedical use by mining social media big data. We are (i) building close to real-time monitoring systems so that we can forecast potential future crises, (ii) developing methods to characterize prescription drugs based on their reported abuse/misuse, (iii) studying potential long-term impacts of drug use disorder and the natural history of addiction, and (iv) empowering toxicologists with information mined from social media so that they can take the necessary steps to help people suffering from opioid use disorder.

NIH-specific information about the project can be found HERE. We also received a small amount of funding (for annotation) and data from the PA CURE project.


Our work was featured on Popular Science (with some fair criticism and skepticism). Our work was featured on The Emory Health Science Blog . Our paper on JAMA Network Open shows that we can potentially combine publicly available Twitter data, geospatial information, temporal information, natural language processing and applied machine learning to predict the status of the opioid crisis at a specific place (county and substate) within the U.S.A. Our MedInfo-2019 paper discusses effective data collection strategies for opioids from Twitter using NLP methods to generate common misspellings and supervised machine learning for filtering out noise. Our JAMIA paper reviews the literature on social media mining for prescription medication use/misuse. We propose a simple data-centric framework that is suitable for social media data. The particularly important aspect is filtering irrelevant information via the use of supervised classification. Our JMIR paper provides a detailed description of the importance of thorough annotation guidelines. The paper also contains publicly available data, annotation guidelines, and other resources.

NIH Abstract

The epidemic of substance use (SU) and substance use disorder (SUD) in the United States has been evolving for decades. Both prescription and illicit drugs have been involved in overdose deaths over the years, with notable increases in synthetic opioids (eg., fentanyl & analogs) and psychostimulants (eg., methamphetamine) in recent years. The emergence of high-potency novel psychoactive substances (NPSs), such as fentanyl analogs, have drastically contributed to rising deaths, and adversely impacted treatment engagement and response. The COVID-19 pandemic has further exacerbated the crisis, and recent studies have also highlighted that substantial disparities exist in SUD treatment, research, interest, and response across different subpopulations, with racial/ethnic minorities being disproportionately impacted. A key element to tackling the crisis is improved surveillance. Specifically, there is a need for establishing novel approaches to provide timely insights about the trends, distributions, and trajectories of the SUD epidemic, as traditional surveillance approaches involve considerable lags. Many recent studies have identified social media (SM) as useful resources for conducting SU/SUD surveillance. Many people use SM to discuss personal experiences, provide advice, or seek answers to questions regarding SU/SUD, resulting in the generation of an abundance of information. Such information can be characterized, aggregated, and analyzed to obtain population- or subpopulation-level insights, at low cost and in near real-time. However, converting SM data into timely, actionable knowledge is non-trivial since the data is big, complex, and noisy, requiring the development of advanced, automated artificial intelligence methods. Funded by the National Institute on Drug Abuse, our past work focused specifically on prescription medications (PM) and established the most sophisticated SM-based data mining pipeline available to date. In response to the evolution of the SUD epidemic, the proposed project will extend our capabilities to include illicit substances and develop novel methods to conduct surveillance. Specifically, we will (i) extend our machine learning and natural language processing (NLP) classification pipeline to automatically classify all SU-related chatter from Twitter and Reddit (rather than PMs only), (ii) collect and analyze longitudinal timelines of cohorts self-reporting SU/SUD, (iii) characterize the cohorts in terms of demographic details such as age-group, gender identity, race and geolocation, (iv) develop advanced NLP-driven methods for detecting NPSs and impacts of SU/SUD, (v) study short-term and long-term trends and trajectories of the epidemic, (vi) conduct observational studies on targeted population subsets, including studies focusing on SU and SUD treatment disparities and stigma, and (vii) disseminate developed methodologies via open source code and aggregated findings publicly via a web-based dashboard. Implementation of our data-centric methods and successful execution of the project has the potential to transform SU/SUD surveillance, and complement traditional surveillance measures by providing close to real-time statistics and insights, including those for targeted subpopulations.

Public Health Relevance Statement

Prescription Medication (PM) abuse is a major epidemic in the United States, and monitoring and studying the characteristics of the PM abuse problem requires the development of novel approaches. Social media encapsulates an abundance of data about PM abuse from different demographics, but extracting that data and converting it to knowledge requires advanced natural language processing and data-centric artificial intelligence systems. Our proposed social media mining framework will automate the process of big data to knowledge conversion for PM abuse, providing crucial insights to toxicologists about targeted populations and enabling the future development of directed intervention strategies.

Funding and Disclosure


To view the data for this project, click on the download link to be redirected to Google Drive.

Download Data

Previous post
Classification of Fall Types in Parkinson's Disease From Self-report Data Using NLP