Social media plays a vital role in bringing research findings to a wider audience, including other researchers and the general public. But social media engagement is complex. How do we go about assessing the quality of engagement that surrounds a research article?
That’s what Michael Johns, a recent master’s graduate at the Harvard Extension School, set out to investigate in his thesis, in collaboration with Elsevier.
With an established career in technology plus an Electrical Engineering degree from West Point, Michael was well placed to carry out this particular research as part of his ALM IT Software Engineering coursework at Harvard. He completed his thesis alongside his day job as Field Engineering Manager for software firm Databricks, where he has worked since 2017.
Originally inspired by the spread of fake news, his research project evolved through several iterations and was completely reshaped by the arrival of the COVID-19 pandemic. The health crisis inevitably accelerated research in related areas, with many preprint articles rushed to publication stage, some without the necessary quality controls. Expanding on this topic of potential misinformation led Michael to concentrate on the dissemination of COVID research via Twitter.
Michael found that honing his research question through the constantly changing lens of the COVID crisis was the most challenging part of his project. “I pivoted multiple times to get there, but once I locked in that path, it was fairly clear,” he said, adding that as his industry advisor, Elsevier Labs Director Dr Ron Daniel guided him through that process.
Ron noted that “Michael did a great job in dealing with the balancing act of a master’s degree thesis – dealing with the changes in a topic from an original idea due to data that doesn’t exist, taking advantage of new opportunities from data that does exist, and getting to completion in limited time.”
Michael’s eventual thesis – Distributed Graph Techniques to Quantify Social Media Engagement of Covid-19 Scientific Literature through Incremental Tweet Chain Measurements – set out to quantify social media engagement by combining existing data and graph techniques with a novel approach applied to over 8 million tweets involving COVID-19 papers. Engagement was scored using several measures, including unique users reached, the number of papers the user posted about, plus the number and length of social post chains about papers.
To carry out his research, Michael used data from Scopus, PlumX Metrics and CORD-19 (an open dataset of COVID-19 articles), along with computational resources from Elsevier’s ISCR Lab. Launched in 2020 by Elsevier’s International Center for the Study of Research, ICSR Lab is a data analytics platform that enables researchers to access Elsevier’s datasets in the cloud at no cost. ICSR Lab has already supported more than 30 collaborative research projects.
With a focus on delineating high- and low-quality engagement on social media, Michael’s project resulted in the creation of two new metrics: the Social Media Engagement (SME) Score and the Social Media Noise (SMN) score. These formulas can help to quantify more complex engagement behavior, going beyond the usual simple count or page rank metrics. For example, the algorithms can identify behavior that correlates with spammers, bots and self-boosting Twitter accounts, helping to flag those that exhibit “noise.” Michael hopes these SME and SMN scores will provide a baseline for a new era of altmetrics that expands on traditional citation-based metrics:
It would be interesting for follow-on research to compare existing metrics and these newly proposed social media metrics on a corpus of papers to see if they correlate. You might find a divergence where there’s a set of what is tagged as high using existing metric scores, but when you put the new quality score on, they’re not as high. I think it would be beneficial to take a multifaceted look at what is truly driving the impact of a paper.
Another key finding from Michael’s research was that articles included in the abstract and citation database Scopus saw longer social media engagement than those in the CORD-19 dataset. While more research would be needed to determine the precise cause, Michael suggests that this could be the result of peer-reviewed articles in quality journals having more ongoing interest than preprints that have not yet been through the review process.
While Michael’s research focused on a specific set of COVID-related papers, the experiments could be adapted for any social media data relating to scientific research and beyond. “I’m hoping others will pick up the research, especially within Elsevier, as it fits in around the altmetrics work they already do and the services that they could enable,” Michael said.
As Michael’s Industry Advisor, Ron agrees there is a need for this insight:
The quality of scientific publications has been sustained by the peer review process, which has historically been slow-moving. Social media provide a much faster way of alerting about new publications and commenting on them. However, bots, spammers and self-promotion are very simple to perform in that space.
Michael’s work is a step towards filtering out that noise and determining early leading indicators of the quality of a preprint or new article.
Michael’s thesis was a complicated study, but he benefited from the support of the Elsevier team while sharing his own expertise. He was also assisted by Kristy James, a data scientist working on the ICSR Lab. She explained:
When Michael’s project came to us, we were excited to be able to support his work in ICSR Lab, though it was by far the most technically challenging project that had applied to us so far. He wanted to work with the citation graph, network science libraries, detailed visualizations, as well as a large Twitter dataset that we didn’t provide by default on the platform.
Luckily, with Michael’s unique background and knowledge of Databricks, he was able to help advise our team on how to best set up the new processes and systems for an analysis of this scale, all while respecting our needs around how we run ICSR Lab for other projects.
Michael said the support he received helped him to achieve his project’s objectives:
It was a great experience working with people at Elsevier. They were very helpful and interested in figuring out how to remove obstacles in my way. I really couldn't have asked for a better experience. Partnering with Elsevier significantly enriched the quality of my thesis.
Dr Hongming Wang, Senior Research Advisor at Harvard Extension School and Michael’s thesis director in software engineering and data science, commented on the elements that made this research collaboration successful:
Michael built upon his expertise in software engineering, especially those related to big data, artificial intelligence, and natural language processing, and explored a research problem with impact in both global health and social media.
Innovation often occurs at the crossroads of multiple disciplines, and our partnership with Elsevier through the Harvard Data Science Initiative has been critical for the success of Michael’s research. We’re fortunate to have worked with Anita DeWaard [VP of Research Collaborations at Elsevier] and Jennifer Pamphile [Collaboration Manager at Elsevier] and other colleagues from Elsevier, who were instrumental in making this happen. As we explore more societal challenges through the lens of data science and artificial intelligence, I believe our collaboration will continue to make a real difference and contribute to a better understanding of these kinds of problems.
Michael’s research findings highlight the importance of taking a data-led approach to investigating how research is disseminated on social media, looking beyond the traditional metrics. This has proven to be particularly important during the COVID pandemic, where the need to understand how rapidly evolving scientific information is shared has become particularly apparent. As Michael noted:
I think what we’ve witnessed over the last 12 to 18 months is a network effect where the collective scientific community now propagates and absorbs COVID-related papers better, having iteratively learned by necessity to be more efficient in getting quality data and findings out there. We’re all becoming better data citizens and consumers of information, though somewhat over optimized understandably for the current pandemic.
I hope my research will help those working on research tools and services to figure out the right way to more systematically expose higher quality work, while dampening the reach of lower quality or misleading work. With this type of focus we will be prepared to triumph more readily over the next major health crisis.
The evolution of Michael’s original research question from what he started out with has also highlighted the importance of adaptability when it comes to supporting complex data-based research. As Kristy explained:
This project has taught me that we need to be agile in the way we imagine researchers and students interacting with our products and databases – and build in the flexibility for them to ask questions that we do not expect.