Accessing Technology’s Ability to Control Covid 19 Using Text Mining July 21, 2020 - Blogs on Text Analytics

The Covid 19 virus has been with us since late 2019 but only spread into a pandemic in early 2020. That is just seven months ago. Since then there have been thousands of journal articles, news reports, editorials, reports, commentaries, blogs and all other matter of text documents dealing with the virus, its spread and how to contain it. Text mining and content analysis is a useful tool in exploring these documents to see how our thinking and processes have evolved over time, what we got right and what we got wrong.

In their recently published paper (Mora, Kummithab, Espositoc, 2020) examine how governments and public health agencies around the world have used information and communication technologies (ICTs) to detect and control the spread of the virus and how do sociomaterial arrangements moderate the effectiveness of the technological solutions adopted by public authorities to control the Covid-19 pandemic. The answer to this question is important because while technology may be very useful its effectiveness can depend on how it is deployed and who deploys it.

Using keyword and combination keyword searches. (and, or) of they produced 2,187 pertinent documents published between January and April, 2020. They were able to identify 39 technological solutions. The documents were cleaned to ensure the keywords were contained in the body of the documents and not in headlines or references. The authors were left with 515 documents for text mining.

“In order to apply text mining techniques based on co-occurrence data, the source documents were transformed in Rich Text Format (RTF) files. The RTF files were then processed with the content analysis software WordStat (Version 8.0.21). WordStat transformed the source documents into high-dimensional sets of unstructured textual data and made it possible to semi-automatize both data cleaning and data processing. Through the data cleaning process, 7 the dimensionality of the dataset was reduced while preserving quality information. This process consisted in the removal of unnecessary textual information whose presence would have generated noise. In the case of academic literature, for example, the footers, headers, and details about the authors and their institutions were filtered out. Given their little semantic value, stop words were also removed by relying on WordStat dictionaries. In addition, misspellings were corrected, and variant forms of the same word were lemmatized. Following the data cleaning phase, 13,894 textual items were extracted, which include 5,917 words and 7,977 phrases. Phrases are conceptual units composed of minimum two and maximum four words. After being extracted, WordStat was tasked with measuring the strength of association between each couple of words and phrases by calculating their intra-document co-occurrence. The co-occurrence data was then normalized. In alignment with research by Eck and Waltman (2009), to normalize the data, the probabilistic affinity index Association Strength was preferred to set-theoretic measures.” P. 6-7

The authors identified various ICTs used to detect and combat the spread of Covid 19 and the relative effectiveness of the technology and how it was used based on the available literature. As we know now some was were more effective than others. The guidelines on how and when to use methods such as temperature sensing devices, contact tracing, testing and other tactics are being continually updated as we learn more about the virus and how it spreads. You can read the complete article online.

References

Mora, L., Kummitha, R. K. R., & Esposito, G. (2020). Digital technology deployment and pandemic control: how sociomaterial arrangements and technological determinism undermine virus containment measures. Available at SSRN 3612338.