Abstract

Smoking status classification on sparsely labelled Dutch clinical notes with weak supervision and a pre-trained Transformer model

by Myrthe Reuver (VU Amsterdam)

Natural Language Processing (NLP) on clinical text has research problems that are difficult to study due to a lack of labelled data. This work uses one case study (smoking status detection) to explore weak supervision and the data programming paradigm as potential solutions for a lack of labelled training data. Information on smoking status (whether a patient smokes, does not smoke, or has never smoked) is an important clinical variable to record in a patient's electronic medical record (EMR), as it affects treatment plans. A high documentation load experienced by medical professionals leads to a lack of labelling of smoking status in EMRs, with medical professionals instead writing about smoking in the free text of clinical notes. Automatically detecting language on smoking status in these clinical notes, and classifying the smoking status mentioned, would allow medical professionals to do less manual documentation - and thus reduce the experienced documentation load. One option for automatic documentation is supervised machine learning. However, training a supervised machine learning model requires large quantities of training data in the form of clinical notes labeled with the patients' smoking status - which we do not have due to the lack of documentation. We therefore attempt to label unlabelled Dutch-language primary care clinical notes with a weakly supervised approach, to be able to use these as training data in a machine learning approach for smoking status detection. Specifically, we use the data programming paradigm with SNORKEL to label more training examples. We then fine-tune a pre-trained Dutch-language Transformer on smoking status classification, and compare approaches with and without SNORKEL in the pipeline. We find that a weakly supervised SNORKEL approach does not considerably improve performance compared to a classification pipeline without SNORKEL, but does improve some within-class performances. We conclude that state-of-the-art Transfer learning methods can successfully be used for smoking status classification, but that the data programming paradigm in this case does not provide any clear benefits in performance gain.

<< back

-