Semantic Textual Similarity Assessment in Chest X-ray Reports Using a Domain-Specific Cosine-Based Metric (2024)

Sayeh GHOLIPOUR PICHA
Univ. Grenoble Alpes,
CNRS, Grenoble INP, GIPSA-lab,
38000 Grenoble, France
sayeh.gholipour-picha@grenoble-inp.fr
& Dawood AL CHANTI
Univ. Grenoble Alpes,
CNRS, Grenoble INP, GIPSA-lab,
38000 Grenoble, France
dawood.al-chanti@grenoble-inp.fr
& Alice CAPLIER
Univ. Grenoble Alpes,
CNRS, Grenoble INP, GIPSA-lab,
38000 Grenoble, France
alice.caplier@grenoble-inp.fr

(February, 2024)

Abstract

Medical language processing and deep learning techniques have emerged as critical tools for improving healthcare, particularly in the analysis of medical imaging and medical text data. These multimodal data fusion techniques help to improve the interpretation of medical imaging and lead to increased diagnostic accuracy, informed clinical decisions, and improved patient outcomes. The success of these models relies on the ability to extract and consolidate semantic information from clinical text. This paper addresses the need for more robust methods to evaluate the semantic content of medical reports. Conventional natural language processing approaches and metrics are initially designed for considering the semantic context in the natural language domain and machine translation, often failing to capture the complex semantic meanings inherent in medical content. In this study, we introduce a novel approach designed specifically for assessing the semantic similarity between generated medical reports and the ground truth. Our approach is validated, demonstrating its efficiency in assessing domain-specific semantic similarity within medical contexts. By applying our metric to state-of-the-art Chest X-ray report generation models, we obtain results that not only align with conventional metrics but also provide more contextually meaningful scores in the considered medical domain.

Keywords Semantic Similarity, Medical Language Processing, Biomedical Metric

1 INTRODUCTION

Advancements in deep learning for medical language processing have significantly improved healthcare clinical analysis, particularly in the domain of medical imaging applications. Notably, there has been substantial progress in generating chest X-ray reports comparable to those written by radiologists. However, a critical challenge persists in the chest X-ray application—assessing the semantic similarity between generated reports and the ground truth.

Identifying semantic similarities in medical texts is a difficult task within the language processing domain Alam etal. (2020).This task necessitates a comprehensive grasp of the entire medical text corpus, the ability to recognize key content, and a profound understanding of the semantic relationships between these critical keywords at an expert level.While existing metrics and approaches for capturing semantic similarity in natural language are effective, they are not designed for the complexities of medical content. The need for a robust metric to assess semantic similarity in medical texts has become increasingly evident, particularly in applications like chest X-ray report generation, and continues to be an active area of research Endo etal. (2021), Miura etal. (2021), Yu etal. (2022).

State-of-the-art chest X-ray report generation models Chen etal. (2020), Miura etal. (2021), Endo etal. (2021) still rely on conventional Natural Language Processing (NLP) methods like BLEU Papineni etal. (2002), METEOR Banerjee and Lavie (2005), and ROUGE Lin (2004) to evaluate the generated reports against ground truth references. However, these metrics produce unreliable results due to their inability to comprehend and compare the semantic similarity of key medical terms.A medical semantic similarity metric would not only provide more significant evaluation scores but could also be incorporated into the training process to improve model performance, potentially leading to enhanced diagnostic accuracy and decision-making. Additionally, as part of our ongoing research, our goal is to focus on providing visual interpretations of chest X-ray reports using text-to-image localization. As a consequence, a robust semantic similarity evaluation metric suitable for medical content will ensure the reliability of generated reports and will enable us to achieve more accurate localization and interpretation of image content.

In this context, we propose a new metric designed to assess and assign scores about the semantic similarity of medical texts. Our metric consists of two sequential steps: first, we identify the primary clinical entities, and subsequently, we evaluate the similarity between these entities using the domain-specific Cosine similarity score. Notably, our approach considers the presence of negations and detailed descriptions associated with medical entities during the evaluation process. To this end, our contributions include:

•
Introduction of a novel system for clinical entity extraction from medical texts.
•
Proposition of a new scoring system for the evaluation of semantic similarity that suits medical and natural texts.
•
Presentation of a validation method for scoring verification.
See Also
Structural Entities Extraction and Patient Indications Incorporation for Chest X-ray Report Generation Development of Pidgin English Hate Speech Classification System for Social Media All of them – Mark Dingemanse BERTKG-DDI: Towards Incorporating Entity-specific Knowledge Graph Information in Predicting Drug-Drug Interactions

This paper is structured as follows: Section 2 discusses related works; Section 3 presents the theoretical and mathematical part of the novel metric; Section 4 validates the metric; Section 5 discusses the results; Finally, Section 6 concludes the paper.

2 RELATED WORKS

Recent studies have addressed the challenge of similarity evaluation between generated medical reports and the ground truth through various approaches other than conventional NLP metrics. Researchers have often introduced innovative metrics in the process.

In the CXR-RePaiR model by Endo et al. Endo etal. (2021) a unique approach for automatically evaluating chest X-ray report generation is proposed by introducing the CheXbert vector similarity metric, using the CheXbert labeler Smit etal. (2020) — a specialized tool for chest X-ray report labeling. The process involves extracting labels from generated reports, comparing them with ground truth labels, and presenting the final score using cosine similarity. While this approach outperforms the BLEU metric, its applicability is limited to the specific context of chest X-ray reports and does not readily extend to other medical applications.The limitations arise from Chexbert being exclusively trained for chest X-ray reports. Moreover, the Chexpert labels Irvin etal. (2019) (Atelectasis, Cardiomegaly, Consolidation, Edema, Enlarged Cardiomediastinum, Fracture, Lung Lesion, Lung Opacity, No Finding, Pleural Effusion, Pleural Other, Pneumonia, Pneumothorax) are specific to the chest X-ray dataset, further limiting the generalizability of the approach to other medical contexts.

In a separate study, Yu et al. Yu etal. (2022) introduced a novel metric targeting the quantification of overlap of clinical entities between ground truth and generated reports in chest X-ray report generation. They use the RadGraph model Jain etal. (2021), a language model trained on a limited subset of reports from the MIMIC-CXR dataset Johnson etal. (2019). The MIMIC-CXR dataset consists of chest X-ray images with corresponding reports, and the RadGraph dataset includes medical entities from chest X-ray reports annotated by radiologists.The approach by Yu et al. is similar to the BLEU score, exclusively considering the exact matches among the primary entities in generated and ground truth reports, overlooking the semantic similarity of these entities. Furthermore, the generalizability of this approach to other medical applications is constrained by the RadGraph model’s specialization in extracting only chest X-ray related entities. Nonetheless, while the RadGraph model acknowledges negations in the texts, they are treated merely as labels to the entities, and the details of entity descriptions are not factored into the evaluation process.

In a recent study, Patricoski et al. Patricoski etal. (2022) conducted an evaluation of seven BERT models to assess semantic similarity in clinical trial texts. Notably, the pre-trained BERT model known as SciBERT Beltagy etal. (2019) demonstrated better performance compared to the other BERT models, even outperforming the standard BERT model, which secured the second position in this evaluation. This study underlines the promising potential of BERT models in semantic similarity evaluation. However, it has a drawback associated with using BERT models without preprocessing.BERT models operate at a token-by-token level, evaluating semantic similarity by comparing all tokens with each other, a computationally intensive process that gives relatively low scores. Despite this computational challenge, it is important to consider the significant potential in SciBERT, particularly due to its huge clinical dictionary. This finding underscores the need for careful consideration of preprocessing strategies to maximize the effectiveness of BERT models in semantic similarity evaluations.

Notably, the absence of a comprehensive, general semantic similarity evaluation metric for medical content persists. Consequently, we introduce a novel metric for Medical Corpus Similarity Evaluation (MCSE) to comprehensively address and resolve these challenges.

3 METHODOLOGY

We developed a novel metric for Medical Corpus Similarity Evaluation (MCSE), by exclusively extracting key medical entities and employing a pretrained BERT model to assess the semantic similarity of these entities within chest X-ray reports. This targeted approach allows BERT to concentrate solely on important information and reduces the computational load during comparison. Importantly, our methodology goes beyond extracting main entities, we also consider the negations and detailed descriptions associated with the primary medical entities in chest X-ray reports. Our MCSE metric consists of two essential steps:

1.
Clinical Entity Extraction.
2.
Domain Similarity Evaluation.

3.1 Clinical Entity Extraction

The most important part of comprehending semantic similarity evaluation in text relies on identifying the key elements, often referred to as clinical entities, within medical texts. These entities typically fall into categories related to anatomical body parts, symptoms, laboratory equipment, and diagnoses. Each category is typically signaled by certain words within a sentence. However, there are additional words that precede or follow these main entities, offering descriptions.

To address these complexities, we employ the Scispacy model Neumann etal. (2019) for extracting primary clinical entities from medical text using the embedded clinical dictionary in this model (BC5CDR: a corpus comprising 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases, and 3116 chemical-disease interactions Li etal. (2016)). Subsequently, we automatically process the entire text to identify associated negations and adjectives related to these key entities. These elements are then integrated to provide a comprehensive representation of the considered text. In the context of this research, the category of laboratory equipment is deliberately excluded, aligning with the specific focus of our application. Table 1 presents an example of medical text and the extracted entities using our method and the Scispacy method without any cleaning process.While we employ the Scispacy model for initial entity extraction, it is evident that this model alone may not suffice. An additional automated post-processing step is needed to refine and integrate related entities.The post-processing steps involve eliminating a single adjective or non-medical entities, excluding entities categorized as lab equipment, identifying and adding the relevant adjective to the remaining medical entities, including the existing negation into these primary entities, and screening out any reported diagnostic procedures terms.These processes are essential to ensure that the final output is presented as a cohesive set of primary medical entities, ready for practical use.

Medical Text	Extracted Entities using our method	Extracted Entities using Scispacy Neumann etal. (2019)
1. Interval clearance of left basilar consolidation. 2. Patchy right basilar opacities, which could be seen with minor atelectasis, but given the context clinical correlation is suggested regarding any possibility for recurrent or new aspiration pneumonitis at the right lung base. 3. Increased new interstitial abnormality, suggesting recurrence of fluid overload or mild-to-moderate pulmonary edema; aspiration could also be considered. Inflammation associated with atypical infectious process is probably less likely given the waxing and waning presentation.	fluid overload, inflammation, aspiration pneumonitis, minor atelectasis, mild to moderate pulmonary edema, left basilar consolidation, patchy right basilar opacities, interstitial abnormality	Interval, clearance, left basilar, consolidation, Patchy, right basilar, opacities, minor, atelectasis, clinical, recurrent, aspiration, pneumonitis, right lung base, Increased, interstitial abnormality, recurrence, fluid, overload, mild-to-moderate pulmonary edema, aspiration, Inflammation, associated with, atypical, infectious process, waxing, waning, presentation

3.2 Domain Similarity Evaluation

Having successfully extracted and shifted our focus to the primary entities within the medical corpus, the next step involves assessing their semantic similarities by assigning corresponding scores.

Reference: 1. Interval clearance of left basilar consolidation. 2. Patchy right basilar opacities, which could be seen with minor atelectasis, but given the context clinical correlation is suggested regarding any possibility for recurrent or new aspiration pneumonitis at the right lung base. 3. Increased new interstitial abnormality, suggesting recurrence of fluid overload or mild-to-moderate pulmonary edema; aspiration could also be considered. Inflammation associated with atypical infectious process is probably less likely given the waxing and waning presentation.
	Candidate Medical Entities
	pulmonary masses	right middle lobe	hilar adenopathy
	fluid overload	0.61	0.49	0.45
	inflammation	0.64	0.48	0.55
	aspiration pneumonitis	0.65	0.39	0.50
	minor atelectasis	0.62	0.47	0.53
Reference Medical Entities	mild to moderate pulmonary edema	0.78	0.31	0.51
	left basilar consolidation	0.52	0.66	0.32
	patchy right basilar opacities	0.64	0.66	0.49
	interstitial abnormality	0.69	0.63	0.59
$S_{i}$	0.548	0.563	0.545

4 VALIDATION

While the underlying logic of this metric is reasonable, it is imperative that we validate the results robustly. Given the use of chest X-ray reports for this particular application, we have conducted an extensive search within existing datasets to identify an appropriate validation method.After a comprehensive review of various datasets, we concluded that it would be more effective to conduct separate validations for the different steps of the proposed metric.

4.1 Clinical Entity Extraction Process

In order to rigorously validate our clinical entity extraction process, we employ the RadGraph dataset Jain etal. (2021). This dataset is a valuable resource in which radiologists thoroughly annotated the primary clinical entities in chest X-ray reports as either "definitely present" within the report or "definitely absent". Importantly, in cases where a negation is associated with a particular entity, it is annotated as "definitely absent."

To achieve our validation objectives, we executed our entity extraction process on the reports within this dataset. Subsequently, we compare the number of similar entities extracted through our method with the annotations provided by radiologists, particularly focusing on the two categories of "definitely present" and "definitely absent". This systematic comparison allows us to assess the accuracy and effectiveness of our clinical entity extraction methodology in the context of chest X-ray reports, aligning with radiological standards.Throughout the validation process, covering all reports in our study, our method consistently achieves a high level of accuracy. On average, it accurately recognizes 75% of entities marked as "definitely present" and successfully identifies 76% of entities labeled as "definitely absent".In our entity extraction process, we deliberately omit anatomical entities like "chest" or "lung," as they are redundant to the chest X-ray application and do not contribute significantly to the process. This selective exclusion is one of the factors contributing to the approximately 75% accuracy in our results. Nevertheless, these results affirm the reliability and consistency of our methodology.

4.2 Domain Similarity Score

In contrast to the initial phase of clinical entity extraction, validating the domain similarity score is more challenging. The scoring system itself is more controversial and subject to debate, and creating an automated validation method, free from reliance on radiologists, necessitates a creative and innovative approach.Nevertheless, through the available tools and databases, we establish a dedicated system for the validation of this scoring method for the application of chest X-rays.

In the chest X-ray application, the MIMIC-CXR dataset Johnson etal. (2019), is one of the biggest available databases for chest X-ray images and their corresponding reports. Notably, this dataset provides us with Chexpert labels (Medical Observation), including Atelectasis, Cardiomegaly, Consolidation, Edema, Enlarged Cardiomediastinum, Fracture, Lung Lesion, Lung Opacity, No Finding, Pleural Effusion, Pleural Other, Pneumonia, Pneumothorax, and Support Devices labels Irvin etal. (2019). The values of each label are 1 (definitely present), 0 (definitely absent), -1 (ambiguous), or it carries no value at all. Table 3 presents a sample of Chexpert labels extracted from chest X-ray reports of five patients from the MIMIC-CXR database. The reports corresponding to these subjects are presented in Table 4.

Subject ##AtelectasisCardiomegalyConsolidationEdemaEnlarged CardiomediastinumFractureLung LescionLung OpacityNo FindingPleural EffusionPleural OtherPneumoniaPneumothoraxSupport Devices01011-1021103100410-1010511

Subject_##	Report
01	Lung volumes remain low. There are innumerable bilateral scattered small pulmonary nodules which are better demonstrated on recent CT. Mild pulmonary vascular congestion is stable. The cardio mediastinal silhouette and hilar contours are unchanged. Small pleural effusion in the right middle fissure is new. There is no new focal opacity to suggest pneumonia. There is no pneumothorax.
02	A triangular opacity in the right lung apex is new from prior examination. There is also fullness of the right hilum which is new. The remainder of the lungs are clear. Blunting of bilateral costophrenic angles, right greater than left, may be secondary to small effusions. The heart size is top normal.
03	Mild to moderate enlargement of the cardiac silhouette is unchanged. The aorta is calcified and diffusely tortuous. The mediastinal and hilar contours are otherwise similar in appearance. There is minimal upper zone vascular redistribution without overt pulmonary edema. No focal consolidation, pleural effusion or pneumothorax is present. The osseous structures are diffusely demineralized.
04	The endotracheal tube tip is 6 cm above the carina. Nasogastric tube tip is beyond the GE junction and off the edge of the film. A left central line is present in the tip is in the mid SVC. A pacemaker is noted on the right in the lead projects over the right ventricle. There is probable scarring in both lung apices. There are no new areas of consolidation. There is upper zone redistribution and cardiomegaly suggesting pulmonary venous hypertension. There is no pneumothorax.
05	A moderate left pleural effusion is new. Associated left basilar opacity likely reflect compressive atelectasis. There is no pneumothorax. There are no new abnormal cardiac or mediastinal contour. Median sternotomy wires and mediastinal clips are in expected positions.

Our approach involves two distinct strategies. Firstly, we seek to identify reports sharing the same sequence of labels and values.For instance, we search for reports from subjects with Chexpert label sequences similar to that of Subject_01 in Table 3.For these reports with matching label sequences, we proceed to similarity scores computation for each pair of reports.Simultaneously, we identify reports featuring only one or two labels and with a value of "definitely present" for these labels resembling Subject_02 in Table 3 and assess the similarity of these reports with the reports with different label sequences.As an example, we calculate the similarity between the reports of Subject_02 and Subject_05 from Table 3, given their entirely distinct label sequences.This two-fold method allows us to analyze the semantic similarity scores for both similar and contrasting reports in terms of their labels.

Semantic Textual Similarity Assessment in Chest X-ray Reports Using a Domain-Specific Cosine-Based Metric (4)

Figure 1 presents the results of the two-fold validation for our scoring method. Within the figure, blue dots represent the average scores for semantic evaluation of reports with similar label sequences, while orange dots show the mean scores for reports with contrasting labels. The red horizontal line within the figure serves as the dividing line distinguishing between similar and opposite evaluations.Upon reviewing these results, it becomes evident that a distinct boundary exists between reports sharing the same clinical diagnoses and those with entirely dissimilar diagnoses. Notably, there are no blue dots below a 70% similarity threshold, whereas six orange dots have scores above 70% across 70 label sequences, which is certainly not very high.Nevertheless, despite this differentiation between similar and opposite evaluations, some level of similarity, exceeding 50%, persists within the opposing category. This can be attributed to the implemented cosine similarity within the medical domain, which introduces a certain bias towards tokens in the same medical domain. Unfortunately, this bias cannot be entirely eliminated, as it plays a substantial role in the evaluation process. However, a clear boundary remains between similar and contrasting reports.

5 RESULTS AND DISCUSSION

In our original application of chest X-ray report generation, we incorporate our metric to assess the outputs of various models. We compare our results with the BLEU scores evaluated by these models, specifically, the CXR-RePaiR Endo etal. (2021) and R2Gen Chen etal. (2020) models, both being state-of-the-art models for generating chest X-ray reports. Our evaluation focuses on measuring the semantic similarity between the generated reports and the ground truth. Table 5 presents the BLEU scores obtained from these models and our metric’s semantic evaluation.As anticipated, the BLEU scores are relatively low, signifying a substantial dissimilarity between the generated results and the ground truth for both the CXR-RePaiR and R2Gen models despite being regarded as state-of-the-art models for chest X-ray report generation. These models still employ the BLEU metric for evaluation, primarily due to the scarcity of more suitable metrics and the need for a standardized evaluation process for comparative purposes.Conversely, our metric produces more promising results for both of these models. While our metric’s scores align with the BLEU scores, indicating higher scores for both BLEU and our MCSE metric in the case of the R2Gen model compared to the CXR-RePaiR, our metric provides a deeper evaluation. It suggests a degree of similarity to the ground truth rather than outright dissimilarity in BLEU, thus making the generated reports more reliable and trustworthy, which is a crucial advancement in the field.

Models	BLEU	Our MCSE
R2Gen Chen etal. (2020)	0.212	0.71
CXR-RePair Endo etal. (2021)	0.069	0.64

Table 6 provides an example of medical text generated and evaluated using both a BLEU score and our MCSE metric. It’s evident that, according to the BLEU score, these two texts appear vastly different, even though they share the same primary medical entities. However, when we delve into the context, we can notice that "moderately severe" serves as a description for the main entity, "pulmonary edema", in the generated text. Similarly, in the second part of the text, the main medical entity is "pleural effusions", and terms like "likely" and "no large" are used to describe this entity, which may not be identical but share semantic similarities. This subtle context evaluation is precisely what our metric considers, yielding a similarity score of 0.64 for these texts, which we argue is a more accurate reflection compared to the BLEU score.

	BLEU	MCSE
Reference Sentence: "Pulmonary edema, cardiomegaly, likely pleural effusions."Generated Sentence: "Moderately severe bilateral pulmonary edema with no large pleural effusion."	0.047	0.64

Lastly, the significant benefit of employing this metric lies in its capacity for comparative analysis alongside other evaluation measures. For instance, when examining the outcomes of the BLEU score, with its word-by-word analysis, situations may arise where the results are totally inaccurate, casting doubt on their reliability, despite the models performing well overall. Integrating the results of our novel MCSE metric into the evaluation process allows us to semantically analyze and ascertain the dependability of the models’ textual outputs within the context of medical content.

6 CONCLUSION

In our research, we tackle the challenge of semantic similarity scoring in medical corpora, driven by the inadequacy of existing metrics that, while suitable for machine translation evaluation, fall short in the field of medical semantic assessment. Our innovative metric draws inspiration from how humans comprehend text, centering on the extraction of key terms and their relational context. It introduces a novel approach for extracting clinical entities from medical text, considering not only the entities themselves but also the associated descriptions and negations. Additionally, we created a new method for scoring the semantic relationships between these entities by using the domain cosine similarity.The validation process allowed us to analyze and validate each of these steps individually, unraveling a clear distinction between reports sharing the same diagnosis and those diverging in this regard.

For our research, we focused on the application of chest X-rays, a critical domain where a robust semantic evaluation metric is highly valuable. We applied our metric to some of the latest state-of-the-art models, and the results harmonized with other evaluation metrics, affirming their reliability.

While our validation process and implementation yielded successful outcomes, we encountered the challenge of an inherent bias in domain cosine similarity. This challenge has illuminated a promising direction for our future research, as we explore ways to mitigate this bias and advance the field of medical semantic evaluation.

Material, codes, and Acknowledgement:

Results can be reproduced using the code available in the GitHub repository https://github.com/sayeh1994/Medical-Corpus-Semantic-Similarity-Evaluation.git. All the computations presented in this paper were performed using the Gricad infrastructure (https://gricad.univ-grenoble-alpes.fr), which is supported by Grenoble research communities.

References

Alam etal. [2020]Fakhare Alam, Muhammad Afzal, and KhalidMahmood Malik.Comparative analysis of semantic similarity techniques for medical text.In 2020 International Conference on Information Networking (ICOIN), pages 106–109, 2020.doi:10.1109/ICOIN48656.2020.9016574.
Endo etal. [2021]Mark Endo, Rayan Krishnan, Viswesh Krishna, AndrewY. Ng, and Pranav Rajpurkar.Retrieval-based chest x-ray report generation using a pre-trained contrastive language-image model.In Proceedings of Machine Learning for Health, volume 158 of Proceedings of Machine Learning Research, pages 209–219, 2021.
Miura etal. [2021]Yasuhide Miura, Yuhao Zhang, Emily Tsai, Curtis Langlotz, and Dan Jurafsky.Improving factual completeness and consistency of image-to-text radiology report generation.In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, IzBeltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5288–5304, Online, June 2021. Association for Computational Linguistics.doi:10.18653/v1/2021.naacl-main.416.URL https://aclanthology.org/2021.naacl-main.416.
Yu etal. [2022]Feiyang Yu, Mark Endo, Rayan Krishnan, Ian Pan, Andy Tsai, EduardoPontes Reis, Eduardo Kaiser UrurahyNunes Fonseca, HenriqueMin HoLee, Zahra ShakeriHossein Abad, AndrewY. Ng, CurtisP. Langlotz, VasanthaKumar Venugopal, and Pranav Rajpurkar.Evaluating Progress in Automatic Chest X-Ray Radiology Report Generation.preprint, Radiology and Imaging, August 2022.URL http://medrxiv.org/lookup/doi/10.1101/2022.08.30.22279318.
Chen etal. [2020]Zhihong Chen, Yan Song, Tsung-Hui Chang, and Xiang Wan.Generating radiology reports via memory-driven transformer.In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1439–1449, Online, November 2020. Association for Computational Linguistics.doi:10.18653/v1/2020.emnlp-main.112.URL https://aclanthology.org/2020.emnlp-main.112.
Papineni etal. [2002]Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu.Bleu: a method for automatic evaluation of machine translation.In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics.doi:10.3115/1073083.1073135.URL https://aclanthology.org/P02-1040.
Banerjee and Lavie [2005]Satanjeev Banerjee and Alon Lavie.METEOR: An automatic metric for MT evaluation with improved correlation with human judgments.In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics.URL https://aclanthology.org/W05-0909.
Lin [2004]Chin-Yew Lin.ROUGE: A package for automatic evaluation of summaries.In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.URL https://aclanthology.org/W04-1013.
Smit etal. [2020]Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, A.Ng, and MatthewP. Lungren.Chexbert: Combining automatic labelers and expert annotations for accurate radiology report labeling using bert.In Conference on Empirical Methods in Natural Language Processing, 2020.URL https://api.semanticscholar.org/CorpusID:215827807.
Irvin etal. [2019]Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins, DavidA. Mong, SafwanS. Halabi, JesseK. Sandberg, Ricky Jones, DavidB. Larson, CurtisP. Langlotz, BhavikN. Patel, MatthewP. Lungren, and AndrewY. Ng.Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison.Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):590–597, Jul. 2019.doi:10.1609/aaai.v33i01.3301590.URL https://ojs.aaai.org/index.php/AAAI/article/view/3834.
Jain etal. [2021]Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven Truong, DuNguyen Duong, Tan Bui, Pierre Chambon, Yuhao Zhang, MatthewP. Lungren, AndrewY. Ng, Curtis Langlotz, and Pranav Rajpurkar.Radgraph: Extracting clinical entities and relations from radiology reports.In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.URL https://openreview.net/forum?id=pMWtc5NKd7V.
Johnson etal. [2019]Alistair E.W. Johnson, TomJ. Pollard, SethJ. Berkowitz, NathanielR. Greenbaum, MatthewP. Lungren, Chih-ying Deng, RogerG. Mark, and Steven Horng.Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports.Scientific Data, 6(1):317, Dec 2019.ISSN 2052-4463.doi:10.1038/s41597-019-0322-0.URL https://doi.org/10.1038/s41597-019-0322-0.
Patricoski etal. [2022]Jessica Patricoski, Kory Kreimeyer, Archana Balan, Kent Hardart, Jessica Tao, Valsamo Anagnostou, Taxiarchis Botsis, Johns Hopkins Molecular TumorBoard Investigators, etal.An evaluation of pretrained bert models for comparing semantic similarity across unstructured clinical trial texts.Stud Health Technol Inform, 289:18–21, 2022.
Beltagy etal. [2019]IzBeltagy, Kyle Lo, and Arman Cohan.SciBERT: A pretrained language model for scientific text.In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China, November 2019. Association for Computational Linguistics.doi:10.18653/v1/D19-1371.URL https://aclanthology.org/D19-1371.
Neumann etal. [2019]Mark Neumann, Daniel King, IzBeltagy, and Waleed Ammar.ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing.In Proceedings of the 18th BioNLP Workshop and Shared Task, pages 319–327, Florence, Italy, August 2019. Association for Computational Linguistics.doi:10.18653/v1/W19-5034.URL https://www.aclweb.org/anthology/W19-5034.
Li etal. [2016]Jiao Li, Yueping Sun, RobinJ. Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, AllanPeter Davis, CarolynJ. Mattingly, ThomasC. Wiegers, and Zhiyong Lu.BioCreative V CDR task corpus: a resource for chemical disease relation extraction.Database, 2016:baw068, 05 2016.ISSN 1758-0463.doi:10.1093/database/baw068.URL https://doi.org/10.1093/database/baw068.
Honnibal etal. [2020]Matthew Honnibal, Ines Montani, Sofie VanLandeghem, and Adriane Boyd.spaCy: Industrial-strength Natural Language Processing in Python.2020.doi:10.5281/zenodo.1212303.
[18]Gricad.infrastructure supported by grenoble research communities.URL https://gricad.univ-grenoble-alpes.fr.

	$\displaystyle S_{i}=\frac{\max y_{i,j}}{\max y_{i,j}+\overline{y_{i,j}}}\mkern 12.0mui=(0,1,\dots,M)\mkern 3.0muj=(0,1,\dots,N)$		(1)
	$\displaystyle y_{i,j}=Similarity(r_{i},\hat{r}_{j})$		(2)

Reference: 1. Interval clearance of left basilar consolidation. 2. Patchy right basilar opacities, which could be seen with minor atelectasis, but given the context clinical correlation is suggested regarding any possibility for recurrent or new aspiration pneumonitis at the right lung base. 3. Increased new interstitial abnormality, suggesting recurrence of fluid overload or mild-to-moderate pulmonary edema; aspiration could also be considered. Inflammation associated with atypical infectious process is probably less likely given the waxing and waning presentation.
Candidate: Stable multiple bilateral pulmonary masses and right middle lobe collapse due to hilar adenopathy.
		Candidate Medical Entities
		pulmonary masses	right middle lobe	hilar adenopathy
	fluid overload	0.61	0.49	0.45
	inflammation	0.64	0.48	0.55
	aspiration pneumonitis	0.65	0.39	0.50
	minor atelectasis	0.62	0.47	0.53
Reference Medical Entities	mild to moderate pulmonary edema	0.78	0.31	0.51
	left basilar consolidation	0.52	0.66	0.32
	patchy right basilar opacities	0.64	0.66	0.49
	interstitial abnormality	0.69	0.63	0.59
$S_{i}$		0.548	0.563	0.545