Our team developed medBERT.de, a comprehensive German BERT model specifically designed for the medical domain. We pre-trained the model on a large corpus of 4.7 million German medical documents, spanning a wide range of disciplines and document types. To evaluate medBERT.de’s performance, we compared it against both general German language models and other German medical models on eight different medical benchmarks.
Our findings, published in Expert Systems with Applications (https://doi.org/10.1016/j.eswa.2023.121598), demonstrate that medBERT.de outperforms existing models on most medical benchmarks, highlighting the importance of domain-specific pre-training. The model’s superior performance was particularly evident in longer text tasks, such as those involving discharge summaries and surgical reports. However, we also observed that deduplication of training data did not consistently improve performance, contrary to previous research.
