S1000: a better taxonomic name corpus for biomedical information extraction

Research output: Contribution to journalJournal articleResearchpeer-review


  • Fulltext

    Final published version, 772 KB, PDF document

Motivation: The recognition of mentions of species names in text is a critically important task for biomedical text mining. While deep learning-based methods have made great advances in many named entity recognition tasks, results for species name recognition remain poor. We hypothesize that this is primarily due to the lack of appropriate corpora. Results: We introduce the S1000 corpus, a comprehensive manual re-annotation and extension of the S800 corpus. We demonstrate that S1000 makes highly accurate recognition of species names possible (F-score =93.1%), both for deep learning and dictionary-based methods.

Original languageEnglish
Article numberbtad369
Issue number6
Number of pages8
Publication statusPublished - 2023

Bibliographical note

Publisher Copyright:
© 2023 The Author(s).

ID: 360982850