A Guide to Dictionary-Based Text Mining
Research output: Chapter in Book/Report/Conference proceeding › Book chapter › Research › peer-review
PubMed contains more than 27 million documents, and this number is growing at an estimated 4% per year. Even within specialized topics, it is no longer possible for a researcher to read any field in its entirety, and thus nobody has a complete picture of the scientific knowledge in any given field at any time. Text mining provides a means to automatically read this corpus and to extract the relations found therein as structured information. Having data in a structured format is a huge boon for computational efforts to access, cross reference, and mine the data stored therein. This is increasingly useful as biological research is becoming more focused on systems and multi-omics integration. This chapter provides an overview of the steps that are required for text mining: tokenization, named entity recognition, normalization, event extraction, and benchmarking. It discusses a variety of approaches to these tasks and then goes into detail on how to prepare data for use specifically with the JensenLab tagger. This software uses a dictionary-based approach and provides the text mining evidence for STRING and several other databases.
Original language | English |
---|---|
Title of host publication | Bioinformatics and Drug Discovery |
Editors | Richard S. Larson, Tudor I. Oprea |
Number of pages | 17 |
Volume | 1939 |
Publisher | Humana Press |
Publication date | 2019 |
Edition | 3 |
Pages | 73-89 |
ISBN (Print) | 978-1-4939-9088-7 |
ISBN (Electronic) | 978-1-4939-9089-4 |
DOIs | |
Publication status | Published - 2019 |
Series | Methods in Molecular Biology |
---|---|
ISSN | 1064-3745 |
ID: 223876548