A Guide to Dictionary-Based Text Mining

Novo Nordisk Foundation
Center for Protein Research

A Guide to Dictionary-Based Text Mining

Research output: Chapter in Book/Report/Conference proceeding › Book chapter › Research › peer-review

Helen V. Cook
Jensen, Lars Juhl

PubMed contains more than 27 million documents, and this number is growing at an estimated 4% per year. Even within specialized topics, it is no longer possible for a researcher to read any field in its entirety, and thus nobody has a complete picture of the scientific knowledge in any given field at any time. Text mining provides a means to automatically read this corpus and to extract the relations found therein as structured information. Having data in a structured format is a huge boon for computational efforts to access, cross reference, and mine the data stored therein. This is increasingly useful as biological research is becoming more focused on systems and multi-omics integration. This chapter provides an overview of the steps that are required for text mining: tokenization, named entity recognition, normalization, event extraction, and benchmarking. It discusses a variety of approaches to these tasks and then goes into detail on how to prepare data for use specifically with the JensenLab tagger. This software uses a dictionary-based approach and provides the text mining evidence for STRING and several other databases.

Original language	English
Title of host publication	Bioinformatics and Drug Discovery
Editors	Richard S. Larson, Tudor I. Oprea
Number of pages	17
Volume	1939
Publisher	Humana Press
Publication date	2019
Edition	3
Pages	73-89
ISBN (Print)	978-1-4939-9088-7
ISBN (Electronic)	978-1-4939-9089-4
DOIs	https://doi.org/10.1007/978-1-4939-9089-4_5
Publication status	Published - 2019

Series	Methods in Molecular Biology
ISSN	1064-3745

ID: 223876548

Novo Nordisk Foundation Center for Protein Research

A Guide to Dictionary-Based Text Mining