Detecting selection in low-coverage high-throughput sequencing data using principal component analysis

Research output: Contribution to journalJournal articleResearchpeer-review

Standard

Detecting selection in low-coverage high-throughput sequencing data using principal component analysis. / Meisner, Jonas; Albrechtsen, Anders; Hanghøj, Kristian.

In: BMC Bioinformatics, Vol. 22, 470, 2021.

Research output: Contribution to journalJournal articleResearchpeer-review

Harvard

Meisner, J, Albrechtsen, A & Hanghøj, K 2021, 'Detecting selection in low-coverage high-throughput sequencing data using principal component analysis', BMC Bioinformatics, vol. 22, 470. https://doi.org/10.1186/s12859-021-04375-2

APA

Meisner, J., Albrechtsen, A., & Hanghøj, K. (2021). Detecting selection in low-coverage high-throughput sequencing data using principal component analysis. BMC Bioinformatics, 22, [470]. https://doi.org/10.1186/s12859-021-04375-2

Vancouver

Meisner J, Albrechtsen A, Hanghøj K. Detecting selection in low-coverage high-throughput sequencing data using principal component analysis. BMC Bioinformatics. 2021;22. 470. https://doi.org/10.1186/s12859-021-04375-2

Author

Meisner, Jonas ; Albrechtsen, Anders ; Hanghøj, Kristian. / Detecting selection in low-coverage high-throughput sequencing data using principal component analysis. In: BMC Bioinformatics. 2021 ; Vol. 22.

Bibtex

@article{faff4adc0c014c18ba8a22b383da2fe6,
title = "Detecting selection in low-coverage high-throughput sequencing data using principal component analysis",
abstract = "BACKGROUND: Identification of selection signatures between populations is often an important part of a population genetic study. Leveraging high-throughput DNA sequencing larger sample sizes of populations with similar ancestries has become increasingly common. This has led to the need of methods capable of identifying signals of selection in populations with a continuous cline of genetic differentiation. Individuals from continuous populations are inherently challenging to group into meaningful units which is why existing methods rely on principal components analysis for inference of the selection signals. These existing methods require called genotypes as input which is problematic for studies based on low-coverage sequencing data.MATERIALS AND METHODS: We have extended two principal component analysis based selection statistics to genotype likelihood data and applied them to low-coverage sequencing data from the 1000 Genomes Project for populations with European and East Asian ancestry to detect signals of selection in samples with continuous population structure.RESULTS: Here, we present two selections statistics which we have implemented in the PCAngsd framework. These methods account for genotype uncertainty, opening for the opportunity to conduct selection scans in continuous populations from low and/or variable coverage sequencing data. To illustrate their use, we applied the methods to low-coverage sequencing data from human populations of East Asian and European ancestries and show that the implemented selection statistics can control the false positive rate and that they identify the same signatures of selection from low-coverage sequencing data as state-of-the-art software using high quality called genotypes.CONCLUSION: We show that selection scans of low-coverage sequencing data of populations with similar ancestry perform on par with that obtained from high quality genotype data. Moreover, we demonstrate that PCAngsd outperform selection statistics obtained from called genotypes from low-coverage sequencing data without the need for ad-hoc filtering.",
keywords = "Genetics, Population, Genome, Genotype, High-Throughput Nucleotide Sequencing, Humans, Polymorphism, Single Nucleotide, Principal Component Analysis",
author = "Jonas Meisner and Anders Albrechtsen and Kristian Hangh{\o}j",
note = "{\textcopyright} 2021. The Author(s).",
year = "2021",
doi = "10.1186/s12859-021-04375-2",
language = "English",
volume = "22",
journal = "B M C Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central Ltd.",

}

RIS

TY - JOUR

T1 - Detecting selection in low-coverage high-throughput sequencing data using principal component analysis

AU - Meisner, Jonas

AU - Albrechtsen, Anders

AU - Hanghøj, Kristian

N1 - © 2021. The Author(s).

PY - 2021

Y1 - 2021

N2 - BACKGROUND: Identification of selection signatures between populations is often an important part of a population genetic study. Leveraging high-throughput DNA sequencing larger sample sizes of populations with similar ancestries has become increasingly common. This has led to the need of methods capable of identifying signals of selection in populations with a continuous cline of genetic differentiation. Individuals from continuous populations are inherently challenging to group into meaningful units which is why existing methods rely on principal components analysis for inference of the selection signals. These existing methods require called genotypes as input which is problematic for studies based on low-coverage sequencing data.MATERIALS AND METHODS: We have extended two principal component analysis based selection statistics to genotype likelihood data and applied them to low-coverage sequencing data from the 1000 Genomes Project for populations with European and East Asian ancestry to detect signals of selection in samples with continuous population structure.RESULTS: Here, we present two selections statistics which we have implemented in the PCAngsd framework. These methods account for genotype uncertainty, opening for the opportunity to conduct selection scans in continuous populations from low and/or variable coverage sequencing data. To illustrate their use, we applied the methods to low-coverage sequencing data from human populations of East Asian and European ancestries and show that the implemented selection statistics can control the false positive rate and that they identify the same signatures of selection from low-coverage sequencing data as state-of-the-art software using high quality called genotypes.CONCLUSION: We show that selection scans of low-coverage sequencing data of populations with similar ancestry perform on par with that obtained from high quality genotype data. Moreover, we demonstrate that PCAngsd outperform selection statistics obtained from called genotypes from low-coverage sequencing data without the need for ad-hoc filtering.

AB - BACKGROUND: Identification of selection signatures between populations is often an important part of a population genetic study. Leveraging high-throughput DNA sequencing larger sample sizes of populations with similar ancestries has become increasingly common. This has led to the need of methods capable of identifying signals of selection in populations with a continuous cline of genetic differentiation. Individuals from continuous populations are inherently challenging to group into meaningful units which is why existing methods rely on principal components analysis for inference of the selection signals. These existing methods require called genotypes as input which is problematic for studies based on low-coverage sequencing data.MATERIALS AND METHODS: We have extended two principal component analysis based selection statistics to genotype likelihood data and applied them to low-coverage sequencing data from the 1000 Genomes Project for populations with European and East Asian ancestry to detect signals of selection in samples with continuous population structure.RESULTS: Here, we present two selections statistics which we have implemented in the PCAngsd framework. These methods account for genotype uncertainty, opening for the opportunity to conduct selection scans in continuous populations from low and/or variable coverage sequencing data. To illustrate their use, we applied the methods to low-coverage sequencing data from human populations of East Asian and European ancestries and show that the implemented selection statistics can control the false positive rate and that they identify the same signatures of selection from low-coverage sequencing data as state-of-the-art software using high quality called genotypes.CONCLUSION: We show that selection scans of low-coverage sequencing data of populations with similar ancestry perform on par with that obtained from high quality genotype data. Moreover, we demonstrate that PCAngsd outperform selection statistics obtained from called genotypes from low-coverage sequencing data without the need for ad-hoc filtering.

KW - Genetics, Population

KW - Genome

KW - Genotype

KW - High-Throughput Nucleotide Sequencing

KW - Humans

KW - Polymorphism, Single Nucleotide

KW - Principal Component Analysis

U2 - 10.1186/s12859-021-04375-2

DO - 10.1186/s12859-021-04375-2

M3 - Journal article

C2 - 34587903

VL - 22

JO - B M C Bioinformatics

JF - B M C Bioinformatics

SN - 1471-2105

M1 - 470

ER -

ID: 281761518