Data and Documentation
Open Data Policy
FAQ
EN
DE
FR
Suchbegriff
Advanced search
Publication
Back to overview
Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases
Type of publication
Peer-reviewed
Publikationsform
Original article (peer-reviewed)
Author
Tørresen Ole K, Star Bastiaan, Mier Pablo, Andrade-Navarro Miguel A, Bateman Alex, Jarnot Patryk, Gruca Aleksandra, Grynberg Marcin, Kajava Andrey V, Promponas Vasilis J, Anisimova Maria, Jakobsen Kjetill S, Linke Dirk,
Project
C16.0072: Discovering evolutionary innovations by assessing variation and natural selection in protein tandem repeats
Show all
Original article (peer-reviewed)
Journal
Nucleic Acids Research
Volume (Issue)
47(21)
Page(s)
10994 - 11006
Title of proceedings
Nucleic Acids Research
DOI
10.1093/nar/gkz841
Open Access
URL
http://doi.org/10.1093/nar/gkz841
Type of Open Access
Publisher (Gold Open Access)
Abstract
AbstractThe widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others.
-