P0004 Cwalking – A Script to Explain Unassembled DNA Sequences

Jose F Barbosa-Neto , Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil
Cristian Chaparro , University of Perpignan, Perpignan, France
Olivier Panaud , University of Perpignan, Perpignan, France
Highly repeated sequences are the major limitation for sequencing repeated regions and assembling scaffolds. This is particularly true when sequencing a complex genome using NGS data. Sequencing projects have to deal with millions of unassembled sequences and an increase in genome coverage does not solve the problem. These unassembled sequences contain repeated elements that cannot be easily assembled by the different algorithms. Cwalking is a Perl script that builds new sequences from these unassembled sequences. It is based on a step by step approach and in the identification of overlapping sequences. At each extension step all matching sequences are eliminated from the database in order to reduce subsequent analysis. At this step it is necessary to test the obtained sequence for the identification of tranposons, retrotranposons, or other genomic structures that are highly repeated. This process may be used several times in order to explain the whole set of unassembled sequences. Cwalking was tested in two sequencing projects, cocoa (Theobroma cacao) and banana (Musa paradisiaca). The analysis in the cacao project allowed the identification of five different repeated structures, which explained 1,845,360 sequences, representing 57% of the total. A major retrotransposon (Gaucho) was estimated to have over 1100 copies, covering around 13Mb of the cacao genome. In the banana analysis a total of 1,645,810 unassembled sequences were used and 58% were related to repeated elements. Cwalking proved to be an efficient and fast script to analyze unassembled sequences, being an important tool for genome annotation projects.