BiteSized Bioinformatics: De novo assembly

CLIMB address: (also, will work!)

In the past some BSB have been used to familiarize with concepts that are useful for this session: NGS file formats (in particular FASTA and FASTQ), and how to perform (and use) a Quality Control with FastQC.

From the Quadram you can access a shared directory containing relevant material like:

  • What does FastQC do.pptx
  • Bite-size-talks/Bitesize_ngs_formats.pdf

And also a blog dedicated to the BSB activity (only available from QIB computers).

From reads to contigs

  1. Introduce yourself - This informal training room can be a place to meet other bioinformaticians and/or learners!
  2. Setting up the environment - We will use a remote Linux server. This part will guide you accessing this machine via SSH, and setting up a program (called screen) to keep the session alive.
  3. What datasets do we have? - Datasets have been downloaded in a shared place. Let's see how to count reads that we will eventually assemble.

Multiple platforms, multiple datasets

  1. More Datasets - Explore WGS experiments performed using different sequencing machines
  2. Quality Check - Let's try a QC on a dataset
  3. bsbdenovo-examples - Some assembly results (in progress)

Advanced tracks

Given the amount of data you can explore different aspects of de novo. We are not thus preparing an omnibus track, but we can discuss with you an advanced tutorial. Here some examples:

Performing De novo assemblies

  • Check the effect of coverage in the results. You'll learn to randomly subsample a FASTQ file: this will allow producing multiple datasets (examples: using 30%, 50%, 70% of the total reads) to be assembled to check the effect of increasing coverage in a particular dataset.
  • Compare multiple assemblers. You can - for example - compare Velvet and SPAdes on Illumina dataset
  • Assemble long reads. You can try the assembly of very long reads (PacBio/Nanopore) using a dedicated assembler: canu. A possible experiment is also mixing Illumina and Nanopore and see how the result is affected.

Inspecting the output

Some ideas for future workshops:

  • Align reads against contigs and view the output with IGV. This gives and idea of coverage fluctuations, reads quality compared with the consensus
  • Aling reads and/or contigs against a known, finished reference genome, import results in IGV. Here we can see if some regions of the reference genome are not present in our assembly. Why do we want to align reads and not only contigs for this?
  • How to check the opposite: if our isolate has novel regions? We want to only use alignments and IGV ;)


  • Ekblom, R., & Wolf, J. B. W. (2014). A field guide to whole-genome sequencing, assembly and annotation. Evolutionary Applications, 7(9), 1026–1042.
  • Schirmer, M., Ijaz, U. Z., D'Amore, R., Hall, N., Sloan, W. T., & Quince, C. (2015). Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform., 43(6), e37–e37.
  • Lee, H. C., Lai, K., Lorenc, M. T., Imelfort, M., Duran, C., & Edwards, D. (2012). Bioinformatics tools and databases for analysis of next-generation sequence data. Briefings in Functional Genomics, 11(1), 12–24.