Analysis of 16S data with lOTUs

We are going to analyze raw data from a 16S experiment using the lOTUs pipeline (paper).

We'll use the 16S reads from the paper Gut barrier impairment by high‐fat diet in mice depends on housing conditions (Mueller 2015). The raw data can be downloaded from ENA.


This is a fraction of the metadata (we'll focus on Diet, where CD is Control Diet and HFD is High Fat Diet):

#SampleID Facility Diet Facility_Diet Cholic_acid Muricholic_acid
13.SPF.CD SPF CD SPF_CD 342.5568553 1026.617105
14.SPF.CD SPF CD SPF_CD 227.4375746 1436.135551
15.SPF.CD SPF CD SPF_CD 20.17057537 862.6286804
16.SPF.CD SPF CD SPF_CD 65.58988185 1139.587569

How many reads? How good?

To start we need to check the sequencing depth (how many reads per sample), and the overall quality of the sequencing reads.

A common tool to perform this analysis is FastQC, that produces a full report for generic NGS output (mainly whole genome sequencing). For amplicons a lot of sections will not be very useful (e.g. overrepresented k-mers are totally expected!).

The beginning of the report will include the total number of sequences. For this tutorial we subsampled approximately 5k reads for each sample.

This is the QC profile for the first pair (_R1) of a sample of the dataset:

An this is it's companion pair (_R2):

It's common to see a stronger quality drop at the end of _R2 sequences.

Before running the lOTUs pipeline, that is automatic, we can manually perform some steps to better understand how the input data is manipulated. To keep the procedure simple

Merge R1 and R2

Most 16S experiments are designed to amplify a target region shorter than the sum of the read length of the two paired end reads. This allows merging the two reads into a single longer sequence, that also mitigate some sequencing errors producing the consensus.

gunzip *.gz
usearch -fastq_mergepairs *R1* -relabel @ -fastq_maxdiffs 20 -fastq_pctid 80 -fastqout merge.fq

The program will print the number of pairs merged, and the reason why some pairs weren't merged at all (usually too many errors).

Preparing the mapping file

The first step for most 16S analysis pipelines it to prepare a file connecting the files to some form of sample identifier. Lotus provides a script to prepare a minimal mapping file.