Analysis of 16S data with lOTUs
Introduction
We are going to analyze raw data from a 16S experiment using the lOTUs pipeline (paper).
Dataset
We'll use the 16S reads from the paper Gut barrier impairment by highâfat diet in mice depends on housing conditions (Mueller 2015). The raw data can be downloaded from ENA.
Metadata
This is a fraction of the metadata (we'll focus on Diet, where CD is Control Diet and HFD is High Fat Diet):
#SampleID | Facility | Diet | Facility_Diet | Cholic_acid | Muricholic_acid |
---|---|---|---|---|---|
13.SPF.CD | SPF | CD | SPF_CD | 342.5568553 | 1026.617105 |
14.SPF.CD | SPF | CD | SPF_CD | 227.4375746 | 1436.135551 |
15.SPF.CD | SPF | CD | SPF_CD | 20.17057537 | 862.6286804 |
16.SPF.CD | SPF | CD | SPF_CD | 65.58988185 | 1139.587569 |
How many reads? How good?
To start we need to check the sequencing depth (how many reads per sample), and the overall quality of the sequencing reads.
A common tool to perform this analysis is FastQC, that produces a full report for generic NGS output (mainly whole genome sequencing). For amplicons a lot of sections will not be very useful (e.g. overrepresented k-mers are totally expected!).
The beginning of the report will include the total number of sequences. For this tutorial we subsampled approximately 5k reads for each sample.
This is the QC profile for the first pair (_R1) of a sample of the dataset:
An this is it's companion pair (_R2):
It's common to see a stronger quality drop at the end of _R2 sequences.
Preliminary analysis
Before running the lOTUs pipeline, that is automatic, we can manually perform some steps to better understand how the input data is manipulated. To keep the procedure simple
Merge R1 and R2
Most 16S experiments are designed to amplify a target region shorter than the sum of the read length of the two paired end reads. This allows merging the two reads into a single longer sequence, that also mitigate some sequencing errors producing the consensus.
gunzip *.gz usearch -fastq_mergepairs *R1* -relabel @ -fastq_maxdiffs 20 -fastq_pctid 80 -fastqout merge.fq
The program will print the number of pairs merged, and the reason why some pairs weren't merged at all (usually too many errors).
Using lOTUs
Preparing the mapping file
The first step for most 16S analysis pipelines it to prepare a file connecting the files to some form of sample identifier. Lotus provides a script to prepare a minimal mapping file.
autoMap.pl