FROGS_3 Remove chimera
Context
Chimeras are artifact sequences formed by two or more biological sequences incorrectly joined together. This often occurs during PCR reactions using mixed templates (i.e., uncultured environmental samples). Incomplete extensions during PCR allow subsequent PCR cycles to use a partially extended strand to bind to the template of a different, but similar, sequence. This partially extended strand then acts as a primer to extend and form a chimeric sequence. Once created, the chimeric sequence is then further amplified in subsequent cycles. The end result is a PCR artifact that does not represent a sequence that exists in nature.
This phenomena is particularly common in amplicon sequencing where closely related sequences are amplified.
Source: EZBioCloud
The chimera rate can reach between 5 to 45% (Haas et al., 2011) of the sequences (typically for 16S data).
How it does
This tool removes chimeric sequences by sample.
De novo detection: In this algorithm, a chimera-free reference database is automatically generated for each NGS data. Initially, the reference database is empty. Then, NGS reads are considered in the order of decreasing abundance. If a sequence is classified as chimeric, it is discarded; otherwise, it is added to the reference database (so the size of the reference database grows). Candidate parents (PCR templates, strains A and B in the previous figure) are required to have more abundance than that of the query sequence, on the assumption that a chimera has undergone fewer rounds of amplification and will, therefore, be less abundant than its parents (Edgar et al., 2011).
The chimera detection is performed with VSEARCH, combined with a homemade strategy. The chimera detection is performed sample by sample and a cross-validation is performed to remove only chimera identified in all samples where they are present.
See here for more details about VSEARCH chimera removal.
Command line
v4.1.0
remove_chimera.py --help
usage: remove_chimera.py [-h] [-p NB_CPUS] [--debug] [-v] -f INPUT_FASTA
[-b INPUT_BIOM | -c INPUT_COUNT] [-n NON_CHIMERA]
[-a OUT_ABUNDANCE] [--summary SUMMARY] [-l LOG_FILE]
Removes PCR chimera.
optional arguments:
-h, --help show this help message and exit
-p NB_CPUS, --nb-cpus NB_CPUS
The maximum number of CPUs used. [Default: 1]
--debug Keep temporary files to debug program.
-v, --version show program's version number and exit
Inputs:
-f INPUT_FASTA, --input-fasta INPUT_FASTA
The cluster sequences (format: FASTA).
-b INPUT_BIOM, --input-biom INPUT_BIOM
The abundance file for clusters by sample (format:
BIOM).
-c INPUT_COUNT, --input-count INPUT_COUNT
The counts file for clusters by sample (format: TSV).
Outputs:
-n NON_CHIMERA, --non-chimera NON_CHIMERA
sequences file without chimera (format: FASTA).
[Default: remove_chimera.fasta]
-a OUT_ABUNDANCE, --out-abundance OUT_ABUNDANCE
Abundance file without chimera (format: BIOM or TSV).
[Default: remove_chimera_abundance.biom or
remove_chimera_abundance.tsv]
--summary SUMMARY The HTML file containing the graphs. [Default:
remove_chimera.html]
-l LOG_FILE, --log-file LOG_FILE
This output file will contain several informations on
executed commands.
Example of command line:
remove_chimera.py --input-biom clustering.biom --input-fasta clustering.fasta --non-chimera remove_chimera.fasta --out-abundance remove_chimera.biom --summary remove_chimera.html
Galaxy
Sequences file: the input file in FASTA format (contains OTUs sequences)
Abundance type: BIOM or TSV (BIOM if you follow the FROGS guidelines, TSV in some other cases)
Abundance file: the file in the type you choose juste before
Outputs
HTML report
The HTML file summarizes important information about the chimera removal process.
How many clusters/sequences are kept after the process?
In this example, 5,945 OTUs are kept and 13,968 OTUs have been detected as chimera and removed. These 5,945 OTUs correspond to 558,062 sequences kept, i.e. 97,5% of total information.
BIOM file
The BIOM file contains informations about the clusters after the chimera removal process.
FASTA file
The FASTA file contains sequences of the non-chimeric clusters.
FROGS_3 Remove chimera
FROGS_3 Remove chimera
Context
Chimeras are artifact sequences formed by two or more biological sequences incorrectly joined together. This often occurs during PCR reactions using mixed templates (i.e., uncultured environmental samples). Incomplete extensions during PCR allow subsequent PCR cycles to use a partially extended strand to bind to the template of a different, but similar, sequence. This partially extended strand then acts as a primer to extend and form a chimeric sequence. Once created, the chimeric sequence is then further amplified in subsequent cycles. The end result is a PCR artifact that does not represent a sequence that exists in nature.
This phenomena is particularly common in amplicon sequencing where closely related sequences are amplified.
The chimera rate can reach between 5 to 45% (Haas et al., 2011) of the sequences (typically for 16S data).
How it does
This tool removes chimeric sequences by sample.
De novo detection: In this algorithm, a chimera-free reference database is automatically generated for each NGS data. Initially, the reference database is empty. Then, NGS reads are considered in the order of decreasing abundance. If a sequence is classified as chimeric, it is discarded; otherwise, it is added to the reference database (so the size of the reference database grows). Candidate parents (PCR templates, strains A and B in the previous figure) are required to have more abundance than that of the query sequence, on the assumption that a chimera has undergone fewer rounds of amplification and will, therefore, be less abundant than its parents (Edgar et al., 2011).
The chimera detection is performed with VSEARCH, combined with a homemade strategy. The chimera detection is performed sample by sample and a cross-validation is performed to remove only chimera identified in all samples where they are present.
See here for more details about VSEARCH chimera removal.
Command line
v4.1.0
remove_chimera.py --help usage: remove_chimera.py [-h] [-p NB_CPUS] [--debug] [-v] -f INPUT_FASTA [-b INPUT_BIOM | -c INPUT_COUNT] [-n NON_CHIMERA] [-a OUT_ABUNDANCE] [--summary SUMMARY] [-l LOG_FILE] Removes PCR chimera. optional arguments: -h, --help show this help message and exit -p NB_CPUS, --nb-cpus NB_CPUS The maximum number of CPUs used. [Default: 1] --debug Keep temporary files to debug program. -v, --version show program's version number and exit Inputs: -f INPUT_FASTA, --input-fasta INPUT_FASTA The cluster sequences (format: FASTA). -b INPUT_BIOM, --input-biom INPUT_BIOM The abundance file for clusters by sample (format: BIOM). -c INPUT_COUNT, --input-count INPUT_COUNT The counts file for clusters by sample (format: TSV). Outputs: -n NON_CHIMERA, --non-chimera NON_CHIMERA sequences file without chimera (format: FASTA). [Default: remove_chimera.fasta] -a OUT_ABUNDANCE, --out-abundance OUT_ABUNDANCE Abundance file without chimera (format: BIOM or TSV). [Default: remove_chimera_abundance.biom or remove_chimera_abundance.tsv] --summary SUMMARY The HTML file containing the graphs. [Default: remove_chimera.html] -l LOG_FILE, --log-file LOG_FILE This output file will contain several informations on executed commands.
Example of command line:
Galaxy
Sequences file: the input file in FASTA format (contains OTUs sequences)
Abundance type: BIOM or TSV (BIOM if you follow the FROGS guidelines, TSV in some other cases)
Abundance file: the file in the type you choose juste before
Outputs
HTML report
The HTML file summarizes important information about the chimera removal process.
How many clusters/sequences are kept after the process?
In this example, 5,945 OTUs are kept and 13,968 OTUs have been detected as chimera and removed. These 5,945 OTUs correspond to 558,062 sequences kept, i.e. 97,5% of total information.
BIOM file
The BIOM file contains informations about the clusters after the chimera removal process.
FASTA file
The FASTA file contains sequences of the non-chimeric clusters.
A work by FROGS team