FROGS_4 Cluster Filters

tutorial

tool

FROGS_4 Cluster Filters

Context

Once the clusters have been reconstructed, it is absolutely essential to filter these data. Most software do this internally without the user being aware of it, but in FROGS this is a user controlled step.
This step is done in the FROGS_4 Cluster Filters tool

What it does

This tool deletes clusters among conditions enter by user. If an cluster reply to at least 1 criteria, the cluster is deleted.

This tool filters the clusters inside an abundance table according to:

The cluster prevalence: The number of times the cluster is present in the environment, i.e. the number of samples where the cluster must be present.
cluster size: An cluster that is not large enough for a given proportion or count will be removed.
Biggest cluster: Only the X biggest are conserved.
Contaminant:
- from the list of proposition, if cluster sequence matches with phiX (a control added in Illumina sequencing technologies), chloroplastic/mitochondrial 16S of A. Thaliana
- or your own contaminant sequence (a fasta file containing a list of contaminant of your choice).

Once the filters of your choice have been set, the kept clusters are the ones that satisfy into the BIOM input file the specified thresholds. The BIOM abundance table and the fasta file are written again according to the clusters kept. And the clusters discarded are listed in the excluded file.

Command line

v4.1.0

cluster_filters.py --help
usage: cluster_filters.py [-h] [-p NB_CPUS] [--debug] [-v]
                          [--nb-biggest-clusters NB_BIGGEST_CLUSTERS]
                          [-s MIN_SAMPLE_PRESENCE] [-r MIN_REPLICATE_PRESENCE]
                          [--replicate_file REPLICATE_FILE] [-a MIN_ABUNDANCE]
                          --input-biom INPUT_BIOM --input-fasta INPUT_FASTA
                          [--contaminant CONTAMINANT]
                          [--output-biom OUTPUT_BIOM]
                          [--output-fasta OUTPUT_FASTA] [--summary SUMMARY]
                          [--excluded EXCLUDED] [--log-file LOG_FILE]

Filters an abundance file

optional arguments:
  -h, --help            show this help message and exit
  -p NB_CPUS, --nb-cpus NB_CPUS
                        The maximum number of CPUs used. [Default: 1]
  --debug               Keep temporary files to debug program.
  -v, --version         show programs version number and exit

Filters:
  --nb-biggest-clusters NB_BIGGEST_CLUSTERS
                        Number of most abundant clusters you want to keep.
  -s MIN_SAMPLE_PRESENCE, --min-sample-presence MIN_SAMPLE_PRESENCE
                        Keep cluster present in at least this number of
                        samples.
  -r MIN_REPLICATE_PRESENCE, --min-replicate-presence MIN_REPLICATE_PRESENCE
                        Keep cluster present in at least this proportion of
                        replicates in at least one group (please indicate a
                        proportion between 0 and 1). Replicates must be
                        defined with --replicate_file REPLICATE FILE
  --replicate_file REPLICATE_FILE
                        Replicate file must be specified if --min-replicate-
                        presence is set. First column of the file must
                        indicate the sample name, and the second column the
                        group name of this replicate. Exemple: TEM1_L0001_R
                        Temoin.
  -a MIN_ABUNDANCE, --min-abundance MIN_ABUNDANCE
                        Minimum percentage/number of sequences, comparing to
                        the total number of sequences, of a cluster (between 0
                        and 1 if percentage desired).

Inputs:
  --input-biom INPUT_BIOM
                        The input BIOM file. (format: BIOM)
  --input-fasta INPUT_FASTA
                        The input FASTA file. (format: FASTA)
  --contaminant CONTAMINANT
                        Use this databank to filter sequence before
                        affiliation. (format: FASTA)

Outputs:
  --output-biom OUTPUT_BIOM
                        The BIOM file output. (format: BIOM) [Default:
                        cluster_filters_abundance.biom]
  --output-fasta OUTPUT_FASTA
                        The FASTA output file. (format: FASTA) [Default:
                        cluster_filters.fasta]
  --summary SUMMARY     The HTML file containing the graphs. [Default:
                        cluster_filters.html]
  --excluded EXCLUDED   The TSV file that summarizes all the clusters
                        discarded. (format: TSV) [Default:
                        cluster_filters_excluded.tsv]
  --log-file LOG_FILE   This output file will contain several information on
                        executed commands.

Example of command line:

cluster_filters.py \
--input-biom remove_chimera.biom \
--input-fasta remove_chimera.fasta \
--min-abundance 0.00005 \
--min-sample-presence 2 \
--output-biom cluster_filters_abundance.biom \
--output-fasta cluster_filters.fasta \
--summary cluster_filters.html

Parameters

1^st criteria : prevalence filter

option 1: exemple

option 2: exemple

How to build the file of replicated sample names ?
The file must consist of only 2 columns, separated by a tab.

The first column contains the exact names of the samples (exactly those contained in the biom file)
The second column contains the name of the group to which they belong. Please note that group names must not contain accents, spaces or special characters.

Results of FROGS_4 Cluster Filters tool with this metadata file :
if we want to keep the clusters that are present in at least 50% of the samples of a same group, we set the threshold at 0.5. The process will therefore keep the clusters present in at least
2 “rich” samples ; 3 “richAB” samples ; 1 “lowAB” sample ; 1 “april21” sample and all clusters in sample9 since it is the only representative of the “low” condition.

Mistakes not to be made:

2^nd criteria : cluster size filter

The “minimum proportion/number” parameter should always be used unless you are focusing on extremely rare clusters and don’t care about false positive clusters. Indeed, to ensure a good community description, a 0.005% abundance filter should be apply to your data.

To justify this threshold, read this paper: Bokulich NA, Subramanian S, Faith JJ, Gevers D, Gordon JI, Knight R, Mills DA, Caporaso JG. Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nat Methods. 2013 Jan;10(1):57-9. doi: 10.1038/nmeth.2276. Epub 2012 Dec 2. PMID: 23202435; PMCID: PMC3531572.

3^rd criteria : Keep biggest cluster

4^th criteria : Contaminant filter

This normalization allows to compare the samples between them. But to perform more precise statistical analysis, some tools as DESeq2 need the non-normalized abundance table to perform the normalization by themselves. So be careful which table to use for further analysis.

Outputs

HTML report

The HTML file summarizes information about filtering results on each sample.

Sample name: Names of all samples of the run.

Initial : Number of cluster present in each sample before filtering.

Kept : Number of cluster present in each sample after filtering.

other columns: one column per chosen filter. In this exemple, user have chosen to apply 3 filters. Filters are apply indenpendantly. If one cluster matches at least one filter, it is removed.

To know the number of cluster that matches with several filters, explore Venn diagram:

Excluded file

List of removed cluster per filter:

Abundance Table and sequence file

Since clusters are removed, Abundance table (BIOM format) and sequence file (fasta format) are updated.

A work by FROGS team

FROGS_4 Cluster Filters

FROGS_4 Cluster Filters

Context

What it does

Command line

Parameters

1st criteria : prevalence filter

option 1: exemple

option 2: exemple

2nd criteria : cluster size filter

3rd criteria : Keep biggest cluster

4th criteria : Contaminant filter

Outputs

HTML report

Excluded file

Abundance Table and sequence file

1^st criteria : prevalence filter

2^nd criteria : cluster size filter

3^rd criteria : Keep biggest cluster

4^th criteria : Contaminant filter