FROGS_4 Cluster Filters
Context
Once the clusters have been reconstructed, it is absolutely essential to filter these data. Most software do this internally without the user being aware of it, but in FROGS this is a user controlled step.
This step is done in the FROGS_4 Cluster Filters tool
What it does
This tool deletes clusters among conditions enter by user. If an cluster reply to at least 1 criteria, the cluster is deleted.
This tool filters the clusters inside an abundance table according to:
- The cluster prevalence: The number of times the cluster is present in the environment, i.e. the number of samples where the cluster must be present.
- cluster size: An cluster that is not large enough for a given proportion or count will be removed.
- Biggest cluster: Only the X biggest are conserved.
- Contaminant:
- from the list of proposition, if cluster sequence matches with phiX (a control added in Illumina sequencing technologies), chloroplastic/mitochondrial 16S of A. Thaliana
- or your own contaminant sequence (a fasta file containing a list of contaminant of your choice).
Once the filters of your choice have been set, the kept clusters are the ones that satisfy into the BIOM input file the specified thresholds. The BIOM abundance table and the fasta file are written again according to the clusters kept. And the clusters discarded are listed in the excluded file.
Command line
v4.1.0
cluster_filters.py --help
usage: cluster_filters.py [-h] [-p NB_CPUS] [--debug] [-v]
[--nb-biggest-clusters NB_BIGGEST_CLUSTERS]
[-s MIN_SAMPLE_PRESENCE] [-r MIN_REPLICATE_PRESENCE]
[--replicate_file REPLICATE_FILE] [-a MIN_ABUNDANCE]
--input-biom INPUT_BIOM --input-fasta INPUT_FASTA
[--contaminant CONTAMINANT]
[--output-biom OUTPUT_BIOM]
[--output-fasta OUTPUT_FASTA] [--summary SUMMARY]
[--excluded EXCLUDED] [--log-file LOG_FILE]
Filters an abundance file
optional arguments:
-h, --help show this help message and exit
-p NB_CPUS, --nb-cpus NB_CPUS
The maximum number of CPUs used. [Default: 1]
--debug Keep temporary files to debug program.
-v, --version show programs version number and exit
Filters:
--nb-biggest-clusters NB_BIGGEST_CLUSTERS
Number of most abundant clusters you want to keep.
-s MIN_SAMPLE_PRESENCE, --min-sample-presence MIN_SAMPLE_PRESENCE
Keep cluster present in at least this number of
samples.
-r MIN_REPLICATE_PRESENCE, --min-replicate-presence MIN_REPLICATE_PRESENCE
Keep cluster present in at least this proportion of
replicates in at least one group (please indicate a
proportion between 0 and 1). Replicates must be
defined with --replicate_file REPLICATE FILE
--replicate_file REPLICATE_FILE
Replicate file must be specified if --min-replicate-
presence is set. First column of the file must
indicate the sample name, and the second column the
group name of this replicate. Exemple: TEM1_L0001_R
Temoin.
-a MIN_ABUNDANCE, --min-abundance MIN_ABUNDANCE
Minimum percentage/number of sequences, comparing to
the total number of sequences, of a cluster (between 0
and 1 if percentage desired).
Inputs:
--input-biom INPUT_BIOM
The input BIOM file. (format: BIOM)
--input-fasta INPUT_FASTA
The input FASTA file. (format: FASTA)
--contaminant CONTAMINANT
Use this databank to filter sequence before
affiliation. (format: FASTA)
Outputs:
--output-biom OUTPUT_BIOM
The BIOM file output. (format: BIOM) [Default:
cluster_filters_abundance.biom]
--output-fasta OUTPUT_FASTA
The FASTA output file. (format: FASTA) [Default:
cluster_filters.fasta]
--summary SUMMARY The HTML file containing the graphs. [Default:
cluster_filters.html]
--excluded EXCLUDED The TSV file that summarizes all the clusters
discarded. (format: TSV) [Default:
cluster_filters_excluded.tsv]
--log-file LOG_FILE This output file will contain several information on
executed commands.
Example of command line:
cluster_filters.py \
--input-biom remove_chimera.biom \
--input-fasta remove_chimera.fasta \
--min-abundance 0.00005 \
--min-sample-presence 2 \
--output-biom cluster_filters_abundance.biom \
--output-fasta cluster_filters.fasta \
--summary cluster_filters.html
Parameters
1st criteria : prevalence filter
option 1: exemple
option 2: exemple
How to build the file of replicated sample names ?
The file must consist of only 2 columns, separated by a tab.
- The first column contains the exact names of the samples (exactly those contained in the biom file)
- The second column contains the name of the group to which they belong. Please note that group names must not contain accents, spaces or special characters.
Results of FROGS_4 Cluster Filters tool with this metadata file :
if we want to keep the clusters that are present in at least 50% of the samples of a same group, we set the threshold at 0.5. The process will therefore keep the clusters present in at least
2 “rich” samples ; 3 “richAB” samples ; 1 “lowAB” sample ; 1 “april21” sample and all clusters in sample9 since it is the only representative of the “low” condition.
Mistakes not to be made:
2nd criteria : cluster size filter
The “minimum proportion/number” parameter should always be used unless you are focusing on extremely rare clusters and don’t care about false positive clusters. Indeed, to ensure a good community description, a 0.005% abundance filter should be apply to your data.
To justify this threshold, read this paper: Bokulich NA, Subramanian S, Faith JJ, Gevers D, Gordon JI, Knight R, Mills DA, Caporaso JG. Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nat Methods. 2013 Jan;10(1):57-9. doi: 10.1038/nmeth.2276. Epub 2012 Dec 2. PMID: 23202435; PMCID: PMC3531572.
3rd criteria : Keep biggest cluster
4th criteria : Contaminant filter
This normalization allows to compare the samples between them. But to perform more precise statistical analysis, some tools as DESeq2 need the non-normalized abundance table to perform the normalization by themselves. So be careful which table to use for further analysis.
Outputs
HTML report
The HTML file summarizes information about filtering results on each sample.
Sample name: Names of all samples of the run.
Initial : Number of cluster present in each sample before filtering.
Kept : Number of cluster present in each sample after filtering.
other columns: one column per chosen filter. In this exemple, user have chosen to apply 3 filters. Filters are apply indenpendantly. If one cluster matches at least one filter, it is removed.
To know the number of cluster that matches with several filters, explore Venn diagram:
Excluded file
List of removed cluster per filter:
Abundance Table and sequence file
Since clusters are removed, Abundance table (BIOM format) and sequence file (fasta format) are updated.
FROGS_4 Cluster Filters
FROGS_4 Cluster Filters
Context
Once the clusters have been reconstructed, it is absolutely essential to filter these data. Most software do this internally without the user being aware of it, but in FROGS this is a user controlled step.
This step is done in the FROGS_4 Cluster Filters tool
What it does
This tool deletes clusters among conditions enter by user. If an cluster reply to at least 1 criteria, the cluster is deleted.
This tool filters the clusters inside an abundance table according to:
Once the filters of your choice have been set, the kept clusters are the ones that satisfy into the BIOM input file the specified thresholds. The BIOM abundance table and the fasta file are written again according to the clusters kept. And the clusters discarded are listed in the excluded file.
Command line
v4.1.0
cluster_filters.py --help usage: cluster_filters.py [-h] [-p NB_CPUS] [--debug] [-v] [--nb-biggest-clusters NB_BIGGEST_CLUSTERS] [-s MIN_SAMPLE_PRESENCE] [-r MIN_REPLICATE_PRESENCE] [--replicate_file REPLICATE_FILE] [-a MIN_ABUNDANCE] --input-biom INPUT_BIOM --input-fasta INPUT_FASTA [--contaminant CONTAMINANT] [--output-biom OUTPUT_BIOM] [--output-fasta OUTPUT_FASTA] [--summary SUMMARY] [--excluded EXCLUDED] [--log-file LOG_FILE] Filters an abundance file optional arguments: -h, --help show this help message and exit -p NB_CPUS, --nb-cpus NB_CPUS The maximum number of CPUs used. [Default: 1] --debug Keep temporary files to debug program. -v, --version show programs version number and exit Filters: --nb-biggest-clusters NB_BIGGEST_CLUSTERS Number of most abundant clusters you want to keep. -s MIN_SAMPLE_PRESENCE, --min-sample-presence MIN_SAMPLE_PRESENCE Keep cluster present in at least this number of samples. -r MIN_REPLICATE_PRESENCE, --min-replicate-presence MIN_REPLICATE_PRESENCE Keep cluster present in at least this proportion of replicates in at least one group (please indicate a proportion between 0 and 1). Replicates must be defined with --replicate_file REPLICATE FILE --replicate_file REPLICATE_FILE Replicate file must be specified if --min-replicate- presence is set. First column of the file must indicate the sample name, and the second column the group name of this replicate. Exemple: TEM1_L0001_R Temoin. -a MIN_ABUNDANCE, --min-abundance MIN_ABUNDANCE Minimum percentage/number of sequences, comparing to the total number of sequences, of a cluster (between 0 and 1 if percentage desired). Inputs: --input-biom INPUT_BIOM The input BIOM file. (format: BIOM) --input-fasta INPUT_FASTA The input FASTA file. (format: FASTA) --contaminant CONTAMINANT Use this databank to filter sequence before affiliation. (format: FASTA) Outputs: --output-biom OUTPUT_BIOM The BIOM file output. (format: BIOM) [Default: cluster_filters_abundance.biom] --output-fasta OUTPUT_FASTA The FASTA output file. (format: FASTA) [Default: cluster_filters.fasta] --summary SUMMARY The HTML file containing the graphs. [Default: cluster_filters.html] --excluded EXCLUDED The TSV file that summarizes all the clusters discarded. (format: TSV) [Default: cluster_filters_excluded.tsv] --log-file LOG_FILE This output file will contain several information on executed commands.
Example of command line:
Parameters
1st criteria : prevalence filter
option 1: exemple
option 2: exemple
How to build the file of replicated sample names ?
The file must consist of only 2 columns, separated by a tab.
Results of FROGS_4 Cluster Filters tool with this metadata file :
if we want to keep the clusters that are present in at least 50% of the samples of a same group, we set the threshold at 0.5. The process will therefore keep the clusters present in at least
2 “rich” samples ; 3 “richAB” samples ; 1 “lowAB” sample ; 1 “april21” sample and all clusters in sample9 since it is the only representative of the “low” condition.
Mistakes not to be made:
2nd criteria : cluster size filter
The “minimum proportion/number” parameter should always be used unless you are focusing on extremely rare clusters and don’t care about false positive clusters. Indeed, to ensure a good community description, a 0.005% abundance filter should be apply to your data.
To justify this threshold, read this paper: Bokulich NA, Subramanian S, Faith JJ, Gevers D, Gordon JI, Knight R, Mills DA, Caporaso JG. Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nat Methods. 2013 Jan;10(1):57-9. doi: 10.1038/nmeth.2276. Epub 2012 Dec 2. PMID: 23202435; PMCID: PMC3531572.
3rd criteria : Keep biggest cluster
4th criteria : Contaminant filter
This normalization allows to compare the samples between them. But to perform more precise statistical analysis, some tools as DESeq2 need the non-normalized abundance table to perform the normalization by themselves. So be careful which table to use for further analysis.
Outputs
HTML report
The HTML file summarizes information about filtering results on each sample.
Sample name: Names of all samples of the run.
Initial : Number of cluster present in each sample before filtering.
Kept : Number of cluster present in each sample after filtering.
other columns: one column per chosen filter. In this exemple, user have chosen to apply 3 filters. Filters are apply indenpendantly. If one cluster matches at least one filter, it is removed.
To know the number of cluster that matches with several filters, explore Venn diagram:
Excluded file
List of removed cluster per filter:
Abundance Table and sequence file
Since clusters are removed, Abundance table (BIOM format) and sequence file (fasta format) are updated.
A work by FROGS team