nf-core/proteinfamilies
Generation and updating of protein families
metagenomicsprotein-familiesproteomics
Version history
Added
- #124
- Added new subworkflow
MERGE_FAMILIES
that can optionally merge similar (but not redundant) generated protein families. (by @vagkaratzas) - Added new functionality to the local module
IDENTIFY_REDUNDANT_FAMS
which now also detects and outputs the identifiers of similar families that can optionally be merged downstream. These identifiers are written to “/remove_redundancy/<samplename>/similar_fam_ids.txt”, and the corresponding family pairwise similarity scores to “/remove_redundancy/<samplename>/similarities.csv”. (by @vagkaratzas) - Added new local module
POOL_SIMILAR_COMPONENTS
that generates family clusters, from a family-similarity edgelist. (by @vagkaratzas) - Added new local module
MERGE_SEEDS
that merges seed alignments of similar families, before restarting the family generation subworkflow. (by @vagkaratzas)
- Added new subworkflow
- #118
- Added preprint citation to the repo. (by @vagkaratzas)
- Added separate metro map files for dark and light browser modes. (by @vagkaratzas)
- Added new local module
EXTRACT_FAMILY_MEMBERS
which outputs a two-column TSV file containing the final family identifiers and their corresponding member sequence identifiers. The file is saved at “/family_reps/<samplename>/<samplename>.tsv”. (by @vagkaratzas)
- #117
- Added
SEQKIT_SEQ
for optional sequence preprocessing in the quality check subworkflow. (by @vagkaratzas) - Added
SEQKIT_REPLACE
for optional sequence name parsing in the quality check subworkflow. (by @vagkaratzas) - Added
SEQKIT_RMDUP
for optional removal of duplicate names and sequences in the quality check subworkflow. (by @vagkaratzas)
- Added
Changed
- #128 - nf-core tools template update to 3.4.1.
- #124
- Conditional workflow flags switched to their
skip
opposites;--trim_msa
to--skip_msa_trimming
,--recruit_sequences_with_models
to--skip_additional_sequence_recruiting
,--remove_family_redundancy
to--skip_family_redundancy_removal
,--remove_sequence_redundancy
to--skip_sequence_redundancy_removal
. (by @vagkaratzas)
- Conditional workflow flags switched to their
- #118
- Swapped the local
CHECK_QUALITY
subworkflow with the new nf-core oneFAA_SEQFU_SEQKIT
. (by @vagkaratzas) - Based on protein family reproducibility benchmarks (i.e., computationally reproducing manually curated protein family resources), the
cluster_seq_identity
andcluster_coverage
parameter default values have been updated to0.3
and0.5
(down from0.5
and0.9
) respectively. (by @vagkaratzas)
- Swapped the local
- #117 - Swapped the local
SEQKIT_STATS
and the localSEQKIT_STATS_TO_MQC
modules with theSEQFU_STATS
one, which runs a bit faster and produces a MultiQC-ready output without the need for manual parsing. (by @vagkaratzas)
Dependencies
Tool | Previous version | New version |
---|---|---|
seqfu | - | 1.20.3 |
multiqc | 1.30 | 1.31 |
Deprecated
- #124 - Deprecated
--trim_msa
,--recruit_sequences_with_models
,--remove_family_redundancy
and--remove_sequence_redundancy
. (by @vagkaratzas)
Special Thanks
To @jfy133, @erikrikarddaniel and @chrisAta for this version’s PR code reviews.
Fixed
- #112 - Fixed a bug in
EXTRACT_FAMILY_REPS
, where all sequences were pasted into the family representative one, and updated the relevant local nf-test. (by @vagkaratzas)
Changed
- #106 - Swapped the local
EXECUTE_CLUSTERING
subworkflow with the new nf-coreMMSEQS_FASTA_CLUSTER
one. (by @vagkaratzas)
Dependencies
Tool | Previous version | New version |
---|---|---|
multiqc | 1.29 | 1.30 |
Changed
- #104 - Pulling
params
from local subworkflows into main workflow. - #103 - Parallelized execution for the
EXTRACT_FAMILY_REPS
local module and changed its input fromfull_msa
tofasta
. - #100 -
CAT_CAT
module replaced withFIND_CONCATENATE
to avoid large scaleArgument list too long
errors. - #98 - nf-core tools template update to 3.3.2.
Added
- #105 -
CHECK_QUALITY
subworkflow added at the start of the pipeline. It utilizes theseqkit/stats
nf-core module to generate aMultiQC
-ready report with statistics for the input amino acid sequences. The metro-map has been updated to reflect this change.
Added
- #93
- Added nf-test and
meta.yml
file for local subworkflowGENERATE_FAMILIES
. - Added nf-test and
meta.yml
file for local subworkflowREMOVE_REDUNDANCY
. - Added nf-test and
meta.yml
file for local subworkflowUPDATE_FAMILIES
.
- Added nf-test and
- #88
- Added nf-test and
meta.yml
file for local moduleBRANCH_HITS_FASTA
. - Added nf-test and
meta.yml
file for local moduleFILTER_NON_REDUNDANT_FAMS
. - Added nf-test and
meta.yml
file for local moduleIDENTIFY_REDUNDANT_FAMS
. - Added nf-test and
meta.yml
file for local moduleEXTRACT_FAMILY_REPS
. - Added the default pipeline end-to-end nf-test.
- Added nf-test and
Changed
- #81 - nf-core tools template update to 3.3.1.
Fixed
- #80 - Fixed a bug where, due to a missing check for equal family sizes, non-redundant families were erroneously marked as redundant through transitive relationships and were removed
Changed
- #77 - Default branch changed from
master
tomain
. - #73 - Changed the fasta parsing library of the
CHUNK_CLUSTERS
local module, frompyfastx
back to the latest version ofbiopython
, and parallelized its writing mechanism, achieving decreased execution time.
Dependencies
Tool | Previous version | New version |
---|---|---|
biopython | 1.84 | 1.85 |
pyfastx | 2.2.0 |
Removed
- #73 - Deprecated
pyfastx
module version ofCHUNK_CLUSTERS
, since it was struggling performance-wise with larger datasets.
Added
- #69 - Added the
hhsuite/reformat
nf-core module to reformat.sto
alignments to.fas
when in-family sequence redundancy is not removed. Also added the option to save intermediate and final family fasta files throughout the workflow with varioussave
parameters. - #58 - Added nf-test and
meta.yml
file for local moduleREMOVE_REDUNDANCY_SEQS
(Hackathon 2025) - #56 - Added nf-test and
meta.yml
file for local moduleFILTER_RECRUITED
(Hackathon 2025) - #55 - Added nf-test and
meta.yml
file for local moduleCHUNK_CLUSTERS
(Hackathon 2025) - #54 - Added nf-test for local subworkflow
ALIGN_SEQUENCES
(Hackathon 2025) - #53 - Added nf-test for local subworkflow
EXECUTE_CLUSTERING
(Hackathon 2025) - #51 - Added nf-test and
meta.yml
file for local moduleCALCULATE_CLUSTER_DISTRIBUTION
(Hackathon 2025) - #34 - Added the
EXTRACT_UNIQUE_CLUSTER_REPS
module, that calculates initialMMseqs
clustering metadata, for each sample, to print withMultiQC
(Id,Cluster Size,Number of Clusters)
Fixed
- #69 - Fixed a bug where redundant family alignments were not published properly, if intra-family redundancy removal mechanism was switched off #68
- #65 - Fixed a bug in
CHUNK_CLUSTERS
, where pipeline would crash if the module filtered out all clusters, due to a high membership threshold #64 - #35 - Fixed a bug in
remove_redundant_fams.py
, where comparison was between strings instead of integers to keep larger family - #33 - Fixed an always-true condition at the
filter_non_redundant_hmms.py
script, by adding missing parentheses - #29 - Fixed
hmmalign
empty input crash error, by preventing theFILTER_RECRUITED
module from creating an empty output .fasta.gz file, when there are no remaining sequences after filtering thehmmsearch
results #28
Changed
- #69 - Changed the publish directory architecture for HMMs, seed MSAs, full MSAs and family FASTA files, to make it more intuitive.
REMOVE_REDUNDANT_FAMS
local module converted toIDENTIFY_REDUNDANT_FAMS
to extract redundant family ids which will then be used downstream.FILTER_NON_REDUNDANT_HMMS
local module converted toFILTER_NON_REDUNDANT_FAMS
and reused four times (HMM, seed MSA, full MSA, FASTA). Changed the output format of theEXTRACT_FAMILY_REPS
andREMOVE_REDUNDANT_SEQS
local modules from.fa
to.faa
. Metro map updated with newhhsuite/reformat
module. - #57 - slight improvements of
nextflow_schema.json
(Hackathon 2025) - #57 - slight improtmenets of
assets/schema_input.json
(Hackathon 2025) - #34 - Swapped the
SeqIO
python library withpyfastx
for theCHUNK_CLUSTERS
module, quartering its duration - #32 - Updated
ClipKIT
2.4.0 -> 2.4.1, that now also allows ends-only trimming, to completely replace the customCLIP_ENDS
module. Users can now also define its output format by setting the--clipkit_out_format
parameter (default:clipkit
)
Dependencies
Tool | Previous version | New version |
---|---|---|
ClipKIT | 2.4.0 | 2.4.1 |
pyfastx | 2.2.0 | |
hhsuite | 3.3.0 | |
multiqc | 1.27 | 1.28 |
Deprecated
- #32 - Deprecated
CLIP_ENDS
module and--clipping_tool
parameter. The only option now isClipKIT
, covering both previous modes, via setting--trim_ends_only
Initial release of nf-core/proteinfamilies, created with the nf-core template.
Added
- Amino acid sequence clustering (mmseqs)
- Multiple sequence alignment (famsa, mafft, clipkit)
- Hidden Markov Model generation (hmmer)
- Between families redundancy removal (hmmer)
- In-family sequence redundancy removal (mmseqs)
- Family updating (hmmer, seqkit, mmseqs, famsa, mafft, clipkit)
- Family statistics presentation (multiqc)
By @vagkaratzas and @mberacochea.