Version history

Added

  • #124
    • Added new subworkflow MERGE_FAMILIES that can optionally merge similar (but not redundant) generated protein families. (by @vagkaratzas)
    • Added new functionality to the local module IDENTIFY_REDUNDANT_FAMS which now also detects and outputs the identifiers of similar families that can optionally be merged downstream. These identifiers are written to “/remove_redundancy/<samplename>/similar_fam_ids.txt”, and the corresponding family pairwise similarity scores to “/remove_redundancy/<samplename>/similarities.csv”. (by @vagkaratzas)
    • Added new local module POOL_SIMILAR_COMPONENTS that generates family clusters, from a family-similarity edgelist. (by @vagkaratzas)
    • Added new local module MERGE_SEEDS that merges seed alignments of similar families, before restarting the family generation subworkflow. (by @vagkaratzas)
  • #118
    • Added preprint citation to the repo. (by @vagkaratzas)
    • Added separate metro map files for dark and light browser modes. (by @vagkaratzas)
    • Added new local module EXTRACT_FAMILY_MEMBERS which outputs a two-column TSV file containing the final family identifiers and their corresponding member sequence identifiers. The file is saved at “/family_reps/<samplename>/<samplename>.tsv”. (by @vagkaratzas)
  • #117
    • Added SEQKIT_SEQ for optional sequence preprocessing in the quality check subworkflow. (by @vagkaratzas)
    • Added SEQKIT_REPLACE for optional sequence name parsing in the quality check subworkflow. (by @vagkaratzas)
    • Added SEQKIT_RMDUP for optional removal of duplicate names and sequences in the quality check subworkflow. (by @vagkaratzas)

Changed

  • #128 - nf-core tools template update to 3.4.1.
  • #124
    • Conditional workflow flags switched to their skip opposites; --trim_msa to --skip_msa_trimming, --recruit_sequences_with_models to --skip_additional_sequence_recruiting, --remove_family_redundancy to --skip_family_redundancy_removal, --remove_sequence_redundancy to --skip_sequence_redundancy_removal. (by @vagkaratzas)
  • #118
    • Swapped the local CHECK_QUALITY subworkflow with the new nf-core one FAA_SEQFU_SEQKIT. (by @vagkaratzas)
    • Based on protein family reproducibility benchmarks (i.e., computationally reproducing manually curated protein family resources), the cluster_seq_identity and cluster_coverage parameter default values have been updated to 0.3 and 0.5 (down from 0.5 and 0.9) respectively. (by @vagkaratzas)
  • #117 - Swapped the local SEQKIT_STATS and the local SEQKIT_STATS_TO_MQC modules with the SEQFU_STATS one, which runs a bit faster and produces a MultiQC-ready output without the need for manual parsing. (by @vagkaratzas)

Dependencies

ToolPrevious versionNew version
seqfu-1.20.3
multiqc1.301.31

Deprecated

  • #124 - Deprecated --trim_msa, --recruit_sequences_with_models, --remove_family_redundancy and --remove_sequence_redundancy. (by @vagkaratzas)

Special Thanks

To @jfy133, @erikrikarddaniel and @chrisAta for this version’s PR code reviews.

Fixed

  • #112 - Fixed a bug in EXTRACT_FAMILY_REPS, where all sequences were pasted into the family representative one, and updated the relevant local nf-test. (by @vagkaratzas)

Changed

  • #106 - Swapped the local EXECUTE_CLUSTERING subworkflow with the new nf-core MMSEQS_FASTA_CLUSTER one. (by @vagkaratzas)

Dependencies

ToolPrevious versionNew version
multiqc1.291.30

Changed

  • #104 - Pulling params from local subworkflows into main workflow.
  • #103 - Parallelized execution for the EXTRACT_FAMILY_REPS local module and changed its input from full_msa to fasta.
  • #100 - CAT_CAT module replaced with FIND_CONCATENATE to avoid large scale Argument list too long errors.
  • #98 - nf-core tools template update to 3.3.2.

Added

  • #105 - CHECK_QUALITY subworkflow added at the start of the pipeline. It utilizes the seqkit/stats nf-core module to generate a MultiQC-ready report with statistics for the input amino acid sequences. The metro-map has been updated to reflect this change.

Added

  • #93
    • Added nf-test and meta.yml file for local subworkflow GENERATE_FAMILIES.
    • Added nf-test and meta.yml file for local subworkflow REMOVE_REDUNDANCY.
    • Added nf-test and meta.yml file for local subworkflow UPDATE_FAMILIES.
  • #88
    • Added nf-test and meta.yml file for local module BRANCH_HITS_FASTA.
    • Added nf-test and meta.yml file for local module FILTER_NON_REDUNDANT_FAMS.
    • Added nf-test and meta.yml file for local module IDENTIFY_REDUNDANT_FAMS.
    • Added nf-test and meta.yml file for local module EXTRACT_FAMILY_REPS.
    • Added the default pipeline end-to-end nf-test.

Changed

  • #81 - nf-core tools template update to 3.3.1.

Fixed

  • #80 - Fixed a bug where, due to a missing check for equal family sizes, non-redundant families were erroneously marked as redundant through transitive relationships and were removed

Changed

  • #77 - Default branch changed from master to main.
  • #73 - Changed the fasta parsing library of the CHUNK_CLUSTERS local module, from pyfastx back to the latest version of biopython, and parallelized its writing mechanism, achieving decreased execution time.

Dependencies

ToolPrevious versionNew version
biopython1.841.85
pyfastx2.2.0

Removed

  • #73 - Deprecated pyfastx module version of CHUNK_CLUSTERS, since it was struggling performance-wise with larger datasets.

Added

  • #69 - Added the hhsuite/reformat nf-core module to reformat .sto alignments to .fas when in-family sequence redundancy is not removed. Also added the option to save intermediate and final family fasta files throughout the workflow with various save parameters.
  • #58 - Added nf-test and meta.yml file for local module REMOVE_REDUNDANCY_SEQS (Hackathon 2025)
  • #56 - Added nf-test and meta.yml file for local module FILTER_RECRUITED (Hackathon 2025)
  • #55 - Added nf-test and meta.yml file for local module CHUNK_CLUSTERS (Hackathon 2025)
  • #54 - Added nf-test for local subworkflow ALIGN_SEQUENCES (Hackathon 2025)
  • #53 - Added nf-test for local subworkflow EXECUTE_CLUSTERING (Hackathon 2025)
  • #51 - Added nf-test and meta.yml file for local module CALCULATE_CLUSTER_DISTRIBUTION (Hackathon 2025)
  • #34 - Added the EXTRACT_UNIQUE_CLUSTER_REPS module, that calculates initial MMseqs clustering metadata, for each sample, to print with MultiQC (Id,Cluster Size,Number of Clusters)

Fixed

  • #69 - Fixed a bug where redundant family alignments were not published properly, if intra-family redundancy removal mechanism was switched off #68
  • #65 - Fixed a bug in CHUNK_CLUSTERS, where pipeline would crash if the module filtered out all clusters, due to a high membership threshold #64
  • #35 - Fixed a bug in remove_redundant_fams.py, where comparison was between strings instead of integers to keep larger family
  • #33 - Fixed an always-true condition at the filter_non_redundant_hmms.py script, by adding missing parentheses
  • #29 - Fixed hmmalign empty input crash error, by preventing the FILTER_RECRUITED module from creating an empty output .fasta.gz file, when there are no remaining sequences after filtering the hmmsearch results #28

Changed

  • #69 - Changed the publish directory architecture for HMMs, seed MSAs, full MSAs and family FASTA files, to make it more intuitive. REMOVE_REDUNDANT_FAMS local module converted to IDENTIFY_REDUNDANT_FAMS to extract redundant family ids which will then be used downstream. FILTER_NON_REDUNDANT_HMMS local module converted to FILTER_NON_REDUNDANT_FAMS and reused four times (HMM, seed MSA, full MSA, FASTA). Changed the output format of the EXTRACT_FAMILY_REPS and REMOVE_REDUNDANT_SEQS local modules from .fa to .faa. Metro map updated with new hhsuite/reformat module.
  • #57 - slight improvements of nextflow_schema.json (Hackathon 2025)
  • #57 - slight improtmenets of assets/schema_input.json (Hackathon 2025)
  • #34 - Swapped the SeqIO python library with pyfastx for the CHUNK_CLUSTERS module, quartering its duration
  • #32 - Updated ClipKIT 2.4.0 -> 2.4.1, that now also allows ends-only trimming, to completely replace the custom CLIP_ENDS module. Users can now also define its output format by setting the --clipkit_out_format parameter (default: clipkit)

Dependencies

ToolPrevious versionNew version
ClipKIT2.4.02.4.1
pyfastx2.2.0
hhsuite3.3.0
multiqc1.271.28

Deprecated

  • #32 - Deprecated CLIP_ENDS module and --clipping_tool parameter. The only option now is ClipKIT, covering both previous modes, via setting --trim_ends_only

Initial release of nf-core/proteinfamilies, created with the nf-core template.

Added

  • Amino acid sequence clustering (mmseqs)
  • Multiple sequence alignment (famsa, mafft, clipkit)
  • Hidden Markov Model generation (hmmer)
  • Between families redundancy removal (hmmer)
  • In-family sequence redundancy removal (mmseqs)
  • Family updating (hmmer, seqkit, mmseqs, famsa, mafft, clipkit)
  • Family statistics presentation (multiqc)

By @vagkaratzas and @mberacochea.