Introduction

This is trimAl's information page. You can also find information related to readAl , a MSA format conversor.
trimAl is a tool for the automated removal of spurious sequences or poorly aligned regions from a multiple sequence alignment.

trimAl can consider several parameters, alone or in multiple combinations, in order to select the most-reliable positions in the alignment.
These include the proportion of sequences with a gap, the level of residue similarity and, if several alignments for the same set of sequences are provided, the consistency level of columns among alignments.
Moreover, trimAl allows to manually select a set of columns and sequences to be removed from the alignment.

trimAl implements a series of automated algorithms that trim the alignment searching for optimum thresholds based on inherent characteristics of the input alignment, to be used so that the signal-to-noise ratio after alignment trimming phase is increased.

Among trimAl's additional features, trimAl allows getting the complementary alignment (columns that were trimmed), to compute statistics from the alignment, to select the output file format , to get a summary of trimAl's trimming in HTML and SVG formats, and many other options.

Installation

The simplest way to compile this package is:

 
    1.- Move to the project folder
    
    2.- Configure the project. 
        This will also create the file "include/ReadWriteMS/formats_header.h" needed to handle formats.
            > cmake .
    
    3.- Compile the project. 
            > make
        
        It is possible to increase the speed of compilation by compiling in multiple threads.
        Specify the number of threads providing the -jN argument, where N is the number of threads.
            > make -j2 
            > make -j3 
            > ...
 
    4.- [Optional] Move or copy binaries from folder './bin/' to '/usr/local/bin' or '/usr/bin'
        This will allow you to use directly trimAl without specifiying path to binary.

Publications

- trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses.
- Supplementary material
- Other benchmarks

Usage

For a more detailed description and use of each argument, please, refer to What Can I Do With trimAl?

trimAl 2.rev0 build 2019-08-05
2009-2019. Víctor Fernández-Rodríguez, Salvador Capella-Gutierrez and Toni Gabaldón.
trimAl webpage: http://trimal.cgenomics.org
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, the last available version.

Please cite:
trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses.
Salvador Capella-Gutierrez; Jose M. Silla-Martinez; Toni Gabaldon.
Bioinformatics 2009, 25:1972-1973.

Basic usage
    trimal -in <inputfile> -out <outputfile> -(other options).

Common options:
For a complete list please see the User Guide or visit http://trimal.cgenomics.org

Help Options
    --help
        Print this information and show some examples.
    --version
        Print the trimAl version.
    -v <n> / --verbose <n>
        Specify the verbose level of the program.
        Available options: error / 3, warning / 2, info / 1, none / 0
        Default value: info / 1

Input-Output Options
    -in <inputfile>
        Input file in several formats.
        Available input formats: clustal, fasta, mega_interleaved, mega_sequential, nexus, phylip32, phylip40, phylip_paml, pir
 
    -formats <format1, ...>
        Specify one or more formats to save the resulting MSA. Separated by spaces
        Combinations with specific format arguments (-fasta, -nexus, etc) is allowed.
        Available output formats: clustal, fasta_m10, fasta, html, mega_sequential, nexus_m10, nexus, phylip32_m10, phylip32, phylip40_m10, phylip40, phylip_paml_m10, phylip_paml, pir
 
    -out <outputfile>
        Output path pattern. (default stdout).
 
         Following tags will be replaced if present on filename:
               [in]        -> Original filename without extension.
               [format]    -> Output's format name.
               [extension] -> Output's extension.
               [contig]    -> Contig name. Only applied if using VCF.
 
         This allows to store on several formats, without overriding the same file,
            as each format would replace the corresponding tags.
         It also allows to store the original filename of the input alignment into
            the result, allowing to reuse the same pattern for all inputs.
         In the case of the tag [contig] allows to store a MSA for each contig found.
 
         This can be used to store alignments in different folders, depending on the
            format, the input alignment, etc.
         Take in mind trimAl DOES NOT create new folders.
 
         Note: Format and extension may be the same depending on format.
 
         Examples:
            trimal -in ./ali1.fasta -out [in].clean.[extension]
                -> ./ali1.clean.fasta
 
            trimal -in ./alignment2.fasta -out [in].clean.[extension] -clustal
                -> ./file1.alignment2.clw
 
            trimal -in ./file1.fasta -out alig.[format].[extension] -formats clustal fasta pir
                -> ./alig.clustal.clw
                -> ./alig.pir.pir
                -> ./alig.fasta.fasta
 
            trimal -in file1.fasta -out ./[in]/trimmed.[format] -formats fasta
                -> ./file1/trimmed.fasta (ONLY if folder file1 already exists)
 
    -lf / --listformats
        List available formats to load from and save to.

Report Output
    -htmlout <outputfile>
        Get a summary of trimal's work in an HTML file.
    -svgout <outputfile>
        Get a summary of trimal's work in a SVG file.
    -sgvstats <outputfile>
        Get a summary of trimal's calculated stats in a SVG file.
 
    -colnumbering
        Get the relationship between the columns in the old and new alignment.

Compare Set Options
    -compareset <inputfile>
        Input list of paths for the files containing the alignments to compare.
    -forceselect <inputfile>
        Force selection of the given input file in the files comparison method.

Backtranslation Options
    -backtrans <inputfile>
        Use a Coding Sequences file to get a backtranslation for a given AA alignment
    -ignorestopcodon
        Ignore stop codons in the input coding sequences
    -splitbystopcodon
        Split input coding sequences up to first stop codon appearance

Trimming Parameters
    --alternative_matrix degenerated_nt_identity
        Specify the degenerated nt identity matrix as the similarity matrix to use.
        If a matrix is not specified, the best suited among a set will be selected.
    -matrix <inpufile>
        Input file for user-defined similarity matrix (default is Blosum62).
    -block <n>
        Minimum column block size to be kept in the trimmed alignment.
        Available with manual and automatic (gappyout) methods
 
    -keepheader
        Keep original sequence header including non-alphanumeric characters.
        Only available for input FASTA format files.
        (future versions will extend this feature)
    -keepseqs
        Keep sequences even if they are composed only by gaps.
 
    -complementary
        Get the complementary alignment in residues.
        Reverses the effect of residue trimming:
            All residues that were to be removed are kept and vice versa.
    -complementaryseq
        Get the complementary alignment in sequences.
        Reverses the effect of sequence trimming:
            All sequences that were to be removed are kept and vice versa.
 
    -terminalonly
        Only columns out of internal boundaries
        (first and last column without gaps) are
        candidates to be trimmed depending on the applied method

Trimming Methods
Manual Selection

    -selectcols { n,l,m-k }
        Selection of columns to be removed from the alignment.
        Range: [0 - (Number of Columns - 1)]. (see User Guide).
    -selectseqs { n,l,m-k }
        Selection of sequences to be removed from the alignment.
        Range: [0 - (Number of Sequences - 1)]. (see User Guide).

 
Automated

    -nogaps
        Remove all positions with gaps in the alignment.
    -noallgaps
        Remove columns composed only by gaps.

    -noduplicateseqs
        Removes sequences that are equal on the alignment.
        It will keep the latest sequence in the alignment.

    -gappyout
        Use automated selection on "gappyout" mode.
        This method only uses information based on gaps' distribution.
    -strict
        Use automated selection on "strict" mode.
    -strictplus
        Use automated selection on "strictplus" mode.
        Optimized for Neighbour Joining phylogenetic tree reconstruction.

    -automated1
        Use a heuristic selection of the automatic method
            based on similarity statistics. (see User Guide).
        Optimized for Maximum Likelihood phylogenetic tree reconstruction.

    -clusters <n>
        Get the most Nth representatives sequences from a given alignment.
        Range: [1 - (Number of sequences)]
    -maxidentity <n>
        Get the representatives sequences for a given identity threshold.
        Range: [0 - 1].

 
Overlap Trimming

    Overlap is defined as having a gap in both positions,
        an indetermination in both positions, or a residue in both positions.
    It's main purpose is to remove sequences which share only a reduced region,
        whereas the other regions are not shared with the rest of sequences
        in the alignment and filled with gaps.
    Both arguments must be provided jointly.

    Ex: Sp8 may be removed from the alignment depending on the thresholds.

    Sp8    -----GLG-----------TKSD---NNNNNNNNNNNNNNNNWV-----------------
    Sp17   --FAYTAPDLLL-IGFLLKTV-ATFG-----------------DTWFQLWQGLDLNKMPVF
    Sp10   ------DPAVL--FVIMLGTI-TKFS-----------------SEWFFAWLGLEINMMVII
    Sp26   AAAAAAAAALLTYLGLFLGTDYENFA-----------------AAAANAWLGLEINMMAQI

    -resoverlap <n>
        Minimum overlap of a positions with other positions in the column
            to be considered a "good position".
        Range: [0 - 1]. (see User Guide).
    -seqoverlap <n>
        Minimum percentage of "good positions" that a sequence must have
            in order to be conserved.
        Range: [0 - 100](see User Guide).

 
Manual Trimming - Thresholds

    -gt -gapthreshold <n>
        1 - (fraction of gaps in the column).
        Range: [0 - 1]
        Not compatible with -gat
    -gat -gapabsolutethreshold <n>
        Max number of gaps allowed on a column to keep it.
        Range: [0 - (number of sequences - 1)]
        Not compatible with -gt
    -st -simthreshold <n>
        Minimum average similarity required.
        Range: [0 - 1]
    -ct -conthreshold <n>
        Minimum consistency value required.
        Range: [0 - 1]
    -cons <n>
        Minimum percentage of positions
            in the original alignment to conserve.
        Range: [0 - 100]

Half Windows
Half window size, score of position i is the average of the window (i - n) to (i + n).
Only compatible with manual methods.
-w <n>
    (half) General window size, applied to all stats.
        Not compatible with specific sizes.
-gw <n>
    (half) Window size applied to Gaps.
-sw <n>
    (half) Window size applied to Similarity.
-cw <n>
    (half) Window size applied to Consistency.

Statistics Output
Statistics to be calculated and outputted by trimAl
-sgc
    Print gap scores for each column in the input alignment.
-sgt
    Print accumulated gap scores for the input alignment.
-ssc
    Print similarity scores for each column in the input alignment.
-sst
    Print accumulated similarity scores for the input alignment.
-sfc
    Print sum-of-pairs scores for each column from the selected alignment
-sft
    Print accumulated sum-of-pairs scores for the selected alignment
-sident
    Print identity scores for all sequences in the input alignment.
    (see User Guide).
-soverlap
    Print overlap scores matrix for all sequences in the input alignment.
    (see User Guide).

NGS Support - VCF SNP MSA creator
Suport for VCF files. Providing a reference genome,
    and one or more VCF, multiple MSA are created.
One MSA for each contig present on the whole VCF-dataset.
Each MSA contains the reference sequence
    and a sequence for each donor, with their SNP applied.
 
    -vcf <inputfile, ...>
        Specify one or more VCF files to produce MSAs
            using the input file (-in <n>) as reference genome.
        It will produce a MSA for each sequence on the original alignment.
        Each MSA will contain the same number of sequences:
            Number of donors + 1 (reference).
             
        If output file is given, it is recommended to use
            the tag "[contig]" in the filename.
            (See -out explanation)
        Otherwise, the alignments will be stacked
            one upon another on the same file.
        This is valid on formats like fasta or pir,
            but will yield a non-valid file for other formats, such as clustal.
             
        If no output file pattern is given (-out <outputfile>)
            or it doesn't contain the tag "[contig]",
            the sequences names will have the name of their contig prepended.
             
    -minquality <n>
        Specify the min quality of a SNP in VCF to apply it.
        Only valid in combination with -vcf.
         
    -mincoverage <n>
        Specify the min coverage of a SNP in VCF to apply it.
        Only valid in combination with -vcf.
         
    -ignoreFilter
        Ignore vcf-filtered variants in VCF.
        Only valid in combination with -vcf.
        Still applies min-quality and min-coverage when provided.

Legacy Options
These options are included for back-compatibility with older versions of trimAl.
New formats will not be added to this list of output format arguments.
The new formats argument "-formats <format1, format2, etc>" should be used instead.
-nbrf
    Output file in NBRF/PIR format
-mega
    Output file in MEGA format
-nexus
    Output file in NEXUS format
-clustal
    Output file in CLUSTAL format
-fasta
    Output file in FASTA format
-fasta_m10
    Output file in FASTA format.
    Sequences name length up to 10 characters.
-phylip
    Output file in PHYLIP/PHYLIP4 format
-phylip_m10
    Output file in PHYLIP/PHYLIP4 format.
    Sequences name length up to 10 characters.
-phylip_paml
    Output file in PHYLIP format compatible with PAML
-phylip_paml_m10
    Output file in PHYLIP format compatible with PAML.
    Sequences name length up to 10 characters.
-phylip3.2
    Output file in PHYLIP3.2 format
-phylip3.2_m10
    Output file in PHYLIP3.2 format.
    Sequences name length up to 10 characters.

Some Examples:
 
1) Removes all positions in the alignment with gaps in 10% or more of
   the sequences, unless this leaves less than 60% of original alignment.
   In such case, print the 60% best (with less gaps) positions.
    
   trimal -in <inputfile> -out <outputfile> -gt 0.9 -cons 60
         
2) As above but, the gap score is averaged over a window starting
   3 positions before and ending 3 positions after each column.
    
   trimal -in <inputfile> -out <outputfile> -gt 0.9 -cons 60 -w 3
         
3) Use an automatic method to decide optimal thresholds, based in the gap scores
   from input alignment. (see User Guide for details).
    
   trimal -in <inputfile> -out <outputfile> -gappyout
         
4) Use automatic methods to decide optimal thresholds, based on the combination
   of gap and similarity scores. (see User Guide for details).
    
   trimal -in <inputfile> -out <outputfile> -strictplus
         
5) Use an heuristic to decide the optimal method for trimming the alignment.
   (see User Guide for details).
    
   trimal -in <inputfile> -out <outputfile> -automated1
         
6) Use residues and sequences overlap thresholds to delete some sequences from the
   alignment. (see User Guide for details).
    
   trimal -in <inputfile> -out <outputfile> -resoverlap 0.8 -seqoverlap 75
         
7) Selection of columns to be deleted from the alignment. The selection can
   be a column number or a column number interval. Start from 0
    
   trimal -in <inputfile> -out <outputfile> -selectcols { 0,2,3,10,45-60,68,70-78 }
         
8) Get the complementary alignment from the alignment previously trimmed.
 
   trimal -in <inputfile> -out <outputfile> -selectcols { 0,2,3,10,45-60,68,70-78 } -complementary
         
9) Selection of sequences to be deleted from the alignment. Start from 0
 
   trimal -in <inputfile> -out <outputfile> -selectseqs { 2,4,8-12 }
         
10) Select the 5 most representative sequences from the alignment
 
   trimal -in <inputfile> -out <outputfile> -clusters 5

Citing this tool

    trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses.
    Salvador Capella-Gutierrez; Jose M. Silla-Martinez; Toni Gabaldon.
    Bioinformatics 2009, 25:1972-1973.

Understanding the code

You can start by taking a look at:
- trimAl Manager
- Alignment

Acknowledgments

Alan Zucconi : As his post explaining bitwise operators is used as reference explaining SequenceTypes