hero

gene orthology software behind OrthoDB

OrthoLoger v3.0.5 is the current stable version!
Docker container, Gitlab.

Orthologs are genes in different species that evolved from a common ancestral gene by speciation.

The LEMMI-style benchmarking shows its state-of-the-art performance.

Cite us to help support the project

OrthoDB v11: annotation of orthologs in the widest sampling of organismal diversity D Kuznetsov, F Tegenfeldt, M Manni, M Seppey, M Berkeley, EV Kriventseva, EM Zdobnov, NAR, Nov 2022, doi:10.1093/nar/gkac996. PMID:36350662

more & stats >>

License

Orthologer software is licensed under GNU General Public License.

Getting OrthoLoger software

OrthoLoger, the OrthoDB standalone pipeline for delineation of orthologs, is freely available

  • as a ready to run docker image
docker pull ezlabgva/orthologer:v3.0.5
docker run -u $(id -u) -v ${where}:/odbwork ezlabgva/orthologer:v3.0.5 setup_odb.sh
docker run -u $(id -u) -v ${where}:/odbwork ezlabgva/orthologer:v3.0.5 ./orthologer.sh "command" "options"

..more >>

  • or you can build the docker image yourself
git clone https://gitlab.com/ezlab/orthologer_container.git
cd orthologer_container
docker build ./ -t orthologer .
  • or build a local instance of orthologer manually
curl https://data.orthodb.org/v11/download/software/orthologer_3.0.5.tgz -O
curl https://data.orthodb.org/v11/download/software/orthologer_3.0.5.md5sum -O
# check md5sum
md5sum -c orthologer_3.0.5.md5sum

# if previous md5sum checks out OK, then unpack the package
tar -xzf orthologer_3.0.5.tgz

and follow the instructions in orthologer_3.0.5/README.

Issues

For any issues related with the orthologer see the issues board.

Overview

The orthologer package contains two different modules:

  1. orthologer - computes orthologs given a set of fasta files
  2. orthomapper - maps a single fasta file to OrthoDB clusters

Both are run in the same work directory. orthomapper runs on top of orthologer with an additional configuration file.

Setting up orthologer

Assume DIR_PIPELINE = install location of ORTHOLOGER.

If the orthologer package is installed locally, create the basic setup using the following procedure:

  1. Create a new empty directory
  2. From within this new directory run $DIR_PIPELINE/bin/setup_empty.sh
  3. It will give instructions for how to proceed and/or whether or not something is missing

If the orthologer_container repo is installed use one of the below to get further help:

docker run -u $(id -u) ezlabgva/orthologer:v3.0.5 orthologer

The configuration file orthologer_conf.sh may need an edit. Variables are described in comments.

The number of threads (or parallel processes) is automatically set. It is set differently depending on hyperthreading (HT).

  • with HT : N(cores)
  • without HT : N(cores)/2

In orthologer_conf the main variable is OP_NJOBMAX_LOCAL. For all steps that run over multiple files, the OP_STEP_NPARALLEL can be set to OP_NJOBMAX_LOCAL. For the single file steps like CLUSTER, set instead OP_STEP_NTHREADS.

The choice how to configure the run in orthologer depends very much on the number of files and CPU architecture.

Configuration

The configuration is in the file orthologer_conf.sh where each variable is briefly described in comments. A few may need adjustment:

Variable Description
OP_NJOBMAX_LOCAL max number of jobs submitted in local mode (OP_STEP_NPARALLEL)
OP_NJOBMAX_BATCH max number of jobs submitted in batch mode
SCHEDULER_LABEL scheduler to be used, NONE or SLURM
ODB_RUN_MODE labels for different preset parameter settings
TREE_ENABLED run in tree mode - requires a newick tree
TREE_INPUT_FILE the newick tree
ALIGNMENT_MATRIX compute full (0) or half (1) matrix for homology
MAKEBRH_NALT if set to > 1 it will allow for fuzzy BRHs
MAKEBRH_FSEPMAX max separation relative the best BRH [0..1] (fuzzy BRHs)
POSTPROC_LABELS labels for various postprocess tools
OP_STEP_NPARALLEL[S] step S : number of jobs launched in parallel
OP_STEP_NTHREADS[S] step S : number of threads per job
OP_STEP_NMERGE[S] step S : number of single jobs merged into one
OP_STEP_RUNMODE[S] step S : run locally (LOCAL) or using scheduler (BATCH)
OP_STEP_SCHEDOPTS step S : options for scheduler
OP_LABELA_START/END pairwise steps: this selects the range of keys A
OP_LABELB_START/END pairwise steps: this selects the range of keys B

Including fuzzy BRHs will add a maximum of MAKEBRH_NALT homologies that are nearly BRHs. The candidates must not differ from the best one by more than MAKEBRH_FSEPMAX.

More help can be obtained using

./orthologer.sh -h                 # orthologer commands
./orthologer.sh -H <step>          # description of a given step
./orthologer.sh -H .               # extra help
./orthologer.sh -H .variables      # help on variables
./orthologer.sh -H .examples       # a few examples

Import fasta files

Fasta files are imported using:

./orthologer.sh manage -f fastafiles.txt

The file fastafiles.txt contains two columns, first a label and the second a file name:

+HAMA   data/myfasta.fs
+SIRI   data/urfasta.fs

The '+' sign before the labels indicates that the sequence id's should be relabeled using that label. If not, it will use the base of the filename for the internal sequence id's. In general it is recommended to relabel to something simple. Only case-insensitive alphanumerical characters are allowed [a-z,0-9] and '_'. The sequence id's in the fasta file will be remapped to

<taxid label>:<hexadecimal nr>

Note that the label TPA is not allowed as it has a special meaning in segmasker used for masking.

When importing, it will also create a corresponding todo file at todo/fastafiles.todo. Ensure that all directories are created by

./orthologer.sh -C -t todo/fastafiles.todo

Run orthologer

If everything is setup and PL_TODO is set in orthologer_conf.sh, the following will start a run:

./orthologer.sh -r ALL                        # run over all steps
./orthologer.sh -r MAKEBRH -t todo/my.todo    # run over one step using a 

Adding an option -d, a dry run is triggered. It will just print out each step without actually running the steps.

Orthologer tree mode

By setting TREE_ENABLED=1 the pipeline will run using a user provided taxonomy tree. It can be defined in one of three ways

  1. set TREE_INPUT_FILE to a newick file defining the tree
  2. set TREE_ROOT_CLADE to a clade NCBI taxid present in OrthoDB (e.g 33208 for metazoa)
  3. none of the above, the given todo file is used to construct a tree file name (todo/<label>.nw)

Mapping

The orthologer can also be used to map new fasta files to an existing project or OrthoDB (orthomapper).

On an existing user project

Import the new fasta file as described above.

In order to map against an existing project you need to create a todo file with the new label as well as the other taxids you want to map against. Set PL_TODO to this file or add the option -t <todofile> to the orthologer.sh call

Run, assuming the taxid label of the imported fasta is mylabel and the source cluster is Cluster/source.og.

./orthologer -r all -R mylabel -I Cluster/source.og

It will ensure that all pre-cluster steps will only involve mylabel. The cluster step will merge those BRHs with the source cluster.

The -R option takes a space separated list of labels. Hence, more than one label can be given. However if there are two or more labels, BRHs will also be computed within the group of extra labels. This is not equivalent to map by running separate runs for each extra label.

On OrthoDB data

In order to run on OrthoDB data, there is a orthomapper tool. It runs on top of orthologer and thus shares the orthologer_conf.sh configuration file (see above). The easiest way to run it is via docker. With docker, obtain further instructions:

docker run -u $(id -u) ezlabgva/orthologer:v3.0.5 orthomapper

The -u $(id -u) option tells docker to run image as your current user.

In case of a local install, run the following from an empty directory:

<DIR_PIPELINE>/bin/setup_orthomapper.sh

In any case, the configuration file generated (orthomapper_conf.sh) may require some editing, in particular the BUSCO related parameters.

Variable Description
DBI_DOWNLOAD temporary storage for downloaded tar files, default is /tmp
MAP_ORTHODB_DATA data location - where the downloaded OrthoDB files are installed, default ./data
BUSCO_NCPUS nr of threads to be used by BUSCO
BUSCO_OFFINE set to 0 if BUSCO should run offline, otherwise
BUSCO_DATA BUSCO data install directory

If BUSCO_OFFLINE=1 then it is assumed that BUSCO_DATA already contains the required files.

In addition there are a number of BUSCO related variables. If a mapping is run without giving a target node (as a NCBI taxid), BUSCO is used in auto lineage mode in order to establish a node.

With a local install of orthologer, BUSCO can either be run via a docker image (BUSCO_DOCKER_IMAGE) or using a local install (BUSCO_CMD).

If orthologer is run via docker, then BUSCO is already available in the docker image.

For more information on BUSCO click here.

OrthoDB taxids are referred to below. They are identical to NCBI id's but with a version appended, e.g 9606_0.

Note that the commands are capitalized below. This is not required.

If running using a docker image, create an new directory (e.g odb), copy your fasta file into that directory and run

docker run -u $(id -u) -v $(pwd)/odb:/odbwork ezlabgva/orthologer:v3.0.5 orthomapper -c run -p <label> -f <fasta file> [-n <OrthoDB node>]

This will create the environment in ./odb and run the mapping on the given fasta file. You might want to edit the orthologer_conf.sh before running. If so, before running the mapper, call orthomapper with the create command:

docker run -u $(id -u) -v $(pwd)/odb:/odbwork ezlabgva/orthologer:v3.0.5 orthomapper -c create

Edit the conf file and then proceed with running the mapper.

The option -v tells docker to mount your directory ($(pwd)/odb) as /odbwork inside the docker image.

If orthologer is locally installed:

./orthomapper.sh MAP <label> <fasta file> [<OrthoDB node>]

If you do not know which node to use, you can get the full lineage using ete3 tool:

ete3 ncbiquery --search <ncbi taxid> --info

See HERE for instructions on how to install ete3. Note that all nodes will not be available in OrthoDB.

If no node is given, the tool will use BUSCO auto-lineage mode to find a node.

Note that all the orthomapper.sh calls below can be run from docker by prefixing them with

docker run -u $(id -u) -v $(pwd)/odb:/odbwork ezlabgva/orthologer:v3.0.5

A full list of all nodes available for download can be obtained by

./orthomapper.sh DOWNLOAD ?

Further help can be obtained using

./orthomapper.sh HELP

Example

Below is an example where a sample fasta file is mapped against a subset of Cichliformes from OrthoDB.

# load a sample fasta file
curl https://data.orthodb.org/v11/download/orthologer/fasta/unknown.fs.gz -O
gunzip unknown.fs.gz

# create job label 'myproj' and map unknown.fs to node enterobacterales (91347)
./orthomapper.sh MAP myproj unknown.fs 91347

# create job label 'myproj' and map unknown.fs, use BUSCO to find node
./orthomapper.sh MAP myproj unknown.fs

The result should be in multiple files

  1. <base>.annotations - contains annotation for each cluster with mapped genes (tab separated columns)
  2. <base>.hits - contains the mapped genes, OrthoDB cluster id's and metrics
  3. <base>.desc - contains annotation for each cluster
  4. <base>.odb - contains FULL clusters including OrthoDB genes
  5. <base> - final cluster result with internal ID's

where base is pipeline/Results/myproj.og

All result files can also be obtained using

./orthomapper.sh RESULT myproj

This will produce a list of result files prefixed by an identifier.

The results can also be packaged into a tar ball

./orthomapper.sh PACKAGE myproj

This tar file will also contain additional timing statistics. Note that if the tar file already exists, it will do nothing, just print the file name.

Tests for Orthologer and Orthomapper

For tests, check the test directory at <DIR_PIPELINE>/test.

In docker, run the tests using (start with an empty directory ./odb)

docker run -u $(id -u) -v $(pwd)/odb:/odbwork ezlabgva/orthologer:v3.0.5 run_tests

Alternatively using a local install:

<DIR_PIPELINE>/test/run_test_all.sh

The output should be something like

copy orthologer test directory to /odbwork
-------------------
--- START TESTS ---
-------------------
proj_prot        run_test                       RunLogs/run_test_230222_1005.log ...                        OK       T(secs) = 37
proj_tree        run_test                       RunLogs/run_test_230222_1006.log ...                        OK       T(secs) = 109
proj_map         run_test                       RunLogs/run_test_230222_1008.log ...                        OK       T(secs) = 58
proj_map         run_test_with_busco            RunLogs/run_test_with_busco_230222_1009.log ...             OK       T(secs) = 400
proj_map         run_test_busco                 RunLogs/run_test_busco_230222_1015.log ...                  OK       T(secs) = 14
proj_map         run_test_busco_auto            RunLogs/run_test_busco_auto_230222_1016.log ...             OK       T(secs) = 78
ALL TESTS OK!

Note that if a test fails, look into the corresponding log file. It may very well be that the reference differs slightly from the test result. Look at the end of the log file for more information. The first column is the test category and gives the location of the log files. E.g:

$(pwd)/odb/proj_map/RunLogs/run_test_with_busco_230222_1009.log

File formats

Files generated by orthologer have, when possible, a header and a footer. Those lines are prefixed by '#'.

A header has the format:

#----------------------------------------------------------------------
# Version : ORTHOLOGER-3.0.5 ; <hash>
# Title   : <command used>
# Date    : <date when run>
# System  : <system environment>
# User    : <user id>
# Job id  : <either /bin/bash if local or a scheduler id>
# Work dir: <dir>
#----------------------------------------------------------------------

and the footer

#TIME : <host> <time in seconds>
#END at <full date time>

Fasta files

Files

Sequences/*.fs_clean

Sequences/*.fs_masked

Sequences/*.fs_selected

A fasta record consists of two or more lines. The first line is a identifier line starting with '>' followed by the internal id. The remaining lines (until the next '>') is the sequence.

Example:

>hobbit:00558a
MTYALFLLSVSLVMGFVGFSSKPSPIYGGLV

CD-hit clusters

Files

Sequences/*.fs_selected.clstr

These files are generated in the SELECT step with the extension .fs_selected.clstr. Each record (cluster) starts with a '>Cluster X'. The following lines are genes that are very similar (97% by default). One of those is selected as representative (marked with a '*'). Example

>Cluster 9
0       7604aa, >hobbit:000047... at 1:7604:137:7732/97.62%
1       7732aa, >hobbit:00008d... *

The representatives are saved in .fs_selected and the remaining in .inpar_selected.

Fasta statistics

Files

Sequences/*.fs_selected.stats

Sequences/*.fs_selected.seqlen

The .stats files contains global statistics on the sequences.

Example:

#NTOT:20915
#MINLEN:31
#MAXLEN:35653
#AVELEN:570.458
#MEDLEN:422
#TOTLEN:1.19311e+07

The .seqlen files contain 3 columns, first is the sequence id, second its length and third the fraction non-masked.

Example:

hobbit:005586 318 0.87
hobbit:005587 98 1.00

Pairwise files (alignment, BRH)

Files:

PWC/*/*.align

PWC/*/*.brh

These files consists of space separated columns with the following format

<id A> <id B> <score> <e-value> <PID> <startA> <lastA> <startB> <lastB> [in BRH possibly two more columnsrelated to fuzzy BRHs]

Example:

fish:000000 63155_0:003794 130.45 4.88e-141 65.8 16 420 1 391 0 0 
fish:000021 63155_0:002534 203.45 4.52e-88 85.6 1 174 1 174 0 0

Inparalogs

Files

PWC/*/*.inpar

Each line contains paralogs wrt to a BRH (the first two columns):

<BRH A> <BRH B> <score> <inpar 1> <inpar 2> ...

Each <inpar> consists of a ';' separated block

<inpar ID>;<score>;<start A>;<last A>;<start inpar>;<last inpar>

Example:

63155_0:0055a4 fish:000163 211.77 63155_0:001168;213.862;23;923;1;902 
63155_0:003ac5 fish:0003a6 49.44 63155_0:00470a;54.4638;35;294;97;359 63155_0:0000bb;49.8592;35;294;139;401

Cluster

Files

Cluster/*.og_raw

Cluster/*.og_inpar

Cluster/*.og

Each line is space-separated, one line per cluster element.

<cluster id> <gene id> <cluster type> <nr of edges> <align start> <align last> <PID> <score> <e-value>

The metrics at the end are evaluated as a mean over all edges.

cluster type encodes how the element was clustered. See table below for more information.

Note that for type 9, the given score is actually the PID (percentage identity) and the e-value is always set to zero.

type description
0 part of a triangle
1 part of a pair
2 element is attached to a triangle but not forming a triangle
3 part of a chain of pairs
4 like 2 but with significantly shorter sequence - split candidate
5 like 2 but with significantly longer sequence - merge candidate
7 (in)paralog wrt BRH
9 (in)paralog wrt a representative

Example:

84327 61853_0:001e1a 0 1 1 717 100 258.55 0
84327 30611_0:00449d 7 1 1 585 100 149.507 0
84327 30611_0:004493 2 1 1 494 100 125.51 7.16e-153
84327 9515_0:004f59 9 1 1 523 100 100 0