gene orthology software behind OrthoDB
TIP
OrthoLoger v3.5.0 is the current stable version!
Docker container (opens new window), Gitlab (opens new window) or bioconda (opens new window)
TIP
Orthologs are genes in different species that evolved from a common ancestral gene by speciation.
TIP
The LEMMI-style benchmarking (opens new window) shows its state-of-the-art performance.
TIP
# Cite us to help support the project
OrthoDB v11: annotation of orthologs in the widest sampling of organismal diversity D Kuznetsov, F Tegenfeldt, M Manni, M Seppey, M Berkeley, EV Kriventseva, EM Zdobnov, NAR, Nov 2022, doi:10.1093/nar/gkac996 (opens new window). PMID:36350662 (opens new window)
more & stats >> (opens new window)
# License
Orthologer software is licensed under GNU General Public License (opens new window).
# Overview
OrthoLoger, the OrthoDB standalone pipeline for delineation of orthologs, is freely available.The package contains two main pipelines:
ODB-mapper
- map fasta file(s) to OrthoDB orthologsorthologer
- find orthologs in a set of fasta files
The software is built using bash, python and C++. In addition it uses a few external tools:
- homology: diamond (opens new window) or mmseqs (opens new window) homology
- low-complexity masking:
segmasker
from NCBI Blast (opens new window) - similarity: [cd-hit]:(https://github.com/weizhongli/cdhit)
Clustering is obtained using an inhouse cluster tool BRHCLUS
which is published (source code) as part of the orthologer distribution.
# Getting OrthoLoger software
# Container
The latest docker image can be pulled from DockerHub:
docker pull ezlabgva/orthologer:v3.5.0
Alternatively build the image locally:
git clone https://gitlab.com/ezlab/orthologer_container.git
cd orthologer_container
docker build ./ -t orthologer .
This method automatically includes BUSCO
in the pipeline.
# Bioconda
Install the software locally using conda
with the bioconda channel.
Instructions and details on the package is found here (opens new window).
# From source
# retrieve package
curl https://data.orthodb.org/current/download/software/orthologer_3.5.0.tgz -O
curl https://data.orthodb.org/current/download/software/orthologer_3.5.0.md5sum -O
# check md5sum
md5sum -c orthologer_3.5.0.md5sum
# if previous md5sum checks out OK, then unpack the package
tar -xzf orthologer_3.5.0.tgz
cd orthologer_3.5.0
# Install at PREFIX, by default 'usr/local'
# export PREFIX=<install root>
./install_pkg.sh
# In order to use the pipelines, make sure that PREFIX/bin is in the PATH.
export PATH=$PREFIX/bin:$PATH
# Running
If using a container, the generic command for running is as follows
# Run orthologer 3.5.0 in directory ./odb
docker run -u $(id -u) -v $(pwd)/odb:/odbwork [-v <fasta dir>:/odbdata] -t orthologer:v3.5.0
If the work directory (./odb
) does not exist, docker will create it using the root user. Hence, make sure the directory exists beforehand.
Note that mounting /odbdata
is not required and only used by the orthologer
pipeline.
# Mapping on OrthoDB
Whether it's in a container or locally, run using
# with no args it will output default work directory etc
ODB-mapper
# get quickstart help
ODB-mapper HELP
# map your fasta file against a given OrthoDB node
ODB-mapper MAP mylabel example.fs 1489911
# get result file
ODB-mapper RESULT mylabel
The working directory is by default ./odbmapper
under the current directory.
If another directory is required, set it using the following
export ODBMAPPER_WORK=<your dir>
The mapper will create one directory per available OrthoDB version.
Currently the versions are v11
and v12
.
The user may select which version using ODB-mapper_<version>
.
With no extension, the current
version is selected.
If no OrthoDB target node is given in the 'MAP' command above, it will use BUSCO
auto-lineage option to find an appropriate node. This requires busco to b available.
When running the container, busco is always available. Otherwise it has to be installed and ODB-mapper needs to know where.
If busco is installed, make sure busco
is in the path.
Otherwise if docker is running it will look for a BUSCO
image. Pull it using:
docker pull ezlabgva/busco:v5.7.1_cv1
Note that this should match the variable BUSCO_DOCKER_IMAGE
in the the orthomapper_conf.sh
found
in $ODBMAPPER_WORK/v12/pipeline
(assuming version v12 is being used).
# Orthologer
Before the first run, the pipeline environment needs to be created.
# setup environment
orthologer -c create
By default it will create environment in the current directory.
The process may create a few warnings related to tool directories. This should not happen if running through a container or a conda install.
If errors occur it's usually related to tool directories.
Examine the config file (orthologer_conf.sh
) and look for e.g DIR_BRHCLUS
.
# Import fasta
To find orthology in a set of fastas, use the command orthologer
.
By default orthologer will look for files in /odbdata
(note that it is at the root).
Otherwise the fasta source can be given using the -d
option.
orthologer -c import [-d <fasta source>]
The fasta source can be given as
- a directory
- a file containing one file per line
If a directory is given (default), ALL files in that directory are used.
Note that if running in a container, the file names given must be visible inside the container. Either by mounting or relative to /odbwork
.
# Run the pipeline
orthologer -c run
The results will be stored in ./Results
.
# Issues
For any issues related with the orthologer see the issues board. (opens new window)
# Advanced
WARNING
The below describes to some extent what is actually going on behind the scene. This is only needed if larger orthology runs are to be made.
The orthologer and ODB-mapper tools can be run and controlled directly using scripts in the working directory
orthologer.sh
orthomapper.sh
In addition there are two config files.
orthologer_conf.sh
orthomapper_conf.sh
- contains mainly busco related options.
The environment can also be created by entering an empty directory and issue
$DIR_ORTHOLOGER/bin/setup.sh
This will create the full environment. Config files may need to be updated.
# Configuration
Some of the variables in the orthologer config file:
Variable | Description |
---|---|
OP_NJOBMAX_LOCAL | max number of jobs submitted in local mode (OP_STEP_NPARALLEL ) |
OP_NJOBMAX_BATCH | max number of jobs submitted in batch mode |
SCHEDULER_LABEL | scheduler to be used, NONE or SLURM |
ODB_RUN_MODE | labels for different preset parameter settings |
TREE_ENABLED | run in tree mode - requires a newick tree |
TREE_INPUT_FILE | the newick tree |
POSTPROC_LABELS | labels for various postprocess tools |
OP_STEP_NPARALLEL[S] | step S : number of jobs launched in parallel |
OP_STEP_NTHREADS[S] | step S : number of threads per job |
OP_STEP_NMERGE[S] | step S : number of single jobs merged into one |
OP_STEP_RUNMODE[S] | step S : run locally (LOCAL) or using scheduler (BATCH) |
OP_STEP_SCHEDOPTS | step S : options for scheduler |
OP_LABELA_START/END | pairwise steps: this selects the range of keys A |
OP_LABELB_START/END | pairwise steps: this selects the range of keys B |
More help can be obtained using
# orthologer commands
./orthologer.sh -h
# description of a given step
./orthologer.sh -H <step>
# extra help
./orthologer.sh -H .
# help on variables
./orthologer.sh -H .variables
# a few examples
./orthologer.sh -H .examples
# Import fasta files
Fasta files are imported using:
./orthologer.sh manage -f fastafiles.txt
The file fastafiles.txt
contains two columns, first a label and the second a file name:
+HAMA data/myfasta.fs
+SIRI data/urfasta.fs
The '+' sign before the labels indicates that the sequence id's should be relabeled using that label. If not, it will use the base of the filename for the internal sequence id's. In general it is recommended to relabel to something simple. Only case-insensitive alphanumerical characters are allowed [a-z,0-9] and '_'. The sequence id's in the fasta file will be remapped to
<taxid label>:<hexadecimal nr>
DANGER
Note that the label TPA is not allowed as it has a special meaning in segmasker
which is used masking.
When importing, it will also create a corresponding todo file at todo/fastafiles.todo
.
Ensure that all directories are created by
./orthologer.sh -C -t todo/fastafiles.todo
# Run orthologer
If everything is setup and PL_TODO is set in orthologer_conf.sh, the following will start a run:
# dry run over all steps
./orthologer.sh -dr ALL
# run over all steps
./orthologer.sh -r ALL
# run over one step using a given todo file
./orthologer.sh -r MAKEBRH -t todo/my.todo
# Orthologer tree mode
By setting TREE_ENABLED=1 the pipeline will run using a user provided taxonomy tree.
It can be defined in one of three ways
- set TREE_INPUT_FILE to a newick file defining the tree
- set TREE_ROOT_CLADE to a clade NCBI taxid present in OrthoDB (e.g 33208 for metazoa)
- none of the above, the given todo file is used to construct a tree file name (
todo/<label>.nw
)
In tree mode the pipeline will first prepare for each clade a todo file based on the tree. Each todo file will contain all leaf organisms below the clade. It will then proceed to run on the root node up to MAKEINPARSEL. Clustering at a given clade is performed using its leaf clades as input using BRH's between the clades.
# Mapping
The following section describes howto use orthologer
to map a set of user fasta to an existing run.
# On an existing user project
Import the new fasta file(s) as described above.
Make sure there is a todo file in ./todo
with the labels of the imported file(s).
Assume that the existing project is named run0
and that the files to be mapped are defined by todo/map.todo
. Start the mapping with
./orthologer -r all -t "todo/map.todo:todo/run0.todo" -I "Cluster/run0.og"
The final result will be in ./Results/map.*
.
# On OrthoDB data
Note that the recommended way of mapping onto OrthoDB is by using ODB-mapper as described [here](#mapping on orthodb).
The following provides more details for the advanced user.
In order to run on OrthoDB data, there is a orthomapper.sh
script generated in the project directory.
It runs on top of orthologer and thus shares the orthologer_conf.sh
configuration file (see above).
The orthomapper configuration file (orthomapper_conf.sh
) may require some editing, in particular the BUSCO
related parameters.
Variable | Description |
---|---|
DBI_DOWNLOAD | temporary storage for downloaded tar files, default is /tmp |
MAP_ORTHODB_DATA | data location - where the downloaded OrthoDB files are installed, default ./data |
BUSCO_NCPUS | nr of threads to be used by BUSCO |
BUSCO_OFFLINE | set to 0 if BUSCO should run offline - if so it will look in BUSCO_DATA for files |
BUSCO_DATA | BUSCO data install directory |
If BUSCO_OFFLINE=1
then it is assumed that BUSCO_DATA
already contains the required files.
In addition there are a number of BUSCO
related variables. If a mapping is run without giving a target node (as a NCBI taxid),
BUSCO
is used in auto lineage
mode in order to establish a node.
With a local install of orthologer
, BUSCO
can either be run via a docker image (BUSCO_DOCKER_IMAGE
) or using a local install (BUSCO_CMD
).
If orthologer
is run via docker, then BUSCO
is already available in the docker image.
For more information on BUSCO
click here (opens new window).
TIP
OrthoDB taxids are referred to below. They are identical to NCBI id's but with a version appended, e.g 9606_0.
TIP
Note that the commands are capitalized below. This is not required.
If orthologer
is locally installed and setup (see [above](#setting up orthologer)):
./orthomapper.sh MAP <label> <fasta file> [<OrthoDB node>]
If you do not know which node to use, you can get the full lineage using ete3 tool:
ete3 ncbiquery --search <ncbi taxid> --info
See HERE (opens new window) for instructions on how to install ete3. Note that all nodes will not be available in OrthoDB.
If no node is given, the tool will use BUSCO
auto-lineage mode to find a node.
Note that all the orthomapper.sh
calls below can be run from docker by prefixing them with
docker run -u $(id -u) -v $(pwd)/odb:/odbwork ezlabgva/orthologer:v3.5.0
A full list of all nodes available for download can be obtained by
./orthomapper.sh DOWNLOAD ?
Further help can be obtained using
./orthomapper.sh HELP
# Example
Below is an example where a sample fasta file is mapped against a subset of Cichliformes from OrthoDB.
# load a sample fasta file
curl https://data.orthodb.org/v11/download/orthologer/fasta/unknown.fs.gz -O
gunzip unknown.fs.gz
# create job label 'myproj' and map unknown.fs to node enterobacterales (91347)
./orthomapper.sh MAP myproj unknown.fs 91347
# create job label 'myproj' and map unknown.fs, use `BUSCO` to find node
./orthomapper.sh MAP myproj unknown.fs
The result should be in multiple files
- <base>.annotations - contains annotation for each cluster with mapped genes (tab separated columns)
- <base>.hits - contains the mapped genes, OrthoDB cluster id's and metrics
- <base>.desc - contains annotation for each cluster
- <base>.odb - contains FULL clusters including OrthoDB genes
- <base> - final cluster result with internal ID's
where base is pipeline/Results/myproj.og
All result files can also be obtained using
./orthomapper.sh RESULT myproj
This will produce a list of result files prefixed by an identifier.
The results can also be packaged into a tar ball
./orthomapper.sh PACKAGE myproj
This tar file will also contain additional timing statistics. Note that if the tar file already exists, it will do nothing, just print the file name.
# Tests for Orthologer and Orthomapper
For tests, check the test directory at <DIR_PIPELINE>/test
.
In docker, run the tests using (start with an empty directory ./odb)
docker run -u $(id -u) -v $(pwd)/odb:/odbwork ezlabgva/orthologer:v3.5.0 run_tests
Alternatively using a local install:
<DIR_PIPELINE>/test/run_test_all.sh
The output should be something like
copy orthologer test directory to /odbwork
-------------------
--- START TESTS ---
-------------------
run_test proj_prot/RunLogs/run_test_231107_0914.log ... OK T(secs) = 31
run_test_localmap proj_prot/RunLogs/run_test_localmap_231107_0915.log ... OK T(secs) = 33
run_test proj_tree/RunLogs/run_test_231107_0915.log ... OK T(secs) = 87
run_test proj_map/RunLogs/run_test_231107_0917.log ... OK T(secs) = 46
run_test_with_busco proj_map/RunLogs/run_test_with_busco_231107_0918.log ... OK T(secs) = 331
run_test_busco proj_map/RunLogs/run_test_busco_231107_0923.log ... OK T(secs) = 24
run_test_busco_auto proj_map/RunLogs/run_test_busco_auto_231107_0924.log ... OK T(secs) = 97
run_test_with_subset proj_map/RunLogs/run_test_with_subset_231107_0925.log ... OK T(secs) = 149
ALL TESTS OK!
Note that if a test fails, look into the corresponding log file. It may very well be that the reference differs slightly from the test result. Look at the end of the log file for more information. The first column is the test category and gives the location of the log files. E.g:
$(pwd)/odb/proj_map/RunLogs/run_test_with_busco_230222_1009.log
# File formats
Files generated by orthologer have, when possible, a header and a footer. Those lines are prefixed by '#'.
A header has the format:
#----------------------------------------------------------------------
# Version : ORTHOLOGER-3.5.0 ; <hash>
# Title : <command used>
# Date : <date when run>
# System : <system environment>
# User : <user id>
# Job id : <either /bin/bash if local or a scheduler id>
# Work dir: <dir>
#----------------------------------------------------------------------
and the footer
#TIME : <host> <time in seconds>
#END at <full date time>
# Fasta files
Files
Sequences/*.fs_clean
Sequences/*.fs_masked
Sequences/*.fs_selected
A fasta record consists of two or more lines. The first line is a identifier line starting with '>' followed by the internal id. The remaining lines (until the next '>') is the sequence.
Example:
>hobbit:00558a
MTYALFLLSVSLVMGFVGFSSKPSPIYGGLV
# CD-hit clusters
Files
Sequences/*.fs_selected.clstr
These files are generated in the SELECT step with the extension .fs_selected.clstr. Each record (cluster) starts with a '>Cluster X'. The following lines are genes that are very similar (97% by default). One of those is selected as representative (marked with a '*'). Example
>Cluster 9
0 7604aa, >hobbit:000047... at 1:7604:137:7732/97.62%
1 7732aa, >hobbit:00008d... *
The representatives are saved in .fs_selected and the remaining in .inpar_selected.
# Fasta statistics
Files
Sequences/*.fs_selected.stats
Sequences/*.fs_selected.seqlen
The .stats files contains global statistics on the sequences.
Example:
#NTOT:20915
#MINLEN:31
#MAXLEN:35653
#AVELEN:570.458
#MEDLEN:422
#TOTLEN:1.19311e+07
The .seqlen files contain 3 columns, first is the sequence id, second its length and third the fraction non-masked.
Example:
hobbit:005586 318 0.87
hobbit:005587 98 1.00
# Pairwise files (alignment, BRH)
Files:
PWC/*/*.align
PWC/*/*.brh
These files consists of space separated columns with the following format
<id A> <id B> <score> <e-value> <PID> <startA> <lastA> <startB> <lastB> [in BRH possibly two more columnsrelated to fuzzy BRHs]
Example:
fish:000000 63155_0:003794 130.45 4.88e-141 65.8 16 420 1 391 0 0
fish:000021 63155_0:002534 203.45 4.52e-88 85.6 1 174 1 174 0 0
# Inparalogs
Files
PWC/*/*.inpar
Each line contains paralogs wrt to a BRH (the first two columns):
<BRH A> <BRH B> <score> <inpar 1> <inpar 2> ...
Each <inpar> consists of a ';' separated block
<inpar ID>;<score>;<start A>;<last A>;<start inpar>;<last inpar>
Example:
63155_0:0055a4 fish:000163 211.77 63155_0:001168;213.862;23;923;1;902
63155_0:003ac5 fish:0003a6 49.44 63155_0:00470a;54.4638;35;294;97;359 63155_0:0000bb;49.8592;35;294;139;401
# Cluster
Files
Cluster/*.og_raw
Cluster/*.og_inpar
Cluster/*.og
Each line is space-separated, one line per cluster element.
<cluster id> <gene id> <cluster type> <nr of edges> <align start> <align last> <PID> <score> <e-value>
The metrics at the end are evaluated as a mean over all edges.
cluster type encodes how the element was clustered. See table below for more information.
Note that for type 9, the given score is actually the PID (percentage identity) and the e-value is always set to zero.
type | description |
---|---|
0 | part of a triangle |
1 | part of a pair |
2 | element is attached to a triangle but not forming a triangle |
3 | part of a chain of pairs |
4 | like 2 but with significantly shorter sequence - split candidate |
5 | like 2 but with significantly longer sequence - merge candidate |
7 | (in)paralog wrt BRH |
9 | (in)paralog wrt a representative |
Example:
84327 61853_0:001e1a 0 1 1 717 100 258.55 0
84327 30611_0:00449d 7 1 1 585 100 149.507 0
84327 30611_0:004493 2 1 1 494 100 125.51 7.16e-153
84327 9515_0:004f59 9 1 1 523 100 100 0