
gene orthology software behind OrthoDB
OrthoLoger v2.8.3 is the current stable version!
Docker container, Gitlab.
Orthologs are genes in different species that evolved from a common ancestral gene by speciation.
The LEMMI-style benchmarking shows its state-of-the-art performance.
Cite us to help support the project
OrthoDB v11: annotation of orthologs in the widest sampling of organismal diversity D Kuznetsov, F Tegenfeldt, M Manni, M Seppey, M Berkeley, EV Kriventseva, EM Zdobnov, NAR, Nov 2022, doi:10.1093/nar/gkac996. PMID:36350662
Getting OrthoLoger software
OrthoLoger, the OrthoDB standalone pipeline for delineation of orthologs, is freely available
- as a ready to run docker image
docker pull ezlabgva/orthologer:v2.8.3
docker run -u $(id -u) -v ${where}:/odbwork ezlabgva/orthologer:v2.8.3 setup_odb.sh
docker run -u $(id -u) -v ${where}:/odbwork ezlabgva/orthologer:v2.8.3 ./orthologer.sh "command" "options"
- or you can build the docker image yourself
git clone https://gitlab.com/ezlab/orthologer_container.git
cd orthologer_container
docker build ./ -t orthologer .
- or build a local instance of orthologer manually
curl https://data.orthodb.org/v11/download/software/orthologer_2.8.3.tgz -O
curl https://data.orthodb.org/v11/download/software/orthologer_2.8.3.md5sum -O
# check md5sum
md5sum -c orthologer_2.8.3.md5sum
# if previous md5sum checks out OK, then unpack the package
tar -xzf orthologer_2.8.3.tgz
and follow the instructions in orthologer_2.8.3/README.
Issues
For any issues related with the orthologer see the issues board.
Setting up a project
Assume DIR_PIPELINE = install location of ORTHOLOGER.
If the orthologer package is installed locally, create the basic setup using the following procedure:
- Create a new empty directory
- From within this new directory run
$DIR_PIPELINE/bin/setup.sh
and answer the questions - in general, the default responses will suffice - Run the generated setup script
- It will give instructions for how to proceed and/or whether or not something is missing
If the orthologer_container repo is installed use the script docker_run.sh
to setup and run the pipeline.
In both cases, the configuration file common.sh
is likely to need an edit.
Configuration of common.sh
The configuration is in the file common.sh
where each variable is briefly described in comments.
A few may need adjustment:
Variable | Description |
---|---|
OP_NJOBMAX_LOCAL | max number of jobs submitted in local mode |
OP_NJOBMAX_BATCH | max number of jobs submitted in batch mode |
SCHEDULER_LABEL | scheduler to be used, NONE or SLURM |
ODB_RUN_MODE | labels for different preset parameter settings |
TREE_ENABLED | run in tree mode - requires a newick tree |
TREE_INPUT_FILE | the newick tree |
ALIGNMENT_MATRIX | compute full (0) or half (1) matrix for homology |
MAKEBRH_NALT | if set to > 1 it will allow for fuzzy BRHs |
MAKEBRH_FSEPMAX | max separation relative the best BRH [0..1] (fuzzy BRHs) |
POSTPROC_LABELS | labels for various postprocess tools |
OP_STEP_NPARALLEL[S] | step S : number of jobs launched in parallel |
OP_STEP_NTHREADS[S] | step S : number of threads per job |
OP_STEP_NMERGE[S] | step S : number of single jobs merged into one |
OP_STEP_RUNMODE[S] | step S : run locally (LOCAL) or using scheduler (BATCH) |
OP_STEP_SCHEDOPTS | step S : options for scheduler |
OP_LABELA_START/END | pairwise steps: this selects the range of keys A |
OP_LABELB_START/END | pairwise steps: this selects the range of keys B |
Including fuzzy BRHs will add a maximum of MAKEBRH_NALT homologies that are nearly BRHs. The candidates must not differ from the best one by more than MAKEBRH_FSEPMAX.
More help can be obtained using
./orthologer.sh -h # orthologer commands
./orthologer.sh -H <step> # description of a given step
./orthologer.sh -H . # extra help
./orthologer.sh -H .variables # help on variables
./orthologer.sh -H .examples # a few examples
Import fasta files
Fasta files are imported using:
./orthologer.sh manage -f fastafiles.txt
The file fastafiles.txt
contains two columns, first a label and the second a file name:
+HAMA data/myfasta.fs
+SIRI data/urfasta.fs
The '+' sign before the labels indicates that the sequence id's should be relabeled using that label. If not, it will use the base of the filename for the internal sequence id's. In general it is recommended to relabel to something simple. Only case-insensitive alphanumerical characters are allowed [a-z,0-9] and '_'. The sequence id's in the fasta file will be remapped to
<taxid label>:<hexadecimal nr>
Note that the label TPA is not allowed as it has a special meaning in segmasker used for masking.
When importing, it will also create a corresponding todo file at todo/fastafiles.todo. Ensure that all directories are created by
./orthologer.sh -C -t todo/fastafiles.todo
Run a project
If everything is setup and PL_TODO is set in common.sh, the following will start a run:
./orthologer.sh -r ALL # run over all steps
./orthologer.sh -r MAKEBRH -t todo/my.todo # run over one step using a
Adding an option -d
, a dry run is triggered. It will just print out each step without actually running the steps.
Tree mode
By setting TREE_ENABLED=1 the pipeline will run using a user provided taxonomy tree. It can be defined in one of three ways
- set TREE_INPUT_FILE to a newick file defining the tree
- set TREE_ROOT_CLADE to a clade NCBI taxid present in OrthoDB (e.g 33208 for metazoa)
- none of the above, the given todo file is used to construct a tree file name (
todo/<label>.nw
)
Mapping
The orthologer can also be used to map new fasta files to an existing project. Import the new fasta file as described above.
On an existing user project
In order to map against an existing project you need to create a todo file with the new label as well as the other taxids you want to map against.
Run, assuming the taxid label of the imported fasta is mylabel
and the source cluster is Cluster/source.og
.
./orthologer -r all -R mylabel -I Cluster/source.og
It will ensure that all pre-cluster steps will only involve mylabel
. The cluster step will merge those BRHs
with the source cluster.
The -R
option takes a space separated list of labels. Hence, more than one label can be given.
However if there are two or more labels, BRHs will also be computed within the group of extra labels.
This is not equivalent to map by running separate runs for each extra label.
On OrthoDB data
In order to run on OrthoDB data, from a new directory run
<DIR_PIPELINE>/bin/setup_mapping.sh
Defaults are OK for small tests but it is recommended to change the storage locations defined in mapping_conf.sh
.
Variable | Description |
---|---|
DBI_DOWNLOAD | temporary storage for downloaded tar files, default is /tmp |
MAP_USERLOC | user location - where pipelines are run per user |
MAP_ORTHODB_DATA | data location - where the downloaded OrthoDB files are installed |
In addition there are a number of BUSCO related variables. If a mapping is run without giving a target node (as a NCBI taxid),
BUSCO is used in auto lineage
mode in order to establish a lineage.
In the current setup, BUSCO is run through the official docker container. Hence, it requires docker to be installed.
Check the mapping test directory at <DIR_PIPELINE>/test/proj_map
.
For more information on BUSCO click here.
The template_common.sh
may require some editing as described above.
You can check the setup by running
./mapping.sh CHECK
The mapping is set up and run via mapping.sh
. Some commands will end with 'GO'.
If not included, the command will just do a dry run.
OrthoDB taxids are referred to below. They are identical to NCBI id's but with a version appended, e.g 9606_0.
Note that the commands are capitalized below. This is not required.
Step 1. Running requires creating 'users' which can be seen as any arbitrary label.
./mapping.sh CREATE <user> GO
Step 2. Download the node you want to map against.
./mapping.sh DOWNLOAD <node ncbi taxid>
If you do not know which node to use, you can get the full lineage using ete3 tool:
ete3 ncbiquery --search <ncbi taxid> --info
Note that all nodes wont be available in OrthoDB.
See HERE for instructions on how to install ete3.
A full list of all nodes available for download can be obtained by
./mapping.sh DOWNLOAD ?
Step 3. Import OrthoDB files to your project
./mapping.sh DBIMPORT <user> <node id> GO
A list of all imported DB files is obtained from
./mapping.sh DBINFO <user>
Step 4. Import your fasta file
./mapping.sh IMPORT <user> "<taxid label>;<filename>"
The taxid label is an arbitrary alphanumerical string to be used as a label for the given file. Note the quotes, otherwise the ';' will be interpreted as a new line by the shell. It creates an import file as described above in Import fasta file.
Step 5. Run
When mapping against a given node, you may want to select a subset of the node to map against as it will reduce to compute time. This subset is provided as a CSV list with OrthoDB taxids.
./mapping.sh RUN <user id> <user taxid> [nodeid=<nodeid>] [mode=<busco|map>] [subsample=<N>] [score=<des|asc|sde|sas>] [using=<subnode|taxids>] [check] [go]
Further help can be obtained using
./mapping.sh HELP
Example
Below is an example where a sample fasta file is mapped against a subset of Cichliformes from OrthoDB.
# create user 'myproj'
./mapping.sh CREATE myproj GO
# download OrthoDB data for Cichliformes (NCBI taxid 1489911)
./mapping.sh DOWNLOAD 1489911
# import that node to 'myproj'
./mapping.sh DBIMPORT myproj 1489911 GO
# load a sample fasta file
curl https://data.orthodb.org/v11/download/orthologer/fasta/example.fs.gz -O
gunzip example.fs.gz
# import file
./mapping.sh IMPORT myproj "fish;example.fs"
# start mapping example.fs to node 1489911 using OrthoDB taxids 303518_0, 43689_0 and 8128_0
./mapping.sh RUN myproj fish nodeid=1489911 using=303518_0,43689_0,8128_0 GO
# start mapping example.fs, node is established by BUSCO, default subsampling of fasta files
# for this one, BUSCO_DATA needs to be set to a valid directory - this is where BUSCO will install all its data files
# make also sure that BUSCO_OFFLINE=0 if you dont already have the data in that location
./mapping.sh RUN myproj fish GO
The result should be in three files
- <base>.annotations - contains annotation for each cluster with mapped genes (tab separated columns)
- <base>.hits - contains the mapped genes, OrthoDB cluster id's and metrics
- <base>.desc - contains annotation for each cluster
- <base>.odb - contains FULL clusters including OrthoDB genes
where base is users/myusr/pipeline/Results/node_1489911_subnode_303518_0_43689_0_8128_0_taxid_fish.og
File formats
Files generated by orthologer have, when possible, a header and a footer. Those lines are prefixed by '#'.
A header has the format:
#----------------------------------------------------------------------
# Version : ORTHOLOGER-2.8.3 ; bc350680b326889c16c2c5b90c60c6567b670e03
# Title : <command used>
# Date : <date when run>
# System : <system environment>
# User : <user id>
# Job id : <either /bin/bash if local or a scheduler id>
# Work dir: <dir>
#----------------------------------------------------------------------
and the footer
#TIME : <host> <time in seconds>
#END at <full date time>
Fasta files
Files
Sequences/*.fs_clean
Sequences/*.fs_masked
Sequences/*.fs_selected
A fasta record consists of two or more lines. The first line is a identifier line starting with '>' followed by the internal id. The remaining lines (until the next '>') is the sequence.
Example:
>hobbit:00558a
MTYALFLLSVSLVMGFVGFSSKPSPIYGGLV
CD-hit clusters
Files
Sequences/*.fs_selected.clstr
These files are generated in the SELECT step with the extension .fs_selected.clstr. Each record (cluster) starts with a '>Cluster X'. The following lines are genes that are very similar (97% by default). One of those is selected as representative (marked with a '*'). Example
>Cluster 9
0 7604aa, >hobbit:000047... at 1:7604:137:7732/97.62%
1 7732aa, >hobbit:00008d... *
The representatives are saved in .fs_selected and the remaining in .inpar_selected.
Fasta statistics
Files
Sequences/*.fs_selected.stats
Sequences/*.fs_selected.seqlen
The .stats files contains global statistics on the sequences.
Example:
#NTOT:20915
#MINLEN:31
#MAXLEN:35653
#AVELEN:570.458
#MEDLEN:422
#TOTLEN:1.19311e+07
The .seqlen files contain 3 columns, first is the sequence id, second its length and third the fraction non-masked.
Example:
hobbit:005586 318 0.87
hobbit:005587 98 1.00
Pairwise files (alignment, BRH)
Files:
PWC/*/*.align
PWC/*/*.brh
These files consists of space separated columns with the following format
<id A> <id B> <score> <e-value> <PID> <startA> <lastA> <startB> <lastB> [in BRH possibly two more columnsrelated to fuzzy BRHs]
Example:
fish:000000 63155_0:003794 130.45 4.88e-141 65.8 16 420 1 391 0 0
fish:000021 63155_0:002534 203.45 4.52e-88 85.6 1 174 1 174 0 0
Inparalogs
Files
PWC/*/*.inpar
Each line contains paralogs wrt to a BRH (the first two columns):
<BRH A> <BRH B> <score> <inpar 1> <inpar 2> ...
Each <inpar> consists of a ';' separated block
<inpar ID>;<score>;<start A>;<last A>;<start inpar>;<last inpar>
Example:
63155_0:0055a4 fish:000163 211.77 63155_0:001168;213.862;23;923;1;902
63155_0:003ac5 fish:0003a6 49.44 63155_0:00470a;54.4638;35;294;97;359 63155_0:0000bb;49.8592;35;294;139;401
Cluster
Files
Cluster/*.og_raw
Cluster/*.og_inpar
Cluster/*.og
Each line is space-separated, one line per cluster element.
<cluster id> <gene id> <cluster type> <nr of edges> <align start> <align last> <PID> <score> <e-value>
The metrics at the end are evaluated as a mean over all edges. Depending on how the element entered the cluster it will get a different cluster type. In particular 7 refers to inparalogs and 9 to genes filtered in the SELECT step (cd-hit). See the README.md file for BRHCLUS.
Example:
84327 61853_0:001e1a 0 1 1 717 100 258.55 0
84327 30611_0:00449d 7 1 1 585 0 149.507 0
84327 30611_0:004493 2 1 1 494 100 125.51 7.16e-153
84327 9515_0:004f59 9 1 1 523 0 100 0