SonicParanoid is a stand-alone software tool for the de novo identification of orthologous relationships among multiple species.
For more details, and for citations, refer to the papers below:
SonicParanoid is able to infer the orthologs for dozens of prokaryotes in minutes, or hours for eukaryotes, using a desktop computer with 8 CPUs. This figure is much smaller when running on HPC servers with dozens of CPUs (e.g. <1h for the QfO benchmark dataset). It is also highly scalable, as it inferred the orthologs for 2000 MAGs in only 1 day using 128 CPUs.
SonicParanoid was tested using a benchmark proteome dataset from the Quest for Orthologs consortium, and the correctness of its predictions was evaluated using the QfO Benchmarking service. SonicParanoid showed the highest accuracy in the aggregated rankings from the three accuracy classification methods in the 2020 QfO benchmark.
SonicParanoid only requires the Python programming language and a GNU GCC compiler to be installed in your laptop/server in order to work. The low hardware requirements make it possible to run SonicParanoid on modern laptop computers, while the "update" feature allows users to easily maintain collections of orthologs that can be updated by adding or removing species.
The latest version of SonicParanoid uses machine-learning to reduce the time required for all-versus-all
alignments.
Following are some real-life examples showing the reduction of execution times for the ML-based alignments
(essentials) and the normal alignments (complete).
Tests were performed on different datasets on a High
Performance computing (HPC) server and on a desktop computer.
Sensitivity | Execution mode (default is complete) |
Ex. time (Hours) |
Orthologous relationships (Million) |
Non-default parameters |
---|---|---|---|---|
Fast | Complete | 0.56 | 14.64 | --mode fast |
Fast | Graph-only | 0.23 | 11.27 | --mode fast -go |
Default | Complete | 0.86 | 15.28 | |
Default | Graph-only | 0.66 | 11.98 | -go |
Sensitive | Complete | 4.36 | 19.27 | --mode sensitive |
Sensitive | Graph-only | 4.17 | 15.65 | --mode sensitive -go |
Sensitivity | Execution mode (default is complete) |
Ex. time (Hours) |
Orthologous relationships (Million) |
Non-default parameters |
---|---|---|---|---|
Fast | Complete | 24.33 | 1,692.88 | --mode fast |
Fast | Graph-only | 19.70 | 1,233.64 | --mode fast -go |
Default | Complete | 37.99 | 1,817.39 | |
Default | Graph-only | 34.41 | 1,390.97 | -go |
Computer: | High performance computing (HPC) server |
CPU: | 128 Cores @ 2.25~3.40 GHz from 2 AMD EPYC 7742 (Rome) sockets |
Memory: | 2 Terabytes of DDR4 shared memory |
Storage: | Intel D3-S4610 solid-state disk |
OS: | Ubuntu 20.04.3 LTS (Linux 5.11.0) |
Sensitivity | Execution mode (default is complete) |
Ex. time (Hours) |
Orthologous relationships (Million) |
Non-default parameters |
---|---|---|---|---|
Fast | Complete | 3.08 | 14.64 | --mode fast |
Fast | Graph-only | 2.15 | 11.27 | --mode fast -go |
Default | Complete | 6.63 | 15.26 | |
Default | Graph-only | 5.56 | 11.98 | -go |
Sensitive | Complete | 45.86 | 19.27 | --mode sensitive |
Sensitive | Graph-only | 45.49 | 15.65 | --mode sensitive -go |
NOTE: these results are currently being updated!
Computer: | Desktop computer (from 2019) |
CPU: | 8 Cores Inter Core i9 9900K cpu @ 3.60 GHz |
Memory: | 32 Gygabytes of DDR4 |
Storage: | SK Hynix PC601 NVMe (1TB) |
OS: | Manjaro (Sikaris 22.0.0) (Linux 6.0.8-1) |
Dataset | Proteomes | Eukaryotes; Prokaryotes | Sequences (thousands) | Required alignments | Description |
---|---|---|---|---|---|
QfO 2020 | 78 | 50; 28 | 984.14 | 6,084 | Curated proteomes from the Quest for Orthologs consortium |
2000 Microbial MAGs | 2000 | 0; 2000 | 5,091.98 | 4,000,000 | Subset of high quality MAGs obtained from Nayfach et al. (2020) |
SonicParanoid requires a system with a 64-bit multi-core (at least 4) CPU and 16 Gigabytes of memory.
Valid input files must contain protein sequences in the FASTA format.
Each FASTA file should have a unique name, and must be different in content from all the others.
It is good practice to keep the species names short (less than 10 letters). For example, given the species name
Homo_sapiens.faa it would be better to rename it to hsapiens.
Doing this would make the final output tables much easier to read.
The above also applies to protein names. These should be short where possible, as the final ortholog table
contains these protein names multiple times and very long protein names would make the output difficult to read.
SonicParanoid requires about 9 Gygabytes of storage for the installation, mainly due to the size of the
PFamA profile DB (~9GBytes).
After the installation, and only during the first execution
SonicParanoid generates the profile DB from PFamA. This file requires about 9GB of disk space.
To further speed-up the computation of all-vs-all alignments MMseqs2 and Diamond can generate index
files of the input proteome files. This can be done using the parameter --index-db. These
index files are relatively big (about 1 Gigabyte per input proteome), but are automatically removed by
SonicParanoid after the execution is completed.
SonicParanoid automatically avoids the creation of the index files if the available storage is
lower than the amount of disk space required to store the index files.
Using the --index-db parameter can result in a 5~10% speed-up for all-vs-all alignments when
using MMseqs2/Diamond.
Once installed SonicParanoid can be executed through the command line by running the program sonicparanoid.
The command sonicparanoid --help provides extra information on the command line parameters.
SonicParanoid comes with a test input set composed of 4 bacterial proteomes. To verify that SonicParanoid has been successfully installed type the following commands:
SonicParanoid allows the update of a previously computed set of ortholog relations by adding and/or removing proteome
files from the original input set.
Suppose in the previous example we computed the ortholog relations amongst species A, B,
C, and D and that we now want to remove C from the analysis.
This is simply done by copying A, B, and D into a new directory my_new_input
(or by removing C from the original input directory) and use the same output directory as follows:
At each execution SonicParanoid stores the execution information and results in a directory named /output/runs/my_project/ where my_project, can be optionally set using the --project-id parameter.
For example, given the following execution of SonicParanoid
The output directory resulting from the above run will have the following structure:
The orthologs shared among the input species are stored in the directory named ortholog_groups under
the main run directory (my_first_run in our example).
Following are the relevant output files related to the ortholog groups:
In addition to the ortholog groups SonicParanoid provides an ortholog table for each pair of
proteomes.
For example, given a run with N input proteomes,the directory pairwise_orthologs
(under
These tables are useful to quicky see the orthologs shared between pairs of species rather then
shared among multiple species.
For example if we give in input the poteomes 1,2, and 3 the
pairwise ortholog tables 1-2, 1-3 and 2-3 will be generated and
stored under the pairwise_orthologs directory.
The tables are stored into sub-directories named as the leftmost species name in the pair as
follows:
You can list all the available parameters by typing sonicparanoid --help in the command line.
Following is a list of SonicParanoid's command line parameters and their usage:
Copyright © 2017, Salvatore Cosentino, The University of Tokyo All rights reserved.
Licensed under the GNU GENERAL PUBLIC LICENSE, Version 3.0 (referred from now on as the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
https://www.gnu.org/licenses/gpl-3.0.en.html
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.