User guide
Wendellor package mount a NCBI query, perform search, download fasta sequences and finally, run muscle and tabajara!
Here you're going to see:
- Installing dependencies
- Installing Wendellor
- Upgrading Wendellor
- Uninstalling Wendellor
- Running an example
Installing dependencies
Wendellor's package was made using Python3. So you will need to execute some commands:
sudo apt-get install git \
python-dev\
python-pip\
python3-dev\
python3-pip
Here you're installing:
Package | Description |
---|---|
git | Version control system for tracking changes in computer files and coordinating work on those files among multiple people. In this case, for manage our repository |
python-dev | Header files, a static library and development tools for building Python modules, extending the Python 2.7 interpreter or embedding Python 2.7 in applications. |
python-pip | pip is the Python 2.7 package installer. It integrates with virtualenv, doesn't do partial installs, can save package state for replaying, can install from non-egg sources, and can install from version control repositories. |
python3-dev | Header files, a static library and development tools for building Python modules, extending the Python 3 interpreter or embedding Python 3 in applications. |
python3-pip | pip is the Python 3 package installer. It integrates with virtualenv, doesn't do partial installs, can save package state for replaying, can install from non-egg sources, and can install from version control repositories. |
There is more two dependencies, muscle and tabajara
.
Installing
After install dependencies, you're able to install Wendellor's package with the following command:
sudo pip3 install Wendellor
This command with sudo permissions, will execute pip3 and install Wendellor for all users on the machine.
If you didn't have permission to perform it, or you don't want to share with you friends:
pip3 install --user Wendellor
Upgrading
To upgrade Wendellor's package, execute:
pip3 install --upgrade Wendellor
Uninstalling
If you don't like this project anymore :(
Use this command for uninstall:
pip3 uninstall Wendellor
If you've installed with sudo
permissions, you'll have to uninstall with sudo
also.
Running an example
Here we're going to execute wendellor.py
, understand the anatomy of this configuration file and see what you got.
Clonning repository
I haven't added configuration files to be installed together with the package (I'll see how do I do it, one day), so in this moment, you will clone our github repository with the following command:
git clone https://github.com/WendelHime/Wendellor.git
And change directory with:
cd Wendellor/conf_examples
Please, ignore another files which didn't have suffix .conf
, they're used in development stage for developers.
Update
I've added conf data files in /usr/share/Wendellor/config
directory, but it will only install if you've installed Wendellor for all users (with sudo)
Understanding configuration file and how this project works
This is the most fucking horrible boring fundamental part of this project. We're going to use example_both_discrimination.conf
, if you haven't cloned Wendellor repository and don't want the another examples, create example_both_discrimination.conf
with the following content:
[Taxonomic group]
taxon = 1980416
groups = all
output_file = ids.txt
[Downloader]
database = ipg
ids_file = ids.txt
positive_terms = polymerase;L protein;RNA;RDRP
negative_terms = nucleocapsid;nucleoprotein;N protein;M protein;glycoprotein;envelope
min_length = 2000
max_length = 10000
[Unifier]
ids_file = ids.txt
discriminate_groups = both
[Wendellor]
log = wendellor.log
ids_file = ids.txt
output_dir = output
muscle_parameters =
tabajara_parameters = -t 0.5 -p 50 -w 15 -b 15
muscle_filepath = /usr/bin/muscle
tabajara_filepath = /home/liliane/ProjetoDr/tabajara/tabajara_v1.43.pl
build_hmm = yes
So, now you've the configuration file, with te content, separated into sessions to be performed by little applications inside of Wendellor's package. This project is composed by four applications:
Applications | Description |
---|---|
organism.py |
Find taxonomic ID performing requests to NCBI Taxonomony Database |
downloader.py |
Download FASTA sequences from NCBI performing requests using organism.py result |
unifier.py |
Unify files into a main organism and rename it |
wendellor.py |
main script to run scripts using config file |
Considering the description table above, you can see how this project works:
wendellor.py
execute as a pipeline: organism.py
-> downloader.py
-> unifier.py
-> muscle
-> tabajara
Thinking on pipeline design, we can use the sections on configuration file like example_both_discrimination.conf
to be executed in Wendellor's package components.
I haven't explain what is a section, a section is something like [Taxonomic group]
or [Downloader]
, it separates parameters into sections until another section appear.
In configuration file, we've defined:
Section | Application |
---|---|
Taxonomic group | organism.py |
Downloader | downloader.py |
Unifier | unifier.py |
Wendellor | wendellor.py |
Now you understand, that sections are separated to be used inside of the little components of Wendellor's package. Unfortunately I'll have to explain the parameters (this section never ends).
Starting by [Taxonomic group]
:
Attribute | Description |
---|---|
taxon | Organism name or taxonomic ID |
groups | Groups descendants from taxon parameter to be considered, should be all or specific group names or taxonomic ids, separated by ; |
output_file | Output filepath |
So organism.py
reads this section, get parameters, find taxonomic ID performing requests to NCBI Taxonomony Database and get taxon ID from descendants groups considered given taxon. Just like:
Peribunyaviridae
|
|- Orthobunyavirus
|- Herbevirus
It will add in outpout_file
the taxonomic ID from Orthobunyavirus and Herbevirus.
In [Downloader]
:
Attribute | Description |
---|---|
database | What kind of NCBI database it will perform search? Should be protein or ipg |
ids_file | organism.py output filepath |
positive_terms | Positive terms to find using NCBI query |
negative_terms | Negative terms to not find using NCBI query |
min_length | Min sequence length |
max_length | Max sequence length |
So downloader.py
reads this section, get parameters, creates a query considering positive_terms
, negative_terms
, min_length
, max_length
; perform search on NCBI database considering database
and download results as FASTA files.
In [Unifier]
:
Attribute | Description |
---|---|
ids_file | organism.py output filepath |
discriminate_groups | You want to discriminate your groups adding a prefix? Should be yes , no , both |
So unifier.py
reads this section, get parameters, read ids_file
, if discriminate_groups
is yes or both, it will add a prefix in all sequences with they scientific name, if no, it doesn't add prefix. In final, it creates a main sequences FASTA file, with the main organism name.
In [Wendellor]
:
Attribute | Description |
---|---|
log | Output log filename |
ids_file | organism.py output filepath |
output_dir | Output dir name |
muscle_parameters | Muscle optional parameters |
tabajara_parameters | Tabajara optional parameters |
muscle_filepath | Muscle executable filepath |
tabajara_filepath | Tabajara executable filepath |
build_hmm | If build_hmm = no , wendellor'll not run muscle and tabajara, if not defined or yes , it will run muscle and tabajara |
So wendellor.py
reads this section, get parameters, execute another components, and in the end, execute muscle and the result, use in tabajara.
Execute wendellor.py
Now that you have an idea about wendellor.py
, you can run it:
wendellor.py example_both_discrimination.conf &
While wendellor.py
is running, you can see log using:
tail -f wendellor.log
Looking at results
In output_dir
defined in [Wendellor]
, you'll have something like:
drwxrwxr-x 2 wendelhlc wendelhlc 4,0K Abr 3 18:03 fasta_sequences
drwxrwxr-x 2 wendelhlc wendelhlc 4,0K Abr 3 18:12 logs
drwxrwxr-x 2 wendelhlc wendelhlc 4,0K Abr 3 18:06 muscle_output
drwxrwxr-x 4 wendelhlc wendelhlc 4,0K Abr 3 18:10 tabajara_conserv
drwxrwxr-x 4 wendelhlc wendelhlc 4,0K Abr 3 18:12 tabajara_discrim
In fasta_sequences
dir, you'll find fasta sequences.
In logs
dir, you'll find logs.
In muscle_output
dir you'll find muscle result.
In tabajara*
dirs, you will find tabajara results.