User guide

Build Status Coverage Status Documentation Status PyPI version
Wendellor package mount a NCBI query, perform search, download fasta sequences and finally, run muscle and tabajara!

Here you're going to see:

Installing dependencies

Wendellor's package was made using Python3. So you will need to execute some commands:

sudo apt-get install git \
    python-dev\
    python-pip\
    python3-dev\
    python3-pip

Here you're installing:

Package Description
git Version control system for tracking changes in computer files and coordinating work on those files among multiple people. In this case, for manage our repository
python-dev Header files, a static library and development tools for building Python modules, extending the Python 2.7 interpreter or embedding Python 2.7 in applications.
python-pip pip is the Python 2.7 package installer. It integrates with virtualenv, doesn't do partial installs, can save package state for replaying, can install from non-egg sources, and can install from version control repositories.
python3-dev Header files, a static library and development tools for building Python modules, extending the Python 3 interpreter or embedding Python 3 in applications.
python3-pip pip is the Python 3 package installer. It integrates with virtualenv, doesn't do partial installs, can save package state for replaying, can install from non-egg sources, and can install from version control repositories.

There is more two dependencies, muscle and tabajara.

Installing

After install dependencies, you're able to install Wendellor's package with the following command:

sudo pip3 install Wendellor

This command with sudo permissions, will execute pip3 and install Wendellor for all users on the machine.
If you didn't have permission to perform it, or you don't want to share with you friends:

pip3 install --user Wendellor

Upgrading

To upgrade Wendellor's package, execute:

pip3 install --upgrade Wendellor

Uninstalling

If you don't like this project anymore :(
Use this command for uninstall:

pip3 uninstall Wendellor

If you've installed with sudo permissions, you'll have to uninstall with sudo also.

Running an example

Here we're going to execute wendellor.py, understand the anatomy of this configuration file and see what you got.

Clonning repository

I haven't added configuration files to be installed together with the package (I'll see how do I do it, one day), so in this moment, you will clone our github repository with the following command:

git clone https://github.com/WendelHime/Wendellor.git

And change directory with:

cd Wendellor/conf_examples

Please, ignore another files which didn't have suffix .conf, they're used in development stage for developers.

Update I've added conf data files in /usr/share/Wendellor/config directory, but it will only install if you've installed Wendellor for all users (with sudo)

Understanding configuration file and how this project works

This is the most fucking horrible boring fundamental part of this project. We're going to use example_both_discrimination.conf, if you haven't cloned Wendellor repository and don't want the another examples, create example_both_discrimination.conf with the following content:

[Taxonomic group]
taxon = 1980416
groups = all
output_file = ids.txt

[Downloader]
database = ipg
ids_file = ids.txt
positive_terms = polymerase;L protein;RNA;RDRP
negative_terms = nucleocapsid;nucleoprotein;N protein;M protein;glycoprotein;envelope
min_length = 2000
max_length = 10000

[Unifier]
ids_file = ids.txt
discriminate_groups = both

[Wendellor]
log = wendellor.log
ids_file = ids.txt
output_dir = output
muscle_parameters = 
tabajara_parameters = -t 0.5 -p 50 -w 15 -b 15
muscle_filepath = /usr/bin/muscle
tabajara_filepath = /home/liliane/ProjetoDr/tabajara/tabajara_v1.43.pl
build_hmm = yes

So, now you've the configuration file, with te content, separated into sessions to be performed by little applications inside of Wendellor's package. This project is composed by four applications:

Applications Description
organism.py Find taxonomic ID performing requests to NCBI Taxonomony Database
downloader.py Download FASTA sequences from NCBI performing requests using organism.py result
unifier.py Unify files into a main organism and rename it
wendellor.py main script to run scripts using config file

Considering the description table above, you can see how this project works:

wendellor.py execute as a pipeline: organism.py -> downloader.py -> unifier.py -> muscle -> tabajara

Thinking on pipeline design, we can use the sections on configuration file like example_both_discrimination.conf to be executed in Wendellor's package components.

I haven't explain what is a section, a section is something like [Taxonomic group] or [Downloader], it separates parameters into sections until another section appear.
In configuration file, we've defined:

Section Application
Taxonomic group organism.py
Downloader downloader.py
Unifier unifier.py
Wendellor wendellor.py

Now you understand, that sections are separated to be used inside of the little components of Wendellor's package. Unfortunately I'll have to explain the parameters (this section never ends).

Starting by [Taxonomic group]:

Attribute Description
taxon Organism name or taxonomic ID
groups Groups descendants from taxon parameter to be considered, should be all or specific group names or taxonomic ids, separated by ;
output_file Output filepath

So organism.py reads this section, get parameters, find taxonomic ID performing requests to NCBI Taxonomony Database and get taxon ID from descendants groups considered given taxon. Just like:

Peribunyaviridae
|
|- Orthobunyavirus
|- Herbevirus

It will add in outpout_file the taxonomic ID from Orthobunyavirus and Herbevirus.

In [Downloader]:

Attribute Description
database What kind of NCBI database it will perform search? Should be protein or ipg
ids_file organism.py output filepath
positive_terms Positive terms to find using NCBI query
negative_terms Negative terms to not find using NCBI query
min_length Min sequence length
max_length Max sequence length

So downloader.py reads this section, get parameters, creates a query considering positive_terms, negative_terms, min_length, max_length; perform search on NCBI database considering database and download results as FASTA files.

In [Unifier]:

Attribute Description
ids_file organism.py output filepath
discriminate_groups You want to discriminate your groups adding a prefix? Should be yes, no, both

So unifier.py reads this section, get parameters, read ids_file, if discriminate_groups is yes or both, it will add a prefix in all sequences with they scientific name, if no, it doesn't add prefix. In final, it creates a main sequences FASTA file, with the main organism name.

In [Wendellor]:

Attribute Description
log Output log filename
ids_file organism.py output filepath
output_dir Output dir name
muscle_parameters Muscle optional parameters
tabajara_parameters Tabajara optional parameters
muscle_filepath Muscle executable filepath
tabajara_filepath Tabajara executable filepath
build_hmm If build_hmm = no, wendellor'll not run muscle and tabajara, if not defined or yes, it will run muscle and tabajara

So wendellor.py reads this section, get parameters, execute another components, and in the end, execute muscle and the result, use in tabajara.

Execute wendellor.py

Now that you have an idea about wendellor.py, you can run it:

wendellor.py example_both_discrimination.conf &

While wendellor.py is running, you can see log using:

tail -f wendellor.log

Looking at results

In output_dir defined in [Wendellor], you'll have something like:

drwxrwxr-x  2 wendelhlc wendelhlc 4,0K Abr  3 18:03 fasta_sequences
drwxrwxr-x  2 wendelhlc wendelhlc 4,0K Abr  3 18:12 logs
drwxrwxr-x  2 wendelhlc wendelhlc 4,0K Abr  3 18:06 muscle_output
drwxrwxr-x  4 wendelhlc wendelhlc 4,0K Abr  3 18:10 tabajara_conserv
drwxrwxr-x  4 wendelhlc wendelhlc 4,0K Abr  3 18:12 tabajara_discrim

In fasta_sequences dir, you'll find fasta sequences.
In logs dir, you'll find logs.
In muscle_output dir you'll find muscle result.
In tabajara* dirs, you will find tabajara results.