User guide

Wendellor package mount a NCBI query, perform search, download fasta sequences and finally, run muscle and tabajara!

Here you're going to see:

Installing dependencies
Installing Wendellor
Upgrading Wendellor
Uninstalling Wendellor
Running an example

Installing dependencies

Wendellor's package was made using Python3. So you will need to execute some commands:

sudo apt-get install git \
    python-dev\
    python-pip\
    python3-dev\
    python3-pip

Here you're installing:

Package	Description
git	Version control system for tracking changes in computer files and coordinating work on those files among multiple people. In this case, for manage our repository
python-dev	Header files, a static library and development tools for building Python modules, extending the Python 2.7 interpreter or embedding Python 2.7 in applications.
python-pip	pip is the Python 2.7 package installer. It integrates with virtualenv, doesn't do partial installs, can save package state for replaying, can install from non-egg sources, and can install from version control repositories.
python3-dev	Header files, a static library and development tools for building Python modules, extending the Python 3 interpreter or embedding Python 3 in applications.
python3-pip	pip is the Python 3 package installer. It integrates with virtualenv, doesn't do partial installs, can save package state for replaying, can install from non-egg sources, and can install from version control repositories.

There is more two dependencies, muscle and tabajara.

Installing

After install dependencies, you're able to install Wendellor's package with the following command:

sudo pip3 install Wendellor

This command with sudo permissions, will execute pip3 and install Wendellor for all users on the machine.
If you didn't have permission to perform it, or you don't want to share with you friends:

pip3 install --user Wendellor

Upgrading

To upgrade Wendellor's package, execute:

pip3 install --upgrade Wendellor

Uninstalling

If you don't like this project anymore :(
Use this command for uninstall:

pip3 uninstall Wendellor

If you've installed with sudo permissions, you'll have to uninstall with sudo also.

Running an example

Here we're going to execute wendellor.py, understand the anatomy of this configuration file and see what you got.

Clonning repository

I haven't added configuration files to be installed together with the package (I'll see how do I do it, one day), so in this moment, you will clone our github repository with the following command:

git clone https://github.com/WendelHime/Wendellor.git

And change directory with:

cd Wendellor/conf_examples

Please, ignore another files which didn't have suffix .conf, they're used in development stage for developers.

Update I've added conf data files in /usr/share/Wendellor/config directory, but it will only install if you've installed Wendellor for all users (with sudo)

Understanding configuration file and how this project works

This is the most fucking horrible boring fundamental part of this project. We're going to use example_both_discrimination.conf, if you haven't cloned Wendellor repository and don't want the another examples, create example_both_discrimination.conf with the following content:

[Taxonomic group]
taxon = 1980416
groups = all
output_file = ids.txt

[Downloader]
database = ipg
ids_file = ids.txt
positive_terms = polymerase;L protein;RNA;RDRP
negative_terms = nucleocapsid;nucleoprotein;N protein;M protein;glycoprotein;envelope
min_length = 2000
max_length = 10000

[Unifier]
ids_file = ids.txt
discriminate_groups = both

[Wendellor]
log = wendellor.log
ids_file = ids.txt
output_dir = output
muscle_parameters = 
tabajara_parameters = -t 0.5 -p 50 -w 15 -b 15
muscle_filepath = /usr/bin/muscle
tabajara_filepath = /home/liliane/ProjetoDr/tabajara/tabajara_v1.43.pl
build_hmm = yes

So, now you've the configuration file, with te content, separated into sessions to be performed by little applications inside of Wendellor's package. This project is composed by four applications:

Applications	Description
`organism.py`	Find taxonomic ID performing requests to NCBI Taxonomony Database
`downloader.py`	Download FASTA sequences from NCBI performing requests using organism.py result
`unifier.py`	Unify files into a main organism and rename it
`wendellor.py`	main script to run scripts using config file

Considering the description table above, you can see how this project works:

wendellor.py execute as a pipeline: organism.py -> downloader.py -> unifier.py -> muscle -> tabajara

Thinking on pipeline design, we can use the sections on configuration file like example_both_discrimination.conf to be executed in Wendellor's package components.

I haven't explain what is a section, a section is something like [Taxonomic group] or [Downloader], it separates parameters into sections until another section appear.
In configuration file, we've defined:

Section	Application
Taxonomic group	`organism.py`
Downloader	`downloader.py`
Unifier	`unifier.py`
Wendellor	`wendellor.py`

Now you understand, that sections are separated to be used inside of the little components of Wendellor's package. Unfortunately I'll have to explain the parameters (this section never ends).

Starting by [Taxonomic group]:

Attribute	Description
taxon	Organism name or taxonomic ID
groups	Groups descendants from taxon parameter to be considered, should be `all` or specific group names or taxonomic ids, separated by `;`
output_file	Output filepath

So organism.py reads this section, get parameters, find taxonomic ID performing requests to NCBI Taxonomony Database and get taxon ID from descendants groups considered given taxon. Just like:

Peribunyaviridae
|
|- Orthobunyavirus
|- Herbevirus

It will add in outpout_file the taxonomic ID from Orthobunyavirus and Herbevirus.

In [Downloader]:

Attribute	Description
database	What kind of NCBI database it will perform search? Should be `protein` or `ipg`
ids_file	`organism.py` output filepath
positive_terms	Positive terms to find using NCBI query
negative_terms	Negative terms to not find using NCBI query
min_length	Min sequence length
max_length	Max sequence length

So downloader.py reads this section, get parameters, creates a query considering positive_terms, negative_terms, min_length, max_length; perform search on NCBI database considering database and download results as FASTA files.

In [Unifier]:

Attribute	Description
ids_file	`organism.py` output filepath
discriminate_groups	You want to discriminate your groups adding a prefix? Should be `yes`, `no`, `both`

So unifier.py reads this section, get parameters, read ids_file, if discriminate_groups is yes or both, it will add a prefix in all sequences with they scientific name, if no, it doesn't add prefix. In final, it creates a main sequences FASTA file, with the main organism name.

In [Wendellor]:

Attribute	Description
log	Output log filename
ids_file	`organism.py` output filepath
output_dir	Output dir name
muscle_parameters	Muscle optional parameters
tabajara_parameters	Tabajara optional parameters
muscle_filepath	Muscle executable filepath
tabajara_filepath	Tabajara executable filepath
build_hmm	If `build_hmm` = `no`, wendellor'll not run muscle and tabajara, if not defined or `yes`, it will run muscle and tabajara

So wendellor.py reads this section, get parameters, execute another components, and in the end, execute muscle and the result, use in tabajara.

Execute wendellor.py

Now that you have an idea about wendellor.py, you can run it:

wendellor.py example_both_discrimination.conf &

While wendellor.py is running, you can see log using:

tail -f wendellor.log

Looking at results

In output_dir defined in [Wendellor], you'll have something like:

drwxrwxr-x  2 wendelhlc wendelhlc 4,0K Abr  3 18:03 fasta_sequences
drwxrwxr-x  2 wendelhlc wendelhlc 4,0K Abr  3 18:12 logs
drwxrwxr-x  2 wendelhlc wendelhlc 4,0K Abr  3 18:06 muscle_output
drwxrwxr-x  4 wendelhlc wendelhlc 4,0K Abr  3 18:10 tabajara_conserv
drwxrwxr-x  4 wendelhlc wendelhlc 4,0K Abr  3 18:12 tabajara_discrim

In fasta_sequences dir, you'll find fasta sequences.
In logs dir, you'll find logs.
In muscle_output dir you'll find muscle result.
In tabajara* dirs, you will find tabajara results.