downloader.py

Create NCBI query and download FASTA sequences from NCBI databases (IPG or Protein) Using the IDs file (product from organism.py) and parameters, this script build a query, execute search on NCBI database and download FASTA sequences.

Usage

You can run this script from command line passing args or using a configuration file

Command line

You can read help message running downloader.py -h:

usage: downloader.py [-h] [--database {ipg,protein}] [--log LOG]
                     [--organisms ORGANISMS] [--positive_terms POSITIVE_TERMS]
                     [--negative_terms NEGATIVE_TERMS]
                     [--min_length MIN_LENGTH] [--max_length MAX_LENGTH]
                     [--ids_file IDS_FILE] [--output_dir OUTPUT_DIR]
                     [--conf CONF]

Downloader - Download FASTA sequences from NCBI

optional arguments:
  -h, --help            show this help message and exit
  --database {ipg,protein}
                        Database to be used
  --log LOG             Log output
  --organisms ORGANISMS
                        Organism name or taxonomic ID separated by ';'
  --positive_terms POSITIVE_TERMS
                        Positive terms to be used in NCBI query separated by
                        ';'
  --negative_terms NEGATIVE_TERMS
                        Negative terms to be used in NCBI query separated by
                        ';'
  --min_length MIN_LENGTH
                        Min sequence length
  --max_length MAX_LENGTH
                        Max sequence length
  --ids_file IDS_FILE   IDs filepath, product from organism.py
  --output_dir OUTPUT_DIR
                        Output dir
  --conf CONF           Configuration file

create a query based on positive-terms, negative-terms, min-length and max-
length; execute query in NCBI database; and download files in output
directory.

An command example:

downloader.py ipg --positive_terms "polymerase;L protein;RNA;RDRP" --negative_terms "nucleocapsid;nucleoprotein;N protein;M protein;glycoprotein;envelope" --min_length 2000 --max_length 10000 --ids_file ids_file.txt

Config file

You can use a configuration file with this content:

[Downloader]
database = ipg
log = downloader.log
ids_file = ids.txt
positive_terms = polymerase;L protein;RNA;RDRP
negative_terms = nucleocapsid;nucleoprotein;N protein;M protein;glycoprotein;envelope
min_length = 2000
max_length = 10000

In first line, you're setting between "[]" the section, organism script use "Downloader" section to get parameters. After that, everything is parameter = value(until it finds another section), they're the same parameters from help message without "--" before the keys.

You can run this example using:

downloader.py --conf example.conf

Functions

This script have the following functions

create_query()

Method used to generate query

Parameters Type Description
organism str Taxon ID
positive_terms list List of positive terms separated by ';'
negative_terms list List of negative terms separated by ';'
min_length int Min sequence length
max_length int Max sequence length

download_sequences_from_query()

Method used to download sequences from NCBI using query

Parameters Type Description
database str Database name (ipg, protein)
query str Query generated from create_query()
organism str Taxon ID
url str URL to download sequences
query_key int Query key used by section, for each request, remember to increment it