downloader.py
Create NCBI query and download FASTA sequences from NCBI databases (IPG or Protein)
Using the IDs file (product from organism.py
) and parameters, this script build a query, execute search on NCBI database and download FASTA sequences.
Usage
You can run this script from command line passing args or using a configuration file
Command line
You can read help message running downloader.py -h
:
usage: downloader.py [-h] [--database {ipg,protein}] [--log LOG]
[--organisms ORGANISMS] [--positive_terms POSITIVE_TERMS]
[--negative_terms NEGATIVE_TERMS]
[--min_length MIN_LENGTH] [--max_length MAX_LENGTH]
[--ids_file IDS_FILE] [--output_dir OUTPUT_DIR]
[--conf CONF]
Downloader - Download FASTA sequences from NCBI
optional arguments:
-h, --help show this help message and exit
--database {ipg,protein}
Database to be used
--log LOG Log output
--organisms ORGANISMS
Organism name or taxonomic ID separated by ';'
--positive_terms POSITIVE_TERMS
Positive terms to be used in NCBI query separated by
';'
--negative_terms NEGATIVE_TERMS
Negative terms to be used in NCBI query separated by
';'
--min_length MIN_LENGTH
Min sequence length
--max_length MAX_LENGTH
Max sequence length
--ids_file IDS_FILE IDs filepath, product from organism.py
--output_dir OUTPUT_DIR
Output dir
--conf CONF Configuration file
create a query based on positive-terms, negative-terms, min-length and max-
length; execute query in NCBI database; and download files in output
directory.
An command example:
downloader.py ipg --positive_terms "polymerase;L protein;RNA;RDRP" --negative_terms "nucleocapsid;nucleoprotein;N protein;M protein;glycoprotein;envelope" --min_length 2000 --max_length 10000 --ids_file ids_file.txt
Config file
You can use a configuration file with this content:
[Downloader]
database = ipg
log = downloader.log
ids_file = ids.txt
positive_terms = polymerase;L protein;RNA;RDRP
negative_terms = nucleocapsid;nucleoprotein;N protein;M protein;glycoprotein;envelope
min_length = 2000
max_length = 10000
In first line, you're setting between "[]" the section, organism script use "Downloader" section to get parameters. After that, everything is parameter = value
(until it finds another section), they're the same parameters from help message without "--" before the keys.
You can run this example using:
downloader.py --conf example.conf
Functions
This script have the following functions
create_query()
Method used to generate query
Parameters | Type | Description |
---|---|---|
organism | str | Taxon ID |
positive_terms | list | List of positive terms separated by ';' |
negative_terms | list | List of negative terms separated by ';' |
min_length | int | Min sequence length |
max_length | int | Max sequence length |
download_sequences_from_query()
Method used to download sequences from NCBI using query
Parameters | Type | Description |
---|---|---|
database | str | Database name (ipg, protein) |
query | str | Query generated from create_query() |
organism | str | Taxon ID |
url | str | URL to download sequences |
query_key | int | Query key used by section, for each request, remember to increment it |