Skip to content
Ashish Jain edited this page Mar 7, 2016 · 7 revisions

Wiki For GBEER Pipeline:

Download and filter bacterial genomes:

  1. We download all the available bacterial genomes from the NCBI website. The FTP link for that is Bacterial Genome Download.
  2. After downloading all the genomes, we filter out the bacterial genomes which have more than one chromosome files. For the time being, we are not considering those genomes for our analysis. In the same step, we filtered out the plasmid files from our genomes data. For carrying out this step we use the “filtering_genomes_with_multiple_chr_file.py” python script. This script takes the complete location of the genomic data folder as the argument. Sample command is “./filtering_genomes_with_multiple_chr_file.py /home/jain/genomes_folder”

Download and map bacterial families from PATRIC to local database:

  1. We download the list of bacterial species in a family with their accession number from PATRIC database. The downloaded data is in the file “GenomeFinder.txt” which is in tab-delimited format. Steps to download the "GenomeFinder.txt" file:

    i. Go to home page of PATRIC database (PATRIC Database Home) and select the bacterial family you need to study.
    ii. Click the Genome List Tab.
    iii. Click the Download button and download the genome list as .txt file.

  2. After that, we filtered out the species which have complete genomes sequence by checking their GenBank accession numbers. We mapped those genomes with our local filtered genome databank using their accession numbers. The strains are also filtered out in this step and the filtered genome databank of the given family is stored. For this step, we use the “list_of_species_from_patric.py” python script. This script takes the complete location of the GenomeFinder.txt file, bacterial genomes folder and the output folder for storing filtered family genomes. Sample command is “./list_of_species_from_patric.py /home/jain/Gram_Positive_Bacteria_Study/Firmicutes/GenomeFinder.txt /home/jain/genomes_folder/ /home/jain/family_genome_folder/

Download and filter operon dataset:

  1. For our analysis, we have taken the operon dataset of the reference species from ProOpDB. For Firmicutes, we have downloaded the operon dataset for B.Subtilis from the database in tab-delimited format. The web link for the database is ProOpDB.
  2. After that, we filter out those operons which have minimum five protein-coding genes. For this, we used the “ProOpDB_Operons_Parser.py” python script. The sample command is “./ProOpDB_Operons_Parser.py -i /home/jain/family_genome_folder -o /home/jain/ -d /home/jain/Downloads/B.Subtilis_Operons_ProOpDB.txt

Select bacterial species for analysis using Phylogenetic Distance Analysis (PDA) tool:

  1. In this step we create the phylogenetic tree of the bacterial family by using the create_newick_tree.py script of GBEER. The sample command is “./create_newick_tree.py -G /home/jain/family_genome_folder -o /home/jain/test/ -q”. This will provide us the phylogenetic tree (out_tree.nwk) of the whole family which will be used in the next step.
  2. After that, we uploaded the newick file on the web PDA tool (PDA Tool). On that page, we entered the subset size and checked the “Rooted using newick input file (for tree only)” option. After running the tool we get the names of the filtered species. We copy those species in a text file which is used to create the filter file for GBEER.
  3. We used the above created species name file to create the filter file using the “get_filter_file_from_Species_Names.py” python script. The script takes the species name file, family genome folder, and the output folder as its arguments. The sample command is “./get_filter_file_from_Species_Names.py /home/jain/test/filter_PDA_names.txt /home/jain/family_genome_folder/ /home/jain/test/”. If the filter file doesn’t contain the accession number of reference genome than we manually add it to the filter file.

Run the GBEER tool:

We run the GBEER tool by using the following command: “./gbeer_standalone.py -i /home/jain/gene_block_names_and_genes.txt -G /home/jain/family_genome_folder/ -f /home/jain/test/filter.txt -o /home/jain/test/Run -q”

Calculate Conservation Score:

  1. To calculate the conservation score we used the “conservation_score_calculation.py” python script. This script takes the output folder path of GBEER and a threshold as its arguments. The threshold is the minimum number of species in which the operon should be present to consider it for study. This is used to filter out the operons with low coverage in output. The sample command is “./conservation_score_calculation.py /home/jain/Proteobacteria/Run/ 30”. This gives us the conservedOperonsSorted.txt file which consists of the conservation score of operons in a tab delimited format.

Note: Currently, GBEER only supports operons having genes less than or equal to 14. Please remove operons which are having more than 14 genes from the operon list.

Clone this wiki locally