Skip to content

Commit

Permalink
Updated
Browse files Browse the repository at this point in the history
  • Loading branch information
wangqion committed Mar 28, 2015
1 parent 4db7c96 commit cc758b4
Showing 1 changed file with 19 additions and 17 deletions.
36 changes: 19 additions & 17 deletions README
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
FrameBot
version 1.0
Last Update Date: 08/27/2013
version 1.2
Last Update Date: 03/28/2015
Contact: Qiong Wang at [email protected], RDP Staff at [email protected]

-----------------------
Expand Down Expand Up @@ -59,29 +59,31 @@ One protein reference fasta file or index file, and one DNA query fasta file are
-x,--scoring-matrix <arg> the protein scoring matrix for no-metric-search ONLY. Default is Blosum62
-z,--de-novo Enable de novo mode to add abundant query seqs to refset

[Note]
FrameBot uses a kmer pre-filtering heuristic for no-metric-search. This pre-filtering may increase the speed by one to two orders of magnitude.
Using this heuristic may cause FrameBot to return a different equally high-scoring (or occasionally almost as high) reference sequence.
Use option "-P" to disable the pre-filtering if necessary. To use metric-search, see step 3.

The option "Add de novo References" may help with genes with high diversity or lack of closely related reference sequences in the reference set (such as biphenyl dioxygenase). The idea of de novo mode came from Dr. Ondrej Uhlík group at Institute of Chemical Technology Prague. This is based on the assumption that abundant sequences are more likely to be correct. The experimental sequences are dereplicated and sorted by abundance in descending order first. Each query is tested against the reference set. If a query doesn't have a close reference with above 70% aa identity, the corresponding protein sequence of the query will be added to the reference set if the following criteria are met:

Length Cutoff and Identity Cutoff.
The abundance is above certain cutoff, default is 10
No frameshifts or stop codon present.

The framebot step produces six output files:
_framebot.txt - the alignment to the nearest match satisfying the minimum length and protein identity cutoff.
_nucl_corr.fasta and all_seqs_derep_prot_corr.fasta - the frameshift-corrected nucleotide and protein sequences satisfying the minimum length and protein identity cutoff.
_nucl_corr.fasta and all_seqs_derep_prot_corr.fasta - the frameshift-corrected nucleotide and protein sequences satisfying the minimum length and protein identity cutoff.
_failed_framebot.txt - the alignment to the nearest match that failed the minimum length and protein identity cutoff.
_nucl_failed.fasta - fasta file containing the nucleotide sequences that failed the minimum length and protein identity cutoff.


[Example command from a terminal]
java -jar /PATH/dist/FrameBot.jar framebot -N -o nifH_test example/nifH_test_refseq_prot.fa example/nifH_test_query.fa

java -jar /PATH/dist/FrameBot.jar framebot -o nifH_test nifH_test.index example/nifH_test_query.fa

[Note]
FrameBot uses a kmer pre-filtering heuristic for no-metric-search. This pre-filtering may increase the speed by one to two orders of magnitude. Using this heuristic may cause FrameBot to return a different equally high-scoring (or occasionally almost as high) reference sequence. Pre-filtering is the default setting. Use option "-P" to disable the pre-filtering if necessary.

The option "Add de novo References" may help with genes with high diversity or lack of closely related reference sequences in the reference set (such as biphenyl dioxygenase). The de novo strategy came from Dr. Ondrej Uhlík group at Institute of Chemical Technology Prague. This is based on the assumption that abundant sequences are more likely to be correct. The experimental sequences are dereplicated and sorted by abundance in descending order first. Each query is tested against the reference set. If a query doesn't have a close reference with above 70% aa identity, the corresponding protein sequence of the query will be added to the reference set if the following criteria are met:
Length Cutoff and Identity Cutoff.
The abundance is above certain cutoff, default is 10
No frameshifts or stop codon present.

[To run FrameBot using the de novo strategy, use the following commands from a terminal]
java -jar /PATH/Clustering.jar derep --sorted -o all_seqs_derep.fasta all_seqs.ids all_seqs.samples query_nucl.fasta
java -jar /PATH/FrameBot.jar framebot -N -o bpha --de-novo bpha_ref.fasta all_seqs_derep.fasta
mkdir filtered_mapping
java -jar /PATH/Clustering.jar refresh-mappings bpha_corr_prot.fasta all_seqs.ids all_seqs.samples filtered_mapping/filtered_ids.txt filtered_mapping/filtered_samples.txt
mkdir filtered_sequences
java -jar /PATH/Clustering.jar explode-mappings -w -o filtered_sequences filtered_mapping/filtered_ids.txt filtered_mapping/filtered_samples.txt bpha_corr_prot.fasta


3. Building index for metric indexed search
This step builds an index file based on the input DNA sequences using global pairwise alignment mode. The parameters and metric scoring matrix are also stored in the index file. The reference DNA sequences should cover the exact same protein-coding region.
Expand Down

0 comments on commit cc758b4

Please sign in to comment.