Log.txt

> 09-10-2022 20:03
Created revTranslate() that reverse translates amino acid sequence to rna sequence with string and codon usage table dictionary.
Tested revTranslate() against https://www.bioinformatics.org/sms2/rev_trans.html and got the same output!



> 09-10-2022 23:45
For the start the plan is to tailor for e_coli.
I therefore used the "average" codon usage table from https://www.genscript.com/tools/codon-frequency-table as standard codon usage table.
For the reason of comparing tailored vs raw.
As there was no download option i had to copy paste and edit. 
I only wanted the codons, aa and fraction. So i manually removed freq and number.
In the end i had this
  TTT
  TTC
  TTA
  TTG	
  F
  F
  L
  L	
  0.59
  0.41
  0.28
  0.29	
I then created a script (standardCUT.py) to create a dictionary in the same format as the python_codon_table package by Edinburgh-Genome-Foundry.
dict{aminoacid:{codon:fraction,codon2;fraction}}
I also added a standard codon usage table for yeast.



> 12-10-2022 09:37
Created transcribe() that transcribes cDNA to mRNA



> 18-10-2022 11:54
Created uniprot_seq() that retrieves the amino acid sequence from Uniprot.org.
It extracts the sequence by requesting the fasta format as a string from uniprot's api.
With a Uniprot accession id one can thus access both SwissProt reviewd or unreviewd sequences.
If one wants the isoform, one must thus enter which isoform of the accession by writing the specific number.

Example: uniprot_seq("Q9JK66", isoform = 4)
It will thus compile accession with isoform to Q9JK66-4 then search through the record.id of the fasta.
sp|Q9JK66|PRKN_RAT
sp|Q9JK66-5|PRKN_RAT
sp|Q9JK66-6|PRKN_RAT
sp|Q9JK66-4|PRKN_RAT    # Match! Will therefore return this amino acid sequence.
sp|Q9JK66-3|PRKN_RAT



> 18-10-2022 13:02
Updated on the revTranslate()
Made the standard CUT's that i created 09-10-2022 available.
The second argument (cut =) can be set as a own CUT dictionary, 1 to access standard e_coli or 2 to access standard yeast.
EXAMPLE: revTranslate(aa_seq, cut = 2) would reverse translate using standard yeast CUT.



> 18-10-2022 16:28
Created the calc_cai() that calculates the codon adaption index (CAI) of sequence by extracting the f(i) and fmax(j) 
from the codon usage table dictionary. 
Tried comparing it but can not understand what format they want the CUT in. 
However, when i use the example CUT at http://genomes.urv.es/CAIcal/.
It does calculate my test seq to 0.75 and calc_cai() calculates it to 0.77. So it seems to be working!



> 19-10-2022 12:00 
Implemented some basic statistics functions such as nucleotide counter and GC content distribution.
Nucleotide counter was easy but GC content distribution across sequence was trickier.
Had to learn how to plot using matplotlib and Seaborn by creating dataframes with numpy and pandas.
But i made it worked, however it needs to be tweaked.
Tweaked in the sense that i need to start asking potential users of this program of what they want to see.
What stats and such.
Ohh, well. For now i have the basic function at least so will be easy to revise it later.



> 19-10-2022 19:37
In an effort to create the codon optimization function i noticed that my test sequenced (mRNA) used codons that e_coli did not use.
I cant just skip that codon and leave it in the sequence, because e_coli wont have the complementary tRNA.
Resulting in a faulty translation.
I therefore need to overlook the whole input handeling of the program.

IDEA:
  User needs to also set taxID for organism input sequence came from.
  That way i can translate it to aa seq and use the aa seq to recive the most optimal codon for that aa.

The codon optimization function worked when input mRNA seq used codons that exist in the target host CUT.
I therefore believe it is time for me to start using object oriented programming and creating a class.
The fundamental functions are working pre tailor.
The tailoring functions can be looked into after.



> 23-10-2022 17:30
I redid the codon optimization following the idea i mentioned before.
So now, user needs the taxid for both organism where GOI was derived from as well as the host cell.
I made it as a class.
So for the moment.
"""Codon optimises the sequence, if codon not existent in host CUT, then the aa from organism CUT will be used to extract the codon.
        .seq        = seq
        .ct_seq     = codon tailored seq
        .missed     = number of missed codons
        .na_codons  = set of missed codons

        Args:
            seq (str): mRNA seq
            org_cut (dict): Organism codon usage table
            host_cut (dict): Host codon usage table


> 25-10-2022 23:19
I have created the main UI for the widget using PyQt5.
What is next is making it actionable.
My idea at the moment is to have three python files. One for GUI (viewer), one for program (main) , one with functions and classes (module).
So the viewer will execute main, where main can call functions from module.
Exciting!


> 31-10-2022
Using only widget was a faulty move as i could not implement the "File" or "Help" tabs.
I want the file tab to be able to open fasta files so multiple sequences can be added.
Thus, i used QT Designer to create an application.
The main UI application is now built.
My idea now is that the when you press the "Tailor" button, you will either get a PDF or a HTML file
with plots and statistics. Maybe sequence allignment as well but dont know how computational that will be...
More thoughts have to be put into this.
The goal now is to make the buttons actionable. Rather, the input from GUI will be parameters for the Tailor program.
We will see....



> 04-11-2022 14:03
After some tutorials and struggles in regards how to connect the GUI with the ProteinTailor program it is finally live!
At the moment it is able to give outputs to the terminal by asserting the parameters in the GUI.
Next step now is to finalize the ProteinTailor program by implementing the harmonizing step.
As well as figure out the output format.
I also still have in my scope to have user be able to upload fasta/txt files as well as download the output in same format.


> 30-11-2022 09:00
November became a boiling point of activities and deadlines, thus this project fell to the side.
As i was out of town i did changes that i did not log, so here is my effort trying to summarize what happend between the 4th and now.

I noticed that the CodonTailor and FinalFit classes would end up using same functions. So instead of repeating code i split CodonTailors built in functions into their own functions.

list_codons() splits the sequence into a ordered list where each element is the codon. AUGGGA becomes [AUG,GGA].

codon_run(): takes the list of codons, the host and organism CUT and returns a list holding alot of good information + codon optimizes the sequence.


    Args:
        codons (list): Ordered list of codons
        hcut (dict): host codon usage table dictionary
        ocut (dict): organism codon usage table dictionary

    Returns:
        list: List with optimised sequence, number of codons that did not exist in host, list of noped codons, number of codons tailored,
        List of lists holding each codons f(i) and f(j).  

So here we aquire the codon optimized sequence and statistics in regards to how poorly the raw sequence was for the host organism.
We also get the means to calculate the CAI using what i call the freq_list, example: [['codon', f(i), f(j)], ['AUU', 0.36, 0.73]]. 

calc_cai() which did what list_codon and codon_run did was thus scrapped and created into caizen(). Caizen only calculates the CAI by using the freq_list.

These functions are then used in the class CodonTailor, which orchestrates the optimization using the functions mentioned.

For the FinalFit class it checks that the codon optimization does not add any stop codons into reading fram as well as false initiations.
rouge_start_stops() returns the start index of start and stop codons of sequence.
shine_dalgarno() locates all the shine dalgarno sites in sequence.
sense_nonsense() finds (senses) the stop codons that are in the normal reading frame. These are important as they otherwise cause nonsense mutations.
make_sense() just straight up removes the stop codon causing a nonsense mutation. it splits the sequence (seq u ence), removes the stop codon (u) and adds the remaining sequences back (seqence).
false_initiation() finds SD sites that are to close to start codons (range of 12nt). This is checked across all reading frames as we dont want to have ribosomes binding on different reading frames or in the middle of the coding region.
shoeshine() removes SD sites that could cause false initiations by replacing one of the codons in the SD site with CGC, GAA or GGG depending on which reading frame.

""" Start codons are OK in the gene as long as they are not near a SD (8-10nt).
    Stop codons are not OK, they will terminate."""

"""
  R       R
AGG     AGG

EKQ      E     ADEGV
XAG     GAG     GXX              11 look ahead

AEGIKLPQRSTVW    G       G
    XXA         GGA     GGX      

posG GAA GXX XXX    
so you switch GGA in +2 to GGG and GAG in +1 to GAA
AGG -> CGU
sq = "xxxAGGAGGxAGGAGGxxxxAGGAGGx"

XAG
 """ 

shoe_shoeshined() checks that the shoeshine did not implement a new sd site. So it check if the shoe is shoeshined.

FinalFit then utilizes these functions and returns the final fitted tailored sequence!

> 06-12-2022 10:25
Need finalize output. I have decided to make it as a html report.

- Tailored Sequence
    - Copy All or maybe Download??
- Statistics
- Distribution Graph
- Alignment

> 07-12-2022 10:25

# Frame
sq = "123AGGAGG123123"                              # 123 AGG AGG 123 123
print(sq[:3+0])                                     # 123
print(sq[3+0+3:])                                   #         AGG 123 123
print(sq[:3+0] + "CGC" + sq[3+0+3:])                # 123 CGC AGG 123 123
print()
# +1
sq = "1231AGGAGG23123"                              # 123 1AG GAG G23 123
print(sq[:4+1])                                     # 123 1A
print(sq[4+1+3:])                                   #           G G23 123
print(sq[:4+1] + "GAA" + sq[4+1+3:])                # 123 1AG AAG G23 123
print()
# +2
sq = "12312AGGAGG3123"                              # 123 12A GGA GG3 123
print(sq[:5+2])                                     # 123 12A G
print(sq[5+2+3:])                                   #              G3 123
print(sq[:5+2] + "GGG" + sq[5+2+3:])                # 123 12A GGG GG3 123
print()