this is a parser to read in MHC files from ANHIG / IMGTHLA and parse it to certain format requirements.
This project uses the HLA files located here. This project is aimed at taking the alignment files, combining the subsets of the gen and nuc files for a given allele, provding a clean output file of the combined data, and providing a count.
-merge_file_B.txt - merged version of B_gen and B_nuc ran 09/06
-merge_file_C.txt - merged version of C_gen and C_nuc ran 09/06
-merged_file_DMA.txt - merged version of DMA_gen and DMA_nuc ran 09/06
-merged_files_A.txt - merged version of A_gen and A_nuc ran 09/06
-mergefiles.py - final version as of 09/06 to read in nuc and gen files and merge
-HLAtest.py - takes the alginments/gen file from the IMGTHLA database and uses pandas to count the unique values
-DataQ&A.pptx - an attempted walk through of the gen and nuc data sets and how they need to be combined
Jupyter notebooks are versions of importing gen file for cleaning and counting unique columns.
- Identify what HLA-gene you would like to work with form here. Grab your txt file from alginement folder -> grab the nuc and gen file for the gene you wish to look at.
- example:
A_gen.txt
andA_nuc.txt
- any where that you see a comment
###### ~ UPDATE
, please update that line prior to running. As seen in the following lines:
###### ~ UPDATE
if("C*" in line):
# Remove the whitespaces from the
# beginning and end from name
name = line[0:16].strip()
if name not in nameList:
nameList.append(name)
# we are stripping out the beginning / end whitespace
value = line[19:].strip()
# let's get rid of internal whitespace and compress sequence
value = value.replace(" ", "")
# we want to be able to distinguish
# between intron/exon, but need the
# full value - we will do this after
# the entire allele value is captured
if name in alleles:
# we're appending to the existing
alleles[name] += value
else:
alleles[name] = value
line = fp.readline()
# replace bars with first allele values
###### ~ UPDATE
first = alleles["C*01:02:01:01"]
## Read in Nuc and Gen files
###### ~ UPDATE
nuc_text = input("Enter a nuc file path or hit enter to accept default:") or \
"/Users/drjacobs/Documents/python_MHC_parser/mergefiles/C_nuc.txt"
# nuc file parsing
nameList_nuc, alleles_nuc = parsetodic(nuc_text)
# gen file parsing
# ###### ~ UPDATE
gen_text = input("Enter a gen file path or hit enter to accept default:") or \
"/Users/drjacobs/Documents/python_MHC_parser/mergefiles/C_gen.txt"
nameList_gen, alleles_gen = parsetodic(gen_text)
Update these sections with material relevent to the allele and file you are working with:
if("C*" in line):
updateC*
with HLA gene name- A
- B
- C
- DQA1
- DQA2
- DQB1
- DQB2
- DRA
- DRB
first = alleles["C*01:02:01:01"]
changeC*01:02:01:01
to the first line in codenuc_text = input("Enter a nuc file path or hit enter to accept default:") or "/Users/drjacobs/Documents/python_MHC_parser/mergefiles/C_nuc.txt"
update with the path to your nuc file.gen_text = input("Enter a gen file path or hit enter to accept default:") or "/Users/drjacobs/Documents/python_MHC_parser/mergefiles/C_gen.txt"
update with the path to your gen file.
You should not be set to run! Just make sure you have python3. 😃