forked from Kennedy-Lab-UW/Duplex-Sequencing
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathConsensusMaker.py
428 lines (375 loc) · 22.4 KB
/
ConsensusMaker.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
#!/usr/bin/env python
'''
Consensus Maker
Version 2.0
By Brendan Kohrn and Scott Kennedy(1)
(1) Department of Pathology, University of Washington School of Medicine, Seattle, WA 98195
Based on work by Scott Kennedy
January 21, 2014
Written for Python 2.7.3
Required modules: Pysam, Samtools
Inputs:
A position-sorted paired-end BAM file containing reads with a duplex tag in the header.
Outputs:
1: A paired-end BAM file containing SSCSs
2: A single-end BAM file containing unpaired SSCSs (if --read_type is 'd')
3: A single-end BAM file containing reads with less common cigar strings
4: A single-end BAM file containing reads not in --read_type
5: A tagcounts file
Note that quality scores in outputs 1, 2, and 3 are just space fillers and do not signify anything about the quality of the sequence.
The program starts at the position of the first good read, determined by the type of read specified on startup. It then goes through the file until it finds a new position, saving all reads as it goes. When it finds a new position, it sends the saved reads to the consensus maker, one tag at a time, untill it runs out of tags. Consensus sequences are saved until their mates come up, at which point both are written to the output BAM file, read 1 first. After making consensuses with the reads from the first position, it continues on through the origional file until it finds another new position, sends those reads to the consensus maker, and so on until the end of the file. At the end of the file, any remaining reads are sent through the consensus maker, and any unpaired consensuses are written to a file ending in _UP.bam.
In the future, this program may be able to autodetect read length.
usage: ConsensusMaker.py [-h] [--infile INFILE] [--tagfile TAGFILE]
[--outfile OUTFILE] [--rep_filt REP_FILT]
[--minmem MINMEM] [--maxmem MAXMEM] [--cutoff CUTOFF]
[--Ncutoff NCUTOFF] [--readlength READ_LENGTH]
[--read_type READ_TYPE] [--isize ISIZE]
[--read_out ROUT] [--filt FILT]
optional arguments:
-h, --help show this help message and exit
--infile INFILE input BAM file
--tagfile TAGFILE output tagcounts file
--outfile OUTFILE output BAM file
--rep_filt REP_FILT Remove tags with homomeric runs of nucleotides of
length x. [9]
--minmem MINMEM Minimum number of reads allowed to comprise a
consensus. [3]
--maxmem MAXMEM Maximum number of reads allowed to comprise a
consensus. [1000]
--cutoff CUTOFF Percentage of nucleotides at a given position in a
read that must be identical in order for a consensus
to be called at that position. [0.7]
--Ncutoff NCUTOFF With --filt 'n', maximum fraction of Ns allowed in a
consensus [1.0]
--readlength READ_LENGTH
Length of the input read that is being used. [84]
--read_type READ_TYPE
A string specifying which types of read to consider.
Read types: n: Neither read 1 or read 2 mapped. m:
Either read 1 or read 2 mapped, but not both. p: Both
read 1 and read 2 mapped, not a propper pair. d: Both
read 1 and read 2 mapped, propper pair. s: Single
ended reads ['dpm']
--isize ISIZE maximum distance between read pairs
--read_out ROUT How often you want to be told what the program is
doing. [1000000]
--filt FILT A string indicating which filters should be
implemented. Filters: s: Softclipping filter. o:
Overlap filter. n: N filter. ['osn']
Details of different arguments:
--minmem and --maxmem set the range of family sizes (constrained by cigar score) that can be used to make a consensus sequence. Examples use --minmem of 3 and --maxmem of 1000
Example 1:
Ten reads (readlength = 80) have a particular barcode. Of these ten, nine of them have a cigar string of 80M, while one has a cigar string of 39M1I40M. Only the nine with a cigar string of 80M are sent on to be made into a SSCS.
Example 2:
Three reads (readlength 80) have a particular barcode. Of these, two have a cigar string of 80M, and one has a cigar string of 20M1D60M. No SSCS results.
Example 3:
A family with over 1000 members exists. A random sample of 1000 reads from that family is used to make a SSCS.
--cutoff sets the strictness of the consensus making.
Example (--cutoff = 0.7):
Four reads (readlength = 10) are as follows:
Read 1: ACTGATACTT
Read 2: ACTGAAACCT
Read 3: ACTGATACCT
Read 4: ACTGATACTT
The resulting SSCS is:
ACTGATACNT
--Ncutoff, with --filt n enabled, sets the maximum percentage of Ns allowed in a SSCS.
Example (--Ncutoff = .1, --readlength = 20):
Two SSCSs are generated as follows:
SSCS 1: ACGTGANCTAGTNCTNTACC
SSCS 2: GATCTAGTNCATGACCGATA
SSCS 2 passes the n filter (10%) with 1/20 = 5% Ns, while SSCS 1 does not with 3/20 = 15% Ns.
--readlength sets the length of the reads imputed. If this value is set incorrectly, the program will often crash with an error message about sequence length not matching quality score length, or will output an empty SSCS bam file.
--read_type sets which reads are considered to have 'good' flags. Options are:
d: Paired-end reads where both reads in the pair map, and where the two are properly paired (read 2 maps in the opposite direction and on the opposite strand from read 1). Flags are 99, 83, 163, and 147 .
p: Paired-end reads where both reads in the pair map, but the two are not properly paired. Flags are 97, 81, 161, 145, 129, 65, 177, and 113.
m: Paired-end reads where only one read in the pair maps. Flags are 181, 117, 137, 133, 73, 89, 69, and 153.
n: Paired-end reads where neither read in the pair maps, and single end unmapped reads. Flags are 141, 77, and 4.
s: Single end mapped reads. Flags are 0 and 16.
--filt sets which filters are used. Options are:
o: Overlap filter. Filters out any read pairs which overlap. Only works on reads of type d (see above).
s: Softclipping filter. Filters out any reads which have been soft-clipped in alignment. This avoids later problems with hard-clipping.
n: N filter. Filters out consensus sequences with a higher percentage of Ns than the threshold imposed by --Ncutoff. Without this option, --Ncutoff doesn't do anything.
--isize
If not -1, sets the maximum distance between read 1 and read 2 for the two to not be considered unpaired. Only works if --read_type is 'd'
'''
import sys
import pysam
import re
import random
from collections import defaultdict
from argparse import ArgumentParser
def printRead(readIn):
sys.stderr.write("%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n" % (readIn.qname, readIn.flag, readIn.tid, readIn.pos, readIn.mapq, readIn.cigar, readIn.mrnm, readIn.mpos, readIn.isize, readIn.seq, readIn.qual, readIn.tags))
def consensusMaker (groupedReadsList, cutoff, readLength) :
'''The consensus maker uses a simple "majority rules" algorithm to qmake a consensus at each base position. If no nucleotide majority reaches above the minimum theshold (--cutoff), the position is considered undefined and an 'N' is placed at that position in the read.'''
nucIdentityList=[0, 0, 0, 0, 0, 0] # In the order of T, C, G, A, N, Total
nucKeyDict = {0:'T', 1:'C', 2:'G', 3:'A', 4:'N'}
consensusRead = ''
for i in xrange(readLength) : # Count the types of nucleotides at a position in a read. i is the nucleotide index within a read in groupedReadsList
for j in xrange(len(groupedReadsList)): # Do this for every read that comprises a SMI group. j is the read index within groupedReadsList
try:
if groupedReadsList[j][i] == 'T' :
nucIdentityList[0] += 1
elif groupedReadsList[j][i] == 'C':
nucIdentityList[1] += 1
elif groupedReadsList[j][i] == 'G':
nucIdentityList[2] += 1
elif groupedReadsList[j][i] == 'A':
nucIdentityList[3] += 1
elif groupedReadsList[j][i] == 'N':
nucIdentityList[4] += 1
else:
nucIdentityList[4] += 1
nucIdentityList[5] += 1
except:
break
try:
for j in [0, 1, 2, 3, 4] :
if float(nucIdentityList[j])/float(nucIdentityList[5]) > cutoff :
consensusRead += nucKeyDict[j]
break
elif j==4:
consensusRead += 'N'
except:
consensusRead += 'N'
nucIdentityList=[0, 0, 0, 0, 0, 0] # Reset for the next nucleotide position
return consensusRead
def tagStats(tagCountsFile):
familySizeCounts=defaultdict( lambda: 0 )
fIn = open(tagCountsFile, 'r')
fOut = open(tagCountsFile.replace('.tagcounts', '.tagstats'), 'w')
for line in fIn:
familySizeCounts[int(line.strip().split()[1].split(":")[0])] += 1
fIn.close()
totals = 0
for size in familySizeCounts.keys():
familySizeCounts[size] *= int(size)
totals += int(familySizeCounts[size])
for size in sorted(familySizeCounts.keys()):
fOut.write("%s\t%s\n" % (size, float(familySizeCounts[size])/float(totals)))
fOut.close()
return(True)
def main():
#Parameters to be input.
parser=ArgumentParser()
parser.add_argument("--infile", action="store", dest="infile", help="input BAM file", required=True)
parser.add_argument("--tagfile", action="store", dest="tagfile", help="output tagcounts file", default='sys.stdout', required=True)
parser.add_argument("--outfile", action="store", dest="outfile", help="output BAM file", required=True)
parser.add_argument("--rep_filt", action="store", type=int, dest='rep_filt', help="Remove tags with homomeric runs of nucleotides of length x. [9]", default=9 )
parser.add_argument('--minmem', type=int, default=3, dest='minmem', help="Minimum number of reads allowed to comprise a consensus. [3] ")
parser.add_argument('--maxmem', type=int, default=1000, dest='maxmem', help="Maximum number of reads allowed to comprise a consensus. [1000]")
parser.add_argument('--cutoff', type=float, default=.7, dest='cutoff', help="Percentage of nucleotides at a given position in a read that must be identical in order for a consensus to be called at that position. [0.7]")
parser.add_argument('--Ncutoff', type=float, default=1, dest='Ncutoff', help="With --filt 'n', maximum fraction of Ns allowed in a consensus [1.0]")
parser.add_argument('--readlength', type=int, default=84, dest='read_length', help="Length of the input read that is being used. [80]")
parser.add_argument('--read_type', type=str, action="store", dest='read_type', default="dpm", help="A string specifying which types of read to consider. Read types: n: Neither read 1 or read 2 mapped. m: Either read 1 or read 2 mapped, but not both. p: Both read 1 and read 2 mapped, not a propper pair. d: Both read 1 and read 2 mapped, propper pair. s: Single ended reads\n\t\t['dpm']")
parser.add_argument('--isize', type = int, default=-1, dest='isize', help="maximum distance between read pairs")
parser.add_argument('--read_out', type = int, default = 1000000, dest = 'rOut', help = 'How often you want to be told what the program is doing. [1000000]')
parser.add_argument('--filt', type=str, default='osn', dest='filt', help="A string indicating which filters should be implemented. Filters: s: Softclipping filter. o: Overlap filter. n: N filter. ['osn']")
o = parser.parse_args()
# Initialization of all global variables, main input/output files, and main iterator and dictionaries.
goodFlag=[]
if 'd' in o.read_type:
goodFlag.extend((99, 83, 163, 147))
if 'm' in o.read_type:
goodFlag.extend((181, 117, 137, 133, 73, 89, 69, 153))
if 'p' in o.read_type:
goodFlag.extend((97, 81, 161, 145, 129, 65, 177, 113))
if 'n' in o.read_type:
goodFlag.extend((141, 77, 4))
if 's' in o.read_type:
goodFlag.extend((0, 16))
inBam = pysam.Samfile( o.infile, "rb" ) # Open the input BAM file
outBam = pysam.Samfile( o.outfile, "wb", template = inBam ) # Open the output BAM file
outNC1 = pysam.Samfile( o.outfile.replace(".bam","_LCC.bam"),"wb", template = inBam )
nonMap = pysam.Samfile( o.outfile.replace(".bam","_NM.bam"), "wb", template = inBam ) # File for reads with strange flags
if o.read_type == 'd':
extraBam = pysam.Samfile(o.outfile.replace(".bam","_UP.bam"), "wb", template = inBam)
readNum = 0
nM = 0
bF = 0
oL = 0
sC = 0
rT = 0
nC = 0
LCC = 0
ConMade = 0
if o.read_type == 'd':
UP = 0
fileDone=False # Initialize end of file bool
finished=False
readOne=False
qualScore = 'J'*o.read_length # Set a dummy quality score
bamEntry = inBam.fetch( until_eof = True ) # Initialize the iterator
readWin = [bamEntry.next(), ''] # Get the first read
winPos = 0
readDict = {} # Initialize the read dictionary
tagDict = defaultdict( lambda: 0 ) # Initialize the tag dictionary
consensusDict={}
#Start going through the input BAM file, one position at a time.
for line in bamEntry:
winPos += 1
readWin[winPos%2] = line
# Reinitialize first line
if readOne==True:
winPos -= 1
while (readWin[winPos%2].pos == readWin[(winPos-1)%2].pos and fileDone==False and readOne==False) or readOne == True:
if readNum % o.rOut == 0:
sys.stderr.write("Reads processed:" + str(readNum) + "\n")
try:
tag = readWin[winPos%2].qname.split('|')[1].split('/')[0] + (":1" if readWin[winPos%2].is_read1 == True else (":2" if readWin[winPos%2].is_read2 == True else ":se"))
tagDict[tag] += 1
except:
print readNum
raise
# Overlap filter: filters out overlapping reads (with --filt o)
overlap=False
if 'o' in o.filt:
if readWin[winPos%2].pos < readWin[winPos%2].mpos and readWin[winPos%2].mpos < readWin[winPos%2].pos + o.read_length and int(readWin[winPos%2].flag) in (83, 99, 147, 163):
overlap=True
elif readWin[winPos%2].pos > readWin[winPos%2].mpos and readWin[winPos%2].pos < readWin[winPos%2].mpos + o.read_length and int(readWin[winPos%2].flag) in (83, 99, 147, 163):
overlap=True
elif readWin[winPos%2].pos==readWin[winPos%2].mpos and int(readWin[winPos%2].flag) in (83, 99, 147, 163):
overlap=True
readNum +=1
# Softclip filter: filters out softclipped reads (with --filt s)
softClip=False
if 's' in o.filt:
if readWin[winPos%2].cigar != None:
for tupple in readWin[winPos%2].cigar:
if tupple[0]==4:
softClip=True
# Check if the given read is good data
if int( readWin[winPos%2].flag ) in goodFlag and overlap==False and softClip==False:
if ('A'*o.rep_filt in tag) or ('C'*o.rep_filt in tag) or ('G'*o.rep_filt in tag) or ('T'*o.rep_filt in tag):
# Check for bad barcodes
nM += 1
nonMap.write(readWin[winPos%2])
rT += 1
else :
# Add the sequence to the read dictionary
if tag not in readDict:
readDict[tag] = [readWin[winPos%2].flag, readWin[winPos%2].rname, readWin[winPos%2].pos, readWin[winPos%2].mrnm, readWin[winPos%2].mpos, readWin[winPos%2].isize,{str(readWin[winPos%2].cigar):[0,readWin[winPos%2].cigar]}]
if str(readWin[winPos%2].cigar) not in readDict[tag][6]:
readDict[tag][6][str(readWin[winPos%2].cigar)]=[0,readWin[winPos%2].cigar]
readDict[tag][6][str(readWin[winPos%2].cigar)].append(readWin[winPos%2].seq)
readDict[tag][6][str(readWin[winPos%2].cigar)][0]+=1
else:
nM += 1
nonMap.write(readWin[winPos%2])
if int(readWin[winPos%2].flag) not in goodFlag:
bF += 1
elif overlap == True:
oL += 1
elif softClip == True:
sC += 1
winPos += 1
if readOne == False:
try: # Keep StopIteration error from happening at the end of a file
readWin[winPos%2] = bamEntry.next() # Iterate the line
except:
fileDone = True # Tell the program that it has reached the end of the file
else:
readOne = False
else:
# Send reads to consensusMaker
readOne=True
for dictTag in readDict.keys(): # Extract sequences to send to the consensus maker
# Cigar string filtering
cigComp={}
for cigStr in readDict[dictTag][6].keys(): # Determin the most common cigar string
cigComp[cigStr]=readDict[dictTag][6][cigStr][0]
maxCig=max(cigComp)
if cigComp[maxCig] >= o.minmem:
if cigComp[maxCig] <= o.maxmem:
ConMade += 1
consensus = consensusMaker( readDict[dictTag][6][maxCig][2:], o.cutoff, o.read_length )
else:
ConMade += 1
consensus = consensusMaker(random.sample(readDict[dictTag][6][maxCig][2:], o.maxmem), o.cutoff, o.read_length)
for cigStr in readDict[dictTag][6].keys():
if cigStr != maxCig:
for n in xrange(2, len(readDict[dictTag][6][cigStr][2:])):
a = pysam.AlignedRead()
a.qname = dictTag
a.flag = readDict[dictTag][0]
a.seq = readDict[dictTag][6][cigStr][n]
a.rname = readDict[dictTag][1]
a.pos = readDict[dictTag][2]
a.mapq = 255
a.cigar = readDict[dictTag][6][cigStr][1]
a.mrnm = readDict[dictTag][3]
a.mpos=readDict[dictTag][4]
a.isize = readDict[dictTag][5]
a.qual = qualScore
outNC1.write(a)
LCC += 1
cigComp={}
# Filter out consensuses with too many Ns in them
if (consensus.count("N" )/ len(consensus) <= o.Ncutoff and 'n' in o.filt) or ('n' not in o.filt):
# Write a line to the consensusDictionary
a = pysam.AlignedRead()
a.qname = dictTag
a.flag = readDict[dictTag][0]
a.seq = consensus
a.rname = readDict[dictTag][1]
a.pos = readDict[dictTag][2]
a.mapq = 255
a.cigar = readDict[dictTag][6][maxCig][1]
a.mrnm = readDict[dictTag][3]
a.mpos=readDict[dictTag][4]
a.isize = readDict[dictTag][5]
a.qual = qualScore
# Write SSCSs to output BAM file in read pairs.
altTag=dictTag.replace(("1" if "1" in dictTag else "2"),("2" if "1" in dictTag else "1"))
if altTag in consensusDict:
if a.is_read1 == True:
outBam.write(a)
outBam.write(consensusDict.pop(altTag))
else:
outBam.write(consensusDict.pop(altTag))
outBam.write(a)
else:
consensusDict[dictTag]=a
else:
nC += 1
readDict={} # Reset the read dictionary
if o.read_type == 'd':
if o.isize != -1:
for consTag in consensusDict.keys():
if consensusDict[consTag].pos + o.isize < readWin[winPos%2].pos:
extraBam.write(consensusDict.pop(consTag))
UP += 1
# Write unpaired SSCSs
for consTag in consensusDict.keys():
if o.read_type == 'd':
extraBam.write(consensusDict.pop(consTag))
UP += 1
else:
outBam.write(consensusDict.pop(consTag))
# Close BAM files
inBam.close()
outBam.close()
nonMap.close()
outNC1.close()
if o.read_type == 'd':
extraBam.close()
# Write summary statistics
sys.stderr.write("Summary Statistics: \n")
sys.stderr.write("Reads processed:" + str(readNum) + "\n")
sys.stderr.write("Bad reads: %s\n" % nM)
sys.stderr.write("\tReads with Bad Flags: %s\n" % bF)
sys.stderr.write("\tOverlapping Reads: %s\n" % oL)
sys.stderr.write("\tSoftclipped Reads: %s\n" %sC)
sys.stderr.write("\tRepetitive Duplex Tag: %s\n" % rT)
sys.stderr.write("Reads with Less Common Cigar Strings: %s\n" % LCC)
sys.stderr.write("Consensuses Made: %s\n" % ConMade)
#sys.stderr.write("Unpaired Consensuses: %s\n" % UP)
sys.stderr.write("Consensuses with Too Many Ns: %s\n\n" % nC)
# Write the tag counts file.
tagFile = open( o.tagfile, "w" )
tagFile.write ( "\n".join( [ "%s\t%d" % ( SMI, tagDict[SMI] ) for SMI in sorted( tagDict.keys(), key=lambda x: tagDict[x], reverse=True ) ] ))
tagFile.close()
tagStats(o.tagfile)
if __name__ == "__main__":
main()