-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simulate linked depth by simulating reads #26
base: main
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## main #26 +/- ##
==========================================
- Coverage 72.86% 67.32% -5.55%
==========================================
Files 2 2
Lines 129 153 +24
==========================================
+ Hits 94 103 +9
- Misses 35 50 +15
Continue to review full report at Codecov.
|
simGL/simGL.py
Outdated
@@ -55,7 +55,43 @@ def refalt_int_encoding(gm, ref, alt): | |||
refalt_int[refalt_str == "T"] = 3 | |||
return refalt_int[gm.reshape(-1), np.repeat(np.arange(gm.shape[0]), gm.shape[1])].reshape(gm.shape) | |||
|
|||
def sim_allelereadcounts(gm, mean_depth, e, ploidy, seed = None, std_depth = None, ref = None, alt = None): | |||
def linked_depth(rng, DPh, read_length, sites_n): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is quite right. Try looking at the mean depth as a function of genomic position and you'll see why.
rng = np.random.default_rng()
DPh = np.array([5] * 500) # 500 haplotypes each with depth 5
linked = linked_depth(rng, DPh, 100, 300)
plt.plot(np.mean(linked, axis=0), label="linked")
plt.show()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, that's correct. This is the effect of having reads of a certain length mapping in a limited region. Then, because you have fewer reads spanning the edges of the sequence, you have less coverage there. Thus, the mean coverage for the rest of the sequence needs to increase to compensate for this edge effect.
This can be alleviated if the sequence is longer
rng = np.random.default_rng()
DPh = np.array([5] * 500) # 500 haplotypes each with depth 5
linked = linked_depth(rng, DPh, 100, 30000)
plt.plot(np.mean(linked, axis=0), label="linked")
plt.show()
If you were to calculate the mean coverage per individual, you'd see that it matches quite well with the expected. For example:
rng = np.random.default_rng()
DPh = np.array([5] * 50) # 500 haplotypes each with depth 5
linked = linked_depth(rng, DPh, 100, 300)
linked.mean(axis = 1)
> array([5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5.,
5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5.,
5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5.])
At this point I realized that the dimensions of the matrix returned are reversed, and thus, I need to transpose the output in the original function with
return np.array(DP).T
. I'll change that later!
Even if the mean coverage inputed per individual is not "5" (I apply some variance sampling coverage values from a norm distribution with sd = 1), the mean coverage per individual outputted is quite close to the inputed:
rng = np.random.default_rng()
DPh = rng.normal(loc=5, scale=1.0, size=50)
DPh
> array([5.30418801, 5.24747888, 4.36645145, 3.67576215, 3.75636558,
4.6398584 , 4.61245799, 5.30115935, 4.57175466, 5.89714514,
5.62581872, 5.52430759, 5.58910025, 5.42643025, 3.917686 ,
4.76126296, 3.69152416, 5.16517386, 4.76283951, 5.22580705,
5.16908739, 4.58332057, 5.6806694 , 6.03509974, 5.98801939,
4.95102787, 5.80498872, 4.72375886, 5.77273645, 6.67349389,
4.46026638, 5.43974589, 4.0169985 , 5.08819149, 5.58219182,
5.0738201 , 4.78218583, 3.85927317, 6.9650079 , 6.02060105,
4.07702383, 4.27175255, 4.88942867, 5.6765129 , 4.03695654,
4.62093238, 5.05252965, 4.4394403 , 4.40764166, 4.23782541])
linked = linked_depth(rng, DPh, 100, 30000)
linked.mean(axis = 1)
> array([5.30333333, 5.24666667, 4.36333333, 3.67333333, 3.75333333,
4.63666667, 4.61 , 5.3 , 4.57 , 5.89666667,
5.62333333, 5.52333333, 5.58666667, 5.42333333, 3.91666667,
4.76 , 3.69 , 5.16333333, 4.76 , 5.22333333,
5.16666667, 4.58 , 5.68 , 6.03333333, 5.98666667,
4.95 , 5.80333333, 4.72333333, 5.77 , 6.67333333,
4.46 , 5.43666667, 4.01666667, 5.08666667, 5.58 ,
5.07333333, 4.78 , 3.85666667, 6.96333333, 6.02 ,
4.07666667, 4.27 , 4.88666667, 5.67333333, 4.03666667,
4.62 , 5.05 , 4.43666667, 4.40666667, 4.23666667])
plt.scatter(DPh, linked.mean(axis = 0))
plt.plot(np.arange(10)[2:], np.arange(10)[2:])
plt.xlabel("Input")
plt.ylabel("output")
plt.show()
I guess you could argue that you want to get rid of this edge effect and have the same mean coverage for the whole simulated genomic region. To this end, I modified the original function to:
- simulate a longer genomic region : original genomic region + 2*read length (line 02). This was to capture the edge effect observed before in "fake genomic regions" that later will be removed.
- simulated reads as before (line 03 to 10).
- trim "read length" bp from each edge (line 11). This way I remove the increasing and decreasing depth pattern observed before. However, the mean coverage of the remaining sequence is greater than the one inputed.
- correct coverage per site per individual to the input coverage by subtracting the difference between the mean coverage of the region and the mean coverage inputed (line 12).
01 def linked_depth(rng, DPh, read_length, sites_n):
02 seq_length = sites_n+(2*read_length)
03 DP = []
04 print(sites_n+(2*read_length))
05 read_n = (DPh*seq_length/read_length).astype("int")
06 for r in read_n:
07 dp = np.zeros((seq_length,), dtype=int)
08 for p in rng.integers(low=0, high=seq_length-read_length+1, size=r):
09 dp[p:p+read_length] += 1
10 DP.append(dp.tolist())
11 DP = (np.array(DP).T)[(1*read_length):(-1*read_length), :]
12 return np.round(DP-(DP.mean(axis = 0)-5).repeat(DP.shape[0]).reshape(DP.shape))
13
14 rng = np.random.default_rng()
15 DPh = np.array([5] * 500) # 500 haplotypes each with depth 5
16 linked = linked_depth(rng, DPh, 100, 300)
Now the coverage per site is similar across sites in the region of interest
plt.plot(np.mean(linked, axis=1), label="linked")
plt.show()
However, because I need to round the coverage per site (coverage needs to be an integer, not a float), the mean coverage per individual does not match the one inputed:
linked.mean(axis = 1)
> array([5.112, 5.318, 4.896, 4.704, 4.704, 5.89 , 4.682, 4.69 , 5.292,
4.946, 4.964, 4.352, 4.758, 5.354, 5.34 , 4.75 , 5.152, 4.94 ,
4.944, 5.146, 4.728, 5.708, 5.72 , 6.31 , 4.302, 5.688, 5.272,
5.286, 5.896, 5.278, 5.286, 5.064, 4.864, 4.878, 4.256, 4.268,
4.698, 4.726, 5.336, 4.956, 5.162, 5.372, 4.354, 5.358, 4.862,
4.34 , 4.33 , 4.922, 5.322, 5.142, 4.948, 5.306, 5.29 , 5.882,
4.66 , 5.25 , 5.846, 5.268, 6.038, 5.646, 4.254, 5.02 , 5.842,
5.856, 5.448, 5.046, 5.02 , 5.438, 5.226, 4.628, 4.638, 4.612,
4.796, 4.79 , 4.788, 4.778, 4.984, 4.556, 4.752, 4.954, 5.766,
5.156, 4.564, 4.58 , 5.186, 5.398, 4.982, 4.79 , 5.594, 4.582,
5.196, 4.63 , 5.242, 5.242, 5.23 , 5.212, 5.21 , 5.214, 5.202,
5.226, 5.236, 5.234, 4.836, 4.856, 4.248, 5.424, 4.25 , 4.846,
5.05 , 6.032, 4.614, 5.62 , 4.842, 5.264, 4.664, 5.266, 5.264,
5.258, 5.274, 5.276, 5.684, 4.688, 4.89 , 5.284, 5.888, 5.292,
5.314, 4.504, 5.478, 4.888, 4.492, 4.714, 5.716, 4.91 , 5.352,
5.356, 4.758, 5.356, 5.54 , 5.138, 5.336, 4.734, 5.336, 5.332,
4.732, 4.746, 5.336, 4.728, 4.71 , 4.71 , 6.11 , 5.732, 5.314,
5.316, 5.318, 5.322, 5.322, 4.72 , 5.35 , 4.748, 4.566, 4.976,
5.364, 4.554, 4.344, 4.33 , 4.328, 4.306, 5.912, 5.304, 5.306,
5.334, 5.142, 4.942, 5.118, 4.33 , 5.346, 4.748, 5.152, 4.94 ,
4.564, 5.57 , 4.752, 4.934, 4.32 , 4.906, 4.324, 5.122, 5.916,
4.924, 5.114, 5.294, 5.298, 5.888, 4.682, 4.264, 4.468, 5.286,
4.89 , 4.474, 5.282, 4.682, 5.28 , 4.648, 5.446, 4.848, 4.234,
5.206, 5.6 , 4.786, 5.806, 4.612, 5.186, 4.162, 4.778, 4.178,
5.604, 4.63 , 4.808, 4.794, 4.778, 5.392, 4.59 , 4.812, 5.012,
5.228, 4.388, 5.374, 5.182, 4.782, 5.594, 5.19 , 5.166, 5.17 ,
4.954, 5.034, 4.524, 5.092, 5.696, 5.09 , 4.692, 4.302, 5.108,
5.11 , 4.728, 6.108, 4.728, 4.308, 5.122, 5.498, 5.286, 5.074,
3.69 , 4.704, 5.106, 5.112, 4.908, 4.692, 5.666, 4.474, 5.076,
5.078, 4.87 , 4.676, 5.694, 5.098, 5.126, 4.74 , 5.554, 5.138,
5.502, 5.678, 4.086, 4.078, 4.28 , 5.088, 4.468, 4.48 , 4.47 ,
4.702, 5.062, 5.462, 4.466, 4.87 , 4.086, 4.486, 4.282, 4.704,
4.678, 4.884, 4.496, 4.894, 5.274, 5.686, 5.486, 4.506, 4.69 ,
5.272, 4.882, 5.098])
plt.hist(linked.mean(axis = 1))
So, if a long sequence is simulated, the edge effect in the former strategy is reduced and might not have a huge effect on the downstream results. Moreover, this strategy also reflects what happens in reality since you might have this edge effect in real data. The latter approximation gets rid of the edge effect and provides a similar mean coverage over all sites simulated (with stochastic variation), but the mean coverage per individual is not guaranteed to be the same as the one inputed.
Thus, I'd rather stick to the former approximation. What do you think? Do you have another idea to get rid of the edge effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can imagine circumstances where one would want edge effects, and where one wouldn't want them. So stick with your first implementation. Just make sure the behaviour is well documented and it won't be a problem---If edge effects are important to someone, they can always ask for a longer sequence than they need and then truncate the output.
…ing other functions such as calculating Major and minor alleles based on Genotype Likelihoods
I resolved the merge conflict here by rebasing against main, and manually editing the conflicting files. To continue with your work here, you'll need to be on your |
…alization its own function and rearanging the indeces in the GL output in the subset_GL function to correspond to the indeces inputed in the alleles_per_site argument
No description provided.