-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should Subsampling be Recommended? #156
Comments
Hi @fjclark, I haven't thought about this in a while and I may misremember. My understanding is that asymptotically for mean and uncertainly to converge uncorrelated samples are needed at least for WHAM and MBAR. I haven't looked at other estimators in detail.
From practical experience, I have usually opted for no subsampling for mean and subsampling for uncertainty. I have compared subsampled and not subsampled means from real simulation data at some point and if there are insufficient samples subsampling can make things worse. Maybe it just means we should sample more? Then I would typically favour different replicas over estimated uncertainties. I can take a look at references again to look at more recent papers on this. |
Hi @ppxasjsm, Thanks very much for the comments.
From appendix A of Shirts and Chodera, 2008:
My understanding from this is that subsampling is not necessarily required for the mean for converge, and that it is only required to converge the uncertainty because of the use of the estimator from Kong et al., 2003 (and that it will slow the convergence of both the mean and uncertainty estimate due to the reduction in effective sample size).
I also use no subsampling for the mean and subsampling for uncertainty, but this seems to contrast with the tick-box recommendation in the article: "Make sure you subsample the data in your free energy estimation protocol". Should this be updated? My understanding, which seems to match my quick test here, is that subsampling always increases the variance of the mean, but the issue becomes particularly pronounced when there are very few effective samples. I completely agree that more sampling/ comparing multiple replicas is best, but even then subsampling would increase the variance of the mean estimate.
Thanks! |
I am generally in agreement with this, though don't know that I have perfect data. Using all the data will allow you to recover some data from the partially correlated samples. I don't know that I've ever seen a great analysis of exactly how much. For example, if you have 1,000,000 samples and have 1000 uncorrelated samples, I bet you could reclaim essentially all of the missing information with, say, 10,000 samples. I have never studied this, though. Bootstrap samples must be uncorrelated, though, or you suppress the variance. |
I agree with all the comments here. Subsampling throws away data and, in my opinion, is not generally advisable or necessary. The bias of an estimator does/should not depend on the level of correlation of the data. In particular, the estimate of the mean of a time series is asymptotically independent of the correlation time of the data. For correlated time series, knowledge of the autocorrelation times is sufficient to obtain an asymptotically unbiased estimate of the variance of estimates. (https://doi.org/10.1021/ct0502864) Determining the subsampling level requires knowledge of the autocorrelation time of the data anyway. Yes, block bootstrapping was a good practical alternative for complex multi-dimensional datasets. However, it requires varying the size of the blocks to match the correlation time. |
Eh, I think this is a little harsh. Subsampling (if it works properly and there is enough data compared to the autocorrelation time) does potentially add noise to the value itself, but always within the statistical errorl. However, you do potentially lose some information, especially if the autocorrelation time is poorly calculated. Better would be to estimate with either all the data or sampled with something like tau/10 (that's a handwave number, I haven't done the experiments yet). Agreed that subsamping does not change the bias. It's often frequently complicated to incorporate correlation time into estimators, especially if there are multiple timeseries involved, whereas subsampling allows one to use the uncorrelated error estimates. Subsampling + bootstrapping makes all error estimates trivial if you can afford it. Block boostrapping (which I think could be replaced with "bootstrapping with different effective taus", though I haven'te tested it) is a good effective way to determine correlation time. To some extent, we are looking at the 4-5 method variations that all give correct, equivalent answers in the limit of large N and simulation time >> autocorrelation time, and looking for what fails the least badly when we are not in those scenarios. So we probably need to be more specific about what "fails least badly means" in order to decide what to recommend. |
Thanks for the comments. Quick Analysis
I've done a quick analysis of this, and of how subsampling affects variance in general:
Summary: It seems like subsampling according to a perfectly estimated Code: subsampling.tar.gz Possible RecommendationsThese possible recommendations reflect my understanding. Please let me know if I'm misunderstanding:
more theoretically justified than estimating both the mean and the uncertainty from the subsampled data. Practically, this makes the mean estimate more robust in cases where
If these make sense, I'm happy to draft some minor changes to the manuscript. However, I would also be very happy if someone more qualified would like to do this. Thanks. |
v1.0 states that "Most estimators require an uncorrelated set of samples from the equilibrium distribution to produce (relatively) unbiased estimates of the free energy difference and its statistical uncertainty." A discussion of subsampling is then given and the checklist states: "Make sure you subsample the data in your free energy estimation protocol".
My questions are:
Is "Most estimators require an uncorrelated set of samples from the equilibrium distribution to produce (relatively) unbiased estimates of the free energy difference" true?
Is "Most estimators require an uncorrelated set of samples from the equilibrium distribution to produce (relatively) unbiased estimates of statistical uncertainty" true?
Should subsampling be recommended?
I've already started a discussion of this in choderalab/pymbar#545, but wanted to raise it here as I found this confusing when I first read the article. My understanding, given in more detail in the PyMBAR issue, is:
This is not generally true. For example, when discussing bridge sampling, Gelman and Meng, 1998 state "the answer [the optimal weighting function] is easily obtained is when we have independent draws from both$p_o$ and $p_1$ ; although this assumption is typically violated in practice, it permits useful theoretical explorations and in fact the optimal estimator obtained under this assumption performs rather well in general" (e.g. when we do have correlated samples).
This is true by definition when using estimates derived for uncorrelated samples, such as Equation 4.2 of Kong et al., 2003 for MBAR, but a better approach might be to use an uncertainty estimate which allows for correlation, such as the asymptotic estimates from Li et al.,2023, or block bootstrapping such as in Tan, Gallicchio et al., 2012. Alternatively, to keep the simple and fast uncorrelated data asymptotic estimate, could the mean be estimated from the unsubsampled data, while the uncertainty is estimated from the subsampled data (and a slight increase in the uncertainty of the uncertainty tolerated)?
Subsampling increases the variance of the mean estimate and the variance of the variance estimate, and isn't helpful unless the cost of storing/ using samples is non-negligible (Geyer, 1992 (Section 3.6)), e.g. correlated samples contain additional information (just less than uncorrelated samples) and discarding them is a waste of information.
Maybe @mrshirts @jchodera @ppxasjsm @egallicc can comment? Even if I'm misunderstanding, it would be great to add some more references to clarify things.
Thanks.
The text was updated successfully, but these errors were encountered: