Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should Subsampling be Recommended? #156

Open
fjclark opened this issue Jan 10, 2025 · 6 comments
Open

Should Subsampling be Recommended? #156

fjclark opened this issue Jan 10, 2025 · 6 comments

Comments

@fjclark
Copy link
Collaborator

fjclark commented Jan 10, 2025

v1.0 states that "Most estimators require an uncorrelated set of samples from the equilibrium distribution to produce (relatively) unbiased estimates of the free energy difference and its statistical uncertainty." A discussion of subsampling is then given and the checklist states: "Make sure you subsample the data in your free energy estimation protocol".

My questions are:

  1. Is "Most estimators require an uncorrelated set of samples from the equilibrium distribution to produce (relatively) unbiased estimates of the free energy difference" true?

  2. Is "Most estimators require an uncorrelated set of samples from the equilibrium distribution to produce (relatively) unbiased estimates of statistical uncertainty" true?

  3. Should subsampling be recommended?

I've already started a discussion of this in choderalab/pymbar#545, but wanted to raise it here as I found this confusing when I first read the article. My understanding, given in more detail in the PyMBAR issue, is:

  1. This is not generally true. For example, when discussing bridge sampling, Gelman and Meng, 1998 state "the answer [the optimal weighting function] is easily obtained is when we have independent draws from both $p_o$ and $p_1$; although this assumption is typically violated in practice, it permits useful theoretical explorations and in fact the optimal estimator obtained under this assumption performs rather well in general" (e.g. when we do have correlated samples).

  2. This is true by definition when using estimates derived for uncorrelated samples, such as Equation 4.2 of Kong et al., 2003 for MBAR, but a better approach might be to use an uncertainty estimate which allows for correlation, such as the asymptotic estimates from Li et al.,2023, or block bootstrapping such as in Tan, Gallicchio et al., 2012. Alternatively, to keep the simple and fast uncorrelated data asymptotic estimate, could the mean be estimated from the unsubsampled data, while the uncertainty is estimated from the subsampled data (and a slight increase in the uncertainty of the uncertainty tolerated)?

  3. Subsampling increases the variance of the mean estimate and the variance of the variance estimate, and isn't helpful unless the cost of storing/ using samples is non-negligible (Geyer, 1992 (Section 3.6)), e.g. correlated samples contain additional information (just less than uncorrelated samples) and discarding them is a waste of information.

Maybe @mrshirts @jchodera @ppxasjsm @egallicc can comment? Even if I'm misunderstanding, it would be great to add some more references to clarify things.

Thanks.

@ppxasjsm
Copy link
Contributor

Hi @fjclark,

I haven't thought about this in a while and I may misremember. My understanding is that asymptotically for mean and uncertainly to converge uncorrelated samples are needed at least for WHAM and MBAR. I haven't looked at other estimators in detail.

This is true by definition when using estimates derived for uncorrelated samples, such as Equation 4.2 of Kong et al., 2003 for MBAR, but a better approach might be to use an uncertainty estimate which allows for correlation, such as the asymptotic estimates from Li et al.,2023, or block bootstrapping such as in Tan, Gallicchio et al., 2012. Alternatively, to keep the simple and fast uncorrelated data asymptotic estimate, could the mean be estimated from the unsubsampled data, while the uncertainty is estimated from the subsampled data (and a slight increase in the uncertainty of the uncertainty tolerated)?

From practical experience, I have usually opted for no subsampling for mean and subsampling for uncertainty. I have compared subsampled and not subsampled means from real simulation data at some point and if there are insufficient samples subsampling can make things worse. Maybe it just means we should sample more? Then I would typically favour different replicas over estimated uncertainties.

I can take a look at references again to look at more recent papers on this.

@fjclark
Copy link
Collaborator Author

fjclark commented Jan 20, 2025

Hi @ppxasjsm,

Thanks very much for the comments.

My understanding is that asymptotically for mean and uncertainly to converge uncorrelated samples are needed at least for WHAM and MBAR.

From appendix A of Shirts and Chodera, 2008:

While estimating equations based on Eq. 6 can be applied to correlated or uncorrelated datasets, provided that the empirical estimator remains asymptotically unbiased, the asymptotic covariance matrix estimator (6) only produces sensible estimates when applied to uncorrelated datasets.

My understanding from this is that subsampling is not necessarily required for the mean for converge, and that it is only required to converge the uncertainty because of the use of the estimator from Kong et al., 2003 (and that it will slow the convergence of both the mean and uncertainty estimate due to the reduction in effective sample size).

From practical experience, I have usually opted for no subsampling for mean and subsampling for uncertainty. I have compared subsampled and not subsampled means from real simulation data at some point and if there are insufficient samples subsampling can make things worse. Maybe it just means we should sample more? Then I would typically favour different replicas over estimated uncertainties.

I also use no subsampling for the mean and subsampling for uncertainty, but this seems to contrast with the tick-box recommendation in the article: "Make sure you subsample the data in your free energy estimation protocol". Should this be updated? My understanding, which seems to match my quick test here, is that subsampling always increases the variance of the mean, but the issue becomes particularly pronounced when there are very few effective samples. I completely agree that more sampling/ comparing multiple replicas is best, but even then subsampling would increase the variance of the mean estimate.

I can take a look at references again to look at more recent papers on this.

Thanks!

@mrshirts
Copy link
Collaborator

I am generally in agreement with this, though don't know that I have perfect data. Using all the data will allow you to recover some data from the partially correlated samples. I don't know that I've ever seen a great analysis of exactly how much. For example, if you have 1,000,000 samples and have 1000 uncorrelated samples, I bet you could reclaim essentially all of the missing information with, say, 10,000 samples. I have never studied this, though.

Bootstrap samples must be uncorrelated, though, or you suppress the variance.

@egallicc
Copy link
Contributor

I agree with all the comments here. Subsampling throws away data and, in my opinion, is not generally advisable or necessary.

The bias of an estimator does/should not depend on the level of correlation of the data. In particular, the estimate of the mean of a time series is asymptotically independent of the correlation time of the data.

For correlated time series, knowledge of the autocorrelation times is sufficient to obtain an asymptotically unbiased estimate of the variance of estimates. (https://doi.org/10.1021/ct0502864) Determining the subsampling level requires knowledge of the autocorrelation time of the data anyway.

Yes, block bootstrapping was a good practical alternative for complex multi-dimensional datasets. However, it requires varying the size of the blocks to match the correlation time.

@mrshirts
Copy link
Collaborator

mrshirts commented Jan 21, 2025

I agree with all the comments here. Subsampling throws away data and, in my opinion, is not generally advisable or necessary.

Eh, I think this is a little harsh. Subsampling (if it works properly and there is enough data compared to the autocorrelation time) does potentially add noise to the value itself, but always within the statistical errorl. However, you do potentially lose some information, especially if the autocorrelation time is poorly calculated. Better would be to estimate with either all the data or sampled with something like tau/10 (that's a handwave number, I haven't done the experiments yet). Agreed that subsamping does not change the bias.

It's often frequently complicated to incorporate correlation time into estimators, especially if there are multiple timeseries involved, whereas subsampling allows one to use the uncorrelated error estimates. Subsampling + bootstrapping makes all error estimates trivial if you can afford it. Block boostrapping (which I think could be replaced with "bootstrapping with different effective taus", though I haven'te tested it) is a good effective way to determine correlation time.

To some extent, we are looking at the 4-5 method variations that all give correct, equivalent answers in the limit of large N and simulation time >> autocorrelation time, and looking for what fails the least badly when we are not in those scenarios. So we probably need to be more specific about what "fails least badly means" in order to decide what to recommend.

@fjclark
Copy link
Collaborator Author

fjclark commented Feb 12, 2025

Thanks for the comments.

Quick Analysis

if you have 1,000,000 samples and have 1000 uncorrelated samples, I bet you could reclaim essentially all of the missing information with, say, 10,000 samples.

I've done a quick analysis of this, and of how subsampling affects variance in general:

  • With purely exponential autocorrelation, my understanding is that subsampling according to a perfectly estimated statistical inefficiency, $g$, increases the variance of the mean by a factor of 1.31. My working is here.
  • I've taken a purely exponential correlation function and a more realistic one with faster initial decay, fit to gradients from an absolute binding free energy calculation using Geyer's initial convex sequence estimator. I've scaled these so that the $g$ s are 1000 for both functions, so that 1,000,000 samples contain 1000 uncorrelated samples:

Image

  • I've plotted the number of effective samples as a function of the number of subsamples from a complete set of 1,000,000 samples. Subsampling has very little effect until the rate approaches ~ $g/10$, as per @mrshirts' comment.

Image

  • Equivalently, I've plotted the factor that the variance of the mean increases by as a function of the number of subsamples drawn. Subsampling according to $g$ produces a variance increase factor of 1.31 for the exponential autocorrelation, but this is 1.87 for the autocorrelation function fit to simulation. This makes sense as the correlations are "more long range" and less easily reduced by subsampling. I've also plotted the variance increase factor for the case where you start from 1000 uncorrelated samples.

Image

Summary: It seems like subsampling according to a perfectly estimated $g$, when simulation time >> autocorrelation time will increase the variance of the mean by a minimum factor of 1.31 (since true autocorrelation functions are expected to have a fast drop followed by exponential tail).

Code: subsampling.tar.gz

Possible Recommendations

These possible recommendations reflect my understanding. Please let me know if I'm misunderstanding:

  • Don't subsample when estimating the mean, unless the cost of analysis is non-negligible. If the cost of analysis is significant, subsample with a rate of $g/10$ or lower to avoid increasing the variance.
  • Subsampling is not generally required for uncertainty estimates, but it can allow the use of simpler uncertainty estimators (at the expense of a - possibly negligible - increase in the variance of the variance estimate). When applying an uncorrelated error estimate to data subsampled according to $g$, the uncertainty estimate applies to the unsubsampled data (and is an underestimate of the variance for the mean estimated from the subsampled data by a factor of at least 1.31). My understanding is that this makes @ppxasjsm's protocol of

I have usually opted for no subsampling for mean and subsampling for uncertainty

more theoretically justified than estimating both the mean and the uncertainty from the subsampled data. Practically, this makes the mean estimate more robust in cases where $g$ is large or overestimated.

  • Estimation of uncertainty requires the robust estimation of autocorrelation time, which can be tricky. This can either be done directly by estimating the autocorrelation function (e.g. Chodera ref. from @egallicc), or indirectly by (overlapping or non-overlapping) block averaging with increasing block sizes (keeping in mind that non-overlapping block averaging is a higher-variance version of overlapping block-averaging, which is effectively equivalent to estimating the autocorrelation function with a Bartlett kernel Meketon and Schmeiser, 1984).

If these make sense, I'm happy to draft some minor changes to the manuscript. However, I would also be very happy if someone more qualified would like to do this.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants