Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*ERROR* Error when loading samples: The sum of logpriors in the sample is not consistent. #306

Closed
mishakb opened this issue Aug 1, 2023 · 17 comments · Fixed by #378
Closed

Comments

@mishakb
Copy link

mishakb commented Aug 1, 2023

I have this error when I try to resume a job. I was able to resume it at least one time but this second tie it gives this. I tried several times but with same message. His is the job.out file content:

[0 : output] Found existing info files with the requested output prefix: 'results/ow0waCDM_all'
[0 : output] Let's try to resume/load.
[2 : jax._src.xla_bridge] Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[2 : jax._src.xla_bridge] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[2 : jax._src.xla_bridge] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
[2 : jax._src.xla_bridge] *WARNING* No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[0 : jax._src.xla_bridge] Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[0 : jax._src.xla_bridge] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[0 : jax._src.xla_bridge] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
[0 : jax._src.xla_bridge] *WARNING* No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[1 : jax._src.xla_bridge] Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[1 : jax._src.xla_bridge] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[1 : jax._src.xla_bridge] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
[1 : jax._src.xla_bridge] *WARNING* No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[3 : jax._src.xla_bridge] Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[3 : jax._src.xla_bridge] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
[3 : jax._src.xla_bridge] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
[3 : jax._src.xla_bridge] *WARNING* No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[0 : output] Found an old sample. Resuming.
[0 : prior] *WARNING* External prior 'SZ' loaded. Mind that it might not be normalized!
[0 : camb] `camb` module loaded successfully from /global/cfs/cdirs/desicollab/users/adematti/perlmutter/cosmodesiconda/20221205-1.0.0/conda/lib/python3.10/site-packages/camb
[0 : StandardCompressionObservable] Found quantities ['DM_over_rd', 'DH_over_rd', 'fsigmar'].
[0 : StandardCompressionObservable] Found quantities ['DM_over_rd', 'DH_over_rd', 'fsigmar'].
[0 : StandardCompressionObservable] Found quantities ['DM_over_rd', 'DH_over_rd', 'fsigmar'].
[0 : StandardCompressionObservable] Found quantities ['DM_over_rd', 'DH_over_rd', 'fsigmar'].
[0 : StandardCompressionObservable] Found quantities ['fsigmar', 'DV_over_rd'].
[0 : planck_2018_highl_plik.ttteee] `clik` module loaded successfully from /global/cfs/cdirs/desicollab/science/cpe/perlmutter/cosmodesiconda/20221205-1.0.0/cobaya/code/planck/code/plc_3.0/plc-3.1/lib/python/site-packages/clik
[0 : planck_2018_lensing.clik] `clik` module loaded successfully from /global/cfs/cdirs/desicollab/science/cpe/perlmutter/cosmodesiconda/20221205-1.0.0/cobaya/code/planck/code/plc_3.0/plc-3.1/lib/python/site-packages/clik
[0 : mcmc] Resuming from previous sample!
[0 : prior] *WARNING* There are unbounded parameters (['A_planck', 'calib_100T', 'calib_217T', 'gal545_A_100', 'gal545_A_143', 'gal545_A_143_217', 'gal545_A_217', 'galf_TE_A_100', 'galf_TE_A_100_143', 'galf_TE_A_100_217', 'galf_TE_A_143', 'galf_TE_A_143_217', 'galf_TE_A_217', 'DES_DzL1', 'DES_DzL2', 'DES_DzL3', 'DES_DzL4', 'DES_DzL5', 'DES_DzS1', 'DES_DzS2', 'DES_DzS3', 'DES_DzS4', 'DES_m1', 'DES_m2', 'DES_m3', 'DES_m4']). Prior bounds are given at 0.9999995 confidence level. Beware of likelihood modes at the edge of the prior
[1 : samplecollection] Loaded 990 sample points from 'results/ow0waCDM_all.2.txt'
[2 : samplecollection] Loaded 1011 sample points from 'results/ow0waCDM_all.3.txt'
[0 : samplecollection] Loaded 1079 sample points from 'results/ow0waCDM_all.1.txt'
[3 : samplecollection] Loaded 1084 sample points from 'results/ow0waCDM_all.4.txt'
[0 : samplecollection] *ERROR* The sum of logpriors in the sample is not consistent.
[0 : samplecollection] *ERROR* Error when loading samples: The sum of logpriors in the sample is not consistent.
[1 : mcmc] Initial point: ombh2:0.02261121, omch2:0.1181356, H0:69.84661, logA:3.045463, ns:0.971925, omk:-0.0009965226, w:-0.9430333, wa:-0.4804295, tau:0.05782186, A_planck:1.001588, calib_100T:0.9993421, calib_217T:0.9989519, A_cib_217:51.14609, xi_sz_cib:0.3915068, A_sz:4.471394, ksz_norm:3.81948, gal545_A_100:7.050906, gal545_A_143:13.35773, gal545_A_143_217:18.55076, gal545_A_217:94.86781, ps_A_100_100:319.5084, ps_A_143_143:37.68866, ps_A_143_217:35.88573, ps_A_217_217:105.5367, galf_TE_A_100:0.128669, galf_TE_A_100_143:0.1368194, galf_TE_A_100_217:0.4279111, galf_TE_A_143:0.2070875, galf_TE_A_143_217:0.6186202, galf_TE_A_217:1.842039, DES_DzL1:0.004783368, DES_DzL2:-0.003013851, DES_DzL3:0.0008851392, DES_DzL4:0.004369828, DES_DzL5:0.003481381, DES_b1:1.477709, DES_b2:1.738489, DES_b3:1.620947, DES_b4:1.962905, DES_b5:2.061378, DES_DzS1:0.003615505, DES_DzS2:-0.02467024, DES_DzS3:0.02731843, DES_DzS4:-0.05860599, DES_m1:0.04670242, DES_m2:0.01681293, DES_m3:-0.003576742, DES_m4:0.01273669, DES_AIA:0.6885432, DES_alphaIA:-0.008803587
[2 : mcmc] Initial point: ombh2:0.02244094, omch2:0.1181104, H0:67.736, logA:3.040276, ns:0.9709289, omk:-0.0006007525, w:-0.7968705, wa:-0.7655362, tau:0.05525625, A_planck:0.9993413, calib_100T:0.9996387, calib_217T:0.9981446, A_cib_217:44.63784, xi_sz_cib:0.3801157, A_sz:6.045817, ksz_norm:5.675437, gal545_A_100:6.166859, gal545_A_143:10.48994, gal545_A_143_217:10.14862, gal545_A_217:76.86048, ps_A_100_100:239.4902, ps_A_143_143:31.59264, ps_A_143_217:40.76925, ps_A_217_217:121.607, galf_TE_A_100:0.1155986, galf_TE_A_100_143:0.1540269, galf_TE_A_100_217:0.544674, galf_TE_A_143:0.2837667, galf_TE_A_143_217:0.7849412, galf_TE_A_217:2.363021, DES_DzL1:0.003117714, DES_DzL2:0.002392154, DES_DzL3:0.002103641, DES_DzL4:-0.00591887, DES_DzL5:-0.008232313, DES_b1:1.440227, DES_b2:1.685149, DES_b3:1.630987, DES_b4:1.979471, DES_b5:2.105889, DES_DzS1:-0.004751653, DES_DzS2:-0.0317832, DES_DzS3:-0.0001454839, DES_DzS4:-0.03830876, DES_m1:0.003314337, DES_m2:-0.005635238, DES_m3:-0.02677006, DES_m4:0.02435357, DES_AIA:0.521304, DES_alphaIA:-1.325487
[3 : mcmc] Initial point: ombh2:0.02253404, omch2:0.1177752, H0:66.68511, logA:3.058207, ns:0.9679595, omk:-0.001886929, w:-0.8575862, wa:-0.4354515, tau:0.06036522, A_planck:1.003104, calib_100T:0.9999663, calib_217T:0.9988349, A_cib_217:51.09076, xi_sz_cib:0.3083462, A_sz:3.599204, ksz_norm:7.452705, gal545_A_100:7.437093, gal545_A_143:12.5047, gal545_A_143_217:16.44311, gal545_A_217:88.90734, ps_A_100_100:245.5505, ps_A_143_143:31.21603, ps_A_143_217:24.33498, ps_A_217_217:100.7772, galf_TE_A_100:0.0861488, galf_TE_A_100_143:0.1955448, galf_TE_A_100_217:0.509976, galf_TE_A_143:0.3648059, galf_TE_A_143_217:0.7208691, galf_TE_A_217:1.722586, DES_DzL1:0.00756885, DES_DzL2:-0.01112213, DES_DzL3:-0.002029036, DES_DzL4:-0.0009077926, DES_DzL5:-0.008044257, DES_b1:1.473136, DES_b2:1.710627, DES_b3:1.674257, DES_b4:1.994127, DES_b5:2.184813, DES_DzS1:-0.0211984, DES_DzS2:-0.008531034, DES_DzS3:-0.003726577, DES_DzS4:-0.0205448, DES_m1:-0.02914757, DES_m2:-0.02931022, DES_m3:-0.005824037, DES_m4:-0.01277812, DES_AIA:0.3168567, DES_alphaIA:2.935892
[0 : run] Aborting MPI due to error
----
clik version plc_3.1
  smica
Checking likelihood '/global/cfs/cdirs/desi/science/cpe/perlmutter/cosmodesiconda/20221205-1.0.0/cobaya/data/planck_2018/baseline/plc_3.0/hi_l/plik/plik_rd12_HM_v22b_TTTEEE.clik' on test data. got -1172.47 expected -1172.47 (diff -4.34054e-07)
----
Checking lensing likelihood '/global/cfs/cdirs/desi/science/cpe/perlmutter/cosmodesiconda/20221205-1.0.0/cobaya/data/planck_2018/baseline/plc_3.0/lensing/smicadx12_Dec5_ftl_mv2_ndclpp_p_teb_consext8.clik_lensing' on test data. got -4.42102
@cmbant
Copy link
Collaborator

cmbant commented Aug 1, 2023

Looks similar to the temperature checking issue that was fixed, from recent temperature-related changes. for @JesusTorrado to check when back.

To workaround you can just comment out these checks.

@mishakb
Copy link
Author

mishakb commented Aug 2, 2023

@cmbant, sure. Do you know where I could find and comment that out? Thanks.

@cmbant
Copy link
Collaborator

cmbant commented Aug 2, 2023

Just search for the error message (The sum of logpriors in the sample is not consist)

@Uendert
Copy link

Uendert commented Aug 10, 2023

Hi @cmbant,

I have a similar issue and I would like to confirm if it is safe to deactivate the following check as well:

    self.collection = SampleCollection(
  File "/global/common/software/desi/users/adematti/perlmutter/cosmodesiconda/20230725-1.0.0/conda/lib/python3.10/site-packages/cobaya/collection.py", line 289, in __init__
    raise LoggedError(
cobaya.log.LoggedError: Error when loading samples: The sample seems to have an inconsistent temperature.

@cmbant
Copy link
Collaborator

cmbant commented Sep 25, 2023

The temperature error should be fixed/worked around in latest Cobaya master - were you using that?

@JesusTorrado, had any chance to look at fix for all these new read accuracy errors?

@JesusTorrado
Copy link
Contributor

Not yet. I was doing some I/O experiments. I'll get to it very soon!

@mishakb
Copy link
Author

mishakb commented Sep 25, 2023 via email

@JesusTorrado
Copy link
Contributor

@mishakb could you please check if the new branch fix_post_prior_test fixes your issue?

The easiest way is to install with pip from that branch with

pip install git+https://github.com/CobayaSampler/cobaya.git@fix_post_prior_test

@JesusTorrado
Copy link
Contributor

Probably fixed by #322. Please reopen if it can still be reproduced.

@SukanB
Copy link

SukanB commented Aug 14, 2024

Hello,
I am fetting the following error related to inconsistent temperature, and tolerance in one of my cobaya runs.

2024-08-07 14:24:51,806 [0 : samplecollection] ERROR The sample seems to have an inconsistent temperature.
2024-08-07 14:24:51,806 [0 : samplecollection] WARNING Needed to relax tolerances when checking consistency of log probabilities and temperature (if present).
2024-08-07 14:24:51,808 [0 : samplecollection] ERROR The sample seems to have an inconsistent temperature.
2024-08-07 14:24:51,808 [0 : samplecollection] ERROR Error when loading samples: The sample seems to have an inconsistent temperature.

Is it related to this issue? Can it be solved also by installing with the following?
pip install git+https://github.com/CobayaSampler/cobaya.git@fix_post_prior_test

@cmbant
Copy link
Collaborator

cmbant commented Aug 14, 2024

I think that's already merged. Can you attach chains/code to reproduce the issue?

@cmbant
Copy link
Collaborator

cmbant commented Aug 16, 2024

@SukanB can you share the files?

@SukanB
Copy link

SukanB commented Aug 19, 2024

Hi @cmbant , thanks for your response. I only made changes in the file classy/source/background.c, to modify the existing scalar field potential for dark energy. I attach the modified code and the output file here. Also, please note that this is after I resume a previous run that has stopped before.
ftoutput.txt
backgroundft.txt

@cmbant
Copy link
Collaborator

cmbant commented Aug 19, 2024

Thanks, but could you attach zip of the actual offending chain files (FTPLDU/ftpdu*)

@cmbant
Copy link
Collaborator

cmbant commented Aug 19, 2024

@SukanB or email directly if you don't want it public

@cmbant
Copy link
Collaborator

cmbant commented Aug 22, 2024

Thanks for emailing the file. OK, so the temperature thing is a bit of a red herring, the issue is the last line of chain files not having a complete set of columns, and hence being filled with NaN when loaded into the collection (presumably from walltime kill happening during file write or before flush).

@cmbant
Copy link
Collaborator

cmbant commented Aug 22, 2024

@SukanB Can you try #378

@cmbant cmbant linked a pull request Aug 22, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants