Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database changes #179

Closed
s-good opened this issue Jul 29, 2016 · 35 comments
Closed

Database changes #179

s-good opened this issue Jul 29, 2016 · 35 comments

Comments

@s-good
Copy link
Contributor

s-good commented Jul 29, 2016

Hi @BillMills, just been looking over the database code. Looks really good and will be a big step forward by releasing us from the memory issues! One thing that occurred to me was I was wonder if the database needs to contain the data? We could just store the unique ID, the file the profile came from and location in the file. Then, this information can be used to read the data from file just prior to the QC. That way there wouldn't need to be any modification of the QC tests (apart from maybe the ones that need to use other profiles other than the one being tested). This would avoid the database becoming too large (which might cause performance issues?) and another reason is that a QC test we add in the future may need to use extra information that is not in the database.

If you think this is a good idea I could try to make some changes over the next few days to implement this.

@bkatiemills
Copy link
Member

Hi @s-good - that definitely has some advantages, particularly in terms of dodging the complete rebuild of how the qc tests consume data; the only wrinkle is that we don't want to forgo the ability to do database queries to get groups of profiles - say all the profiles on a given track for EN_track, or whatever the criteria were for identifying buddies in buddy checks. Any variable you want to put a condition on in a query needs its own column.

That said, that's not really a problem; we can just tease out the bare minimum number of extra such columns when building the DB, and otherwise do as you suggest. I wouldn't worry about the size of the database; that's all sitting on disk, and furthermore what we've described here amounts to a string, a few ints, and a bunch of bools per profile; a million profile stubs would only be like 10s of MB - nothing really, in this context.

So I say let's do it, at least for this version since it gets us a product ASAP. When you say 'the next few days', does that include your weekend? Because I think I can pull this together by Monday, unless you're keen to jump on it, let me know so we aren't both doing the same thing. After that we'll need to validate that both old and new branches give the same answer on a large-ish number of profiles, then merge, then do some speed tests on AWS so I can hopefully give a better estimate of how long it will take to run the whole production run. None of this should take too long, modulo any substantial changes to track and buddy checks to take advantage of the new database.

@bkatiemills
Copy link
Member

Actually, one more addendum: we may actually consider just dumping the raw profile text into a column in the database, and forgo re-reading the ascii files a second time. While this seems like unnecessary duplication, the reason I chose Postgres is that it is very efficient at simultaneous reads and writes; if we pull two rows out of the database in parallel and then have to go extract that profile from the same text file, that will hurt our parallelization.

So, to summarize: we should have columns for the full raw profile, any individual variables we want to query on, and the qc results; and then we can construct a wodpy object out of the raw text found in the db row, and not have to rebuild hardly any of the qc tests.

@s-good
Copy link
Contributor Author

s-good commented Jul 30, 2016

Sounds good! I will struggle to get anywhere over the weekend so better for you to work on it if you are happy to - thanks for doing this. I will try running it at work during this week to get an estimate of how fast it will process there.

@s-good
Copy link
Contributor Author

s-good commented Jul 30, 2016

Just to add that I don't think that wodpy can create a WodProfile object from a raw text, I think it only works from a file. Can a WodProfile object be pickled and stored in the database rather than raw text?

@bkatiemills
Copy link
Member

Alright - the postgres branch now nominally does what we discussed above:

  • the table has columns for lat/long, cruise, uid, raw profile text, true qc flag, and qc from every available test.
  • @s-good, I solved your concern about wodpy ingesting only files by using the tempfile package; seems to work on cursory inspection, but worth paying attention that there are no collisions or anything similar at scale.
  • postgres and master branches produce the same results on the first 10 profiles in quota_subset for all tests except track and buddy, which need to be fundamentally re-written to capitalize on the new database; @s-good, feel free to take charge of buddy if you have some time, while I deal with track.
  • Tests are still failing, but only for buddy and track (which of course will be the case at the moment).
  • Next steps will be to get things running on all of quota_subset (at a minimum we're still seeing bug woa_normbias bug #173 even on master, so this isn't a database problem), ensure everything matches there, then run on a larger collection of profiles (a few thousand), check for consistency there, then we should be set to run at scale. Also I guess we'll want some way to extract the postgres table in a useful way afterwards so it doesn't just disappear with our Docker container, but this is a simple matter (easy enough to just dump the qc results to a csv, for example).

@s-good, not fiddling around exploding the profiles into database columns but rather just keeping something available to re-create the wodpy profile objects was a stroke of brilliance! We may consider doing something more intricate later for possible performance improvements, but on the run-up to October that is definitely the right call for now - thanks for the suggestion!

@s-good
Copy link
Contributor Author

s-good commented Jul 31, 2016

Fantastic! That's really fast work! I will look at the buddy check ASAP. What is the best way to access the database from within a QC test?

Bug #173 is still waiting on merging of a pull request to CoTeDe, but once that is put in it should go away.

@bkatiemills
Copy link
Member

Have a look in AutoQC.py for an example of how to interact with the database; you need to create a database connection, then a cursor, and then pass SQL queries into that cursor - specific syntax is all in AutoQC.py, let me know if there's anything unclear.

@s-good
Copy link
Contributor Author

s-good commented Aug 1, 2016

#180 is my attempt at the EN standard level checks. I really like the new database structure!

@bkatiemills
Copy link
Member

@s-good #180 does a lot of good stuff, but as I mention there I think sqlite might go really wrong at full parallelization. If you're comfortable either getting postgres or docker working locally for you, I can merge #180 and take care of getting things in line with a postgres-only approach, then finish track to match; let me know and I'll take care of it right away.

@bkatiemills
Copy link
Member

Update: things are nominally working under postgres for all qc tests on the first 10 quota profiles, but things are now hanging and failing for the final ICDC test when it tries to load its netcdf file; since things are fine when I run it in isolation, this suggests there's a memory leak somewhere blowing up our overhead. Will investigate Wednesday.

@bkatiemills
Copy link
Member

Today's push leaves all qc tests running quickly and correctly on the first 10 profiles in quota_subset. Changes of note:

  • ICDC_aqc_09_local_climatology_check no longer attempts to store parameters at global scope, but reads them from the netcdf file every time. This is very fast, and I'm in general nervous about what 'global scope' means for such a highly parallelized framework; we should strictly avoid globals going forward, and remove existing ones as soon as possible.
  • EN_track doesn't try to write the results for all profiles on track to the db from within the test. This has the obviously undesirable consequence of running the full track for every profile on a track, but something was going wrong with my first naive implementation of a more complete solution. This is certainly something to fix in time, but right now doesn't seem to actually produce that large a slow-down.

That leaves us ready to iron out the code tests and start scaling up validation on Thursday, in preparation to merge postgres and begin running on the full testing data shortly thereafter.

@s-good
Copy link
Contributor Author

s-good commented Aug 4, 2016

Looks really good! I've been trying it out too and it seems to run at good speed except the EN track check seemed a bit slow. The nice thing about this new framework is that we can run that separately so I've set everything running to see how far it gets over the weekend without the track check and will run that on its own next week if the first stage is successful. First results for the seal tag data are below. This has no bad data (they were removed in the original quality control) so in theory there should be no rejections of any data. There are some tests that are flagging a lot, which is cause for concern. Alternatively something may be going wrong in the processing.

NAME OF TEST FAILS TPR FPR TNR FNR
Argo_global_range_check 0 nan% 0.0% 100.0% nan%
Argo_gradient_test 0 nan% 0.0% 100.0% nan%
Argo_impossible_date_test 0 nan% 0.0% 100.0% nan%
Argo_impossible_location_test 0 nan% 0.0% 100.0% nan%
Argo_pressure_increasing_test 0 nan% 0.0% 100.0% nan%
Argo_regional_range_test 0 nan% 0.0% 100.0% nan%
Argo_spike_test 0 nan% 0.0% 100.0% nan%
CSIRO_constant_bottom 0 nan% 0.0% 100.0% nan%
CSIRO_depth 0 nan% 0.0% 100.0% nan%
CSIRO_long_gradient 10301 nan% 77.8% 22.2% nan%
CSIRO_short_gradient 1295 nan% 9.8% 90.2% nan%
CSIRO_surface_spikes 0 nan% 0.0% 100.0% nan%
CSIRO_wire_break 0 nan% 0.0% 100.0% nan%
CoTeDe_Argo_density_inversion 6387 nan% 48.3% 51.7% nan%
CoTeDe_GTSPP_WOA_normbias 0 nan% 0.0% 100.0% nan%
CoTeDe_GTSPP_global_range 221 nan% 1.7% 98.3% nan%
CoTeDe_GTSPP_gradient 0 nan% 0.0% 100.0% nan%
CoTeDe_GTSPP_profile_envelop 221 nan% 1.7% 98.3% nan%
CoTeDe_GTSPP_spike_check 0 nan% 0.0% 100.0% nan%
CoTeDe_Morello2014 180 nan% 1.4% 98.6% nan%
CoTeDe_WOA_normbias 0 nan% 0.0% 100.0% nan%
CoTeDe_anomaly_detection 5699 nan% 43.1% 56.9% nan%
CoTeDe_digit_roll_over 0 nan% 0.0% 100.0% nan%
CoTeDe_fuzzy_logic 193 nan% 1.5% 98.5% nan%
CoTeDe_gradient 0 nan% 0.0% 100.0% nan%
CoTeDe_location_at_sea_test 440 nan% 3.3% 96.7% nan%
CoTeDe_rate_of_change 0 nan% 0.0% 100.0% nan%
CoTeDe_spike 0 nan% 0.0% 100.0% nan%
CoTeDe_tukey53H_norm 333 nan% 2.5% 97.5% nan%
EN_background_available_check 624 nan% 4.7% 95.3% nan%
EN_background_check 0 nan% 0.0% 100.0% nan%
EN_constant_value_check 0 nan% 0.0% 100.0% nan%
EN_increasing_depth_check 0 nan% 0.0% 100.0% nan%
EN_range_check 0 nan% 0.0% 100.0% nan%
EN_spike_and_step_check 0 nan% 0.0% 100.0% nan%
EN_spike_and_step_suspect 0 nan% 0.0% 100.0% nan%
EN_stability_check 781 nan% 5.9% 94.1% nan%
EN_std_lev_bkg_and_buddy_check 111 nan% 0.8% 99.2% nan%
EN_track_check 0 nan% 0.0% 100.0% nan%
ICDC_aqc_01_level_order 0 nan% 0.0% 100.0% nan%
ICDC_aqc_02_crude_range 0 nan% 0.0% 100.0% nan%
ICDC_aqc_04_max_obs_depth 113 nan% 0.9% 99.1% nan%
ICDC_aqc_05_stuck_value 0 nan% 0.0% 100.0% nan%
ICDC_aqc_06_n_temperature_extrema 0 nan% 0.0% 100.0% nan%
ICDC_aqc_07_spike_check 0 nan% 0.0% 100.0% nan%
ICDC_aqc_08_gradient_check 0 nan% 0.0% 100.0% nan%
ICDC_aqc_09_local_climatology_check 1087 nan% 8.2% 91.8% nan%
WOD_gradient_check 8 nan% 0.1% 99.9% nan%
WOD_range_check 0 nan% 0.0% 100.0% nan%
loose_location_at_sea 0 nan% 0.0% 100.0% nan%

@bkatiemills
Copy link
Member

@s-good great! Still a bunch of validation to do (I'm on that today), but I don't anticipate that causing any slow-downs in run time. About how long did this calculation take, over how many profiles and how many cores?

@s-good
Copy link
Contributor Author

s-good commented Aug 4, 2016

There's about 13000 profiles in this dataset. I ran this on 2 cores in about an hour I think. QuOTA is about 350000 so quite a lot larger. I was managing to get through ~800 profiles in 5 mins over 4 cores last night so the whole dataset should take < 2 days. I think it might be faster if I can get postgres working. The QC tests themselves are fractions of a second but writing to the database with sqlite is more like 0.5 seconds.

@bkatiemills
Copy link
Member

bkatiemills commented Aug 4, 2016

Okay, but that still sets a lower bound estimate of 6500 profiles / hour / core. So on a C38xlarge AWS instance, assuming their processors are comparable, that's like 200k profiles an hour, or two hours for QuOTA - approximately $4. We'll see if these estimates hold up as we finish validation, but if nothing blows up, that's pretty cheap and easy.

@s-good
Copy link
Contributor Author

s-good commented Aug 4, 2016

That's pretty impressive speed! BTW I found that creating the database takes 5-6 hours (in sqlite).

I had to make a few modifications to get things to work. I don't know if it will be necessary for postgres in the docker container so am just going to make a note of them here.

  • I ended up storing the location in the file in the database rather than the code as sqlite didn't seem able to handle accessing at any speed otherwise.
  • I also adjusted the way that the qc results are written in the database so that it creates one long query string setting all the qc results for a single profile and then wrote the whole thing in one go.
  • I used process.imap(process_row, range(nuids)) rather than the loop with the async processes, which just wouldn't work for me (no idea why). process.imap is quite neat as it processes each profile in order so it is possible to print out where you are in the processing and get an idea about when it will finish.

@bkatiemills
Copy link
Member

Re your points:

  • As I mentioned, loading the database is in theory something we could do once and bake into the Docker image to save those hours in future - if people are comfortable with that data sitting on Dockerhub.
  • Re: profile position instead of raw text: whatever works! I was worried about a bunch of processes trying to read the same file at the same time, but it doesn't seem so bad.
  • Writing all in one go is almost certainly better, we should do that in the official release.
  • And imap is still asynchronous? If so then it might be worth checking this out in the official release as well.

@bkatiemills
Copy link
Member

Alright - postgres build is now passing, but note EN_track's integration tests have been temporarily suppressed due to #181. Next up: validating that no regressions have crept in by looking at more than just the first 10 profiles in quota_subset.

@bkatiemills
Copy link
Member

Today:

  • postgres and master match for all of quota_subset, except EN_track; currently investigating.
  • Slowest tests are EN_track (redundant re-running) and ICDC-9 (redundant parameter loading); both problems are very fixable, but will require some subtle db interactions that we may or may not have time for before we run. I will look into one or both over the weekend.

@bkatiemills
Copy link
Member

Update: been running postgres for >24h on 3k profiles from our testing data on my local machine; not running EN_track or ICDC-9 since they are well known to be our slowest; next slowest are EN_background and EN_std_lev. Planning to compare this run to the same on master and hopefully sign off on no regressions from moving to our new database scheme.

@bkatiemills
Copy link
Member

Today:

  • EN_track now no longer re-runs for every track on a cruise, dramatically improving speed. The issue seemed to have been that it's important to commit or rollback after executing any query - even one that doesn't write to the db.
  • The discrepancy in EN_track before and after the new database upgrade is that previously, profiles with None reporting for time were sorted to the top of the list, and now the sql-native sorting sorts None to the bottom of the list. @s-good, I need your judgement call on how to resolve this; in my opinion, since time ordering of profiles is so important to the track check, any profile failing to report all of year, month, day and time must be omitted; otherwise, a profile from the middle of the track with a corrupt timestamp will be moved to an artificial place on the track, causing a big mess - but, it's your call.

@s-good
Copy link
Contributor Author

s-good commented Aug 9, 2016

That's great about the speed up to the track check. I agree, let's omit any profile with complete time/date information from the track check.

@bkatiemills
Copy link
Member

Today:

  • QC results on postgres at 620ff0 matches master at 707340 for all QC tests when run over 1000 profiles from Tim's IQUOD_quota set except:
    • EN track, which might be due to the ordering problem above
    • EN standard level (currently investigating for more details)
    • ICDC-9 (didn't run, too slow atm).

Once these and #183 and #184 are resolved, I think we should be good to go to production.

@bkatiemills
Copy link
Member

Today:

  • satisfied that the standard level check is working correctly; previous discrepancy was due to the fact that the main loop on 'master' was only going through the first 1000 profiles, but buddies were being sought in the full 2000+ profile file. Trimming correctly makes the master and Postgres branches report the same number of flags
  • en_track mismatches by a single profile; investigating.
  • still need to validate ICDC-9, but would prefer to find a way to speed it up first; any way we could trim the size of the parameter files down, or load them into the db once and consume them from there?

@s-good
Copy link
Contributor Author

s-good commented Aug 11, 2016

That's great! It won't be possible to trim the ICDC files down, but I would revert back to the original method of loading the data into module variables once. It is a lot of data to read and they are compressed internally in the netCDF file so there is also a decompression that has to be done each time they are read from file. I think it is likely that this is the main cause of it running slow.

@bkatiemills
Copy link
Member

Today:

  • Satisfied that EN_track is working correctly on postgres; discrepancy was due to a bug on master where probe type wasn't being extracted correctly by the isBuoy function.
  • Agree with @s-good that ICDC-9 needs it's parameters loaded exactly once - will look into this hopefully tonight or certainly by tomorrow.

@bkatiemills
Copy link
Member

Today:

  • Sped up parameter loading for ICDC-9 by creating a global parameter store that is read in before any tests are run, and passed to all qc tests; standard signature for all qc functions is now test(profile, parameters), where parameters is the global parameter store containing info for all qc tests.
  • ICDC-9 confirmed producing the same results on master and postgres

Final to-do before running over all data:

@bkatiemills
Copy link
Member

bkatiemills commented Aug 15, 2016

Today:

  • I think I can speed up ICDC-9 a bunch more by not making unnecessary copies of the parameter arrays; seems to run a lot quicker, validating overnight that it actually still works properly.
  • Ran into some funny business when trying to run over a new block of 1000 profiles, but got distracted speeding up ICDC-9; will look into this more once I'm confident ICDC-9 is as good as it can be.

@s-good, all tests are passing on postgres except for a couple of CoTeDe tests that broke as you mentioned in #183; you know more about what's happening there, I think you're in the best position to sort that along with 183 if you don't mind. I'll try to tackle #184 and any other problems that arise in our final validation round.

@bkatiemills
Copy link
Member

Today:

Investigated the 'funny business' mentioned above when trying to do a second 1000-profile validation:

  • Appears to have been a memory problem caused when EN_track encounters a track with hundreds of profiles on it; EN_track would frequently crash the worker, and other tests made complaints about missing data.
  • Our IQUOD testing data contains cruise 900508, which as I think we discussed in an email thread or another issue some time ago has thousands of profiles on track. In both 1000-profile validation batches, there were hundreds of profiles from this track being considered.
  • This problem didn't crop up in the original 1000-profile batch, since I ran EN_track on its own then, as I was still working on smartening that test up while simultaneously validating the other qc tests. Running the full batch of tests on the first 1000-profile batch shows the same symptoms as the second.
  • Last push dodges this problem by opting out of EN_track for cruises with more than 100 profiles on track. Relaxing this constraint can be a future target.

Hopefully this will let me do my last validation run; finishing #183 and #184 should put us in a good place to run over data at scale.

@s-good
Copy link
Contributor Author

s-good commented Aug 19, 2016

As it runs in the EN processing system the track check only QCs a month of data at a time (although something like a week of data is also read in from the month either side to help with the track checking). If there are memory issues, it would be a good option to only consider a month of data at a time. My worry with the 100 profile limit is that it could exclude a lot of data that have been collected over many years.

@bkatiemills
Copy link
Member

Sounds good - we could also semi-relax that constraint by just running the entire track regardless of dates if there's less than 100 (or a less arbitrarily-selected number) profiles on track; the less artificial divisions, the better, for the track check. Shouldn't be too difficult; I'll look into it over the weekend.

@bkatiemills
Copy link
Member

On second thought, we may not need to slice anything up like this; if we go back to using only database columns for en_track (and not reconstructing entire profiles from the raw text), things run very quickly and smoothly; 7ef6fa4 produces the same en_track result for my first batch of 1000 testing profiles.

@bkatiemills
Copy link
Member

Today: I'm satisfied that everything is working correctly on postgres and is ready for merge, as I've completed another 1000-profile validation between this branch and master (with the exception of CoTeDe since #184 is still not resolved) and no new discrepancies have emerged; since #184 is not actually postgres-related, and CoTeDe looked fine on the first batch of 1000 I think we're good to merge now and fix that downstream. @s-good, do you have any concerns before we wrap up here?

@s-good
Copy link
Contributor Author

s-good commented Aug 24, 2016

Fantastic! No concerns from me.

@bkatiemills bkatiemills mentioned this issue Aug 24, 2016
@bkatiemills
Copy link
Member

Alright, that's that - once #184 wraps up, we can move to speed tests on AWS to try and get an understanding of what profiles / $ is going to look like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants