-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Database changes #179
Comments
Hi @s-good - that definitely has some advantages, particularly in terms of dodging the complete rebuild of how the qc tests consume data; the only wrinkle is that we don't want to forgo the ability to do database queries to get groups of profiles - say all the profiles on a given track for EN_track, or whatever the criteria were for identifying buddies in buddy checks. Any variable you want to put a condition on in a query needs its own column. That said, that's not really a problem; we can just tease out the bare minimum number of extra such columns when building the DB, and otherwise do as you suggest. I wouldn't worry about the size of the database; that's all sitting on disk, and furthermore what we've described here amounts to a string, a few ints, and a bunch of bools per profile; a million profile stubs would only be like 10s of MB - nothing really, in this context. So I say let's do it, at least for this version since it gets us a product ASAP. When you say 'the next few days', does that include your weekend? Because I think I can pull this together by Monday, unless you're keen to jump on it, let me know so we aren't both doing the same thing. After that we'll need to validate that both old and new branches give the same answer on a large-ish number of profiles, then merge, then do some speed tests on AWS so I can hopefully give a better estimate of how long it will take to run the whole production run. None of this should take too long, modulo any substantial changes to track and buddy checks to take advantage of the new database. |
Actually, one more addendum: we may actually consider just dumping the raw profile text into a column in the database, and forgo re-reading the ascii files a second time. While this seems like unnecessary duplication, the reason I chose Postgres is that it is very efficient at simultaneous reads and writes; if we pull two rows out of the database in parallel and then have to go extract that profile from the same text file, that will hurt our parallelization. So, to summarize: we should have columns for the full raw profile, any individual variables we want to query on, and the qc results; and then we can construct a wodpy object out of the raw text found in the db row, and not have to rebuild hardly any of the qc tests. |
Sounds good! I will struggle to get anywhere over the weekend so better for you to work on it if you are happy to - thanks for doing this. I will try running it at work during this week to get an estimate of how fast it will process there. |
Just to add that I don't think that wodpy can create a WodProfile object from a raw text, I think it only works from a file. Can a WodProfile object be pickled and stored in the database rather than raw text? |
Alright - the
@s-good, not fiddling around exploding the profiles into database columns but rather just keeping something available to re-create the wodpy profile objects was a stroke of brilliance! We may consider doing something more intricate later for possible performance improvements, but on the run-up to October that is definitely the right call for now - thanks for the suggestion! |
Fantastic! That's really fast work! I will look at the buddy check ASAP. What is the best way to access the database from within a QC test? Bug #173 is still waiting on merging of a pull request to CoTeDe, but once that is put in it should go away. |
Have a look in |
#180 is my attempt at the EN standard level checks. I really like the new database structure! |
@s-good #180 does a lot of good stuff, but as I mention there I think sqlite might go really wrong at full parallelization. If you're comfortable either getting postgres or docker working locally for you, I can merge #180 and take care of getting things in line with a postgres-only approach, then finish track to match; let me know and I'll take care of it right away. |
Update: things are nominally working under postgres for all qc tests on the first 10 quota profiles, but things are now hanging and failing for the final ICDC test when it tries to load its netcdf file; since things are fine when I run it in isolation, this suggests there's a memory leak somewhere blowing up our overhead. Will investigate Wednesday. |
Today's push leaves all qc tests running quickly and correctly on the first 10 profiles in
That leaves us ready to iron out the code tests and start scaling up validation on Thursday, in preparation to merge |
Looks really good! I've been trying it out too and it seems to run at good speed except the EN track check seemed a bit slow. The nice thing about this new framework is that we can run that separately so I've set everything running to see how far it gets over the weekend without the track check and will run that on its own next week if the first stage is successful. First results for the seal tag data are below. This has no bad data (they were removed in the original quality control) so in theory there should be no rejections of any data. There are some tests that are flagging a lot, which is cause for concern. Alternatively something may be going wrong in the processing.
|
@s-good great! Still a bunch of validation to do (I'm on that today), but I don't anticipate that causing any slow-downs in run time. About how long did this calculation take, over how many profiles and how many cores? |
There's about 13000 profiles in this dataset. I ran this on 2 cores in about an hour I think. QuOTA is about 350000 so quite a lot larger. I was managing to get through ~800 profiles in 5 mins over 4 cores last night so the whole dataset should take < 2 days. I think it might be faster if I can get postgres working. The QC tests themselves are fractions of a second but writing to the database with sqlite is more like 0.5 seconds. |
Okay, but that still sets a lower bound estimate of 6500 profiles / hour / core. So on a C38xlarge AWS instance, assuming their processors are comparable, that's like 200k profiles an hour, or two hours for QuOTA - approximately $4. We'll see if these estimates hold up as we finish validation, but if nothing blows up, that's pretty cheap and easy. |
That's pretty impressive speed! BTW I found that creating the database takes 5-6 hours (in sqlite). I had to make a few modifications to get things to work. I don't know if it will be necessary for postgres in the docker container so am just going to make a note of them here.
|
Re your points:
|
Alright - |
Today:
|
Update: been running |
Today:
|
That's great about the speed up to the track check. I agree, let's omit any profile with complete time/date information from the track check. |
Today:
Once these and #183 and #184 are resolved, I think we should be good to go to production. |
Today:
|
That's great! It won't be possible to trim the ICDC files down, but I would revert back to the original method of loading the data into module variables once. It is a lot of data to read and they are compressed internally in the netCDF file so there is also a decompression that has to be done each time they are read from file. I think it is likely that this is the main cause of it running slow. |
Today:
|
Today:
Final to-do before running over all data:
|
Today:
@s-good, all tests are passing on |
Today: Investigated the 'funny business' mentioned above when trying to do a second 1000-profile validation:
Hopefully this will let me do my last validation run; finishing #183 and #184 should put us in a good place to run over data at scale. |
As it runs in the EN processing system the track check only QCs a month of data at a time (although something like a week of data is also read in from the month either side to help with the track checking). If there are memory issues, it would be a good option to only consider a month of data at a time. My worry with the 100 profile limit is that it could exclude a lot of data that have been collected over many years. |
Sounds good - we could also semi-relax that constraint by just running the entire track regardless of dates if there's less than 100 (or a less arbitrarily-selected number) profiles on track; the less artificial divisions, the better, for the track check. Shouldn't be too difficult; I'll look into it over the weekend. |
On second thought, we may not need to slice anything up like this; if we go back to using only database columns for en_track (and not reconstructing entire profiles from the raw text), things run very quickly and smoothly; 7ef6fa4 produces the same en_track result for my first batch of 1000 testing profiles. |
Today: I'm satisfied that everything is working correctly on |
Fantastic! No concerns from me. |
Alright, that's that - once #184 wraps up, we can move to speed tests on AWS to try and get an understanding of what profiles / $ is going to look like. |
Hi @BillMills, just been looking over the database code. Looks really good and will be a big step forward by releasing us from the memory issues! One thing that occurred to me was I was wonder if the database needs to contain the data? We could just store the unique ID, the file the profile came from and location in the file. Then, this information can be used to read the data from file just prior to the QC. That way there wouldn't need to be any modification of the QC tests (apart from maybe the ones that need to use other profiles other than the one being tested). This would avoid the database becoming too large (which might cause performance issues?) and another reason is that a QC test we add in the future may need to use extra information that is not in the database.
If you think this is a good idea I could try to make some changes over the next few days to implement this.
The text was updated successfully, but these errors were encountered: