Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration to Go implementation of DBS server #10843

Closed
vkuznet opened this issue Sep 29, 2021 · 17 comments
Closed

Migration to Go implementation of DBS server #10843

vkuznet opened this issue Sep 29, 2021 · 17 comments

Comments

@vkuznet
Copy link
Contributor

vkuznet commented Sep 29, 2021

Impact of the new feature
Align DMWM/WMCore clients with new Go implementation of DBS server

Is your feature request related to a problem? Please describe.
We need to resolve outlined issue in order to perform migration to new DBS server implementation

Describe the solution you'd like
WMCore team should review that their usage of DBS clients is aligned with new behavior of DBS dbs2go server

UPDATE (Alan): this issue is meant to be tackled once WMCore applications are ready to work with the Golang implementation of the DBS server.

Describe alternatives you've considered
None

Additional context

To ensure migration from DBS python server to Go implementation there are several issues should be resolved/confirmed by the clients (DMWM tools, CRAB, DAS). Here I provide brief details of each use-case and refer to individual tickets for more details:

  • to reduce CPU/RAM footprints on a server side the dbs2go implementation does not perform aggregation of output results coming out from ORACLE (i.e. it streams from the ORACLE directly). As such few DBS reader APIs output have slightly different data-format (instead of aggregated lists they provide individual records),
    see WMCore#10841

  • dbs2go writer API follow strong data-types checking, while python DBS writer server allows data-types mismatch, e.g. pass '123' and treat it as integer. I found that the following DBS attributes are affected: run_num, lumi_section_num, processing_version, event_count, file_size, for details see DBS#657 and associate WMCore#10844 feature request

  • DBS python server provides inconsistent data-type outputs, e.g. the exception is returned as dict, the results as list, and POST injection as null. In dbs2go implementation all results are returned as list data type, for details see WMCore#10842

  • We should decide on injection POST API output. So far, DBS returns 200 OK HTTP code and null string (since its clients parse JSON outputs), but neither is required by HTTP standard. For data injection we should return 201 OK HTTP code, see here, and return nothing (or return empty list of we'd like to have consistent output). Therefore, even though it is not critical, it is desired to adjust to HTTP standard. As such WMCore tools (e.g. pycurl_managert) should be adjusted to handle 201 HTTP code similar to 200 one.

For more details on these and other potential issues related to this migration please refer to the following document

@amaltaro
Copy link
Contributor

@vkuznet Valentin, do I understand it right that this issue is more like a checklist and supposed to track the final migration in the production system?

dbs2go writer API follow strong data-types checking, while python DBS writer server allows data-types mismatch, e.g. pass '123' and treat it as integer. I found that the following DBS attributes are affected: run_num, lumi_section_num, processing_version, event_count, file_size, for details see DBS#657

Regarding this point. I think we should have a WMCore issue to correct those data types on our side. It should be transparent, since DBSServer is already doing the casting step when needed. So the sooner we have it implemented, the better for the Go implementation migration. Can you open yet another issue please?

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 29, 2021

Yes, it is something we must check before the migration to avoid hidden failures in production workflows. And, yes, I opened dedicated feature request WMCore#10844 for data-types issue.

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 29, 2021

@belforte please review this issue as it relates to CRAB too. This is high-level issue where you can find details about concrete issues related to migration, therefore I only provide this one to you for your review, but I expect that CRAB team can evaluate the usage of new DBS server and ensure that CRAB tools will be complaint with it.

@belforte
Copy link
Member

as reference, I have created a GH project for CRAB on transitioning to new DBS-Go server, feel free to review/comment/add issues there https://github.com/dmwm/CRABServer/projects/8

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 29, 2021

@belforte , @amaltaro somehow your comments din't appear in a ticket. So, I'll paste them here and answer to both:

From @stefano:
usually we do not make HTTP calls to DBS server, but use DBS python API. Can't the format differences be hidden inside the client ?

From @amaltaro
WMCore is the same, Stefano. All the communication with DBSServer goes through the DBS3Reader/DBS3Writer wrapper (which uses dbs3-client). However, with time and migration from DBS2 to DBS3, we accumulated "temporary" code that was never refactored/removed, like this: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/DBS/DBS3Reader.py#L32

I'll answer you both. If you use DBS client which does not perform JSON data-type validation and can only work with DBS custom Python server because it does not follow standard and perform bizzarre things it does not mean that you should not use proper data-types. I will repost my reply from email thread explaining why it is dangerous:

In python I can do int(1), int("1"), and int("00001") and all of them will convert to number 1. Does these values are the same to you? If so, I can construct int("0"*100000000000000000+"1") which is a string with gozzillion number of zeros and 1 at the end then I'll get

>>> int("0"*100000000000000000+"1")
Python(81587,0x111bdbe00) malloc: can't allocate region
:*** mach_vm_map(size=100000000000004096, flags: 100) failed (error code=3)
Python(81587,0x111bdbe00) malloc: *** set a breakpoint in malloc_error_break to debug
Python(81587,0x111bdbe00) malloc: can't allocate region
:*** mach_vm_map(size=100000000000004096, flags: 100) failed (error code=3)
Python(81587,0x111bdbe00) malloc: *** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
MemoryError

Now, please explain to me why run number, file size or lumi number should be a string? Does 0123 is the same to you as 123? Don't you think that at least it is logically wrong? At most the code can cause memory error on a server and in practice DBS server should perform extra cycles of CPU and RAM to perform casting for every single value you send with a wrong data-type? Said that, DBS server should inject valid JSON data regardless if you use Python or curl, if you use DBS client or plain httplib library in Python, or if you use Go/C/C++ or any other languages. If data is correct all of these possibilities are equal.

@belforte
Copy link
Member

I am confused. Is the plan to use existing DBS Python Client with new server, or to update/rewrite it ? Or do you suggest that we replace it with direct http calls to new server ?
Or is this simply:
since DBS Client does not do input sanitization, please make sure that you check things with Lexicon first and that you send proper JSON.

AFAIK current DBS Client does not take JSON as input, but python dictionaries,
which can surely be part of the confusion here. IIUC DBS client streams those
dictionaries to HTTP instead of converting to JSON.
If client-server protocal needs improving, and do proper exception rising
client-side before attempting to send junk to the server, the best way should
be to add checks in the DBS Python client.

So let's make it clear what the procedure is, then we can dwell on various lumi formats
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/DataStructs/LumiList.py
which may surely be "too much user friendly" but are currently in use in various
places in all their formats. The best I can imagine is that we enforce
compact format before sending to DBS.

BTW, I though we (well you!) were getting this chance also to rework the
DBS client-server communication protocol to turn those extra long lists of lists of...
into some compact representation which reduced memory needs, speeds up
xfer, and overall solves the famous problem that posting new blocks can take
so long to hit the 5min FE timeout. E.g. I have seen cases where block is
actually published but still FE times out and I get an http 502 error, and have
no way to tell the difference from cases where block publication failed, and I
still get 502 http error.

I tend to agree with Yuyi that maybe original DBS specifications were too lax, but
since all existing code is functional, we need not to worry too much and it is
not worth IMHO to go through code which use DBS and change it. If at some
point a new developer will do something stupid and the new server will
reject it, so much the better. The new server should definitely assume and enforce
proper format, and since (as you say) it will be ready to acces input
from python client, direct http calls, Go/C/C++ clients etc, there is no alternative
to having proper validation in the server.

Is this as simple as "current server may accept junk and try to send it to Oracle,
but be ware that next one will not" ? I am sure we will agree on that, and yes, wherever
possible we will try to catch those things early in the application side.

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 30, 2021

@belforte let's break things piece by piece:

  • DBS is HTTP RESTFul server, i.e. it can accept GET/POST/PUT requests
  • DBS also provide DBS client which talks to DBS HTTP server, internally DBS client takes python dictionary and composes out of it HTTP request
  • I think you're aware that python dictionary is not (necessary) a JSON, for instance {u'key':1} is a valid Python dictionary but it is not proper JSON since key is a unicode, while {"key":1} is proper JSON. In order to properly convert Python dict to JSON you must use proper software, e.g. json.dumps() which will convert Python data into JSON format, e.g. Python None will be converted to null in JSON, etc.
  • JSON as well as Python dictionary supports multiple data types, strings or ints
  • DBS Server accepts JSON payload in POST request, and it casts necessary attributes from incoming JSON values to int using int(value) function. As such someone can send "123" in JSON which is a string data and it will be converted in DBS server to 123 which is an integer value
  • To send data to DBS server we may use DBS client or not since it is HTTP server, I can easily use curl or any other client to send proper JSON to DBS server, i.e. there is no requirement to use DBS client per-se, but we (historically) use it
  • The Go HTTP server enforces proper data-types because it declares them, e.g. run is defined as integer, therefore if someone in JSON sends run value as a string Go server will refuse to parse it since it is wrong data-type for run.

Now about issues you listed. They are part of the issue I explain here and contribute to DBS performance degradation. For instance, if you send gigabtic list of strings which should be converted to ints, DBS server performs casting for every single value which leads to extra CPU cycles, extra memory usage, etc. As part of Go server implementation I did consider all known issues with DBS server, this is why I demonstrated that DBS memory footprint can drop from 10GB to less than 200MB, etc. I'd like to avoid discussing here all different issues I find and only concentrate on discussion for a specific topic.

And, as I pointed out there is no need to fix DBS client or DBS server, since if client will send proper data-types all components, the legacy DBS server, DBS Client API, and other tools (like curl or other HTTP clients) will work just fine.

The bottom line is that clients should send proper data-types to DBS server regardless if client is DBS client API or curl or anything else. The large JSON can be easily compacted using gzip, and data format should be revisited to minimize CPU/memory footprint both on a client and server sides. But this will be possible once we move to Go server.

@belforte
Copy link
Member

thanks @vkuznet . All fine.
I fail to find any action item for me :-) I will keep using DBS Python client, send dictionary and validate with Lexicon the few data which do not come from DBS itself or other automated tools like CMSSW. We never parsed/validated the list of lumis in CMSSW FrameworkJobReport, and I'd rather not add this.

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 30, 2021

@belforte , it would be extremely useful if you'll find out how CRAB software stack get values into dict which is passed to DBS client API, e.g. do run, lumi, etc. are integers or strings. I think they are created somewhere in production workflow and I'll be interested to know who create them and which data-type are assigned to them. Then, if CRAB creates a dict it would be nice if it will perform conversion in place, like {"run_num": int(var), ...} where var is a variable name which holds run number, etc. The DBS client API takes a dict from upstream, but it is WMCore/CRAB software which creates those dicts.

@belforte
Copy link
Member

Well... all that goes into DBS comes from WMCore's parsing of cmsRun's FJR where xml is turned into json:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/FwkJobReport/Report.py
in particular CRAB jobs wrapper calls the Parse and the to_json methods from there.

OTOH those run/lumi lists are first stored in internal CRAB DB, then read back and sent to DBS.
I do not want to digress too much here, but if there's any question with data format in/out of
CRAB's Oracle I would defer to after we move to do that in python3 with modern cx_oracle,
current stuff works but is nothing I want to leave with longer and suffers of gigantic
string vs. byte-array vs. unicode vs. new-string issues. Getting data representation there
under control is needed regardless of DBS migration and brings us straight to old foe

class DatabaseRESTApi(RESTApi):

@vkuznet
Copy link
Contributor Author

vkuznet commented Sep 30, 2021

It seems to me that WMCore is doing right thing, thanks to @stefano pointer to FwkJobReport file I see, e.g. https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/FwkJobReport/Report.py#L811, that proper data-type is enforced by this code.

@amaltaro could you please provide an example of fwjr JSON file which WMCore use to send data to DBS that I can inspect for data-types correctness. May be everything is in a good shape.

@amaltaro
Copy link
Contributor

amaltaro commented Oct 1, 2021

@vkuznet Valentin, you can find here an old'ish json dump POSTed to the DBSServer in 2019:
http://amaltaro.web.cern.ch/amaltaro/forYuyi/dbsuploader_block.json

Code is still the same though, so it should still be a representative example.

@vkuznet
Copy link
Contributor Author

vkuznet commented Oct 11, 2021

Here is a list of requirements/actions on DBS client side: dmwm/DBS#658

@vkuznet
Copy link
Contributor Author

vkuznet commented Oct 12, 2021

I switched requirements on DBS side to this ticket: dmwm/DBSClient#19

@amaltaro
Copy link
Contributor

amaltaro commented Feb 7, 2022

This ticket is waiting for:

  • old WMAgent version to be completely drained and shut off (1.5.4 cycle)
  • mid of February, when we planned to have the switch over to the Go-based implementation for DBSReader.

@amaltaro
Copy link
Contributor

Now that all the old agent releases have been stopped, and that the Go-based server (DBSReader) has been in production for 5 days, shall we close this issue out? @vkuznet

@vkuznet
Copy link
Contributor Author

vkuznet commented Feb 28, 2022

I think we can close this ticket, @amaltaro please do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants