Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRABRestInterface parses curl stdout and stderr at same time #5159

Open
belforte opened this issue Mar 17, 2022 · 12 comments
Open

CRABRestInterface parses curl stdout and stderr at same time #5159

belforte opened this issue Mar 17, 2022 · 12 comments
Assignees

Comments

@belforte
Copy link
Member

in CRABRestInerface the output of curl is piped to stderr,
this makes the code wrongly assume that the call to serve failed if the output contains a string reporting an HTTP error, like when task submission failed due to an HTTP error in TaskWorker.
In following example task submission failed with the "typical" proxy error 502

tm_task_failure | Problem handling 220317_170502:cmsbot_crab_outLFNDirBase because of DBSReaderError Message: Can't contact DBS at https://cmsweb-prod.cern.ch/dbs/prod/global/DBSReader, got errors Instantiating DBS3Reader failed with HTTP Error 502: URL=https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/serverinfo Code=502 Message=Proxy Error Header=HTTP/1.1 502 Proxy Error Date: Thu, 17 Mar 2022 17:14:10 GMT Server: Apache Content-Length: 497 Content-Type: text/html; charset=iso-8859-1 Body= Proxy Error The proxy server received an invalid response from an upstream server. The proxy server could not handle the request GET /auth/complete/dbs/prod/global/DBSReader/serverinfo.

but crab status just prints:

belforte@lxplus785/bot> crab status 
Fatal error trying to connect to https://cmsweb-test1.cern.ch:8443/crabserver/dev/task?subresource=search&workflow=220317_170502%3Acmsbot_crab_outLFNDirBase using subresource=search&workflow=220317_170502%3Acmsbot_crab_outLFNDirBase.
The server answered with an error.
belforte@lxplus785/bot> 

problematic code is (I believe)

command += ' "%s" | tee /dev/stderr ' % url
# retries this up at least 3 times, or up to self['retry'] times for range of exit codes
# retries are counted AFTER 1st try, so call is made up to nRetries+1 times !
nRetries = max(2, self['retry'])
for i in range(nRetries + 1):
stdout, stderr, curlExitCode = execute_command(command=command, logger=self.logger)
http_code, http_reason = parseResponseHeader(stderr)

combined with
def parseResponseHeader(response):
"""
Parse response header and return HTTP code with reason
Example taken from WMCore pycurl_manager
"""
startRegex = r"HTTP\/\d.\d\s\d{3}[^\n]*"
continueRegex = r"HTTP\/\d.\d\s100[^\n]*" # Continue: client should continue its request
replaceRegex = r"HTTP\/\d.\d"
reason = ''
code = 9999
for row in response.split('\r'):
row = row.replace('\n', '')
if not row:
continue
response = re.search(startRegex, row)
if response:
if re.search(continueRegex, row):
continue
res = re.sub(replaceRegex, "", response.group(0)).strip()
code, reason = res.split(' ', 1)
code = int(code)

I am surprised that this did not appear earlier. Maybe it is another side effect of new message handling in REST (the b64 saga) ? the message is 11.8K characters, maybe earlier it was truncated and this made it formatted in such a way not to tool parseResponseHeader ?

In any case we should not mix HTTP Header and Body in this way, need to understand why we have |tee /dev/stderr in

command += ' "%s" | tee /dev/stderr ' % url

@belforte
Copy link
Member Author

the |tee /dev/stderr was added in ddaina@80688c5 in order to solve
#5067

It was a tricky point. And I am afraid we need to revisit.
@mapellidario do you feel like helping ?

@belforte
Copy link
Member Author

rats !! this happens also on tasks on production server using production client like
https://cmsweb.cern.ch/crabserver/ui/task/220317_212302%3Amstojano_crab_K892_pT30
where I get after crab remake :

belforte@lxplus785/bug> crab --version
CRAB client v3.220210
belforte@lxplus785/bug> crab status crab_K892_pT30/
Fatal error trying to connect to https://cmsweb.cern.ch:8443/crabserver/prod/task?subresource=search&workflow=220317_212302%3Amstojano_crab_K892_pT30 using subresource=search&workflow=220317_212302%3Amstojano_crab_K892_pT30.
The server answered with an error.
belforte@lxplus785/bug> 

increasing priority :-(

@belforte
Copy link
Member Author

I am of course pretty sure that in the past we had "sane outputs from crab status" also when DBS was returning HTTP 502. So is this maybe really due to change in TW ?
Surely changing TW is a much faster fix than deploying new cllient.

@belforte
Copy link
Member Author

belforte commented Mar 17, 2022

hmm... the cmsbot example above was from crab-dev-tw01 where message is truncated to 7500 char. Also the truncate message is in any case added to the message, would not override the reporting of HTTP 502 at the beginning which is the problem here. I am puzzled.

@belforte
Copy link
Member Author

if I remove the |tee /dev/stderr I get:

belforte@lxplus785/bot> crab status 
CRAB project directory:		/afs/cern.ch/work/b/belforte/CRAB3/TC3/dbg/bug/crab_K892_pT30
Task name:			220317_212302:mstojano_crab_K892_pT30
Grid scheduler - Task Worker:	N/A yet - crab-prod-tw01
Status on the CRAB server:	SUBMITFAILED
Task URL to use for HELP:	https://cmsweb.cern.ch/crabserver/ui/task/220317_212302%3Amstojano_crab_K892_pT30
Dashboard monitoring URL:	https://monit-grafana.cern.ch/d/cmsTMDetail/cms-task-monitoring-task-view?orgId=11&var-user=mstojano&var-task=220317_212302%3Amstojano_crab_K892_pT30&from=1647548582000&to=now
Failure message from server:	Problem handling 220317_212302:mstojano_crab_K892_pT30 because of DBSReaderError
				Message: Can't contact DBS at https://cmsweb-prod.cern.ch/dbs/prod/global/DBSReader, got errors Instantiating DBS3Reader failed with HTTP Error 502: 
				URL=https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/serverinfo
				Code=502
				Message=Proxy Error
				Header=HTTP/1.1 502 Proxy Error
				Date: Thu, 17 Mar 2022 21:23:35 GMT
				Server: Apache
				Content-Length: 497
				Content-Type: text/html; charset=iso-8859-1
				
				
				Body=<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
				<html><head>
				<title>502 Proxy Error</title>
				</head><body>
				<h1>Proxy Error</h1>
				<p>The proxy server received an invalid
				response from an upstream server.<br />
				The proxy server could not handle the request <em><a href="/auth/complete/dbs/prod/global/DBSReader/serverinfo">GET&nbsp;/auth/complete/dbs/prod/global/DBSReader/serverinfo</a></em>.<p>
				Reason: <strong>DNS lookup failure for: cmsweb-k8s-prodsrv.cern.ch</strong></p></p>
				</body></html>
				
				
				
					ClassName : None
					ModuleName : WMCore.Services.DBS.DBSErrors
					MethodName : __init__
					ClassInstance : None
					FileName : /data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/crabtaskworker/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/lib/python3.8/site-packages/WMCore/Services/DBS/DBSErrors.py
					LineNumber : 38
					ErrorNr : 1002
				 failure, traceback follows
				Traceback (most recent call last):
				  File "/data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.7/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 445, in __callServer
				    self.http_response = method_func(self.url, method, params, data, request_headers)
				  File "/data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/py3-dbs3-pycurl/3.17.7-comp/lib/python3.8/site-packages/RestClient/RestApi.py", line 36, in get
				    return http_request(self._curl)
				  File "/data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/py3-dbs3-pycurl/3.17.7-comp/lib/python3.8/site-packages/RestClient/RequestHandling/HTTPRequest.py", line 62, in __call__
				    raise HTTPError(effective_url, http_code, http_response.msg, http_response.raw_header, http_response.body)
				RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 502: Proxy Error
				
				During handling of the above exception, another exception occurred:
				
				Traceback (most recent call last):
				  File "/data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/crabtaskworker/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/lib/python3.8/site-packages/WMCore/Services/DBS/DBSReader.py", line 19, in DBSReader
				    dbs.dbs.serverinfo()
				  File "/data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.7/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 1838, in serverinfo
				    return self.__callServer("serverinfo")
				  File "/data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.7/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 457, in __callServer
				    self.__parseForException(data)
				  File "/data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.7/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 483, in __parseForException
				    raise http_error
				RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 502: 
				URL=https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/serverinfo
				Code=502
				Message=Proxy Error
				Header=HTTP/1.1 502 Proxy Error
				Date: Thu, 17 Mar 2022 21:23:35 GMT
				Server: Apache
				Content-Length: 497
				Content-Type: text/html; charset=iso-8859-1
				
				
				Body=<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
				<html><head>
				<title>502 Proxy Error</title>
				</head><body>
				<h1>Proxy Error</h1>
				<p>The proxy server received an invalid
				response from an upstream server.<br />
				The proxy server could not handle the request <em><a href="/auth/complete/dbs/prod/global/DBSReader/serverinfo">GET&nbsp;/auth/complete/dbs/prod/global/DBSReader/serverinfo</a></em>.<p>
				Reason: <strong>DNS lookup failure for: cmsweb-k8s-prodsrv.cern.ch</strong></p></p>
				</body></html>
				
				
				
				During handling of the above exception, another exception occurred:
				
				Traceback (most recent call last):
				  File "/data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/crabtaskworker/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/lib/python3.8/site-packages/TaskWorker/Actions/Handler.py", line 80, in executeAction
				    output = work.execute(nextinput, task=self._task, tempDir=self.tempDir)
				  File "/data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/crabtaskworker/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/lib/python3.8/site-packages/TaskWorker/Actions/DBSDataDiscovery.py", line 243, in execute
				    result = self.executeInternal(*args, **kwargs)
				  File "/data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/crabtaskworker/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/lib/python3.8/site-packages/TaskWorker/Actions/DBSDataDiscovery.py", line 261, in executeInternal
				    self.dbs = DBSReader(dbsurl)
				  File "/data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/crabtaskworker/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/lib/python3.8/site-packages/WMCore/Services/DBS/DBSReader.py", line 23, in DBSReader
				    raise DBSReaderError("Can't contact DBS at %s, got errors %s" % (endpoint, msg))
				WMCore.Services.DBS.DBSErrors.DBSReaderError: DBSReaderError
				Message: Can't contact DBS at https://cmsweb-prod.cern.ch/dbs/prod/global/DBSReader, got errors Instantiating DBS3Reader failed with HTTP Error 502: 
				URL=https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader/serverinfo
				Code=502
				Message=Proxy Error
				Header=HTTP/1.1 502 Proxy Error
				Date: Thu, 17 Mar 2022 21:23:35 GMT
				Server: Apache
				Content-Length: 497
				Content-Type: text/html; charset=iso-8859-1
				
				
				Body=<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
				<html><head>
				<title>502 Proxy Error</title>
				</head><body>
				<h1>Proxy Error</h1>
				<p>The proxy server received an invalid
				response from an upstream server.<br />
				The proxy server could not handle the request <em><a href="/auth/complete/dbs/prod/global/DBSReader/serverinfo">GET&nbsp;/auth/complete/dbs/prod/global/DBSReader/serverinfo</a></em>.<p>
				Reason: <strong>DNS lookup failure for: cmsweb-k8s-prodsrv.cern.ch</strong></p></p>
				</body></html>
				
				
				
					ClassName : None
					ModuleName : WMCore.Services.DBS.DBSErrors
					MethodName : __init__
					ClassInstance : None
					FileName : /data/srv/TaskManager/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/slc7_amd64_gcc630/cms/crabtaskworker/v3.220314patch1-3391c4ccadfae468fd63c78475580c2d/lib/python3.8/site-packages/WMCore/Services/DBS/DBSErrors.py
					LineNumber : 38
					ErrorNr : 1002
				
				
Log file is /afs/cern.ch/work/b/belforte/CRAB3/TC3/dbg/bug/crab_K892_pT30/crab.log
belforte@lxplus785/bot> 

@belforte
Copy link
Member Author

belforte commented Mar 17, 2022

It really looks like TW used to upload a much more terse message :-(
But at least in this way it prints the much needed:
Status on the CRAB server: SUBMITFAILED

@belforte
Copy link
Member Author

FWIW, the message in #5159 (comment) is "only" 7003 char long (from the DB, i.e. measured at

failure = getColumn(crabDBInfo, 'tm_task_failure')
)

@belforte
Copy link
Member Author

belforte commented Mar 17, 2022

a quick test indicates that truncating message at 1K char is plenty.
FYI @mapellidario I suggest to cut at 1K also in the famous PR for b64 dmwm/CRABServer#7106

@belforte
Copy link
Member Author

hmm... maybe WMCore changed and now reports the full http details instead of only one line like
Message: Can't contact DBS at https://cmsweb-prod.cern.ch/dbs/prod/global/DBSReader, got errors Instantiating DBS3Reader failed with HTTP Error 502
?

@belforte
Copy link
Member Author

Dario and myself debugged this. The change is not due to WMCore, but to new DBSClient which prints a new exception message which (besides being too long dmwm/DBSClient#68 ) contains the text from the http header Header=HTTP/1.1 402 or whatever exit code. This is eventually parsed by crab client and triggers the failure.

@belforte
Copy link
Member Author

We have deployed a patched TaskWorker which does not upload such messages anymore. It replaces HTTP/1.1 with HTTP 1.1 which is safe for crab status.
So this is not critical anymore. It may still be good to avoid parsing curl stdout for HTTP code, but we need to test well any change to CRABRestInterface, since all in all Daina may have had good reasons to use this

command += ' "%s" | tee /dev/stderr ' % url

@belforte belforte changed the title CRABRestInterface bad stderr parsing CRABRestInterface parses curl stdout and stderr at same time Mar 21, 2022
@belforte
Copy link
Member Author

I am inclined to do the following:

  • close pending PR and make new CRABClient version in crab-dev (and shortly after crab-prod)
  • make this change (do not pipe stdout to stderr in (go)curl in crab-dev and keep it in IB for a while and try to use it extensively
  • see if we get any odd issue

@mapellidario I welcome your opinion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant