Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coverity builds broken #3723

Closed
richardlau opened this issue May 14, 2024 · 19 comments
Closed

Coverity builds broken #3723

richardlau opened this issue May 14, 2024 · 19 comments
Labels

Comments

@richardlau
Copy link
Member

Two most recent node-daily-coverity builds have failed.
Some error about the agent going offline, no obvious other error.

e.g. https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3010/console

make[1]: *** Deleting file '/home/iojs/build/workspace/node-daily-coverity/out/Release/obj.target/v8_compiler/deps/v8/src/compiler/graph.o'
make[1]: *** Deleting file '/home/iojs/build/workspace/node-daily-coverity/out/Release/obj.target/v8_compiler/deps/v8/src/compiler/graph-visualizer.o'
make[1]: *** Deleting file '/home/iojs/build/workspace/node-daily-coverity/out/Release/obj.target/v8_compiler/deps/v8/src/compiler/graph-reducer.o'
make[1]: *** Deleting file '/home/iojs/build/workspace/node-daily-coverity/out/Release/obj.target/v8_compiler/deps/v8/src/compiler/frame-states.o'
FATAL: Unable to delete script file /tmp/jenkins5048327983852336012.sh
java.nio.channels.ClosedChannelException
	at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:155)
	at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:143)
	at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:789)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
	at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@67405022:JNLP4-connect connection from 147.75.72.255/147.75.72.255:58626": Remote call on JNLP4-connect connection from 147.75.72.255/147.75.72.255:58626 failed. The channel is closing down or has closed down
	at hudson.remoting.Channel.call(Channel.java:996)
	at hudson.FilePath.act(FilePath.java:1230)
	at hudson.FilePath.act(FilePath.java:1219)
	at hudson.FilePath.delete(FilePath.java:1766)
	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:163)
	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:92)
	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:818)
	at hudson.model.Build$BuildExecution.build(Build.java:199)
	at hudson.model.Build$BuildExecution.doRun(Build.java:164)
	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:526)
	at hudson.model.Run.execute(Run.java:1895)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
	at hudson.model.ResourceController.execute(ResourceController.java:101)
	at hudson.model.Executor.run(Executor.java:442)
[Agent went offline during the build](https://ci.nodejs.org/computer/test%2Dequinix%2Dubuntu2204%2Dx64%2D1/log)
ERROR: Connection was broken
@targos
Copy link
Member

targos commented May 14, 2024

When I installed the new Jenkins workspace, I downloaded a more recent version of Coverity that I deployed on the existing machines too (and I updated the Jenkins job).

This is probably the compilation going out of memory. I saw the same symptoms on Fedora hosts. The kernel kills the entire process tree when it happens, including the Jenkins agent.

@richardlau
Copy link
Member Author

richardlau commented May 14, 2024

The Equinix machines (where the job was running) have more CPU/RAM (#3597 (comment)) than the IBM machine (#3597 (comment)) which I put back online a few hours ago.

The job is running

V=1 cov-build --dir cov-int make -j $(getconf _NPROCESSORS_ONLN)

which for the Equinix machine was 16 (which would appear to be 2 threads per each of the 8 cores). We could possibly set server_jobs in the inventory to a lower number, which should set the JOBS environment variable and then change the job to use JOBS. I'm a bit wary of touching test-equinix-ubuntu2204-x64-1 at the moment as it's the machine that also runs the binary temp git repository used in the fanned jobs and the other Equinix workspace machine is currently down (#3721).

The IBM machine is 2 vCPUs/4 GB RAM, which is more like the regular test machines -- maybe adding 2GB swap like we did for the test machines would be sufficient, although the job is tending to prefer running on test-equinix-ubuntu2204-x64-1.

Or maybe we can be more drastic and shift the job to the Hetzner benchmark machines? I forget if there's a reason these had to run on the jenkins-workspace machines other than having to have the Coverity build tool installed, which I've now automated in #3722.

@targos
Copy link
Member

targos commented May 15, 2024

Here's a run with hardcoded -j 6: https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3013/

@targos
Copy link
Member

targos commented May 15, 2024

That build passed. I suggest to keep the hardcoded value until a better solution is implemented.
+1 on using the Hetzner machines.

richardlau added a commit to richardlau/build that referenced this issue Jun 5, 2024
Install the Coverity Scan build tool on the `benchmark` machines
instead of the `jenkins-workspace` machines.

Refs: nodejs#3723
richardlau added a commit to richardlau/build that referenced this issue Jun 5, 2024
Install the Coverity Scan build tool on the `benchmark` machines
instead of the `jenkins-workspace` machines.

Refs: nodejs#3723
@richardlau
Copy link
Member Author

richardlau commented Jun 5, 2024

I've updated the job to run on the benchmark machines instead of jenkins-workspace (after running #3752 against the benchmark machines to install the Coverity Scan build tool). I've undone the workaround to hardcode -j 6 (it now uses ${JOBS} which we can control via the Ansible inventory).

https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3037/ looks okay apart from an expected failure to upload/submit since we're limited to one upload per day. The next scheduled daily run would be expected to pass.

@richardlau
Copy link
Member Author

hmm. The scheduled build failed to upload:
https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3038/console

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   183    0     0  100   183      0    139  0:00:01  0:00:01 --:--:--   139
100   183    0     0  100   183      0     79  0:00:02  0:00:02 --:--:--    79
100   199  100    16  100   183      4     56  0:00:04  0:00:03  0:00:01    61
100   199  100    16  100   183      4     56  0:00:04  0:00:03  0:00:01    61
error code: 1016parse error: Invalid numeric literal at line 1, column 6

Rerunning: https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3039/console

@targos
Copy link
Member

targos commented Jun 6, 2024

Do we use jq in the script? This error message seems to come from it.
It could be useful to print the response in case we're unable to parse it.

@richardlau
Copy link
Member Author

richardlau commented Jun 6, 2024

Yes, we use jq -- the upload is a two step process where the first step is an API call to get a JSON response that contains the temporary URL to upload to.

We are already printing the response -- in this case

error code: 1016

I just logged into test-hetzner-ubuntu2204-x64-1 and checked the response file in the workspace which has that content.

@richardlau
Copy link
Member Author

https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3039/ passed.

@richardlau
Copy link
Member Author

richardlau commented Jun 7, 2024

https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3040/ failed 😞

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   365    0   182  100   183    303    304 --:--:-- --:--:-- --:--:--   608
500 Internal Server Error
If you are the administrator of this website, then please read this web application's log file and/or the web server's log file to find out what went wrong.jq: error (at response:1): Cannot index number with string "url"
parse error: Invalid numeric literal at line 1, column 13

i.e. the first call to the Coverity Scan API returned

500 Internal Server Error
If you are the administrator of this website, then please read this web application's log file and/or the web server's log file to find out what went wrong.

I guess we'll need to monitor this for a while.

FWIW we only have a small sample size, but the successful run was on test-hetzner-ubuntu2204-x64-2 while the two failing runs were on test-hetzner-ubuntu2204-x64-1.

richardlau added a commit that referenced this issue Jun 7, 2024
Install the Coverity Scan build tool on the `benchmark` machines
instead of the `jenkins-workspace` machines.

Refs: #3723
@richardlau
Copy link
Member Author

https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3041/ succeeded.

https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3042/ looks like it succeeded on first glance, but the second stage of the upload failed

100    44  100    16  100    28      4      8  0:00:04  0:00:03  0:00:01    13
error code: 1016

https://scan.coverity.com/projects/node-js?tab=overview is currently showing "Version: v23.0.0-pre-50695e5de1" which is from 3041, but the page also says "Last Build Status: In-queue. Your build is currently being analyzed."

Both builds ran on test-hetzner-ubuntu2204-x64-1.

@richardlau
Copy link
Member Author

richardlau commented Jun 10, 2024

https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3043/console failed to upload:

Your build is already in the queue for analysis. \
       Please wait till analysis finishes before uploading another build.
parse error: Invalid numeric literal at line 1, column 5

https://scan.coverity.com/projects/node-js?tab=overview:
image

I wonder if the failed to upload build from 3042 is now blocking further uploads. I've clicked "Terminate build", which responded:

The build has been scheduled for termination. There may be a delay before a new build can be resubmitted.

Retrying: https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3044/

@richardlau
Copy link
Member Author

Retrying: https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3044/

100    44  100    16  100    28      4      8  0:00:04  0:00:03  0:00:01    13
error code: 1016

😞

@richardlau
Copy link
Member Author

I logged into test-hetzner-ubuntu2204-x64-1 and manually ran the curl command line to enqueue the build (the one in the job config to the URL ending /enqueue). The first time I tried I got the same error:

iojs@test-hetzner-ubuntu2204-x64-1:~/build/workspace/node-daily-coverity$ curl --fail-with-body -X PUT -d token=<redacted> https://scan.coverity.com/projects/<redacted>/enqueue
curl: (22) The requested URL returned error: 530
error code: 1016

I immediately ran it again and it succeeded:

iojs@test-hetzner-ubuntu2204-x64-1:~/build/workspace/node-daily-coverity$ curl --fail-with-body -X PUT -d token=<redacted> https://scan.coverity.com/projects/<redacted>/enqueue
{"project_id":6507,"id":619487}

(I've added --fail-with-body to the command in the job in the hope that will make the failure actually fail the build.)
This has changed https://scan.coverity.com/projects/node-js from saying the build is queued to

Last Build Status: Running. Your build is currently being analyzed

@richardlau
Copy link
Member Author

Yes, we use jq -- the upload is a two step process where the first step is an API call to get a JSON response that contains the temporary URL to upload to.

Small correction, the upload is a three step process:

  1. POST request to Coverity /init endpoint to get back JSON response containing temporary upload URL and build ID.
  2. POST to temporary upload URL with build ID and artifacts from the build.
  3. POST to /enqueue endpoint with build ID.

Of the observed failures so far:

@richardlau
Copy link
Member Author

https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3045/console failed at step 1:

100   199  100    16  100   183      4     53  0:00:04  0:00:03  0:00:01    57
07:49:03 curl: (22) The requested URL returned error: 530
07:49:03 error code: 1016

@richardlau
Copy link
Member Author

https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3046 failed at step 3:

100    44  100    16  100    28      4      8  0:00:04  0:00:03  0:00:01    13
07:53:23 curl: (22) The requested URL returned error: 530
07:53:23 error code: 1016

I've manually run step 3 on the machine to unstick the analysis queue in Coverity. I'll put a loop around the first and third steps so it retries a few times (with a pause between attempts).

@richardlau
Copy link
Member Author

Typically the three most recent Coverity builds since I added the retry loops all succeeded without having to retry 😆.

@richardlau
Copy link
Member Author

richardlau commented Jun 24, 2024

Since I put in the retry loop, we've only had one build failure, which occurred during the build (possibly a resource issue or agent failure): https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-coverity/3056/

All other builds have succeeded and were able to submit the results to Coverity without needing to go through the retry loop, so we have no validation that the loop works/makes things better. Since the builds are succeeding at the moment and we're getting the static analysis run daily I'm going to close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants