Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Flaky CI - Kill and Run #4455

Draft
wants to merge 30 commits into
base: master
Choose a base branch
from
Draft

Fix Flaky CI - Kill and Run #4455

wants to merge 30 commits into from

Conversation

AndrewJGaut
Copy link
Contributor

@AndrewJGaut AndrewJGaut commented Apr 28, 2023

Fix flaky kill and run test (see #4433).

Still parsing the issue... appears to be related to bundle metadata being altered before the bundle-manager can acknowledge a run bundle as finished...

@AndrewJGaut AndrewJGaut changed the title Make kill tests pass more often. We could also wait until the websock… Fix Flaky CI - Kill and Run Apr 28, 2023
@epicfaace epicfaace closed this May 11, 2023
@AndrewJGaut AndrewJGaut reopened this May 11, 2023
@AndrewJGaut
Copy link
Contributor Author

Hmm... Interestingly enough, now we're not getting any failures! In a full 25 runs! (See the GHA runs after the PR was closed) Very interesting...

@AndrewJGaut
Copy link
Contributor Author

The most recent run shows that it only fails when we do multiple tests -- e.g. search read kill -- rather than just run kill for a single test setup. Hmmm... I'm now re-running with even more debug statements and only with those test cases.

@AndrewJGaut
Copy link
Contributor Author

AndrewJGaut commented May 17, 2023

Looking at the logs of a test failure, here's what we see:

At the bottom of the rest-server logs (where transition_bundle_finished is called), we see:

2023-05-17 05:02:22,851 {'uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e', 'run_status': 'Finalizing bundle.', 'bundle_start_time': 1684299727.1436937, 'container_time_total': 3.138313, 'container_time_user': 0, 'container_time_system': 0, 'docker_image': 'python@sha256:b45ed695466dfbab748f4b2f73e3f9dc43236deda8c1a6f9eeb5fd3c861e5354', 'state': 'finalizing', 'remote': 'fv-az221-714', 'exitcode': None, 'failure_message': 'Kill requested', 'bundle_profile_stats': {'RUN_STAGE.PREPARING': {'start': 1684299727.1437242, 'end': 1684299732.7472246, 'elapsed': 5.6035003662109375}, 'RUN_STAGE.RUNNING': {'start': 1684299732.747232, 'end': 1684299737.7799726, 'elapsed': 5.032740592956543}, 'RUN_STAGE.CLEANING_UP': {'start': 1684299737.7799826, 'end': 1684299737.8078065, 'elapsed': 0.027823925018310547}, 'RUN_STAGE.UPLOADING_RESULTS': {'start': 1684299737.8078125, 'end': 1684299737.8927565, 'elapsed': 0.08494400978088379}, 'RUN_STAGE.FINALIZING': {'start': 1684299737.892763, 'end': None, 'elapsed': None}}, 'cpu_usage': 0.0, 'memory_usage': 0.070703125, 'disk_utilization': 4096}
2023-05-17 05:02:22,851 In transition_bundle_finalizing...

So, in particular, we see that 'failure_message': 'Kill requested' is logged as being within the worker_run dictionary in transition_bundle_finalizing. Yet, we see this at the bottom of the bundle-manager logs (in which transition_bundle_finished is run):

2023-05-17 05:02:23,170 In transition_bundle_finished
2023-05-17 05:02:23,171 {'uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e', 'bundle_type': 'run', 'command': 'sleep 100000;', 'state': 'finalizing', 'owner_id': '0', 'frozen': None, 'is_anonymous': False, 'storage_type': 'disk', 'is_dir': True, 'metadata': [{'metadata_key': 'name', 'metadata_value': 'run-sleep', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'description', 'metadata_value': '', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'created', 'metadata_value': '1684299727', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'data_size', 'metadata_value': '4096', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'allow_failed_dependencies', 'metadata_value': 'False', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'request_docker_image', 'metadata_value': 'python:3.6.10-slim-buster', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'request_time', 'metadata_value': '', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'request_memory', 'metadata_value': '10m', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'request_disk', 'metadata_value': '1m', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'request_cpus', 'metadata_value': '1', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'request_gpus', 'metadata_value': '0', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'request_queue', 'metadata_value': '', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'request_priority', 'metadata_value': '0', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'request_network', 'metadata_value': 'True', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'cpu_usage', 'metadata_value': '0.0', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'memory_usage', 'metadata_value': '0.070703125', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'store', 'metadata_value': '', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'actions', 'metadata_value': 'kill', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'time', 'metadata_value': '3.138313', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'time_user', 'metadata_value': '0.0', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'time_system', 'metadata_value': '0.0', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'started', 'metadata_value': '1684299727', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'last_updated', 'metadata_value': '1684299742', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'run_status', 'metadata_value': 'Finalizing bundle.', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'staged_status', 'metadata_value': "Bundle's dependencies are all ready. Waiting for the bundle to be assigned to a worker to be run.", 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'time_preparing', 'metadata_value': '5.6035003662109375', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'time_running', 'metadata_value': '5.032740592956543', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'time_cleaning_up', 'metadata_value': '0.027823925018310547', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'time_uploading_results', 'metadata_value': '0.08494400978088379', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'docker_image', 'metadata_value': 'python@sha256:b45ed695466dfbab748f4b2f73e3f9dc43236deda8c1a6f9eeb5fd3c861e5354', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'remote', 'metadata_value': 'fv-az221-714', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'remote_history', 'metadata_value': 'f4a4bd80c899', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}, {'metadata_key': 'on_preemptible_worker', 'metadata_value': 'False', 'bundle_uuid': '0xd31378a9fd6b4cfc9b9f893ccf04aa7e'}], 'dependencies': []}
2023-05-17 05:02:23,171 {'actions': ['kill'], 'allow_failed_dependencies': False, 'cpu_usage': 0.0, 'created': 1684299727, 'data_size': 4096, 'description': '', 'docker_image': 'python@sha256:b45ed695466dfbab748f4b2f73e3f9dc43236deda8c1a6f9eeb5fd3c861e5354', 'exclude_patterns': [], 'last_updated': 1684299742, 'memory_usage': 0.070703125, 'name': 'run-sleep', 'on_preemptible_worker': False, 'remote': 'fv-az221-714', 'remote_history': ['f4a4bd80c899'], 'request_cpus': 1, 'request_disk': '1m', 'request_docker_image': 'python:3.6.10-slim-buster', 'request_gpus': 0, 'request_memory': '10m', 'request_network': True, 'request_priority': 0, 'request_queue': '', 'request_time': '', 'run_status': 'Finalizing bundle.', 'staged_status': "Bundle's dependencies are all ready. Waiting for the bundle to be assigned to a worker to be run.", 'started': 1684299727, 'store': '', 'tags': [], 'time': 3.138313, 'time_cleaning_up': 0.027823925018310547, 'time_preparing': 5.6035003662109375, 'time_running': 5.032740592956543, 'time_system': 0.0, 'time_uploading_results': 0.08494400978088379, 'time_user': 0.0}
2023-05-17 05:02:23,171 None
2023-05-17 05:02:23,172 0
2023-05-17 05:02:23,172 ready

What we see is that the bundle.metadata dict has no failure message! And the logged failure message is None. Interesting~

@AndrewJGaut
Copy link
Contributor Author

AndrewJGaut commented May 17, 2023

OK, now that I've added more logging, I see that at the end of transition_bundle_finalizing the bundle.metadata is:

{
   "actions":[
      "kill"
   ],
   "allow_failed_dependencies":false,
   "cpu_usage":0.0,
   "created":1684306433,
   "data_size":4096,
   "description":"",
   "docker_image":"python@sha256:b45ed695466dfbab748f4b2f73e3f9dc43236deda8c1a6f9eeb5fd3c861e5354",
   "exclude_patterns":[
      
   ],
   "failure_message":"Kill requested",
   "last_updated":1684306448,
   "memory_usage":0.0703125,
   "name":"run-sleep",
   "on_preemptible_worker":false,
   "remote":"fv-az592-777",
   "remote_history":[
      "3f7310a8fd23"
   ],
   "request_cpus":1,
   "request_disk":"1m",
   "request_docker_image":"python:3.6.10-slim-buster",
   "request_gpus":0,
   "request_memory":"10m",
   "request_network":true,
   "request_priority":0,
   "request_queue":"",
   "request_time":"",
   "run_status":"Finalizing bundle.",
   "staged_status":"Bundle's dependencies are all ready. Waiting for the bundle to be assigned to a worker to be run.",
   "started":1684306433,
   "store":"",
   "tags":[
      
   ],
   "time":3.096053,
   "time_cleaning_up":0.025354385375976562,
   "time_preparing":5.472713470458984,
   "time_running":5.025737524032593,
   "time_system":0,
   "time_uploading_results":0.12300682067871094,
   "time_user":0
}

And, at the beginning of transition_bundle_finished, the bundle metadata is:

{
   "actions":[
      "kill"
   ],
   "allow_failed_dependencies":false,
   "created":1684306433,
   "description":"",
   "exclude_patterns":[
      
   ],
   "name":"run-sleep",
   "on_preemptible_worker":false,
   "remote_history":[
      "3f7310a8fd23"
   ],
   "request_cpus":1,
   "request_disk":"1m",
   "request_docker_image":"python:3.6.10-slim-buster",
   "request_gpus":0,
   "request_memory":"10m",
   "request_network":true,
   "request_priority":0,
   "request_queue":"",
   "request_time":"",
   "staged_status":"Bundle's dependencies are all ready. Waiting for the bundle to be assigned to a worker to be run.",
   "started":1684306433,
   "store":"",
   "tags":[
      
   ]
}

Now, we're getting somewhere. The failure message is wiped from the metadata and the metadata itself changes significantly. Looks like something else is updating the bundle prematurely before transition_bundle_finished gets to it!

@AndrewJGaut
Copy link
Contributor Author

AndrewJGaut commented May 17, 2023

Note: I thought maybe it was due to running transition_bundle_running again before transition_bundle_finished, but the bundle metadata at the end of that function looks like so:

{
   "actions":[
      "kill"
   ],
   "allow_failed_dependencies":false,
   "cpu_usage":0.0,
   "created":1684309050,
   "data_size":4096,
   "description":"",
   "docker_image":"python@sha256:b45ed695466dfbab748f4b2f73e3f9dc43236deda8c1a6f9eeb5fd3c861e5354",
   "exclude_patterns":[
      
   ],
   "last_updated":1684309071,
   "memory_usage":0.069140625,
   "name":"run-sleep",
   "on_preemptible_worker":false,
   "remote":"fv-az842-109",
   "remote_history":[
      "fd7b126df007"
   ],
   "request_cpus":1,
   "request_disk":"1m",
   "request_docker_image":"python:3.6.10-slim-buster",
   "request_gpus":0,
   "request_memory":"10m",
   "request_network":true,
   "request_priority":0,
   "request_queue":"",
   "request_time":"",
   "run_status":"Finalizing bundle.",
   "staged_status":"Bundle's dependencies are all ready. Waiting for the bundle to be assigned to a worker to be run.",
   "started":1684309050,
   "store":"",
   "tags":[
      
   ],
   "time":3.109395,
   "time_cleaning_up":0.022008895874023438,
   "time_preparing":5.538380861282349,
   "time_running":5.032529592514038,
   "time_system":0,
   "time_uploading_results":10.242918252944946,
   "time_user":0
}

which makes me think that's not the issue. Hmm... it's definitely a race condition because it's failing intermittently and it's due to a bundle update happening somewhere else...

@AndrewJGaut
Copy link
Contributor Author

AndrewJGaut commented May 17, 2023

I don't see where that extra bundle update is occurring...

  1. No rest endpoint is being hit (I don't see anything in the worker logs indicating it's doing anything and obviously there's no action from a client)
  2. I don't see anywhere that metadata update could occur in either bundle_checkin or in the bundle manager's four tasks it does every iteration (see here).

So, I'm a bit perplexed as of right now... Where else could a bundle metadata update arise...

@AndrewJGaut
Copy link
Contributor Author

Note: For a correct run, we have the following:

bundle metadata at end of transition_bundle_finalizing:

{
   "actions":[
      "kill"
   ],
   "allow_failed_dependencies":false,
   "cpu_usage":0.0,
   "created":1684308916,
   "data_size":4096,
   "description":"",
   "docker_image":"python@sha256:b45ed695466dfbab748f4b2f73e3f9dc43236deda8c1a6f9eeb5fd3c861e5354",
   "exclude_patterns":[
      
   ],
   "failure_message":"Kill requested",
   "last_updated":1684308936,
   "memory_usage":0.069921875,
   "name":"run-sleep",
   "on_preemptible_worker":false,
   "remote":"fv-az397-676",
   "remote_history":[
      "3246b8975815"
   ],
   "request_cpus":1,
   "request_disk":"1m",
   "request_docker_image":"python:3.6.10-slim-buster",
   "request_gpus":0,
   "request_memory":"10m",
   "request_network":true,
   "request_priority":0,
   "request_queue":"",
   "request_time":"",
   "run_status":"Finalizing bundle.",
   "staged_status":"Bundle's dependencies are all ready. Waiting for the bundle to be assigned to a worker to be run.",
   "started":1684308916,
   "store":"",
   "tags":[
      
   ],
   "time":7.116782,
   "time_cleaning_up":5.061987400054932,
   "time_preparing":0.8744297027587891,
   "time_running":9.032560110092163,
   "time_system":0,
   "time_uploading_results":0.1380014419555664,
   "time_user":0
}

bundle metadata at end of transition_bundle_finished:

{
   "actions":[
      "kill"
   ],
   "allow_failed_dependencies":false,
   "cpu_usage":0.0,
   "created":1684308916,
   "data_size":4096,
   "description":"",
   "docker_image":"python@sha256:b45ed695466dfbab748f4b2f73e3f9dc43236deda8c1a6f9eeb5fd3c861e5354",
   "exclude_patterns":[
      
   ],
   "failure_message":"Kill requested",
   "last_updated":1684308936,
   "memory_usage":0.069921875,
   "name":"run-sleep",
   "on_preemptible_worker":false,
   "remote":"fv-az397-676",
   "remote_history":[
      "3246b8975815"
   ],
   "request_cpus":1,
   "request_disk":"1m",
   "request_docker_image":"python:3.6.10-slim-buster",
   "request_gpus":0,
   "request_memory":"10m",
   "request_network":true,
   "request_priority":0,
   "request_queue":"",
   "request_time":"",
   "run_status":"Finalizing bundle.",
   "staged_status":"Bundle's dependencies are all ready. Waiting for the bundle to be assigned to a worker to be run.",
   "started":1684308916,
   "store":"",
   "tags":[
      
   ],
   "time":7.116782,
   "time_cleaning_up":5.061987400054932,
   "time_preparing":0.8744297027587891,
   "time_running":9.032560110092163,
   "time_system":0.0,
   "time_uploading_results":0.1380014419555664,
   "time_user":0.0
}

@AndrewJGaut
Copy link
Contributor Author

OK, one observation: the bundle manager logs for successful and failed runs have a difference. Before the run that is killed and ultimately fails to be registered as killed for the failed runs, we have the following logs for successful runs:

Staging <uuid>
Pinged worker through websockets, worker id: <id>
Starting run bundle <uuid> on worker <id>
Pinged worker through websockets, worker id: <id>
Acknowledged finalization of run bundle <uuid> on worker <id>

and we get the following logs for failed runs:

Staging <uuid>
Pinged worker through websockets, worker id: <id>
Starting run bundle <uuid> on worker <id>
Pinged worker through websockets, worker id: <id>
socket error when calling send_json_message: [Errno 111] Connection refused
Sleeping for 0.1 seconds.
Acknowledged finalization of run bundle <uuid> on worker <id>

All the failed runs I've seen have at least one Connection refused error at that point in the logs and all the successful runs I've seen don't have it.

@AndrewJGaut
Copy link
Contributor Author

AndrewJGaut commented Jul 23, 2023

I was hoping that #4506 would fix this, but it appears that it doesn't since that's been merged to master and after merging master in here we still get errors. As such, I'm still looking for the culprit...

The fact that we fixed that error makes me more confident that this issue isn't due to a race condition resulting from DB transaction ordering and is more ilkely due to an update happening to the bundle somewhere I haven't seen yet... just added some more logging.

One other thing I'm trying: I added in a sleep to try and induce the socket error when calling send_json_message: [Errno 111] Connection refused Sleeping for 0.1 seconds. error every time. If that causes the error to occur every time, then I'll be able to reliably reproduce the error locally and that'll make it WAY easier to debug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants