Skip to content

TaskWorker Canary Deployment

Dario Mapelli edited this page Aug 20, 2024 · 4 revisions

here's a strategy for allowing us to test a new TW tag on just a fraction of the user workflows. So that we can do the equivalent of "canary" deployment in K8s.

Main idea is from Wa, details below tenatively identified by Dario and Stefano.

Design

  • Have two independent TaskWorkers, both pointing to same database, but with different configuration
    • Primary TW (identified as such in the configuration)
    • Secondary TW (identied as such in the configuration), AKA Canary TW
  • Primary TW configuration will contain the Canary TW name (e.g. crab-prod-tw02) and the fraction of tasks to assign to it

Implementation

  • leverage the new WAITING --> NEW status transition introduced for TaskScheduling
  • Primary TW looks at all tasks in WAITING. When it decides to put something in NEW it also tags the task with the name of the TW which will work on it. i.e. move this tagging from _lockWork to the new _selectWork
  • Primary TW only works on NEW tasks which are already tagged with its name
    • modify _lockWork to select tasks based also on TW name when changing status from NEW to HOLDING
  • Secondary TaskWorker simply does not run the _selectWork step and runs only on tasks which the Primary tagged with its name

Questions

  • do we need to protect against two TW's deployed as primary at same time ? how ?
    • no. we dont't
  • how do we monitor secondary TW ? New dashboard ?
    • add TW name to the stats reported by GenerateMONIT and make new panels.
    • "TaskWorker internal" dashboard: data from https://monit-timberprivate.cern.ch, index monit_private_crab_logs_crabtaskworker-*, use metadata.hostname or data.filebeat_name
    • "GenerateMONIT": not easy, need change the code (getCountTaskByStatus function) to get the tw name as well, likely a new query is requires