Skip to content

Commit

Permalink
TRD: remove lock automatically, do not upload to bucket, push slack alrt
Browse files Browse the repository at this point in the history
1. when run interactively, if the previous version of the program
   crashed, TRD asks you if you want to remove the lock. This causes
   problems with k8s cronjobs as they have to be fixed manually, shelled
   into to remove the lock. We remove the lock automatically. We already
   have a policy to forbid concurrent runs so the lock is useless
2. we remove the bucket upload funcitonality which has been useless for
   us. Besides, it was using the snapshot engine container which no
   longer exists in the repo.
3. we prevent the job from failing when TRD fails. Instead, we push a
   slack notification. This removes the need to clean up failed jobs
   within the cluster.
  • Loading branch information
nicolasochem committed Jul 28, 2024
1 parent 5d36473 commit 7b83908
Show file tree
Hide file tree
Showing 6 changed files with 33 additions and 62 deletions.
9 changes: 0 additions & 9 deletions charts/tezos-reward-distributor/scripts/bucket_upload.sh

This file was deleted.

3 changes: 0 additions & 3 deletions charts/tezos-reward-distributor/scripts/bucket_upload_secrets

This file was deleted.

22 changes: 15 additions & 7 deletions charts/tezos-reward-distributor/scripts/run.sh
Original file line number Diff line number Diff line change
@@ -1,15 +1,14 @@
#!/bin/sh
# remove lock if present
if [ -f /trd/cfg/lock ]; then
rm /trd/cfg/lock
fi

if [ "${DRY_RUN}" == "false" ]; then
dry_run_arg=""
else
dry_run_arg="--dry_run"
fi
if [ "${ADJUSTED_EARLY_PAYOUTS}" == "false" ]; then
aep_arg=""
else
aep_arg="--adjusted_early_payouts"
fi
python src/main.py \
-M 2 \
--reward_data_provider ${REWARD_DATA_PROVIDER} \
Expand All @@ -20,5 +19,14 @@ python src/main.py \
--initial_cycle ${INITIAL_CYCLE} \
-N ${NETWORK} \
${EXTRA_TRD_ARGS} \
${dry_run_arg} \
${aep_arg}
${dry_run_arg}

# if TRD fails, send a slack alert
if [ $? -ne 0 ]; then
# check if webhook is set
if [ -z "${SLACK_WEBHOOK}" ]; then
echo "TRD failed, but SLACK_WEBHOOK is not set, failing job"
exit 1
fi
curl -X POST -H 'Content-type: application/json' --data "{\"text\":\"Payout failed for $BAKER_ALIAS\"}" ${SLACK_WEBHOOK}
fi
30 changes: 8 additions & 22 deletions charts/tezos-reward-distributor/templates/cronjob.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ spec:
volumeMounts:
- mountPath: /trd
name: storage
containers:
- name: tezos-reward-distributor-cron-job
image: {{ .Values.images.tezos_reward_distributor }}
imagePullPolicy: {{ .Values.images_pull_policy }}
Expand Down Expand Up @@ -80,26 +81,11 @@ spec:
value: "{{ .Values.initial_cycle }}"
- name: DRY_RUN
value: "{{ .Values.dry_run }}"
containers:
- name: report-uploader
image: {{ .Values.tezos_k8s_images.snapshotEngine }}
volumeMounts:
- mountPath: /trd
name: storage
- mountPath: /trd/cfg/bucket_upload_secrets
name: secret-volume
subPath: bucket_upload_secrets
command:
- /bin/sh
args:
- "-c"
- |
{{ tpl ($.Files.Get (print "scripts/bucket_upload.sh")) $ | indent 16 }}
env:
- name: BUCKET_ENDPOINT_URL
value: "{{ .Values.bucket_upload.bucket_endpoint_url }}"
- name: BUCKET_NAME
value: "{{ .Values.bucket_upload.bucket_name }}"
- name: BAKER_NAME
value: {{ include "tezos-reward-distributor.fullname" . }}
- name: BAKER_ALIAS
value: "{{ .Values.baker_alias || default 'unknown' }}"
- name: SLACK_WEBHOOK
valueFrom:
secretKeyRef:
name: {{ include "tezos-reward-distributor.fullname" . }}-secret
key: slack_webhook
restartPolicy: OnFailure
2 changes: 1 addition & 1 deletion charts/tezos-reward-distributor/templates/secrets.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ kind: Secret
metadata:
name: {{ include "tezos-reward-distributor.fullname" . }}-secret
data:
bucket_upload_secrets: {{ tpl (.Files.Get "scripts/bucket_upload_secrets") . | b64enc | quote }}
slack_webhook: {{ .Values.slack_webhook | b64enc | quote }}
29 changes: 9 additions & 20 deletions charts/tezos-reward-distributor/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,8 @@ images:
images_pull_policy: IfNotPresent

tezos_k8s_images:
# snapshotEngine is needed for upload of logs to bucket
# since it already exists, we do not create a new container
# just for this task.
snapshotEngine: ghcr.io/tacoinfra/tezos-k8s-snapshotengine:main
# container with python for TRD
utils: ghcr.io/tacoinfra/tezos-k8s-utils:main

# The node endpoint. It must be an archive node.
# May start with https://
Expand All @@ -25,17 +23,12 @@ signer_addr: tezos-signer-0.tezos-signer:6732
schedule: "0 */6 * * *"

# Where TRD gets its payout data from.
# Defaults to rpc. When using rpc, you must set `tezos_node_addr` to an archive node.
#
# Pick one of "rpc", "tzstats", "tzkt"
reward_data_provider: "rpc"
# Defaults to tzkt (the only option)
reward_data_provider: "tzkt"

# Tezos Network. Can be MAINNET or GHOSTNET
network: MAINNET

# Enable adjusted early payouts. Pay out 6-9 days after delegation instead of 18-21 days.
adjusted_early_payouts: false

# Set initial cycle to pay rewards from. Set to -1 to start from just finished cycle.
initial_cycle: -1

Expand Down Expand Up @@ -114,12 +107,8 @@ trd_config:
# Rewards for cycle %CYCLE% are completed.
# We paid out %TREWARDS% tez in rewards to %NDELEGATORS% delegators.

# optionally upload all TRD state to a bucket. This allows all data to be examined
# when the cronjob is not running.
bucket_upload:
bucket_endpoint_url:
bucket_name:
bucket_upload_secrets:
access_key_id:
default_region:
secret_access_key:
# slack webhook to be alerted when TRD fails
# slack_webhook: "https://hooks.slack.com/services/XXXXXXXXX/XXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXX"

# baker alias to push to slack webhook
# baker_alias: "mybaker"

0 comments on commit 7b83908

Please sign in to comment.