Prism long term storage consumes messages from specified topics and stores them in AWS S3 partitioned by topic and metric date.
Messages written to the saas-usage-metrics topic with a metric date of 6-16-18 will end up in s3://prod-prism-lts/topics/saas-usage-metrics/year=2018/month=06/day=16/saas-usage-metrics+<kafka-partition>+<kafka-offset>.avro
NOTE Kafka Message and .avro files in S3 are not 1 for 1. A .avro file in S3 may contain 1 - N messages up to 5MB of messages.
- No other services are impacted by an outage in Prism LTS.
- The Kafka message retension policy allows for Prism-LTS to be down for ~7 days before it starts missing messages.
- No customers need to be notified of a Prism LTS outage
- Check the Number of Active Tasks in Grafana
- 12/12 indicates that Prism LTS is health
- < 12/12 indicates that tasks are dying
- 0/12 indicates no messages are being written to S3
- Check Prism-LTS Rest API for a Stack Trace
- This is Much easier that searching Loggly; Sometimes errors are not logged
- SSH into the Prod Jumpgate
- Do a
kubetl port-forward <prism-lts pod> 8083:8083
to expose the Rest API - Check the Status of the Tasks with
curl 127.0.0.1:33250/connectors/s3-connector/status
- All 12 tasks should be in the
RUNNING
state
- Check the logs to ensure Prism-LTS is writing and committing message offsets
- example log message:
INFO Files committed to S3. Target commit offset for ac-user-event-2 is 1839407
- example log message:
Manually Restart the Failed Tasks:
- SSH into the Prod Jumpgate
- Do a
kubetl port-forward <prism-lts pod> 8083:8083
to expose the Rest API - Check the Status of the Tasks with
curl 127.0.0.1:33250/connectors/s3-connector/status
- Restart a failed Task with
curl -X POST 127.0.0.1:33250/connectors/s3-connector/tasks/<task-id/restart