Azure Incident Response with Datadog

1. Intro

In this quick section we'll set up a test, assign a metric as a monitor, then go through an incident.

If you're reading this as a readme on GitHub, you can download the notebook's JSON file from json_import.json file in the repository, and import it.

First create a new notebook in Datadog.
After creating a new notebook, import it from the top-right share icon.

Note: if ever you need to export a notebook, from the same menu you can download as PDF or markdown (.md) or export the JSON file.

Overview

1. Intro
2. Synthetic Testing
3. Monitors
4. Creating an Incident
5. Updating an Incident
6. Resolving an Incident
7.Blameless Postmortem
8. Links and Docs
9. Markdown Fun!

2. Synthetic Testing

In order to test the site we created earlier in the lab, we'll set up a synthetic monitor.

You can do that by following this link to create a multi-step syntethic test.

In a different tab, we'll need to retrieve the URL for our app service from the Azure App Services page.

After running a few tests - you can export the metrics or graphs to a dashboard (new or existing) or even add one to a notebook, as seen below!

Once you see the Network Timings graph below fill up, please proceed to the next section.

Network timings (averaged)

3. Monitors

Next we'll head over to the Monitors section of Datadog in order to have a look at the automatically-created monitor from our synthetic test.

It might be red, but don't panic - we're only setting things up in our development environment 😅

Because we set our alert to @all, everyone in our company would have received this alert. That could have been via Teams, email, or other services you've set up in order to receive alerts or notifications.

Synthetics Response Time by URL

4. Creating an Incident

From the dashboard we created by exporting the Synthetics metric, we'll declare an incident.

In the incident declaration, you can set a title/summary, the severity level, pick an audience for notifications as well as context and signals (ours will be pre-filled as we created the incident from a graph).

Once the incident has been created, it will appear in the following graph of Active Incidents.

Active Incidents

5. Updating the Incident

Throughout the incident lifecycle, we'll want to update the status in order to keep team team and stakeholders up-to-date on the progress.

Note that you can also link to both live chat as well as video chat - say for example you've set up a new Teams channel programmatically in order to deal with the incident, but also a live Teams video meeting muster point (or "war room"). Both can be added as links so that others can join and get updated with one click from the header on the incident's page. ADD IMAGE HERE

Next we'll go over adding an update, as well as sending out a notification from within the Incident Response section of Datadog.

To do so, first add an update from the Incident Response timeline.

Once it has been updated, on the top-right of the Incident Response page, we'll send out a notification: ADD IMAGE HERE

Next we'll add a task to the incident. This is like a to-do list for the team. You can assign tasks to team members as well as add a deadline, if required.

After adding a task to the incident, the Datadog Events query should show an entry for the Incident update.

6. Resolving the Incident

Once we've addressed the causes of the incident, for example via a subsequent deployment, we can then resolve it, via the status on the top-left of the Incident Resolution page.

Resolved Incidents

7. Blameless Postmortem

Once the incident has been resolved, you would normally start the blameless postmortem process.

This means collecting a timeline of events, things that were tried, any dashboards related to the incident, etc. Since we collected these as we went along, when the postmortem is created in Datadog, it will collect all of these for you, and collect it in a notebook. Once the postmortem notebook has been created, you can then export it as markdown, JSON and/or PDF.

Link to first postmortem

8. Markdown Fun!

While working with Datadog notebooks, sometimes having a cheat-sheet handy can be helpful for those unfamiliar with Markdown, as well as to serve as a quick reminder to you while on-call.

Some Handy Markdown

Title

Sub-title

Sub-sub-title

Note

Emphasis

Use either _ or *:

bold / bold

italics / italics

Code Snippets

Single line / inline code

Multiline

for i in {1..100}; 
  do echo "hi from Datadog!"; 
done

Links

link

Tables

Azure Service	Monitor
App Service	Throughput
VM	Uptime

Checklists

Checklist Item
Unchecked

Bullets

Bullet
List

Numbered Lists

This is the first item
This is the second!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Azure Incident Response with Datadog

1. Intro

Overview

2. Synthetic Testing

Network timings (averaged)

3. Monitors

Synthetics Response Time by URL

4. Creating an Incident

Active Incidents

5. Updating the Incident

6. Resolving the Incident

Resolved Incidents

7. Blameless Postmortem

8. Markdown Fun!

Some Handy Markdown

Title

Sub-title

Sub-sub-title

Emphasis

Code Snippets

Links

Tables

Checklists

Bullets

Numbered Lists

Files

README.md

Latest commit

History

README.md

File metadata and controls

Azure Incident Response with Datadog

1. Intro

Overview

2. Synthetic Testing

Network timings (averaged)

3. Monitors

Synthetics Response Time by URL

4. Creating an Incident

Active Incidents

5. Updating the Incident

6. Resolving the Incident

Resolved Incidents

7. Blameless Postmortem

8. Markdown Fun!

Some Handy Markdown

Title

Sub-title

Sub-sub-title

Emphasis

Code Snippets

Links

Tables

Checklists

Bullets

Numbered Lists