Creating A Culture of Observability at Stripe

@gphat

Stripe had problems with observability.
- Lack of confidence, super reactive
This isn't about tech, it's about people.
Did it work? see my resume at...
Observability is a team. 3 + 1 intern
Where to begin:
- Spend time with the tools in place currently
- improve them if possible
- Replace if not possible
- leverage past knowledge.
Empathy and Respect:
- People are not (generally) evil. But they are busy.
- Stressed, doing best with what they have
- Being a hater is lazy
- Help People Be Great At Their Jobs!
Replaced existing System
- Maybe a bad call, technically better
- overcoming momentum is hard, adds work
- Declaring Bankruptcy
- Saved us ops headaches
- still going
Tip: Nemawashi (google it)
- Start small, you're a great guinea pig
- Quietly lay a foundation and gather feedback
- ask how you can improve, follow up!
- Engage discontent! usually fine. Sometimes you need whisky.
Identify Power Users
- Find interested parties
- talk to them, give them what they need
- Empower them to help others
- watch them grow.
Value
- What are you improving?
- How can you measure it?
- is this the best way?
- You have to think about this daily. Success is hitting the real goal, not delivering something you're bent on delivering
Control Theory and Observability.
- Reference -> Programmer -> System/sensor(s) -> Work ^----------------/
Flat org work ethic:
- Shave the preposterous line of yaks.
- Stigmergy - Disparate systems working together.
- Strike when good opportunities arise (incidents, etc)
Advertise
- Don't be afraid!
- Promote accomplishments
- Promote OTHER PEOPLE'S accomplishments
- Humbly ask to help then learn
- We send monthly 'State of' addresses.
Make it easy & good
- Much harder than it sounds. (Email is easy, but not good)
- Make it easy/automatic to do things right, and hard to do wrong.
Automated Monitors:
- Baseline monitoring
- common problems, common solutions
- users have no state, and are surprised
- people care when you show them failure and how to fix it.
- IF YOU PAGE, YOU NEED TO TELL RECIPIENT WHY.
- Stripe built a dashboard for each alert.. including a runbook of how to fix it.
Getting Feedback: how we improve
Teach the basics:
- Company curriculum: Teach 'em early.
- Measuring work metrics
- ...
- Visualizations
Ownership
- Poor story for this
- Org was ready for this, management was on board.
- Evolving, tools are lacking.
Did it work?
- Yes, but not done.
  - Some teams, heck yes. Strong champions...
  - some teams, kinda the same
  - Some other teams? What is observability... why do I care? (rare) (Not on call)
  - Usage? Getting better.
  - All positive feedback from automation (avg 4.5 of 5 stars) (2.5% response rate to alert feedback)
- Tools:
  - Dozens of OSS PRs, OSS StatsD library (scala)
  - vast improvement over old pipeline, no loss
  - New styles, better naming, more consistency
- Adjustments
  - Embracing other tools (log analysis, error catching)
  - Begging to work on strategic things( global timers, histograms and sets)
  - Need to improve metrics on our own work (we got by easy for a while)
  - Writing new monitors is super hard. Watch for things going to zero too!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mon04Creating.A.Culture.of.Observability.at.Stripeby_Cory.Watson.md

Mon04Creating.A.Culture.of.Observability.at.Stripeby_Cory.Watson.md

Creating A Culture of Observability at Stripe

Files

Mon04__Creating.A.Culture.of.Observability.at.Stripe__by_Cory.Watson.md

Latest commit

History

Mon04__Creating.A.Culture.of.Observability.at.Stripe__by_Cory.Watson.md

File metadata and controls

Creating A Culture of Observability at Stripe

Mon04Creating.A.Culture.of.Observability.at.Stripeby_Cory.Watson.md

Mon04Creating.A.Culture.of.Observability.at.Stripeby_Cory.Watson.md