Customer Postmortems

So you’ve had some downtime. Your customers were impacted for an extended period of time. You’re most likely are already writing a postmortem that’s circulated internally, but what about writing one for your customers, too?

 

A History Lesson

I work on a legacy product that’s been around for over a decade. When our product was new, we had a tendency to interact with our customers on a regular basis, both within the product, and on venues like web forums. Our customers felt like we were a part of the community, and that we cared about them.

But as the product aged, our involvement grew less and less. What was a steady flow of information slowly fell to just a trickle, and then eventually stopped. We went several years without any real form of community to our customers after outages, other of a posting on our status blog that things were broken, and then again when it was fixed.

This lack of communication made us seem like we didn’t care about our customers. This wasn’t actually the case, we cared about them very much, but they had no way to determine this. Contempt grows in a vacuum. Our product caters to creative individuals, whose imaginations were left to run wild.

This brings us to about a year ago. After an extended period of downtime, I decided the right thing to do was to tell our customers what happened. I took a risk and wrote a blog post explaining what happened (and saying sorry for the mess), got it approved from all the right people, and it was posted on our website.

The response from our customers was overwhelmingly positive.

Suddenly, the same people that loved to come up with conspiracy theories about what happened (often saying we’d done it on purpose), were happy! They welcomed the information with open arms. It got reposted on other blogs, and customers even wrote their own blog postings on what happened.

By taking just a few hours to write a postmortem and posting it for the world to see, we were able to generate a huge amount of goodwill towards not only the ops team, but the company in general.

We were suddenly seen as human again. We weren’t some huge corporation that didn’t care about our customers, we were a group of engineers that went to battle to keep our product going. They saw us as people. It was amazing.

This trend has continued after every major period of downtime. During the last round of it, I even saw someone mention on Twitter that they were looking forward to our write-up of what happened.

It’s a good feeling knowing that ops has been able to help make such a big difference in how our customers perceive our company. It’s all about goodwill, and saying we’re sorry and explaining what went wrong, has been a great source of it.

 

How to get started?

Writing customer postmortems is a bittersweet thing for ops. Writing one means that downtime occurred. We don’t like downtime. I’m always happy to write one, but hate that I have to do it.

The way I got started was to just simply write one after a major outage, and started passing it around. This worked well in the small company that I work for.

If you work for a larger organization, I’d start by having a conversation with your coworkers and manager about how you feel better communication is important. See if you can figure out where it would go on your website. Figure out what kind of buy-in you need post something publicly before there’s an outage, who would need to approve it, and who would be the person to actually get it posted.

But simply put, the best way I found was just to write one following an outage. Show people what’s possible and what you have in mind.

 

Timing

Timing is important. Obviously you shouldn’t write about it until after the situation is fully resolved, but you don’t want it to linger for weeks, either.

My goal is to write a customer postmortem the day of the outage, by the end of the day. This gives me a day or two to make changes and get approvals from all of the right people. Getting approvals often takes longer than writing the thing in the first place. Usually we post the customer postmortem within 24 hours of the situation being resolved.

Is it more important than your internal technical postmortem? That depends on your organization. Often a customer postmortem is easier to write than an internal technical one, since you don’t need a full list of everything that happened, but it’s easier to write the external one after the internal one has been written.

 

When to Write a Postmortem

So when should you write a postmortem? After every outage?

This is something you need to figure out as an ops team. If you write too much, you might give the impression that things are worse than they are, but if you don’t write enough, you appear to lack empathy.

I have a benchmark I use on when to write a customer postmortem: when more than half of customer base is impacted, for more than a few minutes.

Twitter (and your support team) is a great source of intelligence to gauge how upset your customers are. If there’s more than just a few tweets on an outage, I use that as a guide that we need to tell folks that happened.

 

Blameless

If you haven’t read John Allspaw’s piece on Blameless Postmortems, take a few minutes and do it now. John Allspaw is one of the greats in our industry.

In my opinion, it’s even more essential that customer postmortems are blameless. How something happened (or who) is completely unimportant to your customers, and it’s inappropriate to mention it. Your customers don’t care about the structure of your company – they care about the product they’re using.

Anything that’s posted on the Internet is going to be indexed by Google and saved for forever and ever. Never mention a person’s name. You could be messing with someone’s career if a future employer decides to search on a potential candidate’s name and find something damaging.

Just. Don’t. Do. It.

 

Blameless Extends to Vendors, too

As tempting as it might seem, don’t throw vendors under a bus. We don’t like downtime. When our product goes down, and it’s due to a fault in a vendor’s product or service, it can be pretty upsetting. (These are the types of outages I like the least, myself!)

Here’s a few things to keep in mind:

  • Once the outage is over, that relationship with the vendor will continue. It’s in the best interest of your team, and ultimately your customers, to keep it a healthy one.
  • As ops, it’s our job to engineer things such that the failure of a single vendor doesn’t cause downtime… and if that happens… yeah.

Write as if your vendors are going to be reading what you’re writing, because odds are, they’re going to.

 

Learn to Speak Marketing

This is hard for opsen.

Your organization most likely has a certain way they want to communicate with your customers. For example, some product names might always need to be written a certain way. Or, some terms are considered proper nouns and need to be capitalized. Remember that you’re speaking on behalf of the company.

Most organizations have a style guide, or maybe even people you can ask for help. My employer certainly does. They’re usually very willing and happy to work with you, so don’t be afraid to talk to them.

This was hard for me! The first time I wrote a customer-facing postmortem, our Communications Director had me make dozens up changes to my posting to match the corporate style. Every time I do it there’s been less and less corrections I need to make… to the point that on the most recent one, there weren’t any! Yay.

 

How Much Detail?

This can be tricky. I keep some things in mind when I write.

First, I’m speaking (largely) to a non-technical audience. They’re very smart, but they aren’t necessarily technical. So while it might be tempting to write something like “a slave MySQL host encountered an unexpected foreign key constraint conflict, halting replication, which in turn made reads to the cluster fall out of sync with the master,” bear in mind that the only people that are going to understand that are your co-workers. Instead, I’d say something like “there was an issue with the database.” Save the detailed explanation for your internal postmortem.

Second, and perhaps most importantly, remember that not everyone that’s going to read this has the best interests of your customers at heart. I try really hard not to tip our hand on how the problem was created, out of fear that someone may use this information against us.

You have to be really careful here. For example, if you lose a transit link, don’t tell people what capacity was lost. Circuits only come in so many sizes, and you might have accidentally just told someone that’s out to DDoS you how much bandwidth they need to knock you out. (Never, ever do this.)

There was one time when I specifically mentioned the CVE number that we had to interrupt services to patch against. This was because it was a hot issue (that was being mentioned on the news), and I wanted our customers to know “we got it, don’t worry about it.”

I try to explain what happened, while not giving specifics. Terms like “database cluster” are usually pretty safe. Just use your head and pretend you’re reading it with the intentions of doing harm. Ask your security team if you have any doubts!

 

Say How You’re Making it Better

Be sure and mention what your team is doing to keep this from happening in the future.

Just like before, a lot of level of detail isn’t needed. It could just be as simple as “we’re going to dig into the database logs and figure out what happened.”

You most likely list “next steps” on your internal technical postmortem. This might be a good place to start, if you can do so without compromising too much information.

During a string of outages at my company last year we had problems with the blog system where our support team communicates with customers going down. I made sure to mention this in my postmortem as a problem, even if it wasn’t directly tied to the outage itself. Customers being unable to get information during an outage is a big deal to them… maybe even more than the outage itself. (Vacuums breed contempt!)

Every postmortem I gave a followup on that issue, telling how it had (or hadn’t) gotten better. When it was finally resolved, I once again mentioned that we’d been listening to our customer’s feedback on it, and we were happy that things had improved.

 

It’s All About Empathy

This is really what it’s all about.

The student edition of the Merrian-Webster dictionary defines empathy as:

a being aware of and sharing another person’s feelings, experiences, and emotions; also : the ability for this

I don’t know of a single person that works in ops that enjoys downtime. We take pride in our work. We don’t like it when our customers are impacted. Downtime to us represents failure.

Part of being good at what we do is understanding that while we’re upset that things went down, our customers are impacted and upset, too. By publicly acknowledging to your customers that you understand they’re upset, and let them know that you’re upset too, you help build a bond between you. It’s that feeling of “we’re all in this together” that can drive empathy and help form a community.

 
Thanks for listening. Hopefully I’ve given you enough motivation to help make your ops team seem a little more human to your customers, too.

April ❤

Advertisements

Tools of the Trade: The Basics

This is the first in what I think will be a regular thing on Ops n’ Lops, a look at the toolset that I use! This post is going to feature the basic tools that I use every single day to access servers.

Opsen are a fickle group when it comes to our tools, and I am no exception!

Every professional has a well-worn and loved set of tools they’ve acquired and worked with over their career. Just as carpenters have beloved hand-tools, and musicians have well worn and crafted instruments, operation engineers all have a set of configurations and workflow items that we’ve molded and shaped to fit our fancy.

Basic Workflow

My day-to-day workflow looks like this:

My workflow from local machine to remote host.
My workflow from local machine to remote host.

I work locally off a Mac, connected to our remote environment over a VPN, accessing my “home base” machine within our environment over mosh.

I do all my work off of the remote VM, and not on my local machine. I try to keep as little confidential data on my local machine as possible in case it’s ever stolen. (I commute via public transportation, so this is actually a fairly large concern!) Basically my local workstation is nothing more than a battery powered dumb terminal with an amazing screen and great keyboard.

At work each engineer (developer, ops, or anyone that wants one, really) is given their own VM to work in. It’s ours and we can do with it what we want. (Truth be told, I have four VMs at the moment, but that’s because I work with machine images and need places to test them!)

A common question we debate endlessly at work is: “Where is the best place to run your ssh-agent?” There’s good arguments on each side, but I’m firmly on the side of “keep as little confidential data locally as possible,” and an SSH key is no exception. (It can be a moot point, mostly, because SSH assumes that your servers have a way to share public keys, which is not trivial when running at scale!)

mosh

So why mosh and not SSH? Mosh is designed to work on high-latency, lossy connections. It maintains state locally so I don’t have to wait for keystrokes to echo back, which is great when using cellular-based connections. But the best part of mosh is that it can handle connections changing over time. I can be connected at the office, and close the lid on my laptop when it’s time go, and leave without a care. When I get home, the session just resumes as I’d left it, even if my IP address changed!

Technical explanation: it uses SSH to establish the remote connection (it starts a process on the remote host that maintains the session state), but then drops the SSH connection and transmits display updates over UDP instead. The authors claim it’s as just as secure (if not better) as SSH, but I run it over a VPN just to play it safe.

A Closer Look at Some of my Tools

Now I’ll talk a little bit about each of the tools I use, and why.

Local Computer

My local machine is a MacBook Air. I use it mostly because that’s what I was handed on my first day at work, but also because it’s a good machine. Apple laptops tend to have great displays, and keyboards that are good to work with all day. (Many hours of battery life is nice, too!)

I adore OSX because it’s basically UNIX with a very nice and workable UI. When I open up a command line I’m instantly at home, and the GUI is very nice to work with all day and into the night.

The folks on my team are mostly split between OSX and Linux for their local workstations. Both work well! I actually gave Linux a shot for a few months before going back to OSX. The reasons I stuck with OSX had nothing to do with the tools of my job… it was the little things that the apps just do better on OSX. (The OS itself is great.)

What about Windows?

I suppose it’s possible to use Windows in ops, but I don’t know anyone that does. It’s missing some pretty basic functionality that I depend on – like a good, workable (aka POSIX) command line. It also is lacking a lot of things that I take for granted on the UNIX side, like SSH. (I mean the entire suite of tools – including things like scp.) Windows is sorta the land that ops forgot… go look at the mosh download page for a Windows binary. You won’t find it.

iTerm 2

I use iTerm 2 as my local terminal for a shockingly simple reason – it allows me to set up key macros. I have a few built that make interacting with tmux much faster… like Cmd-<LeftArrow> and Cmd-<RightArrow> to change between windows. I’m so use to having those macros built that I get really grumpy when I’m working on a machine without them. 🙂

Tools of the Trade: iTerm 2 Key Macros
Some of my keyboard macros in iTerm 2.

Fonts

There are a lot of strong feelings about font your terminal uses.tools-of-the-trade-anonymous-pro I am a HUGE fan of Anonymous Pro. It’s one of the very first things I install when I’m setting up a new machine. If you’re looking for a good font, try it! Magic happens around size 15pt when the OSX anti-aliasing kicks in just right. O looks different than 0, and l looks different than 1.

tmux

I use a long-running tmux session as my main workspace. I picked tmux over screen because I’ve had better luck with it over time, but some of my co-workers swear by screen. To each their own. 🙂

Putting it all Together

Here’s a sample screenshot of what I look at all day, every day:

bash running in tmux
An example of bash running in a tmux window.

That’s bash, running in tmux, running in iTerm 2 with my font set to Anonymous Pro (15pt) on MacOS X. 🙂

 

I hope this has been an interesting look at the the tools I use all day, every day. Thanks for reading, and I think I’ll be doing more “Tools of the Trade” features in the future!

April ❤