Recruiting via GitHub and Unintentional Biases

My GitHub Commit History

I would like to talk about unintentional basis for a moment.

The above chart is my own commit history (on my personal account) on GitHub right now. Not a single one. Pretty terrible, right? Some people in our industry think so. I linked to this blog post only because it’s the most recent one I read – I’ve read others like this, too, recently.

Now, taking this with a grain of salt because it’s written by a recruiter with a decade of experience working at RedHat, I want to talk about two points.

The first, and least important, is that GitHub is just one of many Version Control Systems in wide use today. To claim that it’s the “social network of code” is very misleading. It’s popular with open source projects… but lots of really talented and amazing engineers work on proprietary systems that aren’t tracked in GitHub. I happen to be one of these.

VCSes are, by their very nature, suppose to be boring. They’re just a tool for tracking changes and promoting collaboration between engineers. Tools like this date as far back as to the 1970s. There’s a bunch of them on the market. They’re designed to get out of the engineer’s way, and let them focus on code.

I work on a legacy codebase at work that pre-dates GitHub by many years. We have a well established set of processes and scripts built around the VCS we use, and they’re working for us. It’s fun to think about adopting a new one, but let’s be honest, our current one works, and we’d rather focus our efforts on making cool things. We don’t want to take the productivity hit fixing something that isn’t really broken.

I work with some amazing engineers who create awesome things, but you wouldn’t know it by looking at our GitHub commit history. (I’d rather you look at the stuff we produce instead!)

 

The second point, however, is far more important.

This blog dances around it (most likely wisely, because this comes up a lot), but I’ve read several that suggest that “always be coding” (ABC) means that it doesn’t matter if you work on a proprietary codebase that’s under lock and key – all good engineers will spend their weekends and evenings writing code as a side project.

This is crazy.

I work with some really talented engineers whose “side projects” include things like “raising a family.”

Just in my circle of friends and coworkers, I know people that do things after work hours like participate in community theater, volunteer at a suicide hotline, be a parent to multiple children, care for animals in a shelter, contribute to grassroots political campaigns, write novels, and help create safe spaces for closeted LGBT people that are in need of someone to talk to. Each of these examples is from an amazingly talented and awesome engineer that any company would love to have on staff – but you’d never know it by their GitHub contributions.

 

Don’t let a tool like GitHub inject unintentional biases into your recruiting processes. It’s a pretty slick VCS, yeah, but remember that it’s just one of many… and you’re going to miss out of some amazing people.

April ❤

Advertisements

Customer Postmortems

So you’ve had some downtime. Your customers were impacted for an extended period of time. You’re most likely are already writing a postmortem that’s circulated internally, but what about writing one for your customers, too?

 

A History Lesson

I work on a legacy product that’s been around for over a decade. When our product was new, we had a tendency to interact with our customers on a regular basis, both within the product, and on venues like web forums. Our customers felt like we were a part of the community, and that we cared about them.

But as the product aged, our involvement grew less and less. What was a steady flow of information slowly fell to just a trickle, and then eventually stopped. We went several years without any real form of community to our customers after outages, other of a posting on our status blog that things were broken, and then again when it was fixed.

This lack of communication made us seem like we didn’t care about our customers. This wasn’t actually the case, we cared about them very much, but they had no way to determine this. Contempt grows in a vacuum. Our product caters to creative individuals, whose imaginations were left to run wild.

This brings us to about a year ago. After an extended period of downtime, I decided the right thing to do was to tell our customers what happened. I took a risk and wrote a blog post explaining what happened (and saying sorry for the mess), got it approved from all the right people, and it was posted on our website.

The response from our customers was overwhelmingly positive.

Suddenly, the same people that loved to come up with conspiracy theories about what happened (often saying we’d done it on purpose), were happy! They welcomed the information with open arms. It got reposted on other blogs, and customers even wrote their own blog postings on what happened.

By taking just a few hours to write a postmortem and posting it for the world to see, we were able to generate a huge amount of goodwill towards not only the ops team, but the company in general.

We were suddenly seen as human again. We weren’t some huge corporation that didn’t care about our customers, we were a group of engineers that went to battle to keep our product going. They saw us as people. It was amazing.

This trend has continued after every major period of downtime. During the last round of it, I even saw someone mention on Twitter that they were looking forward to our write-up of what happened.

It’s a good feeling knowing that ops has been able to help make such a big difference in how our customers perceive our company. It’s all about goodwill, and saying we’re sorry and explaining what went wrong, has been a great source of it.

 

How to get started?

Writing customer postmortems is a bittersweet thing for ops. Writing one means that downtime occurred. We don’t like downtime. I’m always happy to write one, but hate that I have to do it.

The way I got started was to just simply write one after a major outage, and started passing it around. This worked well in the small company that I work for.

If you work for a larger organization, I’d start by having a conversation with your coworkers and manager about how you feel better communication is important. See if you can figure out where it would go on your website. Figure out what kind of buy-in you need post something publicly before there’s an outage, who would need to approve it, and who would be the person to actually get it posted.

But simply put, the best way I found was just to write one following an outage. Show people what’s possible and what you have in mind.

 

Timing

Timing is important. Obviously you shouldn’t write about it until after the situation is fully resolved, but you don’t want it to linger for weeks, either.

My goal is to write a customer postmortem the day of the outage, by the end of the day. This gives me a day or two to make changes and get approvals from all of the right people. Getting approvals often takes longer than writing the thing in the first place. Usually we post the customer postmortem within 24 hours of the situation being resolved.

Is it more important than your internal technical postmortem? That depends on your organization. Often a customer postmortem is easier to write than an internal technical one, since you don’t need a full list of everything that happened, but it’s easier to write the external one after the internal one has been written.

 

When to Write a Postmortem

So when should you write a postmortem? After every outage?

This is something you need to figure out as an ops team. If you write too much, you might give the impression that things are worse than they are, but if you don’t write enough, you appear to lack empathy.

I have a benchmark I use on when to write a customer postmortem: when more than half of customer base is impacted, for more than a few minutes.

Twitter (and your support team) is a great source of intelligence to gauge how upset your customers are. If there’s more than just a few tweets on an outage, I use that as a guide that we need to tell folks that happened.

 

Blameless

If you haven’t read John Allspaw’s piece on Blameless Postmortems, take a few minutes and do it now. John Allspaw is one of the greats in our industry.

In my opinion, it’s even more essential that customer postmortems are blameless. How something happened (or who) is completely unimportant to your customers, and it’s inappropriate to mention it. Your customers don’t care about the structure of your company – they care about the product they’re using.

Anything that’s posted on the Internet is going to be indexed by Google and saved for forever and ever. Never mention a person’s name. You could be messing with someone’s career if a future employer decides to search on a potential candidate’s name and find something damaging.

Just. Don’t. Do. It.

 

Blameless Extends to Vendors, too

As tempting as it might seem, don’t throw vendors under a bus. We don’t like downtime. When our product goes down, and it’s due to a fault in a vendor’s product or service, it can be pretty upsetting. (These are the types of outages I like the least, myself!)

Here’s a few things to keep in mind:

  • Once the outage is over, that relationship with the vendor will continue. It’s in the best interest of your team, and ultimately your customers, to keep it a healthy one.
  • As ops, it’s our job to engineer things such that the failure of a single vendor doesn’t cause downtime… and if that happens… yeah.

Write as if your vendors are going to be reading what you’re writing, because odds are, they’re going to.

 

Learn to Speak Marketing

This is hard for opsen.

Your organization most likely has a certain way they want to communicate with your customers. For example, some product names might always need to be written a certain way. Or, some terms are considered proper nouns and need to be capitalized. Remember that you’re speaking on behalf of the company.

Most organizations have a style guide, or maybe even people you can ask for help. My employer certainly does. They’re usually very willing and happy to work with you, so don’t be afraid to talk to them.

This was hard for me! The first time I wrote a customer-facing postmortem, our Communications Director had me make dozens up changes to my posting to match the corporate style. Every time I do it there’s been less and less corrections I need to make… to the point that on the most recent one, there weren’t any! Yay.

 

How Much Detail?

This can be tricky. I keep some things in mind when I write.

First, I’m speaking (largely) to a non-technical audience. They’re very smart, but they aren’t necessarily technical. So while it might be tempting to write something like “a slave MySQL host encountered an unexpected foreign key constraint conflict, halting replication, which in turn made reads to the cluster fall out of sync with the master,” bear in mind that the only people that are going to understand that are your co-workers. Instead, I’d say something like “there was an issue with the database.” Save the detailed explanation for your internal postmortem.

Second, and perhaps most importantly, remember that not everyone that’s going to read this has the best interests of your customers at heart. I try really hard not to tip our hand on how the problem was created, out of fear that someone may use this information against us.

You have to be really careful here. For example, if you lose a transit link, don’t tell people what capacity was lost. Circuits only come in so many sizes, and you might have accidentally just told someone that’s out to DDoS you how much bandwidth they need to knock you out. (Never, ever do this.)

There was one time when I specifically mentioned the CVE number that we had to interrupt services to patch against. This was because it was a hot issue (that was being mentioned on the news), and I wanted our customers to know “we got it, don’t worry about it.”

I try to explain what happened, while not giving specifics. Terms like “database cluster” are usually pretty safe. Just use your head and pretend you’re reading it with the intentions of doing harm. Ask your security team if you have any doubts!

 

Say How You’re Making it Better

Be sure and mention what your team is doing to keep this from happening in the future.

Just like before, a lot of level of detail isn’t needed. It could just be as simple as “we’re going to dig into the database logs and figure out what happened.”

You most likely list “next steps” on your internal technical postmortem. This might be a good place to start, if you can do so without compromising too much information.

During a string of outages at my company last year we had problems with the blog system where our support team communicates with customers going down. I made sure to mention this in my postmortem as a problem, even if it wasn’t directly tied to the outage itself. Customers being unable to get information during an outage is a big deal to them… maybe even more than the outage itself. (Vacuums breed contempt!)

Every postmortem I gave a followup on that issue, telling how it had (or hadn’t) gotten better. When it was finally resolved, I once again mentioned that we’d been listening to our customer’s feedback on it, and we were happy that things had improved.

 

It’s All About Empathy

This is really what it’s all about.

The student edition of the Merrian-Webster dictionary defines empathy as:

a being aware of and sharing another person’s feelings, experiences, and emotions; also : the ability for this

I don’t know of a single person that works in ops that enjoys downtime. We take pride in our work. We don’t like it when our customers are impacted. Downtime to us represents failure.

Part of being good at what we do is understanding that while we’re upset that things went down, our customers are impacted and upset, too. By publicly acknowledging to your customers that you understand they’re upset, and let them know that you’re upset too, you help build a bond between you. It’s that feeling of “we’re all in this together” that can drive empathy and help form a community.

 
Thanks for listening. Hopefully I’ve given you enough motivation to help make your ops team seem a little more human to your customers, too.

April ❤

Tools of the Trade: The Basics

This is the first in what I think will be a regular thing on Ops n’ Lops, a look at the toolset that I use! This post is going to feature the basic tools that I use every single day to access servers.

Opsen are a fickle group when it comes to our tools, and I am no exception!

Every professional has a well-worn and loved set of tools they’ve acquired and worked with over their career. Just as carpenters have beloved hand-tools, and musicians have well worn and crafted instruments, operation engineers all have a set of configurations and workflow items that we’ve molded and shaped to fit our fancy.

Basic Workflow

My day-to-day workflow looks like this:

My workflow from local machine to remote host.
My workflow from local machine to remote host.

I work locally off a Mac, connected to our remote environment over a VPN, accessing my “home base” machine within our environment over mosh.

I do all my work off of the remote VM, and not on my local machine. I try to keep as little confidential data on my local machine as possible in case it’s ever stolen. (I commute via public transportation, so this is actually a fairly large concern!) Basically my local workstation is nothing more than a battery powered dumb terminal with an amazing screen and great keyboard.

At work each engineer (developer, ops, or anyone that wants one, really) is given their own VM to work in. It’s ours and we can do with it what we want. (Truth be told, I have four VMs at the moment, but that’s because I work with machine images and need places to test them!)

A common question we debate endlessly at work is: “Where is the best place to run your ssh-agent?” There’s good arguments on each side, but I’m firmly on the side of “keep as little confidential data locally as possible,” and an SSH key is no exception. (It can be a moot point, mostly, because SSH assumes that your servers have a way to share public keys, which is not trivial when running at scale!)

mosh

So why mosh and not SSH? Mosh is designed to work on high-latency, lossy connections. It maintains state locally so I don’t have to wait for keystrokes to echo back, which is great when using cellular-based connections. But the best part of mosh is that it can handle connections changing over time. I can be connected at the office, and close the lid on my laptop when it’s time go, and leave without a care. When I get home, the session just resumes as I’d left it, even if my IP address changed!

Technical explanation: it uses SSH to establish the remote connection (it starts a process on the remote host that maintains the session state), but then drops the SSH connection and transmits display updates over UDP instead. The authors claim it’s as just as secure (if not better) as SSH, but I run it over a VPN just to play it safe.

A Closer Look at Some of my Tools

Now I’ll talk a little bit about each of the tools I use, and why.

Local Computer

My local machine is a MacBook Air. I use it mostly because that’s what I was handed on my first day at work, but also because it’s a good machine. Apple laptops tend to have great displays, and keyboards that are good to work with all day. (Many hours of battery life is nice, too!)

I adore OSX because it’s basically UNIX with a very nice and workable UI. When I open up a command line I’m instantly at home, and the GUI is very nice to work with all day and into the night.

The folks on my team are mostly split between OSX and Linux for their local workstations. Both work well! I actually gave Linux a shot for a few months before going back to OSX. The reasons I stuck with OSX had nothing to do with the tools of my job… it was the little things that the apps just do better on OSX. (The OS itself is great.)

What about Windows?

I suppose it’s possible to use Windows in ops, but I don’t know anyone that does. It’s missing some pretty basic functionality that I depend on – like a good, workable (aka POSIX) command line. It also is lacking a lot of things that I take for granted on the UNIX side, like SSH. (I mean the entire suite of tools – including things like scp.) Windows is sorta the land that ops forgot… go look at the mosh download page for a Windows binary. You won’t find it.

iTerm 2

I use iTerm 2 as my local terminal for a shockingly simple reason – it allows me to set up key macros. I have a few built that make interacting with tmux much faster… like Cmd-<LeftArrow> and Cmd-<RightArrow> to change between windows. I’m so use to having those macros built that I get really grumpy when I’m working on a machine without them. 🙂

Tools of the Trade: iTerm 2 Key Macros
Some of my keyboard macros in iTerm 2.

Fonts

There are a lot of strong feelings about font your terminal uses.tools-of-the-trade-anonymous-pro I am a HUGE fan of Anonymous Pro. It’s one of the very first things I install when I’m setting up a new machine. If you’re looking for a good font, try it! Magic happens around size 15pt when the OSX anti-aliasing kicks in just right. O looks different than 0, and l looks different than 1.

tmux

I use a long-running tmux session as my main workspace. I picked tmux over screen because I’ve had better luck with it over time, but some of my co-workers swear by screen. To each their own. 🙂

Putting it all Together

Here’s a sample screenshot of what I look at all day, every day:

bash running in tmux
An example of bash running in a tmux window.

That’s bash, running in tmux, running in iTerm 2 with my font set to Anonymous Pro (15pt) on MacOS X. 🙂

 

I hope this has been an interesting look at the the tools I use all day, every day. Thanks for reading, and I think I’ll be doing more “Tools of the Trade” features in the future!

April ❤

Heya!

Hi!

Something I have been wanting to do for YEARS now is start a blog where I can talk about my professional life! I have lots of outlets for personal things, but nothing about my career. Thus, this blog has been born.

I’m a Systems Engineer that works for a Bay Area tech company in operations. Ops are the folks that keep your favorite “thing” up and running 24/7/365.25. No one really thinks about us when things are working… but the moment things go down a giant spotlight is pointed in our direction.

Operations is more than a career, it’s a lifestyle. Like most other opsen I know, I go on call for weeks at a time (in my case I’m secondary on-call for a week, and then primary). I also try to be available to answer questions about technology I am a subject matter expert at anytime I’m needed, even if it’s at 3am on a Sunday. It takes a special kinda person to enjoy being a production engineer (and it’s often-crazy lifestyle), and I just happen to be one of those nutso people!

The focus of this blog isn’t going to be about the cool latest tech, or the things like the latest release of Docker. (There’s a ton of other blogs out there that can provide that information way better than I!) Instead I’m going to focus on a more human side of operations, and maybe talk about what I’m personally doing as an engineer.

Thanks for reading! I hope I can be entertaining and educational. 🙂

April ❤