On my first day at my first full-time corporate tech job, I unleashed an email virus. I wasn’t the only one, but still… great way to make a first impression. Fortunately for me, the IT guys mostly just made fun of me, and soon became friends (RIP Dr. Dave).
Everyone in tech has one of those stories. A tiny mistake in a line of code blew up your whole platform. A typo resulted in a phone sex line number getting sent to customers instead of your company’s 1-800 number. A confidential email got forwarded to the entire company…
This week a misconfigured database exposed nearly 200 million American voters’ personal information.
If you haven’t experienced one of these Very Bad Days, you will. You may screw up; you may have to clean it up; or you may watch from the sidelines (this time).
Tech is full of humans typically working very hard, very fast and under a lot of pressure. This is a recipe for screwing up.
What happens when it happens says a great deal about a company’s leadership, business intelligence and culture.
A couple of weeks ago there was a Reddit thread about a junior developer who screwed up big time. “Accidentally destroyed production database on first day of a job…”
The fallout was very unpleasant. Though, tellingly, the many answers the person didn’t get seemed to be almost worse then what s/he was told.
Also interesting was the number of people who immediately noticed a lot of issues not just with how the screw-up was handled, but how it came to be possible (to the point of inevitable) in the first place.
This Quartz piece outlines the situation well. It reiterates that such things happen regularly and mentions some other high profile examples: British Airways, GitLab, Amazon, etc.
The last two examples — GitLab and Amazon — have become industry best practices case studies of sorts regarding how big tech blowups are handled and publicized. They also tell us a lot about organizational culture.
In short, the GitLab and Amazon incidents were handled transparently and used as learning opportunities. Not as the lead-up to a firing squad.
Not only could the person/people who screwed up learn from it, so could their team, the rest of the company and external developer community. It completely changed the focus from, “Oh sh*t! This guy screwed us!” to figuring out what happened and why, and how to fix it so it can’t happen again.
To paraphrase the Quartz article, that Redditor made one mistake. The company made several. The database s/he destroyed wasn’t backed up. Security procedures were poor. The system overall was poorly organized. The company’s developer orientation also sounded sorely lacking.
The Redditor made a technical mistake, but more explosively, exposed the company’s many operational failings, for which s/he had zero responsibility. Frankly, it was going to happen eventually. Because humans.
The company’s response was kneejerk and of no benefit to anyone, including themselves. Their operational failings didn’t remain hidden. They didn’t seem to have much focus on figuring out how the mistake happened and repairing those weaknesses. They didn’t prevent it from happening again, at least not right away. And they wasted an expensive hiring cycle. This says a lot about their C-level leadership that made these calls.
They did publicize themselves as a company where no sane developer would want to work. (Even though the company name wasn’t mentioned devs could likely find out easily enough.) Developers know code is never perfect. That’s why there are levels of testing. But at that company, I guess every developer’s days are numbered. Because humans.
Founders have a million and one things to create, manage and hustle for when building a company. Particularly if you’re new to the game, realizing the need for and knowing how to build a “safe” company for your employees may not be in your skill set. Yet.
But people are the most important resource you’ll have, so it’s not something that can be retrofitted later. (Especially if you manage to lose trust then have to try to rebuild it.)
As the Quartz article outlines, “psychological safety” empowers people to take risks and make mistakes. That may sound scary, but it doesn’t doom companies with this type of culture. It boosts success.
Developing this culture requires more honest and open communication, which is rarely a bad thing. It sheds light on wobbly ideas, faulty logic and embryonic screw-ups before they have a chance to become catastrophes.
Open and honest discourse also means that it’s considered a good thing to express yourself — at any level. It’s fine — and encouraged — to call out mistakes, but the intent is not shaming, it’s to correct them and enable everyone to learn.
It’s fine to disagree, because different viewpoints provide a more detailed perspective on projects and issues, leading to more robust solutions, and helping prevent echo chambers from developing.
Junior staff will grow and become more valuable much faster if it’s safe to experiment to learn, to ask any question, no matter how dumb or off the wall it may seem, or to be able to contribute by exposing errors, even by people much more senior than you are.
And if a catastrophe does hit, nothing binds a team like slogging through the trenches together and emerging battered but victorious. Not to mention doing so without throwing anyone under the bus.
Learning from direct experience that the sky won’t fall, or, on the rare occasion it does, that you now know how to put it back up, will build ever-greater confidence, creativity and psychological safety among teams.
So to that young developer: You haven’t thrown away your career. You’ve probably done a lot of other developers (and yourself) a solid. One day, when you’re out with your team and people start sharing tech disaster stories, you’ll be able to throw down with the best of them. And live to code another day.