This article is all about a tale of what caused some strands of gray hair to magically appear on my head overnight. As an accountant who finally ventured into tech, I can beat my chest and say that I never make mistake J.
Gring gring was the loud sound coming from my rather unique mobile phone at exactly 2:45am a day before a crucial board meeting.
It was the CFO making an SOS call that all the financial details meant for the board meeting has vanished into thin air leaving no trace whatsoever.
My heart sank into the lower part of my stomach before gradually finding its place back to its normal position. Did I hear you asking why the panic? I headed the team of developers responsible for building the current bespoke accounting software currently in use by the company.
In a very shaky voice, I asked what happened. As if he hasn’t already told me. Well, this article is a postmortem of what led to that outage that would have led to a serious issue if we had not arrested the situation by flipping a magic wand L.
Re: postmortem of lost access to crucial financial reporting system
Timeline:
- Issue detected at 2:35am a day before board meeting
- CFO’s adrenaline was pumping in preparation for a board so he decided to make an eleventh hour check – as if he has not already done so on several occasion. So happy he did.
- It was initially assumed that he forgot his password – boy was I so wrong
- We first reset his password, followed by placing a call to AWS (cloud where we hosted the application). What was I even thinking when I concurred to that email being sent?
- All members of development team and the finance team gathered at the HQ within 12 minutes
- At the end of the day, it was not a problem with the application but with our IT security policy. Our defense was breached and everyone locked out.
- The problem was resolved at exactly 7:13am. Phew, we saved the day.
Root cause:
Failure to update a critical security sensitive software was what led to what would have been the most embarrassing moment of our life. We configured certain application to trigger a total shutdown if an important update or patch is postponed beyond certain number of times. In the frenzy heat of totally focusing on making sure that the financial team gain uninterrupted access to all the system they need to meet with their deadline, we instructed the IT/InfoSec guys to override the system that will trigger shutdown.
The problem was that we forgot about the interdependencies (I mean us in the IT world love redundancies) that exist between systems. To cut the long story short, our backup security infrastructure kicked it when it sensed that the first line of defense has been breached.
Corrective measure:
We leant our lesson and implemented another system that logs all mission critical interdependencies. Sometimes the cause of our IT downtime are simple things that does not require massive work to resolve.
Leave a Reply