Alarming

This past weekend I got an email from a team member saying that there was an event that occurred at a customer facility which should have triggered an alarm in a system that we wrote, but didn't. Shortly after, I got an email from our project manager informing me that this error has most likely caused a very large amount of fines for the customer. Trigger minor panic attack.

The two of them (both higher than me on the chain and closer with the customer) had other emails and calls with the customer over the weekend regarding the issue, but the general plan was that we'd analyze it and develop a plan forward on Monday.

We ended up coming out of the situation pretty well, despite the circumstances, partly because of good fortune, but partly because we handled the situation well. I give kudos to my team member and PM for their handling of it. First thing first, we did a root cause analysis. We looked at all available logs and database records until we found the broken link (or didn't find one, more accurately), then recreated the events as they were described to have occurred, and then recreated the data flow.

We found the root cause, which, by my PM's judgement, and with the agreement of the customer, was a specific scenario that would have taken a very long time to find in testing, if at all. Additionally, the system behaved according to spec, but the spec didn't reveal the possibility of this scenario. After this, it was determined that about half of the fines incurred were the result of operator error and that our system performed correctly for the events related.

After we found the root cause, my PM also discussed the procedural issues resulting in this situation. The communication of the completion and testing status was not clear between us and the customer and there was no sign off to decommission the old system and start using the new one for this mission critical functionality. In summary, we found a technical shortcoming with the system, and some procedural missteps. We communicated all of this to the customer and began development to fix the shortcomings.

All in all, I definitely learned a lot about dealing with major problems at a customer site, and the handling of them, in addition to lessons about communication, signing off on mission critical system releases, and testing.