Getting to the root of a problem

Supporting a software product takes a lot of work – and it is not always clear where the effort needs to be applied to keep that product healthy and running smoothly. In this blog entry let’s talk about problems (in the IT sense of the term) and how we go about proactively solving them.

When you host a SaaS (Software as a Service) product on behalf of customers, you come across the occasional bug or issue that needs resolving, a server that needs some love, or customers will tell you about useful enhancements and minor tweaks that can make it more effective for their particular business needs. Taking our GRID web mapping product as an example, these support items are generally things that are covered by the annual support and maintenance agreements – or support blocks – we have with Natural Resource Management (NRM) groups.

What I wanted to talk about in this article is something a bit set aside from bugs and enhancements, called Problem Management.

A problem in the IT or software development sense, is a special type of issue where the cause is unknown at the time that issue is raised, and a process is needed to investigate and manage the ‘root cause’. With most stock standard issues, the support team have a good sense – once they are aware of its existence – of what caused the issue; and therefore can quickly formulate a plan of attack for fixing it. This involves creating a ticket in our issue tracking system, assigning a team member to ‘replicate’ the issue and write some code (a ‘fix’), and a second team member to quality control the result before it is deployed.

As we learned in a recent experience with a problem that surfaced in GRID, we needed more brain power and a plan of attack. The problem manifested itself in recurring missing attribute data across more than one GRID instance – but despite a couple rounds of testing by one of our team it was unclear what the cause could be. We had some great collaboration from Tilo and Ray at the South Coast NRM office, who helped us recognise when and how the problem was happening, and the specific layer that it was affecting. It seemed impossible to replicate this mysterious phenomenon, and the best information we had were the dates of the last daily backup where that data was still present.

Tony, James and Serge at work in our new Flux shared workspace environment*

We recognised the critical importance of confidence and reliability in the data for our GRID customers, and so took this problem very seriously despite knowing that the data could be restored where necessary. As part of this process, we piled the brains trust into a “war room” – one of the benefits of our new offices in Perth at FLUX – with a bunch of markers and whiteboards, and discussed a plan of attack that focused on what we call the band-aid (short-term), medicine (middle-term) and surgery ( long-term) approaches:

Band-aid: when we know the problem has occurred, restore the offending data and notify the customer.
Medicine: put a range of diagnostic logging in place to recognise when the problem has occurred, so we can apply the band-aid before customers are negatively impacted (this is something we now have occurring on a daily basis).
Surgery: identify the root cause of the problem and remove it.

Problem management is more than assigning a resource and saying “You sir, find out what’s wrong… and fix it!” Our workshop also came up with three strong leads on what to investigate, clarity about who was responsible, and a commitment to meet on a regular basis until a root cause is understood. If those leads did not bear fruit then the team would come together again and discuss another line of investigation.

Within two days of focused attention, and involving three of our developers, we had successfully cracked the nut. Along with better monitoring or GRID and tighter processes, we found that complex layer settings around symbology – and a bug in the code – were causing the back-end server to time out and not save properly. This resulted in certain data not being displayed on the GRID application. It was our new diagnostic tools (and help from our NRM testers) that pinpointed the root cause. From there the solution was actually straight-forward, and a fix was applied that can handle the more complex symbology settings.

The experience has had a few positive outcomes for GRID and Gaia Resources as a business. Firstly we have identified the root cause of this particular problem and resolved it. Secondly, we have transformed our Problem Management process, and are better equipped as a team to recognise when an issue is “more than meets the eye.” Thirdly, the diagnostic logging we now have in place is going to pay dividends for future troubleshooting, which we hope will result in increased responsiveness and issue resolution time for our customers. Not just for GRID but across our other solutions too.

We’d be keen to hear your feedback on this – and if your organisation has processes to deal with problem management – so please get in touch via email or start a conversation with us on our Facebook, Twitter or LinkedIn feeds.

Chris

* Editor’s note: I thought James was the “dance guy”, not Serge?

Posted in Blog, Development, GRID, NRM | Tagged: GRID, nrm, software engineering

« Corals of the World

GRID Refresher training »