Several years ago I worked with a prominent bank with approximately 1700 branches, making it one of the largest in its country. They were receiving seemingly random complaints about an application used to determine the credit worthiness of commercial loan applicants. A common scenario was that someone would walk into a branch to discuss a loan, the loan officer would look up their information in the application and a pop-up window with the applicants history would look like this:
Not exactly what they wanted to see, to say the least. Some reports from the field indicated that if you waited long enough (reportedly somewhere between 20-60 minutes) it would finally render. In other cases it would eventually time out the session. Some people claimed if they refreshed it would come back right away, some said it did nothing. The only additional information we had was that some branches reported consistent application slowness.
Twelve months earlier, a member of the application support team had travelled to many of the branches to try and observe the problem. Using basic packet analyzer tools like Fiddler and Wireshark, he was able to capture some baseline performance metrics, but found nothing conclusive. When I looked at the results, plus HTTP session information captured using HP Real User Monitor, two things jumped out:
1) HTTP 404 and 500 errors
2) Network spikes that corresponded closely to the complaining branches.
We also used a combination of CA Wily Introscope and Dynatrace to determine that server-side application performance was excellent, averaging sub-second response times.
What was interesting was the way the problems were hiding in plain sight. We had previously been reassured that the network was excellent. Upon presenting our findings though, the capacity planning team sheepishly admitted that many of the complaining branches were already on an upgrade schedule, but it would be another six months before that could happen. Essentially it was a political football at that point.
This is a really common scenario, where seemingly innocuous HTTP errors or application exceptions are ignored because the application “seems to work.” I’ve yet to find a scenario where we couldn’t significantly improve the real-world performance just by fixing some “unimportant” errors.