Performance Problems Hiding in the Open

Several years ago I worked with a prominent bank with approximately 1700 branches, making it one of the largest in its country. They were receiving seemingly random complaints about an application used to determine the credit worthiness of commercial loan applicants. A common scenario was that someone would walk into a branch to discuss a loan, the loan officer would look up their information in the application and a pop-up window with the applicants history would look like this:

blankFirefoxBrowser1

 

Not exactly what they wanted to see, to say the least. Some reports from the field indicated that if you waited long enough (reportedly somewhere between 20-60 minutes) it would finally render. In other cases it would eventually time out the session. Some people claimed if they refreshed it would come back right away, some said it did nothing. The only additional information we had was that some branches reported consistent application slowness.

Twelve months earlier, a member of the application support team had travelled to many of the branches to try and observe the problem. Using basic packet analyzer tools like Fiddler and Wireshark, he was able to capture some baseline performance metrics, but found nothing conclusive. When I looked at the results, plus HTTP session information captured using HP Real User Monitor, two things jumped out:

1) HTTP 404 and 500 errors

2) Network spikes that corresponded closely to the complaining branches.

We also used a combination of CA Wily Introscope and Dynatrace to determine that server-side application performance was excellent, averaging sub-second response times.

What was interesting was the way the problems were hiding in plain sight. We had previously been reassured that the network was excellent. Upon presenting our findings though, the capacity planning team sheepishly admitted that many of the complaining branches were already on an upgrade schedule, but it would be another six months before that could happen. Essentially it was a political football at that point.

The Fiddler capture that had happened twelve months prior to our engagement had also shown HTTP errors. Some of these had been fixed, but it was believed that they were unimportantly. The thinking went that these were just image files or static reference content rather than anything of substance to the application. What had been overlooked was that many of these 404s related to javascript files. Thus, under certain circumstances, portions of the application functionality rendered by asynchronous HTTP requests (IE: “AJAX”) were failing. All the data relevant to solving the problem was available, but no one had interpreted it successfully.

This is a really common scenario, where seemingly innocuous HTTP errors or application exceptions are ignored because the application “seems to work.” I’ve yet to find a scenario where we couldn’t significantly improve the real-world performance just ┬áby fixing some “unimportant” errors.