Assuming the Obvious

“We know the network is slow and there are probably a thousand problems in the application.”

I heard this from a customer recently. The problem was, it was assumption and hearsay – they had no tooling, no analysis, and had in no way quantified those statements. Even worse, parts if the application environment were considered “out of scope” for the analysis project, because “obviously those parts *should* work.” Let me tell you a story about assumptions.

In December of 2008, I did a project in a hardware vendor’s showcase lab with a major national rental car chain. The assembled team had been working for more than a month to make the rental pricing application perform, with no success. It was an ideal setting – access to multi-million dollar top-of-the-line hardware systems, a complete clean room environment. I was called in because the performance was miserable – the application could barely handle 5 users and transactions were running for minutes at a time.

In less than a week, we were done. There were two very obvious glaring issues where assumptions were made.

1) The customer’s application was running on IBM WebSphere application server but no one had sanity checked that the installation was even functional – they just kept hitting it with traffic and assumed that since it responded, that meant it was “working.” A quick reinstall to fix what was obviously a failed installation – because “obviously” that wouldn’t fail, right? Checking that assumption made all the difference in the world.

2) Once the application was actually running, we observed a lot of exceptions being thrown related to a static value lookup. These had been observed previously, but no one thought they were important. Just by caching that static value and ending the stream of exceptions, we cut several seconds off every request – a major performance improvement.

Assume nothing you can’t or haven’t measured.

Performance Problems Hiding in the Open

Several years ago I worked with a prominent bank with approximately 1700 branches, making it one of the largest in its country. They were receiving seemingly random complaints about an application used to determine the credit worthiness of commercial loan applicants. A common scenario was that someone would walk into a branch to discuss a loan, the loan officer would look up their information in the application and a pop-up window with the applicants history would look like this:

blankFirefoxBrowser1

 

Not exactly what they wanted to see, to say the least. Some reports from the field indicated that if you waited long enough (reportedly somewhere between 20-60 minutes) it would finally render. In other cases it would eventually time out the session. Some people claimed if they refreshed it would come back right away, some said it did nothing. The only additional information we had was that some branches reported consistent application slowness.

Twelve months earlier, a member of the application support team had travelled to many of the branches to try and observe the problem. Using basic packet analyzer tools like Fiddler and Wireshark, he was able to capture some baseline performance metrics, but found nothing conclusive. When I looked at the results, plus HTTP session information captured using HP Real User Monitor, two things jumped out:

1) HTTP 404 and 500 errors

2) Network spikes that corresponded closely to the complaining branches.

We also used a combination of CA Wily Introscope and Dynatrace to determine that server-side application performance was excellent, averaging sub-second response times.

What was interesting was the way the problems were hiding in plain sight. We had previously been reassured that the network was excellent. Upon presenting our findings though, the capacity planning team sheepishly admitted that many of the complaining branches were already on an upgrade schedule, but it would be another six months before that could happen. Essentially it was a political football at that point.

The Fiddler capture that had happened twelve months prior to our engagement had also shown HTTP errors. Some of these had been fixed, but it was believed that they were unimportantly. The thinking went that these were just image files or static reference content rather than anything of substance to the application. What had been overlooked was that many of these 404s related to javascript files. Thus, under certain circumstances, portions of the application functionality rendered by asynchronous HTTP requests (IE: “AJAX”) were failing. All the data relevant to solving the problem was available, but no one had interpreted it successfully.

This is a really common scenario, where seemingly innocuous HTTP errors or application exceptions are ignored because the application “seems to work.” I’ve yet to find a scenario where we couldn’t significantly improve the real-world performance just  by fixing some “unimportant” errors.

 

 

How Children Changed My View of People

It happened again this week. I was in a meeting when someone with questionable authority on the project stormed in and began berating the team, mostly spewing hypocrisy designed for political posturing to cover their own mistakes and weaknesses.

In short, it was a temper tantrum; classic 3-year old behavior.

Once I had children of my own, a transformation took place where I came to realize that few adults are reflective enough to move beyond the mental state of a child. They might be physically grown and hopefully have some greater knowledge of the world, but emotionally and behaviorally, they are no different than children on the playground arguing over who gets to go down the slide first. It’s sad their parents didn’t teach them some basic lessons on respect, sharing, and communicating with others, even in cases of disagreement.

This is honestly a kinder way to view people, because the alternative is that they are completely in control of their behavior and believe that the bullying, the idle threats, the yelling is acceptable in a professional setting. It’s a technique they have used successfully because there is no good way to give them a spanking and send them to “time-out.” Disturbingly, too often this behavior is encouraged as it is viewed as a path through the corporate bureaucracy – “that guy” may be wrong, he may be awful, but he gets things done. I’ve yet to see this be reality – instead, “that guy” uses that behavior to overshadow his deficits, and will produce a dysfunctional, ineffective organizational culture.

Please, fire “that guy” before he ruins you.

Basic Caching Math

I’ve seen this really simple problem now at two software companies, both filled with really bright developers. In both cases, the customer wanted to cache commonly-viewed images in memory at application startup. The number of images was known or could be approximated and would have well-defined categories. For example, perhaps they would want to cache pictures of Animals, Sports teams, etc. In both cases, the customers were complaining of long application startup time and significant memory issues, causing application outages. Can you guess what was broken? It’s simple math:

- A host with 2 gb of ram.

- A Java process consuming 1gb of ram.

- 600 images requiring approximately 2mb of ram each.

See the problem? If your cache loading “algorithm” is to blindly go load the cache with the 600 images, your host will either run out of ram or will thrash about trying to load those images into insufficient memory. Now, to be fair, where this gets complicated is if we can’t accurately estimate the size of the images required in ram. Even in this era of virtualization and container-based deployment, understanding the physical limits of systems determines architectural choices.

 

Simple Application Performance Problem Example

Let’s say you are using some kind of application performance tool (HP Diagnostics, New Relic, etc) and you see a graph that looks like this:

Screen Shot 2013-10-31 at 8.27.38 AM

 

What do we know by looking at this? That the reason the request is taking 20 seconds is due to four linear, synchronous http requests. I see this kind of thing frequently when assessing the performance of customer applications, where subsystems are tested and considered to have acceptable performance in isolation. Then someone comes along and wants to tie several services together and doesn’t stop to reflect upon how 4 synchronous calls of 4 seconds each is automatically 16 seconds. How do you fix this? You have two basic choices:

1) Find a way to significantly improve the performance of each subsystem.

2) Find a way to call the subsystems asynchronously such that the overall execution time is reduced.

Why Tornado Framework?

In our last post (http://mercenarytech.com/2013/07/why-django-framework/) we covered Django framework. Django is a fine choice, with module and configuration organization as its biggest strength. If it fits your requirements then it is a safe choice; it’s certainly popular. But, my experience has been that in practice it doesn’t actually offer much benefit. When given a choice (and based on project requirements) I push customers towards Tornado Framework. In this post I will talk about the reasons.

A Little Strategy
When I evaluate application frameworks technically, I’m generally looking for:

1) Features that will buy time or at least not cost. Sometimes simplicity is better than sophistication.

2) Assumption on the part of the framework architect that do not fit my roadmap. These become the corners we paint ourselves into.

3) Deployment or user experience compromises. Are there any the framework is asking me to make? Again, see point #2.

Thus, when I’m thinking about the future technical direction of an application, features like automatically generating template code, say for a Controller class, are interesting, but that is something I can automate from an IDE. It may not buy me much as a part of an application framework. Sure, if I can save 30 seconds by generating boilerplate code, especially if I will potentially write that same boilerplate code 1000 times in the life of that project is useful, but in part I judge a framework on not having to write that boilerplate at all. 

Simplicity Matters

Simplicity counts for a lot in my book. Take Controllers again. A Java framework like Spring contains many types of Controllers, where you have to learn something about each kind in order to subclass the right one. Tornado contains two – one for http, one for websockets. The User model in Django is another example of this. Depending on your version of Django, the process for extending the User model is different and requires understanding the internals of Django to correctly modify. With Tornado, there is a get_current_user(), you simply override it with whatever representation of user you wish – could be as simple as a cookie.

Where Django and Flask don’t make sense to me is they seem to assume I’m building a web application that will serve up html pages, and in the case of Django, have a significant administrative component. There’s a further assumption that my application will be monolithic in nature and that I need an MVC framework tightly coupled to a persistence layer. This is not how applications are being constructed today. Applications have moved on from that model, with real-time push technology (like websockets) becoming commonplace, and ever more functionality pushed to the client side. Thus web applications that a generation ago would have rendered pages from the server are instead feeding json from the server to a user experience rendered entirely in the browser, which in turn is more commonly a mobile device. Django and Flask are not offering anything unique to this style of application apart from a convenient way to organize your source code.

The Io loop / event model in Tornado is sadly overlooked. Tornado’s detractors would claim the moment you have to do something concurrent (fetching from a database is often used as an example), you lose the benefit. This is a false assumption though, for three reasons:

1) There are asynchronous drivers available for many databases and queues.
2) This assumes a monolithic stack, where the app needs to make a direct connection to the database (likely false as you scale)
3) This ignores real data (google it) that Tornado is significantly more performant than Apache with wsgi (for example).

On the last point, I’d challenge that setting up apache with wsgi is not difficult, but is more effort than Nginx with Tornado, for lesser performance. So why choose Django or Flask when that’s the standard deployment practice?

Tornado provides a basic template system, internationalization, simple organization, and python itself gives you parsing (json, XML,etc) and logging, no framework needed. SQLAlchemy is at least as full-featured as Django ORM and integrates well. On top of this you get a high performance http server and mature websockets support. Thrown in a built-in asynchronous http client and you have a framework that doesn’t get in your way but does give you the pieces to build functionality quickly. Thus for me, the power of Tornado over other frameworks is that it offers just enough and does not make any assumptions about the kind of applications I can build.

The Problem with APM / NPM / TPM / OA (and Everything Else Your Vendor Might Call It)

Here’s the question that derails Application Performance Management projects:

“Who is going to use it?”

It’s the people, man! Dynatrace, Introscope, OpNet, Diagnostics, ExtraHop – sure there are differences among products, but the single common difference between success and failure is the person using the tool.

It’s really simple:

    1. If you buy tools, but don’t use them… they aren’t solving any of your problems.
    2. If you buy sophisticated technical tools but don’t have the prerequisite background or invest in training… they might as well be doorstops.
    3. If other teams use the tool but don’t share the information learned… nobody benefits (silos are bad)

Unfortunately, I’ve seen too many customers over the years make major investments in these tools – licensing, professional services, etc – and then ultimately look around and realize that after the consultants leave, there’s no one who can actually use them.

Some might say that Predictive Analytics and Automation will solve that – just put the brains in the software – but we aren’t there yet. With each new vendor into the market (New Relic, AppDynamics, for example), there is a focus on building a better mousetrap – more technology, more metrics. This is not making it any easier on Corporate IT, which is already cash-strapped and down to minimal staffing. For now, the best tool really is the one that will get used.

 

Application Instrumentation Made Simple

There are many good application performance tools on the market today, supporting a variety of languages and in both software-as-a-service and traditional shrinkwrap forms. Whether using HP Diagnostics, CA Wily Introscope, Dynatrace, AppDynamics, New Relic, or something else, knowing where and why to apply instrumentation is worthwhile to understand.

While I have Java on my mind as I write this, the rules stay the same as we approach C# or even Python or Ruby. With tooling, too much information can be as much of a problem as too little.

What not to Instrument

My recommendation to not instrument these cases is not a definitive rule. There are legitimate scenarios where instrumenting the list below could make sense, but generally as a secondary step as the location of a problem becomes clear.

1) Model / domain / “getter-setter” objects are seldom a source of performance overhead; they are simple data carriers.

2) Primitive types / built-in types. Imagine instrumenting java.lang.String. It will produce a firehose of data and even in the rare possibility that you find a legitimate issue, how will you get it fixed?

3) Virtual Machine code. If your focus is on the performance of your application, instrumenting the (Java | ruby | python | etc) VM itself is likely to artificially degrade performance and produce copious, unusable data.

4) Vendor code / libraries. This isn’t an absolute “don’t” but be aware that you are taking on a challenge. If you find a problem in your vendor’s code, you will need to take it all the way through to convincing them that the problem is real and requires a fix.

5) Stay away from utility behavior unless you have a really good reason to apply instrumentation. Case in point, logging. Logging involves I/O operations that are already a potential performance drain, so the last thing you want to do is make it worse (unless you’ve got a really good indication that logging is the problem).

typical_application_instrumentation

What to Instrument
1) Reflect on the generic application diagram to the left. The first thing to understand is that with many tools, your effort is reduced because common frameworks and APIs are instrumented out of the box.

2) Focus on their business logic – the heart of their custom-built functionality.

3) Within the  application, focus on Verb rather than Noun behaviors. Look for not only classes, but also specific methods where there is transactional behavior. Focus on specific classes and methods that interact with external systems or where there is a transition between the modules of the application – those are the places where things break.

4) Both when applying instrumentation and doing your analysis, don’t get too hung up on calculations, memory, or threads until you have an indication that they are the source of a problem. Recognize too that a profiler is different than a static analyzer.

Despite vendor warnings, you can get by with a lot of instrumentation if you know where to apply it. Main thing is to keep it focused on actual transaction components – all those classes in a system that control the workflow.

Installing the HP Diagnostics Java Agent

HP Diagnostics has agents for Java, .Net, and Python. The Java and Python agents support multiple operating systems, for which there are separate installers available. In the following example we will install the Java agent on Windows using Oracle WebLogic.

Screen Shot 2013-07-17 at 10.07.28 PM  Step 1: Accept license agreement

Be prepared to hit the “enter” key many times if you are installing using the command-line installer

 

 

 

 

Screen Shot 2013-07-17 at 10.07.41 PM Step 2: Choose Installation Directory

After this point, the Setup Module will automatically load, enabling configuration of the agent. If you are installing the agent from the command-line, you will need to navigate to the <installation directory>/bin directory and manually launch the setupModule script for your platform.

 

 

 

Screen Shot 2013-07-17 at 10.08.19 PM

Step 3: Choose Configuration Options

 Profiler Mode: The Agent can be run in a stand-alone configuration (no integration with the Commander), free of charge.

AD License: Use this if you intend to only integrate the Commander with LoadRunner or Performance Center

 AM License: The Diagnostics Agent is also used as a data collector for HP TransactionVision. Diagnostics can also be deployed in an HP SaaS configuration. If you are installing Diagnostics in your environment and are not using TransactionVision, then select only the “Diagnostics” option.
Screen Shot 2013-07-17 at 10.09.07 PMStep 4: Enter an Agent Name and Agent Group

This step is important, as the names used here will appear in the Diagnostics Commander interface.  Agent Group is used where you have multiple agents all performing a similar task, examples: “Production,” “Application123,” or “Cluster456.” Both Agent Name and Agent Group will be used by default for any Agent instances executed on this host. By appending “%” to your agent name, a unique incrementing number will be appended.

 

 

 

 

 

Screen Shot 2013-07-17 at 10.09.21 PMStep 5: Agent Configuration

In this step we are configuring the Agent to send its data to the Mediator. This may or may not involve a proxy server, depending on your environment. In many cases, the Agent and Mediator will be on the same subnet (good idea), with firewall configuration so that the Mediator and Commander can connect.

 

 

 

 

 

 

Screen Shot 2013-07-17 at 10.09.43 PM Step 6: Complete Installation

We will run the JREInstumenter in the next step, so no need to run it in this step. If we were to select the checkbox in this step, the JREInstrumenter would run against the first JRE/JDK discovered on the system, which may or may not be the one used by our application. By manually executing it in the next step, we explicitly identify which JRE/JDK we intend to use.

 

 

 

 

 

Screen Shot 2013-07-27 at 1.20.59 PMStep 7: Proceed Using the JREInstrumenter

The JREInstrumenter is a separate application, accessible from the windows program group. If you are installing from the command-line, you will need to navigate to  the <installation directory>/bin directory and manually launch the JREInstrumenter script for your platform.

 

 

Screen Shot 2013-07-27 at 1.19.30 PM Using the JREInstrumenter, we select the JRE/JDK being used by the application to be monitored.The entire output of the JREInstrumenter is a string parameter we will append to our application startup.

Is running the JREInstrumenter required? It depends on the version of your JRE/JDK and Agent. HP strongly recommends that the JREInstrumenter be run as they reserve the right to apply additional Agent initialization features.

 

 

Screen Shot 2013-07-27 at 1.46.30 PM Step 8: Modify application bootclasspath. 
This step is application specific, but the summary is that you will append the parameter from the JREInstrumenter to the application bootclasspath for your application. In some cases, such as when using IBM WebSphere application server, you may be able to use a graphical user interface. For Oracle WebLogic, there is a startup script where you can append the parameter. This step may take several attempts to work.

Screen Shot 2013-07-27 at 1.49.00 PM Step 9: View the Probe

You will know the probe is functioning when you can view its user interface at http://<host>:35000. Default username and password are both “admin.”

 

 

 

 

Free Advice