Friday, June 26, 2009

QA - Load/performance Test Plan

The ultimate goal is to use load test to find out the performance bottle neck and eventually improve it. To find out the performance bottle neck, we would like to collect the real performance status on the production site then try to reproduce the issues/activities offline (in QA environment) and find out the root cause. Then we can do a before/after test to measure the performance improvement before push it to production.

For a web-based application, the performance might be affected by the following factors:
  • Network Bandwidth - This includes external (internet, traffic between hosting servers and the end users) and internal (intranet, traffic between file servers, DB servers, and Web servers)
  • Server Capacity - This includes all the capacity (CPU, memory, extra loads, etc) of Web servers, File servers, and Db servers
  • Application Efficiency - This includes front end (server side) and back end applications response time
    • Front End - This includes the efficiency of all server side pages, scripts, and pictures, and media files
    • Back End - This includes the efficiency of Database's SP, functions, and data's quality

So, before we spend the effort on any or all areas, let's find out where is the problem first. Like all business owners will like to find out who is their customer, what they want, what they buy, how they buy, and how much they buy, we would also like to find out those things from our customers so we can put those items they want on the most convenient places for them. In our case, if we know what 80% of our customers are doing, searching, and/or listing, then we will be able to improve the performance and improve 80% of the customer's satisfaction.

To illustrate this "Behavior Analysis", let's look at this commonly seen scenario in our application - Search and list records with multiple pages displayed.
Possible implementation:
  • Execute the query and put selected (usually, 10 records per page) records into a cursor. Because the cursor is existing in memory only, thus we need to execute the whole query every time. However, also because it is doing the transaction in memory, it will run faster.
  • Execute the query and put all selected records into a temp table. Then from here, we have the variance of using a temp table, common table expression, or global temp table.
Better implementation?
Rather than decide which one is more efficient, we want to know which one is more match what we need which is what the customer is doing. For example, if 80% of the transaction done by the user are just blindly search and then paging through to look for what they want, then obviously, how we handle the "paging" is more important than just try to optimize the core query they all execute. On the other hand, if most transactions show pretty much the search/list is done at the first 1-3 pages, then the process of getting that result set becomes critical. Additionally, if our database is suffered by "write" already, then we want to take the always-multiple-write from the first implementation into consideration. So again, if our profiler data tell us our SP spend a lot of time on "waiting" to read, then dump all result set to a session-based global temp table and read exclusively from there will give us a better result. And from here, goes on and on until we find a solution for our customer's pattern.

This technique for analyzing data is called "data mining", from whole bunch of data to find the pattern and then provide solution to business. Especially, we have seen/notified our issues must related by the behavior because it is always happening on the Monday morning. For no particular reasons, we would say our application should not act differently based on the time. It must be how the users using the application. But how? Let's find out.

First, we want to determine the bottle neck from the data, not just assumption. So we would like to capture and analyze the whole trip traffic from front end (the end user) to back end and then back to front end.

Second, We want to identify the transaction/activity happening during the peak hour. This is more like the shopping site want to know how far their potential customer go before they left the site. For them, they will then improve the "not-so-attractive" pages. For us, we want to know at which point they will do another activity (register, take courses, enter data, etc.). Once we have the traffic analysis, we can also provide it to our customer as the feedback of the usages/popularity so they can adjust their content, length, and interaction, etc.

Third, once the pattern is identified, we then can create/capture testing script and reproduce it in our non-production site for verification. When we know the pattern, we can then make business decision on how to implement the most impacted areas. Since our QA capacity is different than production, we can adjust the traffic accordingly but still based on the pattern.

Forth, once we have the proposed fix, we can test it in our controlable environment. So not only we want to fix the bottle neck caused by the pattern, we also want to prevent another pattern to cause the bottle neck. In this stage, we can increase/decrease the pressure on network traffic, DB response time, extra application occupied system resource on web servers, etc. Basically, after this, we know the fix can really fix the issue (how can we fix an issue without able to reproduce it?).

Fifth, the behavior/pattern will change. The tools and data we created need to be able to capture and analyze the future data to help us identify the new pattern. In the future, we just need to inject the data we captured from the troubled client within certain time frame. We can either give explanation to the client for the limitation of the application with solid analyzed data or we can tune our application from there.

Sixth, this technique/tools/methodology is reusable even we move on the next phase of ACS Learning. By being able to analyze "our" application, we can play a more professional role on servicing and retailing other vendor (SumTotal)'s product. Because we know our clients and we know our business.

0 comments:

Post a Comment