Something about IT products and their analysis: Performance Problems with Data processing in Mid-size businesses

Data is growing by leaps and bounds. You throw a query on Google about world's largest database and the returns would give numbers ranging from a few hundred terabytes to yahoo's 2 petabyte DB. They are known as Extremely Large Databases (ELDB). Yahoo and other giants have enough money to spend for turning that big data pie into intelligent information. The concern of this article is not the ELDBs. It is the second and the third tier of databases, where the DB sizes range from single digit TBs (called Large Databases or LDB) to a few hundred TBs (called Very Large Databases or VLDB), which the article would discuss. Most of the enterprises having an IT system have databases of the scale of LDB or VLDB.

The big player

According to IDC’s Database Market Share Analysis (2008), Oracle has the highest market share (44.3 %) followed by IBM (21%), Microsoft (18.5%), Sybase (3.5%) and Teradata(3.3%). Though Gartner's numbers are different, but even in their report, Oracle is the market leader with 47.1 % market share (as in 2007). The problem here is that these traditional database vendors, to

remain in the ascendency, keep on pumping budgets for research on BI for ELDB (Extremely Large Databases). This article attempts to give an insight for those not so big databases in the midsize businesses. And as Oracle database has the highest share, let us concentrate more on oracle database.

The evolution

The IT system of any organization evolves over a period of time. A typical IT system would evolve in three stages viz. Initial stage (Capturing transaction), intermediate stage (Process optimization) and Advanced stage (Business Intelligence). In the initial stage, the enterprise would try to build an application which can help in performing their day-to-day operations and capturing the events. Events like capturing the materials purchased by the purchase department and payment made to the vendor are examples of it. In intermediate stage, they try to find the trends in the processes followed in the organization and accordingly optimize those processes if required. For example, the vendor's payment clearance from the accounts department might be taking two days due to some batch processing or due to the clearance from other departments. The third stage or the advanced stage is the one where the realistic and intelligent information is to be squeezed out of the captured data. We know it as Business intelligence. The trends in the sales during the monsoon sale or effect of discounts on the buying behavior of the customer are the examples of the referred intelligence.

In this approach, the advanced stage and the intermediate stage, to some extent, require large amount of data processing and number crunching. At this point of time, the organization does not have a BI system neither it has a data warehouse to process its data intrinsic queries. At most the IT department could do is a) Replicate the database so that the processing does not affect production environment and b) merge, scrub, aggregate and organize the data to give some kind of OLAP(Online Analytical Processing)- like replica of the OLTP databases to the users for running their large queries. On top of this, they may also try to de-normalize the tables. But these activities do have a lot of complexities and hurdles if it needs to be done on a daily basis. It would be a nightmare for the DBAs to perform the replication and re-arranging the data. One option is to write a job, but even then, maintaining the job and keeping on tuning the job to match the growing volume of data would be tough. Is there a sustainable and inexpensive solution to this problem? Let us see. Let us try to study a few aspects which can be applied by the organizations at the second and third stage to improve the database performance without moving to the high-end solutions. Let me remind here, we are here referring to the oracle database.

Some Problems and some solutions

First, let us check out what oracle thinks that would be better to do with large tables and the complex queries it has to run. Oracle uses a tool called optimizer. Every time a query is fired in the database, it goes to the optimizer and the optimizer chooses the best plan to fetch the data. There are two types of optimizers- Rule based optimizer (RBO) and Cost based optimizer (CBO). Rule based optimizers are almost obsolete from oracle DBs. Each oracle query has multiple choices for execution plans. The execution plan includes a pointer to the index which is to be used for a particular column, the order of the tables to be joined, the process to be followed while joining these tables(or which type of join is to be used- nested loop join, hash join, etc.), search for any parallel executions if possible and so on. These execution plans are selected based on the CBO algorithm. To get a glimpse of the execution plan, there are certain methods which Oracle provides, for example: ‘explain plan’ utility, SQLT or TKPROF.

The question here is- how does this CBO makes these plans? Every database environment is different. So the CBO algorithm needs some inputs from the environment for generating these execution plans. These inputs are nothing but statistics generated by oracle for the tables. This statistics include the basic information about the table like the number of records and some calculated information like ‘skewness’ (elaborate later) of data. This algorithm of getting inputs and finding the best execution plan, works pretty good most of the times. But there are certain conditions where it fails to deliver. One of the flaws in calculating the statistics is that it needs a parameter in percentage which says that how much percentage of data is to be taken as sample for calculating those statistics. As we all would agree that sampled data is always subject to error. It has been observed that most of the DBAs prefer 10-15 % of data as sample. During an experiment we calculated it with 30% and there was a drastic difference in the results. Fortunately they were positive, but be prepared to expect the other way round.

One more problem while analyzing the DB objects is that it tries calculating the skewness of the data. Skewness here is defined as the difference between the largest and the smallest value of the column. If the data is more skewed, then oracle tries building a histogram for that column. This histogram helps in distributing the column values in different buckets and help while retrieving the results. The limitation here is that oracle does not consider the values in between. For example, if column values are in range 1-10000 and 100000-200000 then oracle would try building histogram for range 1-200000. It does not understand that the range 10001-99999 is empty. We need to make oracle understand that indeed oracle is an efficient and intelligent system but there are such limitations to its intelligence.

One solution to the above discussed problems is to retain the statistics of the database. Other solution is to increase the sample size used to analyze the database objects. This may take time but the results would improve. Moreover changing the environment parameters may also work out. But while doing so, one need to take utmost care and should thoroughly understand the impact of that parameter. There are some more solutions to such problems which are beyond the scope of this article.

Any talk about performance improvement is incomplete without reference to OCI (Oracle Call Interface). We all know about PL/SQL. We make procedures and functions in PL/SQL to process and present the data. OCI can be said as the helping hand when PL/SQL objects become a bottleneck in performance. OCI is a wonderful C based API provided by Oracle for giving an interface to the Oracle database in a low level language like C. OCI provides all the facilities needed to access the database ranging from database administrative tasks to firing queries. Moreover it also fully supports the data types, calling conventions, syntax, and semantics of C and C++. Let us see how OCI can make a difference in overall performance over PL/SQL.

The figure shows the working of a PL/SQL block. It would require ‘PL/SQL Engine’ to dismantle the block and fetch the data from the database. If the PL/SQL blocks are bulkier, then engine has to do more work. Now here is the key. Let such bulky operations be performed by a C compiler instead of giving it to PL/SQL engine, the former being faster. We can altogether by-pass the engine and directly throw the queries to the oracle server. It is apparent that OCI would work faster. But here too, we need to take utter care for building such programs. OCI is definitely a very powerful tool. And as with every powerful thing, even this can devastate the environment if not used cautiously.

These were the few aspects which can be helpful when enterprise does not have a full fledged BI system. But again, these measures would not be long lasting. Tuning the databases is very tricky and so is using programming languages to access the data. With the growth of data and complexity of requirement being increased, the organizations would definitely need one of the two – a BI system or an enormously efficient team of administrators and programmer. The choice is yours.

Something about IT products and their analysis

Saturday, September 4, 2010

Performance Problems with Data processing in Mid-size businesses

2 comments: