Oracle Index Fundamentals

Indexes are one of the most powerful and misunderstood aspects of SQL performance. Consider the following table,

create table students (student_id number, name varchar2(20), department_id number);

and if a session issues query such as the following:

select name from students where department_id = 12;

Without indexes on the department_id column, the session would have to scan the entire table, to find all the matching entries. This is not only slow as the session has to read all the blocks of the table but also resource intensive as the session consumes lot of CPU and IO. Time taken by the session to execute the query is directly proportional to the size of the table and the available CPU resources, IO bandwidth. The execution plan will be as follows:

and the corresponding execution statistics are as follows:

To understand the concept of an index, a simple analogy is to look at the index at either the front or back of a book. It’s pretty simple to find the subject that we are interested in by referring the index, and flip to those pages in the book instead of searching each and every page in the book. Similarly, a database index helps us visit only relevant table blocks where the necessary information is. Whenever a user executes a query, optimizer will generate all the possible execution plans and picks the plan with the lowest cost based on various criteria such as presence of indexes, number of blocks, NDV etc. Without the presence of an index, the optimizer is forced to pick a full table scan. Now if we create an index on department_id column of the students table using the following statement:

CREATE INDEX STUDENTS_DEPTID_IDX ON STUDENTS (DEPARTMENT_ID);

The optimizer will consider the index in the subsequent executions of the SQL statement in its cost estimates, and if the cost of accessing the table using the index is less than the cost of full table scan, optimizer picks a plan with the index. The execution plan will be as follows:

and the corresponding execution statistics are as follows:

Let’s look at the 10053 trace (level 12) of the session executing the statement: There is a reason for including the session’s optimizer trace here, as there are several concepts at play and we will look at each in the subsequent sections. Pay special attention to those highlighted below:

***************************************

BASE STATISTICAL INFORMATION

***************************************

Table Stats::

Table: STUDENTS Alias: STUDENTS

#Rows: 999999 SSZ: 0 LGR: 0 #Blks: 4906 AvgRowLen: 30.00 NEB: 0 ChainCnt: 0.00 ScanRate: 0.00 SPC: 0 RFL: 0 RNF: 0 CBK: 0 CHR: 0 KQDFLG: 1

#IMCUs: 0 IMCRowCnt: 0 IMCJournalRowCnt: 0 #IMCBlocks: 0 IMCQuotient: 0.000000

Index Stats::

Index: STUDENTS_DEPTID_IDX Col#: 3

LVLS: 2 #LB: 2724 #DK: 4801 LB/K: 1.00 DB/K: 203.00 CLUF: 978571.00 NRW: 999999.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00 GQL: 0.00 CHR: 0.00 KQDFLG: 8192 BSZ: 1

SINGLE TABLE ACCESS PATH

Single Table Cardinality Estimation for STUDENTS[STUDENTS]

SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE

kkecdn: Single Table Predicate:"STUDENTS"."DEPARTMENT_ID"=123

Column (#3): DEPARTMENT_ID(NUMBER)

AvgLen: 4 NDV: 4801 Nulls: 0 Density: 0.000208 Min: 0.000000 Max: 4800.000000

Estimated selectivity: 2.0829e-04 , col: #3

Table: STUDENTS Alias: STUDENTS

Card: Original: 999999.000000 Rounded: 208 Computed: 208.289731 Non Adjusted: 208.289731

Scan IO Cost (Disk) = 1330.000000

Scan CPU Cost (Disk) = 224937594.640000

Cost of predicates:

io = NOCOST, cpu = 50.000000, sel = 0.000208 flag = 2048 ("STUDENTS"."DEPARTMENT_ID"=123)

Access Path: TableScan

Cost: 1337.374257 Resp: 1337.374257 Degree: 0

Cost_io: 1330.000000 Cost_cpu: 274937545

Resp_io: 1330.000000 Resp_cpu: 274937545

****** Costing Index STUDENTS_DEPTID_IDX

SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_SCAN

SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_FILTER

Estimated selectivity: 2.0829e-04 , col: #3

Access Path: index (AllEqRange)

Index: STUDENTS_DEPTID_IDX

resc_io: 207.000000 resc_cpu: 1555648

ix_sel: 2.0829e-04 ix_sel_with_filters: 2.0829e-04

Cost: 207.041725 Resp: 207.041725 Degree: 1

Best:: AccessPath: IndexRange

Index: STUDENTS_DEPTID_IDX

Cost: 207.041725 Degree: 1 Resp: 207.041725 Card: 208.289731 Bytes: 0.000000

From the above we can see clearly that the optimizer picks the index as the cost of accessing the table using the full table scan is very high compared to accessing the table using the index STUDENTS_DEPTID_IDX and If we look at the execution statistics the session accessing the table through the index had to read only 265 blocks compared to 4871 blocks using a full table scan thereby reducing the number of blocks read by 94.6%. You might wonder why there are no physical reads in both the execution statistics for full table scan and index range scan, if you have gone through the direct path read section, you will understand why Oracle caches blocks of a table during a full table scan.

Before proceeding with Index concepts, let’s look at how the Optimizer decides whether to choose an index or perform full table scan.

Storage Level Perspective:

From the storage perspective, let’s look at how Oracle handles full table scans and index range scans.

Full Table Scan:

During a full table scan we know that Oracle/session/Server process uses multiblock IO or reads multiple blocks at a time to read the table contents, this is evident in the OS calls a server process makes while reading (which can be obtained from the strace output). The following is the STRACE output of a server process performing a full table scan.

As we can see from the above each read request (pread64) reads 1MB of continuous chunks of data from the datafiles. (since db_file_multiblock_read_count (precisely _db_file_exec_read_count) is set to 128 and the database block size is 8k, each read request translates to 1MB. If Async IO is made available using the filesystemio_options parameter, we would see that the session actually uses issues io_submit os calls instead of pread64.

Even with Async IO enabled, as highlighted in the above snippet, the maximum IO size per call will be 1 MB.

Considering the factors such as the following:

- Smaller initial extent sizes (extent allocation auto at tablespace level).

- Extent boundaries.

- Range of formatted blocks between Low HWM and high HWM.

The server process issues about 3074 OS calls to read a segment about 3 GB in size. The output of the iostat command during the same period is as follows:

r/s indicate the read IO per second

rkB/s indicates the size in KB read per second.

Util is the disk utilization rate.

Pay close attention to the rkB/s, r/s and util columns as highlighted, as the session issues large IOs, with full table scans we are more likely to saturate the bandwidth (maximum allowed data transfer rate) at the storage layer, even though the storage is capable of handling more IOPS.

Index Scan:

During an index range/unique scan, a session traverses the index B-tree structure to find the index leaf blocks that contain the matching key value and based on the ROWIDs obtained from the index leaf blocks the session reads the corresponds table blocks. The Strace output of the session performing a index range/unique scan is as follows:

As you can see from the above the session issues pread64 calls which reads 1 database block at a time (database block size 8kb). Pread64 is a synchronous read call. Oracle employs various optimizations or techniques such as table/index prefetching, batching during index range scans where a session ends up waiting on either db file scattered read (multi-block read, multiple contiguous blocks) or db file parallel read (batches multiple data block read operations stored in different regions of the disk in one IO operation) to speed up the index scans. The following snippet of strace output with the Async IO enabled shows the IO_SUBMIT calls.

As we can see from the above io_submit call, a session batched about 127 block read operations into one call and this io_submit call is subjective to whether the Async IO is enabled through filesystemio_options. Io_submit is an asynchronous read call and its API allows the caller to batch either multiple blocks or single read request into one operation. The output of IOSTAT command can be seen below.

Pay attention to rkB/s, r/s and util columns as highlighted, as the session issues multiple small IO requests or batches multiple small IO read operations into one. A session performing index range scan is more likely to saturate the IOPS at the storage layer.

Summarizing the concept:

Since the session issues large IO requests, session performing full table scans are more likely to saturate the bandwidth at the storage even though the storage is capable of handling much more IOPS. Sessions performing index range scans are more likely to saturate the IOPS as they issue multiple small IO requests. A HDD that provides about 150-200 IOPS can only sustain a data transfer rate of about 1.17 – 1.5 MBps as sessions perform random IOs. Remember that IO by a session accessing the table contents through the index is mostly is mostly random in nature.

Optimizer’s Decision

Optimizer uses various factors such as the following to estimate the cost of accessing a table using a particular access method:

- Initialization parameters:

o db_file_multiblock_read_count

o optimizer_index_cost_adj

o optimizer_index_caching

o optimizer_mode

- System Statistics

o SREADTIM

o MREADTIM

o MBRC

o IOSEEKTIM

o IOTFRSPEED

- Table Statistics:

o Blocks below HWM

- Index Statistics:

o Index Height (Blevel).

o Selectivity of a value.

o Clustering Factor.

o Number of Leaf blocks.

Let’s look at how Oracle estimates the cost of a full table scan vs index range scans.

Cost of full table scan:

Cost of index range scan:

Following are the most important factors listed in the order of magnitude of impact that can tip the balance off from a full table scan to an index range scan and vice versa

- Effective Selectivity

- DB_FILE_MULTIBLOCK_READ_COUNT & MBRC

- OPTIMIZER_INDEX_COST_ADJ

- CLUSTERING_FACTOR.

- OPTIMIZER_INDEX_CACHING

Effective Selectivity:

Selectivity is nothing but the portion of rows in percentage of total rows in a row source retrieved by a query for a particular predicate or set of predicates. Oracle calculates the selectivity as a value between 0 and 1. 0 being no rows retrieved and 1 being all rows retrieved. If this value is close to 0, it means that the predicate is more selective and the query is most likely to benefit from an index-based scan. The formula to calculate the selectivity is as follows.

Selectivity of a particular predicate or set of predicates is very useful in determining the actual access costs. By default, selectivity is based on high value, low value and NDV (number of distinct values) with an assumption of even distribution of data between these two points. Histograms provides better selectivity estimates for unevenly distributed data, but do remember that histograms may be mandatory even on primary key columns when using queries with predicates containing (>, < or between) especially when there are gaps in the ranges. Without the histograms the selectivities for a predicate are calculated as follows:

Predicate	Selectivity
c1 = '32'	1/NDV
c1 > '32'	(High - Value / High - Low)
c1 >= '32'	(High - Value / High - Low) + 1/NDV
c1 < '32'	(Value - Low / High - Low)
c1 <= '32'	(Value - Low / High - Low) + 1/NDV
c1 like '32'	1/NDV

Unless the predicate used is a equality predicate with uniform distribution of data, especially for range predicates bind variables in the query present a challenge to the optimizer as the optimizer as the optimizer is unaware of the value used for the bind variable during the query optimization. In this case the optimizer has nothing but to assume the selectivities as follows:

Predicate	Selectivity
c1 = :B1	1/NDV
c1 > :B1	Default of 5%
c1 >= :B1	Default of 5%
c1 like :B1	Default of 25%

It’s enough to know that the data Histogram provide is more accurate and the optimizer has better chance of accurately calculating the selectivities for given predicate or set of predicates. In the subsequent section and in the optimizer statistics part we will look at various scenarios where the histograms are mandatory.

db_file_multiblock_read_count & MBRC:

This is one of the parameters that shifts the balance between the full table scan and an index range scan more towards a full table scan, Oracle recommends us not to set this parameter explicitly and it’s better to let Oracle to determine the actual value. Contrary to popular belief optimizer actually uses the value of _db_file_optimizer_read_count or MBRC obtained from system statistics to calculate the cost of a full table scan instead of db_file_multiblock_read_count. When the system statistics are not available or when the MBRC value is null in aux_stats$ (which stores the system statistics), oracle uses the value the _db_file_optimizer_read_count to determine the cost. The value of this parameter defaults to 8 when oracle automatically determines the value of db_file_multiblock_read_count and if db_file_multiblock_read_count is set manually, _db_file_optimizer_read_count always equals db_file_multiblock_read_count. Remember that regardless of the values of db_file_multiblock_read_count and _db_file_optimizer_read_count, oracle always uses MBRC if the MBRC and MREADTIM are calculated as part of the system statistics. Following are the only ways with which we can influence the optimizer’s cost estimates for a full table scan

- Without System statistics

o Setting DB_FILE_MULTIBLOCK_READ_COUNT to a value above 8 (provided system statistics are unavailable).

- With System Statistics

o Manually increase the MBRC.

o Reduce the difference between SREADTIM and MREADTIME so that MREADTIM/SREADTIM is always > 1.

OPTIMIZER_INDEX_COST_ADJ

This parameter lets us modify the optimizer behavior for access path selection to be more or less index friendly – that is to make the optimizer more or less prone to selecting an index access path over a full table scan. The default value is 100% at which the optimizer evaluates the cost of performing a single IO through an index equals the cost of performing a single multiblock IO. Setting the value of this parameter to 10 makes the index access path about 1/10^th the actual or original cost.

The following snippet of 10052 Level 12 trace explains the details.

Here the optimizer_index_cost_adj is set to 50% indicating that the cost of index access is actually the half the cost.

The optimizer initially estimates the cost of table scan using the index at 8746 and since the optimizer_index_cost_adj is set to 50, it simply adjusts the cost of index access to half the original cost.

As with every feature there are arguments which oppose the idea and others which favor it, same goes for this parameter as well. Let’s look at both sides of the discussion. Let’s start with the argument which favors or recommends setting this parameter:

Arguments that favor adjusting optimizer_index_cost_adj:

During the normal OLTP operations, most of the table and index blocks are already cached in the memory. With the default values for the parameters listed below:

optimizer_index_caching = 0

optimizer_index_cost_adj = 100

For each SQL statement executed against the database, the CBO calculates the IO cost as if it has to read all the blocks from the disk, in other words since optimizer is not aware of the blocks that are already cached in the memory, it estimates the costs as if it has to read all blocks from disk or perform physical IOs. But in reality, this is far from the truth as the OLTP workloads demand that that queries run as fast as possible and to ensure this the execution plans are mostly index-based lookups and nested loop joins, rarely we do encounter hash joins and full table scans.

That being said, since the optimizer uses various factors such as clustering factor, selectivity of a predicate, index leaf block accessed to determine the tipping point between the full table scan and index range scan. Especially for tables with high clustering factor (due to the inherent way with which it calculates it), blocks accessed or read from the disk may be far less compared to the estimates, due to the which the optimizer may end up picking a full table scan. There are also cases where queries only select a fraction of rows from the table or access only a few blocks above the CBOs tipping point, but most of the blocks may already be present in the buffer cache, in these cases queries may require only a few blocks from the disk, in such situations, reducing the optimizer_index_cost_adj can tip the scale from session performing a full table scan to a index range scan and the performance of application or the overall TPS can increase considerably.

When the table’s average row length is less the accuracy of the clustering factor statistics may be far from the actual reality and inherently with ASSM this statistic may will be high provided that there are enough sessions executing the DML statements in parameter. In these scenarios there is a high chance that the range queries (<, > or combination <= >=) will hit the same set of blocks, using table_cached_blocks table’s statistics preference can help in adjusting the clustering factor there by influencing the optimizer cost estimates, unless it is the same case with all the tables in the application reducing the optimizer_index_cost_adj make sense as there is a high chance that the queries can benefit from reading the same set of blocks which are already cached and the queries select only a small subset of data. There is a negative side to this, which is discussed in the following section.

Arguments against setting optimizer_index_cost_adj:

Factors such as the following should be considered even before decreasing the value of optimizer_index_cost_adj

- How busy the buffer cache is? And is it appropriately sized?

- Is the workload primarily DML oriented?

- Portion of the SQL statements that depend on the optimizer’s cost estimates tipping point between the full table scan and index range scan.

- Tables with small average row length and high clustering factor.

In a typical OLTP application the queries that depend on the optimizer’s cost estimate that governs the tipping point between the full table scan and fast full index scan are very less, and a perfectly sized buffer cache contains buffers which are already warm, unless properly worked out reducing the value of optimizer_index_cost_adj to help few queries that depend on the tipping can result in a set of cascading issues as oracle or session eventually has to find free buffers to read the data blocks into, if the buffers are clean a session can simply reuse the buffer but if these blocks are required again, they have to read from the disk again and this can increase the IOPS on the server significantly. If the buffers are dirty then the DBWn process has to flush dirty blocks to make room for the new blocks coming in. This also increase the latch activity at the instance level as the buffers are linked and unlinked from the chains more often. In these cases, increasing the buffer cache will help reduce the impact but up to what extent depends on the variations of the workload.

If the workload is primarily DML oriented, inherent with the design ASSM spreads the DML load across the available free blocks and is highly likely that sessions inserting data may end up inserting data into different blocks rather than inserting into the same block which contains the free space, it is highly likely that the buffer cache contains more dirty blocks, and reducing the value of optimizer_index_cost_adj to tip the balance between the full table scans can result in SQL statements bringing more blocks into the buffer cache if not already present, as batching and prefetching resulting in fetching multiple data blocks in one IO request when Async IO enabled (the process is more aggressive when Async is enabled), the demand for free buffers arises at one point in time or another and this will eventually add pressure on the DBWn process to write the dirty blocks to make room for the free buffers and if the corresponding redo information related to the block is not flushed to the disk yet, DBWn process can post/signal the LGWn process to write the redo entries to disk before it can write the data block to the disk. This entire process can significantly increase the IOPS on the server.

We have to remember that this parameter is global and affects all the CBO estimates for all the queries that execute on the instance. When the table’s row size is large the clustering factor statistic of the index tends to be more accurate. If the SQL statement selects a significant portion even 2% of the table rows when the table size is large reducing the optimizer_index_cost_adj can result in reading a large portion of blocks into the buffer cache which not only increases the latch activity at the instance level but also increases the elapsed time to a point where performing a full table scan is faster. This can impact other sessions as well. In these cases, executing a SQL query using a full table scan requires reading a data block only once and a full table scan may be faster due to SAN caching and/or features such as read ahead.

When clustering factor statistic is artificially very high, the sessions executing a SQL statement may end up reading the same set of blocks over and over again, in these cases reducing the optimizer_index_cost_adj parameter will increase the threshold and the sessions may end up performing more logical IO than the reads performed by a full table scan.

Remember that this parameter does not apply to user defined functions for domain indexes. With all the cases listed above in which reducing optimizer_index_cost_adj can have potential negative impact on the performance, most of the cases, it is better to leave the value of this parameter to its default. Many argue that comparing the average wait times of db file scattered read and db file sequential read can be used as a starting point for determining the value of optimizer_index_cost_adj, but remember that the average wait time for these wait events is at the mercy of several factors such as disk utilization rate, average response times, whether there is a sufficient sample or number of times the sessions waited on db file scattered read, also the number of blocks read using db file scattered read may not be constant if these are less full table scans (buffered) as this wait event can also occur during prefetching which actually reads less number of blocks rather than the value specified by db_file_multiblock_read_count.

CLUSTERING_FACTOR:

OPTIMIZER_INDEX_CACHING:

OPTIMIZER_INDEX_CACHING lets us adjust cost of index probe to favor nested loop joins or in-list iterators. The range of values 0 to 100 indicate percentage of index blocks already present in the buffer cache. The default value of 0 indicates that the buffer cache doesn’t have any index blocks in it and that Oracle has to read the index blocks to memory from disk and these reads should be treated as physical IOs, this can make optimizer favor full table scan. During the optimization phase, the optimizer has no clue regarding the blocks that are already cached in memory and by default it assumes that it has to read all the blocks from the disk and costs the plans accordingly. During normal OTLP database operations, most of the index and table blocks are already cached in the buffer cache and the optimizer has clue as to how much percentage of these blocks are already present in the buffer cache. The only way to influence the optimizer’s cost estimates is by using the parameter OPTIMIZER_INDEX_CACHING which indicates the percentage of index blocks that are likely to be found in the buffer cache and OPTIMIZER_INDEX_COST_ADJ which adjusts the cost of index probes.

Setting this parameter to a higher value makes nested loop joins and in-list iterators look less expensive to the optimizer and as a result the optimizer is more likely to pick a nested loop joins over hash or sort merge joins and to pick indexes using in-list iterator over other indexes or full table scans. A value of 100 infers that 100% of the index blocks are likely to be found in the buffer cache and optimizer adjusts the cost of an index probe using in-list iterator or nested loop accordingly. Once this parameter to a particular value the CBO cost algorithm is adjusted as follows:

For example, if we set this parameter to 50, it means that CBO will consider 50% of all IOs associated directly with the index blocks are already cached and therefore reduce the overall cost of the index block IO by 50%. The following snippet of 10053 level 12 trace shows the cost of executing a query with an in-list iterator

****** Costing Index IDX

OPTIMIZER PERCENT INDEX CACHING = 50

SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_SCAN

SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_FILTER

Using prorated density: 1.6414e-04 of col #4 as selectivity of out-of-range/non-existent value pred

Using prorated density: 1.6247e-04 of col #4 as selectivity of out-of-range/non-existent value pred

Using prorated density: 1.6079e-04 of col #4 as selectivity of out-of-range/non-existent value pred

Using prorated density: 1.5912e-04 of col #4 as selectivity of out-of-range/non-existent value pred

Access Path: index (RangeScan)

Index: IDX

resc_io: 7862.993271 resc_cpu: 73793163

ix_sel: 6.4652e-04 ix_sel_with_filters: 6.4652e-04

Cost: 7865.064483 Resp: 7865.064483 Degree: 1

****** Costing Index IDX

SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_SCAN

SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_FILTER

Using prorated density: 1.6414e-04 of col #4 as selectivity of out-of-range/non-existent value pred

Using prorated density: 1.6247e-04 of col #4 as selectivity of out-of-range/non-existent value pred

Using prorated density: 1.6079e-04 of col #4 as selectivity of out-of-range/non-existent value pred

Using prorated density: 1.5912e-04 of col #4 as selectivity of out-of-range/non-existent value pred

Access Path: index (RangeScan)

Index: IDX

resc_io: 7908.978163 resc_cpu: 74163478

ix_sel: 6.4652e-04 ix_sel_with_filters: 6.4652e-04

Cost: 7911.059769 Resp: 7911.059769 Degree: 1

Remember that the amount of caching depends on various factors that the optimizer cannot predict, such as the load the system, block access patterns, variations in workload at different times etc.

Similar to optimizer_index_cost_adj, this is a global parameter which means that any value that we set will be the average for all the indexes and for all the situations in which this parameter can have an effect. In an OLTP database, there are some indexes which can be quite small and are heavily accessed and likely to be cached in the entirety in the buffer cache, whereas few are huge and accessed less frequently. Setting a value for this parameter averages out for all the indexes.

As with optimizer_index_caching which indicates the amount or percentage of blocks cached in the buffer cache, there is no equivalent parameter for table blocks and as such Oracle treats all the IO for the table blocks as physical IOs. One way to influence the behavior by setting the optimizer_index_cost_adj parameter which effectively reduces the overall cost of the index-based table lookups. As with optimizer_index_cost_adj parameter care should be taken when setting this parameter to a particular value, as this can tip off the balance between the nested loop and hash join. A hash join could have easily been faster than a nested loop join. Likewise, same goes for the index-based lookup and full table scans for queries involving in-list iterators.

How a Session performs a full table scan:

The following diagram depicts how a session reads the table during a full table scan.

Optimizer uses the statistics and the availability of indexes to determine whether it is better to perform read the table contents using either a full table scan or index range scan. Since we are talking about full table scans, once the decision is made to read the table using a full table scan, a session starts by reading the segment/table header block to get the details regarding the extent map (how many extents are allocated), their addresses, Low HWM, High HWM etc and starts by reading db_file_multiblock_read_count (more precisely _db_file_exec_read_count) blocks, one extent at a time until low HWM as the session is 100% certain that all the blocks below the low HWM are known to be formatted and carefully reads the formatted blocks beyond the low HWM and high HWM and blocks above the high HWM can be either formatted or unformatted. Oracle just ignores or doesn’t read unformatted blocks as it is known that they do not contain any data. The performance of a full table scan is influenced by various factors such as availability of Asynch IO, how busy the disks/buffer cache (in case of buffered reads) etc and you will be seeing about these in the direct path read wait event section. These reads are basically sequential in nature.

Index Fundamentals:

How Oracle manages B-Tree index:

In Oracle, B-Tree index is the default index type, CREATE INDEX DDL statements always results in an B-Tree index. In a B-Tree index, the data blocks are arranged in a balanced tree like structure with branch and leaf nodes. Contrary to popular belief regarding the root node, oracle doesn’t have the concept of special type of block called root node or block, all the blocks in an index are either BMB which are used to track the free space or branch or leaf blocks, the branch root block that contains the pointers to all the branch blocks can be considered as a root node. In terms of performance, B-Trees are excellent for a wide variety of queries, including those with the equality and range predicates. A B-Tree index is balanced because all leaf blocks automatically stay at the same depth and thus to retrieve a record from anywhere in the index takes approximately the same amount of time and/or same number of lookups and the B-Tree is always balanced. Every index has a height associated with it and the height of the index is the number of blocks required to go from the root block to a leaf block. The following diagram shows the basic structure of B-Tree index.

Branch Blocks:

Branch blocks are used for traversing/searching the B-Tree structure and depending on the size of the index branch blocks can be are arranged in levels or in other words a branch block can have pointers to either leaf block or other branch blocks. Branch blocks store the minimum key prefix needed to make a branching decision between two keys. This technique enables the database to fit as much data as possible on each branch block. The number of keys and pointers stored in a branch block is limited by the block size and the column length.

The dump of a branch block contains the details such as the following:

The following is a dump of the branch root block.

Branch block dump

=================

header address 139999760347212=0x7f543bfbf04c

kdxcolev 2

kdxconco 1

kdxconro 600

kdxcofbo 1228=0x4cc

kdxcofeo 2689=0xa81

kdxcoavs 1461

kdxbrlmc 29369552=0x1c024d0

kdxbrsno 599

kdxbrbksz 8056

kdxbr2urrc 0

The explanation of relevant details are as follows:

kdxcolev – indicates index height

Kdxconco – number of columns in the index

Kdxconro - number of entries in this block.

Kdxcofbo – offset to the beginning of the free space within the block.

Kdxcofeo – offset to the end of free space within the block.

Kdxcoavs – available free space in the block.

Kdxbrlmc – pointer to other branch or leaf blocks.

Kdxbrbksz – maximum usage space within the block (basically less than the block size due to the overhead).

When a table is initially created (depending on the deferred_segment_creation parameter), an extent is not allocated, only when the first-row insert is created, extents are allocated to the table and its corresponding indexes if any. Oracle doesn’t create a branch blocks until a certain threshold or certain number of rows are inserted to the table. So, the very first block immediately following the L3 BMB (segment header) will be a leaf block. Only when sufficient number of rows are inserted into the table (which causes the leaf block to split) oracle creates branch nodes, it then formats or changes the very first block following the segment header from leaf block to a branch block, and once sufficient number of rows are inserted into the table, the oracle splits the branch block into a root branch block and the other branch blocks. The root branch block is always the 1^st block following the index’s segment header block that is the fourth block in the initial extent allocated to the segment as shown in the following figure.

Even though a column value can span across multiple leaf blocks, branch blocks stores all the leaf block pointers. Do remember that a delete operation can cause an update or deletion of an entry from branch blocks.

Leaf blocks:

All the leaf blocks reside at the same level or height. The leaf blocks contain indexed column value and a corresponding ROWID which points to the actual location (table data block) where the row is stored. Each entry is a B-Tree index is sorted by the (key-value). Leaf blocks are stored or arranged in a doubly linked list format with pointers to its previous and the next leaf blocks, so that during a range scan or queries with predicates such as > or < a session can simply traverse leaf blocks instead of reading the branch blocks over and over again.

Leaf block dump

===============

header address 140609502597364=0x7fe2337450f4

kdxconco 1

kdxlenxt 29368846=0x1c0220e

kdxleprv 0=0x0

The explanation for relevant columns is as follows:

Kdxleprv – pointer to the previous leaf block

Kdxleprv – pointer to the next leaf block.

The following is the leaf block format:

There is a reason for specifying the PCTFREE here for the index leaf block and we will get back to this shortly.

Let’s look at how the entries are stored in an index leaf block:

row#0[8021] flag: -------, lock: 0, len=11

col 0; len 1; (1): 80

col 1; len 6; (6): 01 c0 01 5b 00 5f

row#1[8010] flag: -------, lock: 0, len=11

col 0; len 1; (1): 80

col 1; len 6; (6): 01 c0 01 5d 00 34

As you can see from the above each entry requires 11 bytes

2 bytes to store the flag and lock.

1 byte to store the length of col 0

1 byte to store the col 0 value

1 byte to store the length of col 1

6 bytes to store the col 1 value – Hex/internal representation of the ROWID.

When a B-Tree index is created and if PCTFREE and INITRANS parameters are not specified, then Oracle automatically set the default value for PCTFREE as 10 and INITRANS as 2. But for a table default value of INITRANS is 1. Unlike tables where the PCTFREE specifies the amount of free space reserved for future updates, PCTFREE for an index block specifies amount of space reserved for future updates and insert of new key values. PCTFREE is only relevant during the initial index creation or rebuild (we will later see why a rebuild causes the index size to increase) It causes optimal splitting of the B-Tree in preparation for subsequent growth. The idea is to do as much splitting a possible during the initial creation and avoid having to pay the penalty later during insertion into the table. This is what, a high PCTFREE setting on an index gives us. An increased PCTFREE value can significantly increase the bloat in an index and is not recommended for all workloads, the default value of 10 should suffice in most of the cases, and we will look at these scenarios eventually. Similar to a table, once the index is created, an index block can accommodate keys until there is no space left in the block (this includes the space for ITLs as well, unless the space is reserved using the INITRANS parameter). In order to modify the contents of a particular block, a session must first acquire an ITL slot, but what happens if a table/index block is completely full? Sessions succeed in acquiring ITL slots if the ITL slots are reserved using the INITRANS parameters during the table or index creation, but what if the INITRANS is not set and the value defaults to 1 and multiple sessions try to modify the same table/index block. In this scenario, Oracle handles the situation based on the type of block and correspondingly the sessions behave. When a table block is completely full and multiple sessions try to modify the same block, since the block is full, it cannot accommodate more ITL slots, in this case, the first session which tries to modify the block acquires the ITL slot (default 1 reserved), but either the second or subsequent sessions end up waiting on enq: TX - allocate ITL entry wait event until the first session holding the ITL slot releases it, even though the statement can be a delete statement which deletes multiple rows, until the first sessions commits the transactions, the rest of the sessions will end up being blocked on enq: TX - allocate ITL entry. This is not the case with the index blocks, when an index block is full and multiple sessions try to modify the same block, in these cases, since the block does not have enough free space to hold additional ITL slots, oracle straight away performs a block split which can be either a 50-50 or 90-10 depending on which part of the index structure the block resides.

How are key – ROWID pairs stored in a leaf block?

By default, if we don’t specify a DESC clause in the CREATE INDEX DDL statement, index stores all the key - ROWID values in the ascending order. For example, if we create an index on EMP_ID which is unique (not unique index) by the way, the index leaf block contains the similar to the following:

The order of EMPID takes precedence over the ROWID which makes sense as the values are stored in ascending order in an index leaf block. So, what if an index key value results in multiple ROWIDs? In this case, the database stores the entries are sorted by ROWID and stored in a leaf block.

In other words, the entries in the leaf blocks are sorted by key value and then the ROWID. So now one might wonder why does Oracle have to sort the ROWIDs in the index leaf blocks? The answer to this question, we need to first understand what a ROWID is.

A ROWID is a pseudo column that returns the address of a row. Oracle Database rowed values contains the following information:

- The datafile in which the row resides. The file number is relative to the tablespace.

- The data block in which the row resides.

- The data object number of the object.

- The position of the row in the data block (first row is 0).

The value of the ROWID increases with the increase the FILE_ID and the location of the data block in the datafile and the position of the row in that data block. Remember that the ROWID of a row only changes when the row is moved to a different partition due to an update operation, in other cases even if an update operation results in a migrated row, the ROWID entries in the index does not change as oracle uses pointers with forwarding address.

When the ROWIDs are persisted in a sorted order, there is a high chance or probability that these ROWIDs points to data blocks that reside very near on the sectors and that the disk spindle doesn’t have to move too much there by reducing the seek time eventually, the disk spindle eventually moves only in one direction from the higher to lower tracks, and by the time the last ROWID is reached it has made only one complete ARC. If the ROWIDs are not stored in ascending order there is a high chance that the data blocks are stored randomly in a data file and the spindle has to move back and forth multiple times to retrieve the data, this not only increases the seek time but also reduces the life span of the disk, this is one explanation and the other is when using queries with predicates such as COL1=:A1 and COL2:A2 and both are indexed columns (two separate indexes), since the ROWIDs are already sorted, it is very easy to filter the rows, otherwise the session would end up sorting the ROWIDs to filter the rows.

Index Block Splits:

Leaf Block Splits:

An index leaf block split happens in two ways:

- 50 – 50

- 90 – 10 = misleading instead it should be 100-0

50 – 50 splits:

An index leaf block split happens when an index leaf block becomes full. This happens for all indexes even when the value is monotonically increasing. In this case oracle takes a free index block and links it into the index structure and moves approximately half of the indexed entries of the full block into the new leaf block. The following is the snippet of index tree dump which gives us the details.

leaf: 0x2c0010e 46137614 (175: row:268.268 avs:3928)

leaf: 0x2c0016a 46137706 (176: row:265.265 avs:3973)

90-10 splits:

A 90-10 split happens only for columns with monotonically increasing values or the blocks that are present at the right most part of the index. 90-10 split naming is really misleading as the correct naming should have been 100-0. In this case once the once the index leaf block is full, a new leaf block is linked to the index structure, and the subsequent inserts are directed towards that index leaf block. The following is the snippet of a index tree dump.

leaf: 0x2c4726b 46428779 (29: row:533.533 avs:6)

leaf: 0x2c4726c 46428780 (30: row:533.533 avs:7)

leaf: 0x2c4726d 46428781 (31: row:533.533 avs:6)

leaf: 0x2c4726e 46428782 (32: row:199.199 avs:5013)

As we can see from the above, the index leaf blocks are almost full.

Branch/Root branch Block Splits:

A branch block split is also subject to the same rules as the leaf block split, a split can occur in two ways:

- 50 – 50

- 90 – 10

(a root block is a special type of branch block and should be considered as a branch block as specified before).

DML behavior

Insert an Index entry:

Once the index segment is created, new rows inserted into the table will eventually create index entries (key-value pairs). Depending upon the index key values, these entries are inserted into appropriate index leaf blocks until they are full, once the index leaf block is full as there is not space left, a subsequent insert will cause the index leaf block to split. An index leaf block can be split using either 50-50 or 90-10 approach depending on its location in the index structure. Several factors such as sequences, column order of the index etc., contribute whether the block split is either 50-50 or 90-10.

When an index key value results in multiple ROWIDs (which is typically the case in most non-unique columns), these entries are stored in the leaf block in ROWID order. So, all the blocks for that particular index key value are inspected in turn until an entry is found with a greater ROWID than the new row, and the index key, ROWID pair is inserted into the block. This doesn’t mean that all the corresponding leaf blocks will be read, Let’s look at various such scenarios:

When a column value is populated as a monotonically increasing value:

In most databases/environments, column values that are populated using a monotonically increasing value using either an application sequence or database sequence are usually never updated and are mostly protected using a unique or primary key constraint, and will stay constant unless the corresponding rows are deleted or older partitions dropped. eg: ORDER_ID/STUDENT_ID/EMPLOYEE_ID remains constant and are never likely be changed. This is very important to remember even before proceeding, as this helps us determine whether to or not to rebuild indexes built on such columns.

Considering an ideal situation in which the column value of each subsequent insert statement is higher than the previous value and the value is increasing monotonically, the index entries are usually inserted at the right most index leaf block, and as a result once the index leaf block is full it is split using 90-10 approach (or more precisely 100-0 approach) in which case a new leaf block is attached the doubly linked list and the inserts are directed into that leaf block. In this case we see no wastage of space. But remember that since the block is 100% used (unless we have space reserved in the index block by defining INITRANS during index creation or rebuild), when multiple sessions try to delete the table rows pointed by this leaf block, they must initially add ITL entries into the index leaf block header and this operation requires space (default INITRANS for an index is 2), and as there is no free space in the leaf block, oracle splits the index leaf block further using 50-50 approach which results in wastage of space. Also remember that for indexes PCTFREE is valid only during index segment creation (like CREATE INDEX …, ALTER INDEX… REBUILD... etc.) and the space is not reserved for future updates or inserts during normal processing.

The following is a snapshot of how a 90-10 block split looks like:

leaf: 0x2417821 37845025 (83: row:533.533 avs:6) //space usage is almost full

leaf: 0x2417822 37845026 (84: row:533.533 avs:7)

leaf: 0x2417829 37845033 (85: row:533.533 avs:6)

leaf: 0x241782a 37845034 (86: row:533.533 avs:6)

leaf: 0x241782b 37845035 (87: row:533.533 avs:7)

leaf: 0x241782c 37845036 (88: row:303.303 avs:3454)

The above behavior regarding 90-10 split (precisely 100-0 split) is more optimistic and is not usually the case in most environments and this happens only when we have a database sequence created with NOCACHE ORDER or the application makes sure that the subsequent inserts results in higher value for the particular column. With application sequences, in highly concurrent system we cannot guarantee that the subsequent insert statement will result in a higher value as this is dependent on various factors such as network delays, CPU scheduling, distributed cache synchronization mechanism etc., as the number of sessions inserting the data into the table can increase the gap which results in 50-50 splits. For example, if there are 20 sessions inserting data into the table, and due to delay in OS CPU scheduling or thread synchronization or JVM garbage collection etc., a thread which is supposed to insert a value 10000 may get delayed by few milliseconds, by that time other sessions would have already inserted many rows and a 90-10 split (100-0) split occurred, and this session when it tries to insert the data, ends up finding that the index leaf block is full, which ultimately results in a 50-50 block split.

Before proceeding let’s look at two most common ways that applications generate monotonically increasing values:

1. Application sequences

2. Database Sequences.

Application Sequences:

There are several aspects or behavior of the application or Java or synchronization mechanisms that a DBA should be aware:

Application Sequences or custom code or functions to generate certain values:

Database Sequences inherently generate only numeric values, if the application demands sequences to be in the following order. (such as those used for the ORDER_ID columns in a retail application such as the following).

0000-001-012-0000310

0000-001-012-0000321

In this case using database functions reduces the overall concurrency when compared to applications generating the sequences since there are multiple layers involved and can lead to library cache contention depending on the complexity of the function and the number of sources from which it retrieves the data, not to mention the interim tables or sequences that it requires to hold the maximum values. In cases such as these, application developers are forced to write separate code to generate/handle such values.

Multiple Sessions inserting data:

When multiple application processes or threads insert data and/or generate sequence values several factors such as the following come into play:

- CPU scheduling algorithm

- Network delays.

- Processes Synchronization mechanisms used for generating sequence values etc.

In multi-threaded and/or multiple application server (where application or middle-tier runs on multiple servers) environments, we cannot guarantee that the subsequent column values (rows into the table) will be higher than the last value as shown below: (this is for Strings or character-based columns, we will look into this shortly).

pool-1-thread-19 --- 0000-001-180533

pool-1-thread-19 --- 0000-001-181259

pool-1-thread-50 --- 0000-001-180522

Notice the gap between the values, the scenario is basically the same when we use application sequences to generate integer values: (this is just a snippet showing how the subsequent inserts may result in lower values as well)

pool-1-thread-37 --- 202757

pool-1-thread-5 --- 202259

pool-1-thread-5 --- 202759

pool-1-thread-34 --- 202247

Due to the nature of 90-10 split (precisely 100-0 split), a block may be full and a subsequent insert statement with a lower value will result in the index leaf block to be split using 50-50 approach, thereby wasting even more space. Synchronization mechanisms used, CPU scheduling, distributed cache layer, network delays etc., also play a role in such sequence gap. Due to the reasons specified earlier, when populating columns using monotonically increasing values for single column indexes (primary or unique) or leading column value of an index, always prefer database sequences as there is less risk of space wastage.

Database Sequences:

When using database sequences to populate the value of a particular column we cannot guarantee a 90-10 split each time for the underlying index unless

- Sequence is created using NOCACHE ORDER clause

- Only one session inserts the data.

- Only few sessions insert the data that too less frequently.

Creating the sequence with NOCACHE ORDER essentially undermines the concurrency capabilities as the sessions will be plagued by events such as row cache lock, latch: shared pool etc. To improve the concurrency, it is recommended to alter the sequence with CACHE and NOORDER. Due to the behavior explained earlier, this can result in 50-50 index leaf block splits adding more space wastage.

The following snippet of index treedump reveals the details especially:

leaf: 0x1c016da 29365978 (212: row:480.480 avs:8) à 90-10 block split

leaf: 0x1c016e4 29365988 (213: row:480.480 avs:9) à 90-10 block split

leaf: 0x1c016ab 29365931 (214: row:480.480 avs:9) à 90-10 block split

leaf: 0x1c016bb 29365947 (215: row:245.245 avs:3531) à result of 50-50 block split

leaf: 0x1c016ea 29365994 (216: row:237.237 avs:3652) à result of 50-50 block split

leaf: 0x1c0169b 29365915 (217: row:480.480 avs:9) à 90-10 block split

leaf: 0x1c016b7 29365943 (218: row:480.480 avs:8) à 90-10 block split

leaf: 0x1c016a7 29365927 (219: row:480.480 avs:9) à 90-10 block split

leaf: 0x1c016b3 29365939 (220: row:245.245 avs:3532) à result of 50-50 block split

leaf: 0x1c016bd 29365949 (221: row:480.480 avs:9) à 90-10 block split

leaf: 0x1c016f2 29366002 (222: row:480.480 avs:9) à 90-10 block split

leaf: 0x1c01697 29365911 (223: row:480.480 avs:9) à 90-10 block split

leaf: 0x1c016fe 29366014 (224: row:242.242 avs:3576) à result of 50-50 block split

leaf: 0x1c016de 29365982 (225: row:239.239 avs:3622) à result of 50-50 block split

leaf: 0x1c016d8 29365976 (226: row:480.480 avs:9) à 90-10 block split

leaf: 0x1c016f6 29366006 (227: row:480.480 avs:8) à 90-10 block split

When multiple sessions insert data into the table concurrently, database sequences perform better compared to application sequences in terms of 90-10 index leaf block splits. There will be fewer 50-50 block splits. For single column indexes (primary or unique) or when the leading column value of an index is populated using monotonically increasing values, always prefer database sequences as there is less wastage.

The above discussion above about 90-10 and 50-50 index leaf block splits holds true only for

- Single column indexes or

- When the leading column of a concatenated index is populated using monotonically increasing values

and falls apart when the column whose value is populated using sequence or monotonically increasing value is not the leading column in which case the index leaf block splits are mostly 50-50. regardless of whether we are using database sequences or application sequences, when you load the data using bulk insert or batch inserts using PreparedStatements JDBC API, there will be more 50-50 index leaf block splits and the extent to which the 50-50 index leaf block splits occur is directly linked to the following:

- Batch size used.

- Number of sessions inserting the data.

- Index key size and database/tablespace block size.

The data type of the column also plays a major role in index bloat factor. An indexed column which is of numeric data type, and populated using monotonically increasing value using either a database sequence or application sequence will have more 90-10 splits than the same column which is defined as varchar or char. Using an incorrect data type can increase the bloat in the index.

When the column value is mostly unique but not monotonically increasing and it is randomly populated:

In this case, there will be wasted space. But since the data is mostly unique, rebuilding or shrinking space may do more damage as the index leaf blocks may get split using 50-50 method causing redo generation unless the PCTFREE defined can accommodate the data coming in. As the data is mostly unique any potential gains achieved through rebuilding or shrinking or coalescing will be minimal as the number of index blocks read to locate a value will not drastically improve, thereby fewer potential gains for SELECT statements. Examples of such columns include message digest or message hash etc.

When the column value is not unique and non-monotonically increasing a key value will usually have multiple ROWID values. By default, when there are multiple ROWIDs for each indexed key value Oracle sorts these ROWIDs in the index leaf blocks. As the index leaf block becomes full the blocks are usually split using the 50-50 approach. This imposes two different scenarios especially when the extent sizes are different and the data files with enough free space in the tablespace. Usually on a ASSM tablespace, since it uses bitmaps to track the free space and formats blocks in ranges in random which can be in any part of the extent, and due to its inherent behavior with regards to spreading the load across multiple blocks especially when multiple sessions try to insert the data into the table, there is a high change the that the index leaf blocks will be properly utilized following a 50-50 block split, and the utilization rate also increases when multiple data files are present in a tablespace with free space as extents are allocated

- When using uniform extent allocation and smaller extent sizes with a single data file added to a tablespace when the space crunch occurs, the likelihood of space wastage is a bit high with the increase in the ROWID.

- When using auto allocate extent allocation as the extent sizes tend to become large and with the default ASSM behavior as mentioned about the likelihood of the index block wastage is very low as the multiple sessions end up inserting data into the table into different blocks whose positions may be different and with multiple data files in the tablespace with enough free space. The index leaf blocks often tend to be better utilized.

The above-mentioned behavior is rare but this will help us understand oracle properly in its entirely.

Update an Index entry:

There is no such concept as updating a key value when it comes to indexes. When an update to a row results in an indexed column value change, the old index key value pair is marked as deleted and is re-inserted into an appropriate index leaf block (at the correct location in the B*Tree).

Deleting an Index entry:

When an index key value is deleted, all the subsequent entries in the leaf blocks will be marked as deleted. Remember even if an index leaf block contains all the deleted rows, it will be scanned during an index range scan. In other words, the index leaf block will not be delinked from the leaf block doubly linked list.

Generic DML behavior:

Always create your unique index or the indexes that support the constraints first and make sure that object_id of these indexes is always less than other indexes on the same table.

Insert behavior:

Oracle uses the following approach while inserting a row into the table.

1. Read the segment header (L3 BMB 3^st block of the 1^st extent)

2. Read the L2 BMB and determine the L1 BMB that contains details regarding free space.

3. Read the L1 BMB and find the block that contains enough space.

a. If the L1 BMB indicates that blocks that are formatted are all full and there are blocks that are unformatted. Format a range of data blocks (make session wait on ENQ: FB – Contention).

b. If there is enough space insert the row into the table.

4. If there are indexes on the base table.

a. Start inserting entries into the indexes in the order starting from the least OBJECT_ID in other words if there are 4 indexes, start with the index with the least OBJECT_ID and proceed further.

5. If the inserted value results in a constraint violation then rollback the change to all the blocks modified earlier.

An interesting behavior here is that if the OBJECT_ID of the index policing the unique constraint is higher than the rest of the indexes present on the table, policing will be delayed which can cause higher redo and undo generation in case of roll back or constraint violation.

For example, consider the following statements if executed in the order listed below:

create table students(roll number, name varchar2(20), dept_id number, sub_id number, mark1 number, mark2 number, mark3 number);

create index mark1_idx on students(mark1);

create index mark2_idx on students(mark2);

create index mark3_idx on students(mark3);

alter table students add constraint students_pk primary key (roll);

insert into students select rownum, dbms_random.string(0,20), round(dbms_random.value()*100), round(dbms_random.value()*100), round(dbms_random.value()*100), round(dbms_random.value()*100) , round(dbms_random.value()*100) from dual connect by level < 1000;

commit;

Querying the DBA_OBJECTS results in the following:

The OBJECT_ID of STUDENTS_PK constraint is higher because it is created after the other indexes. Let’s look at the implications of this, let’s insert a row into the table which causes a unique key constraint violation and observe the amount of redo generation.

SQL> insert into students values (10, 'VISHNU', 123, 123, 123, 123, 123);

insert into students values (10, 'VISHNU', 123, 123, 123, 123, 123)

ERROR at line 1:

ORA-00001: unique constraint (VISHNU.STUDENTS_PK) violated

The 10046 level 12 trace of the above statement is as follows:

We can clearly see that the oracle starts reading and modifying the index entries in the order specified by OBJECT_ID. The corresponding redo generation and rollback changes applied reveals the details.

Now let’s create the primary key immediately following the table creation and see the benefits:

create table students(roll number, name varchar2(20), dept_id number, sub_id number, mark1 number, mark2 number, mark3 number);

alter table students add constraint students_pk primary key (roll);

create index mark1_idx on students(mark1);

create index mark2_idx on students(mark2);

create index mark3_idx on students(mark3);

commit;

Querying the DBA_OBJECTS results in the following:

Specifying the keywords PRIMARY KEY or UNIQUE in the CREATE TABLE DDL results in the creation of the unique indexes immediately following the table creation thereby resulting in subsequent OBJECT_IDs for the indexes. The corresponding 10046 level 12 trace is as follows:

From the above we can clearly see that the execution halted immediately following the unique key constraint violation. The redo generation and rollback changes statistics are as follows:

The redo generation is significantly less and the only 1 undo record had to be applied – which is basically rolling back the changes applied to the table block.

Update Behavior:

An update (in terms of index) is basically marking the existing entry as deleted and inserting the new entry (based on its value and ROWID in the corresponding position in the B-Tree structure). Let’s see the path of flow using an update statement.

Consider the following:

create table students(roll number, name varchar2(20), mark1 number, mark2 number);

create index mark1_idx on students(mark1);

alter table students add constraint students_pk primary key (roll);

insert into students select rownum, dbms_random.string(0,20), round(dbms_random.value()*100), round(dbms_random.value()*100) from dual connect by level < 1000;

Commit;

The OBJECT_IDs are as follows:

Let’s see the behavior of the following SQL statement:

update students set mark1 = 100, roll = 999 where roll = 1;

(the above statement sounds really strange but this required if you have to understand how an update works and how the logic flow continues. This statement does result in a unique constraint violation)

Execution plan is as follows:

10046 level 12 trace is as follows:

The entries related to unique index are marked as RED. The flow is as follows:

- Read the index root, leaf block to find the corresponding ROWID.

- Once the ROWID is identified, read the corresponding table block and modify the contents.

- Proceed updating the indexes based on OBJECT_ID

- Read the mark1 index root and locate the old value that has to be updated.

- Mark the old value as deleted in the mark1 index leaf block.

- Read the mark1 leaf block (new location) where the row has to be inserted.

- Now upon reaching Index unique index (which has higher OBJECT_ID), the block that stored the old value roll = 1 is already in the memory, so mark the entry as deleted.

- Read the block (new location) where the entry has to be inserted, upon finding an existing entry initiate a rollback.

The status of the blocks post rollback is as follows:

The Snippet of the redo log dump indicating the changes is as follows:

CHANGE #1 CON_ID:0 TYP:0 CLS:1 AFN:7 DBA:0x01c0015c OBJ:76207 SCN:0x00000000008a80a1 SEQ:2 OP:11.5 ENC:0 RBL:0 FLG:0x0000

... update values in the table block

col 0: [ 3] c2 0a 47

col 2: [ 2] c2 02

CHANGE #2 CON_ID:0 TYP:0 CLS:57 AFN:4 DBA:0x01000228 OBJ:4294967295 SCN:0x00000000008a8055 SEQ:1 OP:5.2 ENC:0 RBL:0 FLG:0x0000

ktudh redo: slt: 0x0000 sqn: 0x000008ad flg: 0x0012 siz: 144 fbi: 0

uba: 0x0106c69f.00a8.07 pxid: 0x0000.000.00000000

CHANGE #3 CON_ID:0 TYP:0 CLS:1 AFN:7 DBA:0x01c0016e OBJ:76208 SCN:0x00000000008a80a1 SEQ:2 OP:10.4 ENC:0 RBL:0 FLG:0x0000

index redo (kdxlde): delete leaf row //Mark1_IDX

CHANGE #4 CON_ID:0 TYP:0 CLS:1 AFN:7 DBA:0x01c0016f OBJ:76208 SCN:0x00000000008a80a1 SEQ:2 OP:10.2 ENC:0 RBL:0 FLG:0x0000

index redo (kdxlin): insert leaf row //Mark1_IDX

CHANGE #5 CON_ID:0 TYP:0 CLS:1 AFN:7 DBA:0x01c00166 OBJ:76209 SCN:0x00000000008a80a1 SEQ:2 OP:10.4 ENC:0 RBL:0 FLG:0x0000

index redo (kdxlde): delete leaf row //Unique Index

CHANGE #6 CON_ID:0 TYP:0 CLS:58 AFN:4 DBA:0x0106c69f OBJ:4294967295 SCN:0x00000000008a8054 SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000

rollback initiated...

Now that you have seen how the insert/update are done and how Oracle OBJECT_ID to start modifying the index, always make sure that the primary key/unique indexes are created immediately following a table creation. Always remember that the SQL trace will never show the details regarding the blocks that are already in the buffer cache, this abstracts us from looking at how many times a buffer is accessed over the course of SQL execution.

Now Let’s take this a step further and see the impact of this when it comes to foreign keys.

Insert Behavior – Constraints Foreign key:

Foreign key constraint checks are done post the insert or update statement completes modifying the existing table and its dependent indexes.

Consider the following:

create table parent (roll number);

alter table parent add constraint parent_pk primary key(roll);

insert into parent select rownum - 1 from dual connect by level < 102;

commit;

create table child (child_id number, roll number, mark1 number, mark2 number);

create index child_mark1_idx on child(mark1);

create index child_mark2_idx on child(mark2);

alter table child add constraint child_pk primary key (child_id);

create index child_fk on child(roll);

alter table child add constraint child_fk foreign key (roll) references parent(Roll);

insert into child

select rownum, round(Dbms_random.value()*100),

round(Dbms_random.value()*100),

round(Dbms_random.value()*100) from dual connect by level < 20000;

commit;

The OBJECT_IDs for the tables is as follows:

Inserting into the Parent Table:

insert into parent values (1000);

commit;

Inserting into parent table doesn’t require any checks on foreign keys for data integrity and is obvious from the snippet of 10046 level 12 trace for the above statement.

Inserting into the child table:

Inserting into the child table triggers a foreign key constraint check (post modifying – inserting - the child table and its dependent index blocks) to validate whether the parent table has the corresponding value present, if not throw an exception. Lets look at the following example:

SQL> insert into child values (1231231, 1231, 123, 123);

insert into child values (1231231, 1231, 123, 123)

ERROR at line 1:

ORA-02291: integrity constraint (VISHNU.CHILD_FK) violated - parent key not found

Snippet of 10046 level 12 trace is as follows:

As we can see from the above the execution halts immediately following the foreign key constraint check.

Deleting a row:

Parent Table:

Unless the foreign key constraint is created with ON DELETE CASCADE clause, Oracle will not allow a row to be deleted from the parent table if the corresponding values or rows are present in the child table.

Child Table:

Deleting a row from the child table is similar to deleting a row from a normal table. The procedure starts by finding the corresponding ROWID and delete the rows from the table and then start marking the index entries in the corresponding indexes as deleted in OBJECT_ID order.

Updating a row:

Parent Table:

Oracle will not allow us to update the foreign key column value if there are any matching values present in the child table (ON UPDATE CASCADE is not supported yet).

Child Table:

Similar to the Insert, foreign key constraint checks are performed post the modification to the current table and its dependent indexes are done.

Take away of this, always index the foreign key columns on the child and parent tables as this can significantly slow down the performance of insert/update statements.

Sequence of Steps involved a in Index Range/Unique Scan:

During an index range scan or unique scan, based on the key value specified in the predicate oracle or server process retrieves the ROWID from the index and based on the ROWID it accesses the corresponding table blocks. The process flow is as follows:

- Read the branch root block and determine the branch block

- Read the branch block and determine the right leaf block

- Read the ROWID from the leaf block

- Read the table data block based on the ROWID.

Identifying bloat in an Index:

We know that in a tablespace created with segment space management auto (simply ASSM), oracle uses bitmaps to track the freeness status of blocks in a segment. Oracle uses different blocks such as L1BMB, L2BMB & L3BMB to track/manage freeness status of blocks in an extent and the block can be in any of the following different states:

- FREE

- FULL

- FS1 (0 – 25% used)

- FS2 (25 – 50% used)

- FS3 (50 – 75% used)

- FS4 (75 – 100% used).

- UNFORMATTED

When it comes to Indexes, index block can have only 2 states

- FS2 – can be reused.

- FULL – even if it has only few entries.

We can find the bloat in an index using any the following:

- Segment advisor

- Analyze index … validate structure (INDEX_STATS.PCT_USED).

- Treedump

Using the segment advisor is by far the best and recommended approach there is - to identify the bloat in an index. Analyze index … validate structure command can also be used against an index to populate various details, but this operation locks the table thus preventing any DML on the table. Treedump can also be used but calculating the bloat using this approach is not tidy and thus not at all recommended.

Impact of additional Indexes on DML

Although an index helps speed up the queries significantly, there is performance penalty when it comes to maintaining these indexes, and this is clearly visible when it comes to performing DML on the table. Since the indexes store the column value and its associated ROWID, Oracle has to maintain each index on the table. When rows are inserted, updated or deleted, all the corresponding indexes on the table need updated which not only slows down the performance but also increases the undo and redo generation as well. The following chart gives insight into the increase in the time taken for loading 100k rows with additional index on the table.

By now we already know that the B-Tree indexes are implemented as balanced trees, before proceeding further let’s understand the concepts behind the tree data structure as it is very important for further concepts and how this impacts the performance of the database. Tree data structure requires that key-value pairs to be unique and no two key-value pairs be the same. Similarly, in Oracle, regardless of whether the index is a unique or non-unique index, the entries in a leaf block are always unique and whether the column value, ROWID pair is stored together or separately depends on whether the index is unique or non-unique index. The structure of a unique and non-unique index is essentially the same but difference is evident in how the data get stored in the leaf blocks.

Unique Index on STUDENTS.ROLL_NO column leaf block dump.

Normal B-Tree index on STUDENTS.ROLL_NO column leaf block dump:

If we notice kdxconco for a unique index which indicates the number of columns is 1, but for a non-unique index on the same column the kdxconco value is 2. This is because for a unique index the ROWID is not considered as part of the key and in a non-unique index the ROWID is considered as part of the key. Due to this behavior the size of a unique index is always smaller than a non-unique index. We can clearly see that in the leaf block dumps kdxconro value which indicates the number of entries in the leaf block is higher compared to the number of entries in a leaf block for a non-unique index. In other words, due this behavior a unique index leaf block can accommodate more rows when compared to a non-unique index.

B-Tree indexes and Constraints:

Oracle enforces unique and primary key constraints using the B-Tree indexes. A B-Tree indexes can be either Unique or Non-Unique. Unique indexes guarantee that no two column values in the index are the same. When we add a unique or primary key constraint to a table oracle automatically create a unique B-Tree index. This has rather interesting implications, when we drop or disable the constraint, oracle automatically drops the underlying index. In order to prevent this from happening use the KEEP INDEX clause when disabling or dropping the constraint. Another important aspect is that when we create a unique index even though we are indirectly enforcing a unique constraint, it doesn’t show up in the DBA/USER/ALL_CONSTRAINTS view unless we manually add a constraint using ALTER TABLE command. A DBA should be aware of the following behavior when it comes to constraints and Unique indexes.

- If a B-Tree index doesn’t already exist, Oracle creates a unique index automatically when we add a unique or primary key constraint.

- If the columns are already indexed using either a B-Tree unique or non-unique index, oracle simply uses that index to enforce the primary or unique key constraint.

When it comes to performance, enforcing unique and/or primary key constraints using non-unique Indexes presents different challenges.

- Increased buffer gets / consistent gets / latch activity for each execution.

o Index Range Scan vs Index Unique Scan

- Increased redo / undo generation while performing roll back.

Increased Logical IO (LIO)/Latch activity:

When the unique or primary key constraints are enforced using a non-unique index, a predicate with an equality condition always results in index range scan instead of index unique scan to retrieve the table contents as shown below:

Index Range Scan:

select * From temp where roll = 123;

Index Unique Scan:

select * From temp where roll = 123;

With a unique index, Oracle is certain that a key value returns in only value. Due to this it examines the buffer with only one CBC latch, but when it comes to non-unique indexes, oracle uses a full 2 gets CBC latch for the index leaf block and the data block resulting in an increase of about 50% latch activity. Both the cases the number of buffer gets doesn’t change. Depending on the height of the index, a unique/range scan that results in a single row has the following path

- Index Branch root block

- Index Branch block

- Index Leaf block

- Table block

When queries have predicates such as <, <=, >, >=, or, between, in etc, the latch activity is the same for both the unique and non-unique indexes. This behavior has a noticeable effect only when the application issues lot of queries that use predicates that use equality condition on primary key or unique columns but these constraints are protected or policed by non-unique indexes.

Increased Redo / Undo Generation during rollback:

When a non-unique index enforces the primary key or unique key constraint, when performing inserts, oracle straightaway modifies the index leaf blocks and after modifying the contents the constraint check will be done and the oracle issues an ORA-00001: unique constraint violated error, this result in both undo and redo generation. This is not the case when a unique index enforces a primary key or unique key constraint. Oracle validates the unique constraint even before modifying the contents of the index and this results in less redo or undo generation.

This behavior is the same when it comes to update statements updating unique key or primary key column, such as the following

Update temp set roll=10912 where roll=1; //roll column is protected via a unique index.

In this case, the end result should be that the entry in the index leaf block containing the value 1 should be deleted, and a new entry with value 10912 should be inserted into a different leaf block. In this particular case Oracle straight away deletes the entry in a leaf block containing the value 1 and before inserting the value 10912 it checks whether the value already exists, and if it exists, before modifying the block it throws an ORA-00001 error. The end result is that only one leaf block gets modified, and it is rolled back. The redo generation in this case is reduced by half by using a unique index rather than a non-unique index enforcing a unique or primary key constraint.

Reverse Key indexes:

Contention for the right most index leaf blocks is clearly visible when the data is inserted into a table by multiple sessions using either a database or application sequence to populate certain columns such as primary key or unique key. This problem is more severe in RAC environments where application sessions connected to different instances insert the data concurrently as the instance spends most of the time transferring the blocks across several instances buffer cache via interconnect. The severity increases with the increase in the number of nodes in the cluster and if the sessions are evenly distributed among the instances. In these cases, implementing a reverse key index can help alleviate the contention on index leaf blocks. We can even consider reverse key indexes even in single instance environments, if we notice high contention (Buffer busy waits, enq – TX: index contention) for index leaf blocks.

Reverse key index is a type of b-tree index that physically reverses the bytes of each index key while keeping the column order and preserving the ROWID. For example, if the index key is 123456 and if the bytes stored for this key in hexadecimal is 15 2c 46 20 c4 in a standard B-Tree index, then the reverse key index reverses the bytes as c4 20 45 2c 15, so the index key values 123456 and 123457 have a huge gap between them instead of 1 and these values are likely to be stored in index blocks which are far from each other. Thus, the IO is spread uniformly across the index leaf blocks.

A reverse key index can be created just adding REVERSE clause to the create index DDL statement, and an example is as follows:

CREATE INDEX temp_idx on temp(roll) reverse;

Since there is no particular order in which the index branches or values are organized in a reverse key index. The function min or max will result in an INDEX FAST FULL SCAN.

Inherent to the way in which reverse indexes are implemented, a reverse key indexes have the following disadvantages: and we will look at each one of them.

- Ability to perform Index range scans

- Inability to make use of Asynch IO.

o Index prefetching.

o Batched Table access.

- Increased IOPS.

- Increased number of buffers in buffer cache.

- Increased Clustering Factor

- More fragmentation in the index

o 50-50 block splits.

- Ability to use Batching in PreparedStatements (application level), which is necessary to implement bind variables using JDBC, but do no batch transactions especially inserts!

A big disadvantage of reverse key indexes is its ability to perform index range scans. By default, oracle performs an index range scan on a non-unique index even when predicates contain = condition as key value in a non-unique index can return multiple ROWIDs. Reverse key indexes on non-unique indexes do support index range scans in a way when the predicate contains conditions such as IN, or, = but range-based predicates such as <, <=, >, >=, between are not supported as the adjacent leaf blocks may contain the key values that are greater than the value specified in the predicate clause, due to this predicates based on ranges are not supported and the session always ends up performing either a full table scan or full index scan. Remember that index full scans or index fast full scans are supported.

When a key value results in retrieving multiple rows from the table, Oracle does a pretty good job when it comes utilizing the features such as prefetching and batching in which a session batches multiple IO requests to read the table contents into one IO_SUBMIT call provided that the Async IO is available so that all the blocks are read in parallel instead of serially, this improves the performance considerably provided that the IO subsystem have enough IOPS to spare and the disk layout is proper. Implementing reverse key indexes in these cases, where most of the SQL statements query data form this table alone (or single table access – no joins) will not have any performance impact (predicate condition =). Summarizing, if the queries access this table alone we may not see any performance effect.

When the indexed key column has high number of distinct values or when the columns are protected by a primary key, unique key constraints, and the queries against the database uses this indexed column in the joining condition, oracle’s ability to effectively make use of prefetching, batching, async IO diminishes, regardless of whether the table is driving or driven table. Oracle resorts to old behavior of reading one entry from the index, and reading the corresponding table block. This process increases the IOPS on the server considerably if the data blocks are not already cached, if the query is a frequently executing query, and if the blocks are already cached in the buffer cache, we might not see any noticeable performance impact, unless it has to read the blocks from the disk.

When the table’s average row length is very less and the table size is extremely large, there is high chance that the size of the dependent structures such as the indexes will be high. Due to its inherent nature, as the keys are distributed (may not be even) among the available index leaf blocks and even though the application works with the latest data. Converting the existing index into a reverse key index or rebuilding the existing index as reverse key index will eventually bring more data blocks into the buffer cache as the data is spread across several leaf blocks, and the buffer cache size requirements will increase considerably depending on the size of the table and amount of activity on the table. Remember that all it takes is an additional 100MB buffer cache size requirement to hold the active working region of the segments to cause a cascading effect resulting in waits such as free buffer waits, write complete waits, buffer exterminate, db file sequential read etc. Before implementing a reverse key index, always perform the following checks

- Number of index buffers in the buffer cache and the ratio of Clean/XCUR vs CR.

- How busy is the buffer cache?

- Is the buffer cache properly sized?

Since the byte order is reversed subsequent insert statements result in updating the index blocks leaf at various location and as such the block splits are 50-50 and due to this space requirement for a reverse key index is usually larger than a regular index and as such the ability of an index leaf block to hold more key values is reduced. Periodically monitor the size of the index and see if rebuilding can help reduce the bloat.

There is a general recommendation floating around which says when you see high waits on log file sync due to huge number of commits – advice the application team to batch the transaction, this can prove deadly when it comes inserting into tables that have reverse key indexes. The impact increases with the increase in the batch size. as the sessions batch insert statements the likely hood of session hitting the same set of index blocks increases considerably and as the block splits are 50-50 more time is wasted moving the entries in the leaf blocks.

Clustering factor for a reverse key index is inherently higher than a regular index and will usually be very close to the number of rows in the table, so consider this when implementing reverse key indexes as the clustering factor. This may not have a noticeable impact when compared with a regular index especially in cases where the table’s average row length is high, but will have a significant impact when the table’s average row length is less as the index size is larger.

Invisible Indexes:

Starting with 11g R1, an index can be marked as visible or invisible. By default, the optimizer ignores these invisible indexes unless we explicitly set the OPTIMZIER_USER_INVISIBLE_INDEXES initialization parameter to TRUE either at the session or system level. Unlike unusable indexes, an invisible index is maintained by Oracle during DML statements, the visibility status is designed exclusively to hide or abstract certain indexes from the optimizer during the cost estimation. Both normal and invisible indexes are maintained by oracle in a similar fashion. We can even make indexes that support integrity constraints or referential integrity constraints as invisible, but oracle internally still maintains them, unless you make them unusable. Although we can make a partitioned index invisible, we cannot make an individual index partition invisible while leaving other partitions visible. Invisible indexes feature is primary used for the following:

- Test the removal of an index before dropping it.

- Test new indexes without affecting the overall application.

Making an index invisible to the optimizer is less invasive than making it unusable and dropping it. Once the index is made unusable, the only way to make sure that the optimizer uses it again is to rebuild it which can take huge time depending on the size of the index. Another important use case is when testing the performance of a new index without affecting the existing application queries. Introduction of a new index may cause the existing plans to change to better or worse, in these cases we can create an index invisible so that the optimizer is not aware of the new index, and we can test the performance of the queries in the session by setting the OPTIMIZER_USE_INVISIBILE_INDEXES to TRUE, once the performance is known to have improved or the necessary benefits attained, we can make the index visible.

Making an index visible or invisible invalidates all the dependent cursors on that table and these cursors have to reparsed again resulting in a new child cursor. Immediately after the visibility change If there are large number of cursors in the shared pool or if there are huge number of sessions concurrently executing SQLs against this table, then expect to see library cache contention. Sessions may end up waiting on wait events such as cursor: pin s or library cache: mutex x. The extent or impact due to this is directly proportional to the number of cursors and total number of sessions. The information regarding why the cursor is invalidated is stored in V$SQL_SHARED_CURSOR.REASON. The following is an example.

<reason>Marked for Purge(5)</reason>

<unsafe_ddl_code>11</unsafe_ddl_code>

</ChildNode>

Composite Indexes:

Creating an index on multiple columns results in a composite or multicolumn index. We can create either a B-Tree or Bitmap composite index.

Consider the following table:

create table temp(A number, B number, C number, D number);

and we create a composite index on the table as follows:

create index comp_idx on temp(A, B, C);

Optimizer considers this index for the cost estimates for all the following set of predicates:

Index Range Scans:

Where a=? and b=? and c=?

Where a=? and b=?

Where a=? and c=?

Where a=?

Where (a between ? and ?) or (a > ?) or (a < ?) or (a in ()) or (a = ? or a = ?)

Etc.,

Index Skip scans:

Where b=?

Where c=? //depending on NDV on column a and b

Where b=? and c=?

Oracle can still utilize the index when the predicate conditions OR condition, but only when the predicates contains a leading column of a concatenated index as follows, this is due to the restriction imposed by the index skip scans and we will look into this later.

Where a=? or b=?

In this case the plan will appear as follows:

In the older versions (pre-11g), Optimizer considered concatenated indexes only when leading column or columns of the index are specified in the predicate clause. Starting from 11g, oracle can consider this index for skip scans as well. Column order in a composite index doesn’t matter when all the columns of the composite index are specified in the query. When all the predicates are specified in a query the number of lookups or reads required to execute the query doesn’t change regardless of the column order. But there is an important catch here, the size of a composite index increases when the number of distinct values in the leading column/columns is high. It is always better to have the most frequently used column/columns as the leading column/columns in a concatenated index. A better starting point is to use the SYS.COL_USAGE$ to get a high-level overview of the types of predicates used and the number of times they are used for the columns.

In situations such as the following:

When both the columns A and B are specified in the predicate clause as follows:

Where a=? and b=?

Column A has only 100 distinct values and Column B has only 90 distinct values, and if the column values are either uniformly or randomly distributed. Optimizer may not pick the index on either Column A or column B due to the cost as it has to read many table blocks. But if there exists either a relationship between column A and B or the combined cardinality of the columns A and B is really high, the optimizer is likely to consider an index created on A, B columns there by reducing number of lookups required. For example, there are 50 states and 3242 counties and with a table with over 1 billion rows, querying a table based on indexed state or county column alone will always result in a full table as the number of retrieved through the index lookup is very high. But we know that a county is in only one state, there exists a clear relationship, creating an index on (states, counties) results in optimizer selecting an index scan over a costly full table scan as the combined cardinality of both the columns states and counties is 50*3242 = 162100, and listing the employees in a state department using and index on county and state columns will results in an index scan. Similarly, if the cardinality of a column A is 200 and column B is 200 the combined cardinality of both the columns is 40000, in these cases when queries use both the columns in their predicates, creating an index on both the columns is very beneficial. Therefore, by creating a multicolumn or concatenated index on multiple low cardinality columns, we are improving the odds of optimizer choosing an index scan over a costly full table scan.

There are several aspects that we have to understand regarding multi-columned indexes:

For example, let’s take students table which records the marks a student gained in a particular subject and load it with 1000 rows.

create table students (student_id number, entry_id number, dept_id number, subject_id number, mark1 number);

create index students_idx on (subject_id, mark1, student_id);

Depending on the distribution of the data and NDV of the leading columns in a multi-column index, oracle can use either the leading column or several columns in the index branch blocks to uniquely identify a leaf block containing the corresponding index entry. As rows are inserted into the table, if oracle determines that the NDV of the leading column is not enough to uniquely identity the leaf block, it includes the second column of the multi-column index to be part of the branch blocks and this process can continue until all the columns are included in the index leaf blocks. So, don’t be surprised to find branch blocks containing entries with only a leading column, or more. Let’s demonstrate this.

We load 1000 rows into the above table as follows:

insert into students select rownum, round(dbms_random.value(0,100)), round(Dbms_random.value(0,300)), round(Dbms_random.value(0,100)) from dual connect by level < 1000;

commit;

Dumping the B-Tree structure gives us the following:

branch: 0x1c00173 29360499 (0: nrow: 3, level: 1)

leaf: 0x1c00174 29360500 (-1: row:344.344 avs:819)

leaf: 0x1c00175 29360501 (0: row:328.328 avs:824)

leaf: 0x1c00176 29360502 (1: row:327.327 avs:842)

now Let’s look at the branch block dump:

row#0[8048] dba: 29360501=0x1c00175

col 0; len 2; (2): c1 64

col 1; TERM

row#1[8039] dba: 29360502=0x1c00176

col 0; len 3; (3): c2 03 07

col 1; TERM

----- end of branch block dump -----

In this case, subject_id has 294 distinct values (the number of distinct values varies depending on the how dbms_random.value generates values). Now let’s load 10000 rows into the table and see the contents of the branch block:

row#0[8045] dba: 29360485=0x1c00165

col 0; len 2; (2): c1 0a

col 1; len 2; (2): c1 4e

col 2; TERM

row#1[8037] dba: 29360486=0x1c00166

col 0; len 2; (2): c1 14

col 1; TERM

row#2[8023] dba: 29360487=0x1c00167

col 0; len 2; (2): c1 1d

col 1; len 2; (2): c1 43

col 2; len 2; (2): c2 2d

col 3; TERM

We can clearly observe that a particular range of values oracle uses only the leading columns or leading two columns or all the columns of the index. This is done for efficiency and to reduce the number of branch blocks and to increase the number of entries in the branch block. So, a leading column with high NDV requires a smaller number of branch blocks as Oracle includes only the leading column, inherently the branch block can hold more entries. Having a high NDV leading column also increases chances of branch blocks being subjected to 50-50 splits but this is dependent on various factors such as the following:

- Whether the value is a monotonically increasing value and strict ordering is used.

- The data type of the leading column.

- The nature of data load.

But following an index rebuild or creation, the number of branch blocks in a multi-column index will be significantly less than other multi-column indexes with low NDV leading column as oracle stores two or more columns in the branch blocks.

There are several takeaways from this:

- Regardless of the column order, when all the indexed columns are specified in the predicates the number of lookups required to execute the query doesn’t change.

- Regardless of the column order, following an index rebuild or creation the size of the concatenated index will almost be similar with the exception of branch blocks(unless we have enabled index key compression).

Size of a multi-columned index – column order:

Consider the following table and its indexes:

create table students(student_id number primary key, dept_id number not null, mark1 number not null, mark2 number);

create index sid_did_mark1_idx on students(student_id, dept_id, mark1);

create index did_sid_mark1_idx on students(dept_id, student_id, mark1);

create index did_mark1_sid_idx on students(dept_id, mark1, student_id);

The cardinality of each column is as follows:

student_id à Unique

dept_id à 300 (1-300)

mark1 à 300 (1-300)

After an initial load from multiple sessions, the index structure and the space usage is as follows:

Remember that the branch blocks can also be split using 50-50, 90-10 approach. The bloat in the index leaf blocks and branch blocks is completely at the mercy of how the data is loaded into the table. If we observe closely USED_SPACE is almost the same for all the indexes, this is because the leaf blocks store all the indexed column values along with the ROWID, small variation in the USED_SPACE can be due to the presence of NULLs. Variation in the BTREE_SPACE is due to the nature of the splits. Following the rebuild of these indexes, the corresponding structure is as follows:

If we notice the index SID_DID_MARK1_IDX which has a high NDV leading column (in our case unique) requires only a smaller number of branch blocks as since student_id column is student_id is enough to traverse down the index to reach a particular leaf block/index entry without additional leaf block reads.

Now let’s see a quick introduction into the benefits of index compression and how it can reduce the overall size of the index. Let’s rebuild the index DID_MARK1_SID_IDX using additional clause COMPRESS 2 and see the corresponding size.

If we notice USED_SPACE and BTREE_SPACE is reduced significantly. We will see more details regarding how this is achieved in the key-compressed indexes section.

High NDV leading column in a concatenated index deprives us from

- Reaping the benefits of Index key compression, which not only reduces the overall size of the index but also reduces the number of lookups required (this is when the table’s average row length is small in which case the index size will be large) thereby not only improving the efficiency of an buffer cache as less number of blocks are required but also reduces the overall query elapsed time (though the saving or benefits with regards to buffer cache and query elapsed time is subjective to many factors such as the number of SQL executions, space savings achieved, etc.).

- Ability of the optimizer to use or consider that index for index – skip scans.

- Write about leading column when it comes to joins.

Having low cardinality columns as the leading columns in a multi-column or composite index has the following advantages.

- Ability to use index key compression.

- Ability to perform index skip scan – even when the predicates refer to the third column.

Having multiple indexes on same set of columns but with different column order can result in the optimizer’s index access cost to be same and in certain situations this can cause problems. For example, consider the following scenario:

create table students (student_id number, name varchar2(20), dept_id number, sub_id marks number);

create index dept_sub_marks_idx on students(dept_id, sub_id, marks);

create index marks_sub_dept_idx on students(marks, sub_id, dept_id);

Consider the following query:

select * From students where dept_id = 1 and sub_id = 10 and marks = 100

The corresponding 10053 level 12 trace is as follows:

****** Costing Index DEPT_SUB_MARKS_IDX

SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_SCAN

ColGroup Usage:: PredCnt: 3 Matches Full: #1 Partial: Sel: 1.1259e-07

SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_FILTER

ColGroup Usage:: PredCnt: 3 Matches Full: #1 Partial: Sel: 1.1259e-07

Estimated selectivity: 0.010000 , col: #2

Estimated selectivity: 0.010000 , col: #4

Estimated selectivity: 3.1250e-04 , col: #5

Access Path: index (AllEqRange)

Index: DEPT_SUB_MARKS_IDX

resc_io: 5.000000 resc_cpu: 36467

ix_sel: 1.1259e-07 ix_sel_with_filters: 1.1259e-07

Cost: 5.000978 Resp: 5.000978 Degree: 1

****** Costing Index MARKS_DEPT_SUB_IDX

SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_SCAN

ColGroup Usage:: PredCnt: 3 Matches Full: #1 Partial: Sel: 1.1259e-07

SPD: Return code in qosdDSDirSetup: NOCTX, estType = INDEX_FILTER

ColGroup Usage:: PredCnt: 3 Matches Full: #1 Partial: Sel: 1.1259e-07

Estimated selectivity: 3.1250e-04 , col: #5

Estimated selectivity: 0.010000 , col: #2

Estimated selectivity: 0.010000 , col: #4

Access Path: index (AllEqRange)

Index: MARKS_DEPT_SUB_IDX

resc_io: 5.000000 resc_cpu: 36467

ix_sel: 1.1259e-07 ix_sel_with_filters: 1.1259e-07

Cost: 5.000978 Resp: 5.000978 Degree: 1

The final plan is as follows:

One might wonder how did oracle choose MARKS_DEPT_SUB_IDX. The answer to this is Oracle uses the following statistics/metrics in the following order when dealing with equally ranked indexes.

- Index which has a higher DISTINCT KEYS.

- Index with lower number of leaf blocks.

- Index with the lower ASCII name. Example index name that starts with A will be preferred over Z*.

For the above indexes the NDV, and the number of leaf blocks is as follows:

Since MARKS_DEPT_SUB_IDX had more distinct keys, clearly MARKS_DEPT_SUB_IDX is the choice but wait a minute how can number of distinct keys varies? This is because while gathering statistics using DBMS_STATS.GATHER_*_STATS procedure estimate_percent => dbms_stats.auto_sample_size. Oracle also estimates the numbers of leaf blocks instead of actual number of leaf blocks. We will learn more about this on statistics gathering section.

Once you gather the statistics using estimate_percent => 100, we will get the actual results.

When the predicates contain only the subset of the indexed columns this behavior can cause the optimizer to pick up an index or plan which may appear good for the first execution but for subsequent executions this index might not be good. (dept_id = ? sub_id=?).

Key-Compressed Indexes:

Index key compression is introduced in Oracle 8i, COMPRESS option of CREATE or ALTER INDEX statement allows us to enable the key compression option which eliminates the need to store repeated occurrence of column values in the index lead blocks and this results in reduction of index space requirements (there are cases where key compression can result in increased space usage, we will look into them as we progress).

Consider the following index:

create index idx on randomload(mark5);

Here mark5 has about 3200 distinct values with uniform distribution. Here as an indexed column value results in multiple rows, oracle stores the column value for each corresponding ROWID in the index leaf blocks as follows:

row#402[2793] flag: -------, lock: 0, len=13

col 0; len 3; (3): c2 13 43

col 1; len 6; (6): 00 80 55 8a 00 1a

row#403[2780] flag: -------, lock: 0, len=13

col 0; len 3; (3): c2 13 43

col 1; len 6; (6): 00 80 55 8c 00 34

Don’t you think that it’s a little waste of space? Now if we rebuild the same using compress option, the instead of storing the key value again and again for each entry, it just stores the column value only once in the leaf blocks. The entries in the leaf blocks appear as follows:

prefix row#0[8027] flag: -P-----, lock: 0, len=5

col 0; len 2; (2): c1 3c

prc 651

row#0[8018] flag: -------, lock: 0, len=9

col 0; len 6; (6): 01 c0 da e2 00 0a

psno 0

row#1[8009] flag: -------, lock: 0, len=9

col 0; len 6; (6): 01 c0 da ea 00 33

psno 0

From the above leaf block dump, we can notice that column key value is stored only once and the subsequent index entries store only the ROWID. Running analyze index … validate structure displays the details before and after the compression is enabled for the index that mentioned before.

NAME LF_BLKS BR_BLKS DISTINCT_KEYS

---------- ---------- ---------- -------------

IDX 107696 201 3200

IDX 79299 149 3200 <-- COMPRESS

So, reiterating again, index key compression feature eliminates the need to store the indexed column value multiple times in an index leaf block. Rebuilding an index using the COMPRESS will result in the reduction of the number of index leaf and/or branch blocks if the column values are uniformly distributed or if a column key value usually results in multiple rows.

We can use this COMPRESS option for both the B-Tree indexes and IOT. The compression is actually achieved by splitting the index key into two parts: the prefix part (which contains the column/columns values depending on whether the index is a single column index or a multi-column index) and the suffix part (the suffix part usually contains ROWID or the rest of the columns of a multi-column or concatenated index if we have manually specified the prefix length). When an indexed key value results in multiple rows, the prefix part (which is the column/columns) is shared across multiple suffix entries as the suffix entries are usually unique. We get better compression when an index key value results in multiple rows.

The compression is usually done in the leaf blocks of a B-Tree index. The index entry (key value – ROWID) are compressed locally within the same block. As each one of these compressed rows (suffix part) refers to the corresponding prefix, the prefix is also stored in the same block. By storing the prefix and suffix locally in the same block, each index leaf block is self-contained and in order to reconstruct the complete key no additional IO is required.

A single column unique index is not a candidate for key compression feature as it is not supported in the first place and the following explains why the key compression feature is not good for both unique and mostly unique columns.

- As all the rows are unique and key compression feature relies on repeated column values, there are no potential benefits to compressing that index. In fact, utilizing key compression will not only increase the space requirements but also the CPU utilization.

- Unique indexes are special in the way they store the entries in the leaf blocks, they consider the ROWID as part of the part of the key. As the key compression feature splits an index entry into prefix and suffix (prefix containing the column value and the suffix containing the ROWID), ROWIDs are no longer considered part of the key.

For a multicolumn unique index, if the leading columns are not unique or results in multiple values, key compression feature can help. Even in this case unless the distribution of non-unique leading columns is high, the potential benefits of key compression can be less. We need to find the perfect balance here between the distribution of values and the effects of prefix and suffix split.

By default, COMRESS option of CREATE/ALTER INDEX includes all the columns of an index as part of the prefix. If we want to include only certain leading columns to be part of the prefix since these columns are known to have repeating values, COMPRESS clause also lets us specify the prefix length or the columns to be included as part of the prefix as follows:

create index temp_idx on temp(mark1, mark2, mark3) compress 2;

In this case oracle includes only the columns mark1, mark2 as part of the prefix.

Remember that although the key compression reduces the size of an index by sharing parts of keys across multiple entries, there is a CPU overhead to reconstruct the column values during the index loops or scans even though the prefixes are stored locally in the block. If a SQL statement runs more frequently and touches only few index blocks then the CPU overhead may be in the order of < 1% and the elapsed times for each execution for such SQL statement may not increase at all. But it is only when the SQL statement has to read or process many index blocks for each SQL execution, the impact can be clearly seen. This increases with the number of index leaf blocks it has to read and the elapsed times of the SQL statement increase considerably when compared with the CPU utilization.

The following is an example where the SQL statement touches about >39+ index blocks. We can clearly see the impact on the CPU utilization and elapsed times for each execution. (shared pool is flushed so that the impact of the impact of the disk reads is not considered). We can see that even though the consistent reads is far less for the SQL statement which used compressed index, the elapsed times are significantly higher.

Similarly, the following is a situation which basically shows the impact on the elapsed times and CPU utilization for a SQL statement that basically touches very few index blocks.

There is a popular misconception that key compression is effective only when the number of distinct values is less. This is not true, even if there are 100k distinct values in a column uniformly distributed in a table with over 1 million rows, the key compression can still help reduce the size of the index.

Finally summarizing, index key compression can be considered in the following cases:

- In case of a non-unique index in which an index key value results in multiple rows.

- In case of multi-column unique index in which the NDV of leading columns is very less.

- In case of IOT, -- Check

- The key compression feature becomes less effective when the NDV of a column or set of leading columns is very high and data is randomly distributed. For example, there are 100k distinct values of a column/columns in a table with just 200k rows and only few column values result in multiple rows fetched from the table and if the select statements end up selecting data from the index uniformly. So, even before implementing the index key compression, investigate how the data is organized in the table

- By default, using the COMPRESS clause without a prefix compresses all columns of an index.

- Similar to unique indexes, ROWIDs are included as part of the key in a compressed multi-column unique index.

Advanced key compression:

Consider a scenario in which the leading column of a multi-column index has less NDV and the tail column has values distributed randomly, and has unique values. With default key compression feature (COMPRESS), the DBA has no option but to include only the lead column as the prefix. Now enter advanced index key compression to the rescue. Advanced key compression is introduced in 12.1, and is available only with advanced compression license. This feature allows Oracle to automatically determine the column or set of columns to compress in the index leaf blocks.

compress for advanced low

Consider the following composite index:

Create index idx on students (department_id, address_gps);

Here we are sure that there are many students in each department, and the address of the student in GPS coordinates is most unique, only for students who stay in Hostels the GPS coordinates will be the same. (no application requires such an index but this gives an overall idea of the situations when we can consider a composite index). In cases such as these, advanced key compression will help in reducing the size of the index. Without the advanced compression feature the DBA has no other option but to specify the department_id as prefix, even though there is a great scope of compressing other common values.

Create index idx on students (department_id, address_gps) compress 1;

To use this feature, create or rebuild the index as follows:

create index idx on students (department_id, address_gps) compress for advanced low;

alter index idx rebuild compress for advanced low;

As is the case with single column unique indexes, advanced compression does not allow us to compress a single column unique index (compress for advanced low, but compress for advanced high allows as it uses a different compression algorithm altogether similar to hybrid columnar compression).

Oracle introduced compress for advanced high with 12.2 which uses a completely new algorithm and the compression ratios are very high even when the column contains unique values. This is only for warehouse or OLAP workloads, and not for OLTP.

alter index idx rebuild compress for advanced high;

Index key compression feature doesn’t alter the optimizer’s ability to consider index range/full/skip/fast full/unique scans. All these are supported

Function Based Indexes:

If we use predicates such as lower(name) like ‘asdf%’ on normal indexed columns, Oracle always performs a full table scan because Oracle has no way of knowing the selectivity of the predicate and it has to apply the function to every indexed column value in the table which is much costlier than performing a full table scan. Enter function-based index to the rescue. A function-based index computes the value of a function (which can be predefined or user defined function) or expressions (+, -, /, case, etc.) involving one or more columns and stores it in an index. Oracle supports both B-Tree and Bitmap function-based indexes. Function-based indexes or simply FBI are specialized indexes, in that once they are created, oracle automatically adds a virtual column to the table and this can be seen in DBA_TAB_COLS, DBA_TAB_COL_STATISTICS, USER_TAB_COLUNNS.DATA_DEFAULT etc., views. Remember that descending indexes are internally implemented as function-based indexes. Oracle will not allow us to create a function-based index, unless the function returns a deterministic value (the function that returns the same value at all times for example, value returned by the function TO_CHAR in TO_CHAR(SYSDATE,’DD-MON-YYYY HH24:MI:SS’) will differ each time we invoke it and hence it is not deterministic).

Since the values stored in a function-based index are precomputed based on a function or expression on a column or set of columns, a session or a server process can use the index directly instead of computing the value at the runtime which not only saves CPU and I/O (as oracle performs a full table scan) but also reduce the elapsed times of the query significantly (provided the predicate results in fewer rows returned – cost). Function based indexes are more frequently and predominantly used for indexing the case-sensitive columns such as CHAR, VARCHAR2 etc.,

Once we create a function-based index, optimizer will consider that index in the cost estimates provided that the function or expressions used is identical to the one used when we create the index. This is very important and we will see why it matters. Consider the following query:

select * from students where upper(name)=’ROBERT’;

Unless the index is created as follows, the optimizer will not consider the index in the cost estimates.

create index stud_name_idx on students(upper(name));

We can find the expressions or function used to create a function-based index using either DBA_IND_EXPRESSIONS.COLUMN_EXPRESSION or by extracting the DDL of the index using DBMS_METADATA.GET_DDL. When indexing foreign key columns on the child table make sure that these indexes are not created as function-based indexes as this causes locking issues, create an additional function-based index apart from the b-tree index in this case, if it is mandatory.

Oracle allows us to selectively index column value using either user defined function or case expressions. Consider the following index:

create index mark1_idx on randomload(case mark1 when 1 then null else mark1 end);

In this case, oracle will index all the values other than 1 since we are using a case expression to store nulls in case the mark1 value is 1. Unless we specify the entire expression in the where clause oracle will not use the index:

select avg(mark2) from randomload where case mark1 when 1 then null else mark1 end=98

Function-based indexes are not just restricted to applying functions or expressions on single columns, we can build either single-column or multi-column function-based indexes, using multiple table columns. Consider the following:

create index average on randomload (mark1 + mark2 + mark3 + mark4 + mark5 + mark6, round ( (mark1 + mark2 + mark3 + mark4 + mark5 + mark6)/6));

A function-based index is used only when the predicate contains the exact expression or functions used while creating the index. For example, consider the index created above, unless we specify the SQL statement as follows optimizer will not consider the index.

select roll from randomload where mark1 + mark2 + mark3 + mark4 + mark5 + mark6 = 123;

Instead of specifying the entire expression or functions used while creating the index, we can also use the virtual column which oracle creates automatically when it creates a function-based index in the where clause as follows:

SQL> select * from randomload where "SYS_NC00009$" = 5747;

Predicate Information (identified by operation id):

---------------------------------------------------

2 - access(ROUND(("MARK1"+"MARK2"+"MARK3"+"MARK4"+"MARK5"+"MARK6")/6)=5747)

If we notice the column name, the name starts with SYS_*, this is because it is automatically generated by oracle. As Oracle considers this as a hidden column, histograms on such columns are not collected by default, unless we specify either FOR ALL HIDDEN COLUMNS in the METHOD_OPT parameter or specify the hidden column name in the METHOD_OPT. Unless the value returned by a function-based index are uniform, collecting histogram on these virtual/hidden columns is mandatory.

There is a specific reason for specifying that we can use the virtual column in the predicate clause. As the expressions or functions used in the predicate clause become too large, there is a higher probability of mistakes while writing the code. We can even consider a better alternative which is to add a virtual column to the table and index that column, which makes coding simpler at the application level. Consider the same example, that we looked at, instead of creating an index, we could perform the same as follows:

alter table randomload add (total number generated always as (mark1 + mark2 + mark3 + mark4 + mark5 + mark6) virtual, average number generated always as (round ( (mark1 + mark2 + mark3 + mark4 + mark5 + mark6)/6)) virtual;

create index total_avg_idx on randomload(sum,average);

And now the SQL statements can simply refer these columns in the predicate clause as follows:

select roll from randomload where sum = 123123;

we can disable the use of function-based indexes using _disable_function_based_index hidden parameter. Min/Max functions on function-based indexes are supported (descending indexes are an exception) and will result in INDEX FULL SCAN (MIN/MAX) for numeric based columns.

Following are the limitations of function-based indexes:

- Skip scans are not supported (remember reverse key indexes)

- Expression cannot include aggregate functions such as sum, avg etc.

- The database ignores a function-based index while doing an OR-expansion.

- Function or expression should result in deterministic length (operations on strings should result in only finite length).

- Direct path loads are not supported on a table with a function-based index that is dependent on a user defined function.

- Compared with a regular index, a table with a function-based index experiences slow Inserts and updates.

Descending Index:

Starting from Oracle 8i, we can create indexes in descending order for a specific column. Oracle internally implements descending indexes as function-based indexes (Remember that skip scans are not supported with function based or reverse key indexes). By default, when we create an index, the entries are stored in ascending order. Descending indexes are useful in cases where we order the data in descending order. To create a descending index on a column, all we need to do is add the DESC keyword along with the column name in the CREATE INDEX DDL statement.

Consider the following example:

create table students (student_id number, name varchar2(20), marks number);

create index marks_idx on students(marks desc);

insert into students select rownum, dbms_random.string(0,20), round(dbms_Random.value()*1000) from dual connect by level < 100000;

commit;

Since Oracle creates a function-based index, it adds a virtual (hidden) column to the table:

Remember that for descending indexes, the actual function used is not displayed in the DATA_DEFAULT column but can be seen in the predicate section of the Explain plan. Consider the following query let’s look at the predicate section for this plan:

select * from students where marks = 9;

Oracle internally applies the SYS_OP_DESCEND function on the column value and stores it in the index.

Technically both the following queries are equivalent.

Select * from students where marks = 9;

Select * from students where sys_op_descend(marks) = SYS_OP_DESCEND(9));

Select * from students where sys_op_descend(marks) = 9;

Now that we know that the column values are stored in the descending order, how about ROWIDs? ROWIDs are stored in ascending order as usual.

Creating a descending index on a single column may not be beneficial since Oracle can perform INDEX RANGE SCAN DESCENDING when we use order by clause with DESC keyword to eliminate the sort operation. For normal multicolumn indexes, things work little differently, since the data is sorted in ascending order for all columns in the index leaf blocks, sorting is eliminated when the SQL contains the ORDER BY clause contains leading indexed column and/or tailing columns.

E.g.: consider the following index:

create index mark1_idx on randomload(mark1, mark2);

Sorting will not happen for the clauses:

Order by mark1,

Order by mark1 desc;

Order by mark1, mark2;

Order by mark2; (if we use = predicate on mark2 column)

But for the rest sorting of results happen and this can be clearly seen in the explain plan as SORT ORDER BY.

Oracle can utilize either INDEX SKIP SCAN DESCENDING or INDEX RANGE SCAN DESCENDING to avoid a sort operation when order by clause includes the leading column. But it is when we use ORDER BY clause on the tailing columns with DESC keyword (without a = predicate on the leading column), sorting occurs. Descending indexes helps in avoiding the sort operation post fetching the results, but inherently since they are implemented as function-based indexes, they cannot support skip scans. Another interesting point to remember here is that sys_op_descend(null) returns 00. in other words, nulls are stored in a descending index.

Limitations of a descending index:

- Skip scans are not supported.

Partitioning Basics:

Partitioning allows us to logically divide a large table, index or IOT into smaller pieces called partitions. Each partition has its own name, and can optionally have its own storage characteristics. Partitions can either be managed collectively or individually. A row in a partition table is assigned to only a single partition using a partition key and the partition key can consist of one or more columns, Oracle automatically directs inserts, updates and delete operation to the appropriate partition with the partitioning key. There are several benefits of partitioning such as

- Partition pruning.

- Ability to drop old partitions instead of deleting data (traditional table).

- Ability to move old data to slower disks.

- Reduce the index skew etc.

There are downsides to partitioning as well and these depend on how well the schema and/or application is configured. As the table/index partitions are completely transparent to the application, even with correct predicates in the WHERE clause, oracle has to perform checks to ensure that the DML or SELECT statement are redirected towards the right partition and this involves cost and increase the elapsed times of a query. The delay also tends to increase with the number of partitions. Consider the following table and insert statement:

create table students(student_id number, dept_id number, name varchar2(30), sub_id number, mark1 number, mark2 number, mark3 number, mark4 number, mark5 number, mark6 number, mark7 number, mark8 number) partition by range (dept_id) interval(1) (partition p1 values less than (1));

create index did_sid_idx on students (dept_id, student_id) local;

insert into students (student_id, dept_id, name, sub_id, mark1, mark2, mark3, mark4, mark5, mark6, mark7, mark8) values (?,?,?,?,?,?,?,?,?,?,?,?);

For each insert Oracle should compare partition boundaries to determine the right partition this row maps to and then finally insert the row. Similarly, for the operations such as DELETE, UPDATE etc, even though the predicates contain partitioning key such as the following, additional cost is involved in order to find the correct partition.

delete from students where dept_id = ? and student_id=?

Update students set mark7 = ? where student_id = ? and dept_id = ?

Even though the workload is highly concurrent and you are pushing the limits of IO in terms of IOPS and log file sync, you may not notice the effect of the this for traditional inserts, updates, deletes (the additional cost for will in the order of <0.02-0.03ms, but things become more apparent when you use Batch Inserts or bulk loads as they are slowed down considerably as Oracle should check each and every row. The following chart gives us an overview of delay or time it takes to insert millions of records into a partitioned and non-partitioned table.

The test environment is as follows:

1. Database is cached in memory and in no archive log mode.

2. Segments were pre-allocated with extents.

3. Redo sizes were more than 6GB in size (reduce the frequency of incremental checkpoints).

4. Tables had no indexes.

Partitioning:

Partitioned Indexes:

Partitioned indexes are of two types:

- Local

o Local Prefixed Indexes

o Local Non-Prefixed Indexes

- Global

Local Index:

In a local index, all the index partitions are equipartitioned with the underlying table. Oracle partitions the index on the same columns as the underlying table, creates the same number of partitions and/or sub-partitions and gives them the same partition bounds as corresponding partitions of the underlying table. Oracle also maintains the index partitioning automatically when the partitions in the underlying table are added, dropped, merged, or split or when hash partitions or sub-partitions are added or coalesced. This ensures that the index remains equipartitioned with the table. Remember that a Local index can be created UNIQUE only when partitioning column form a subset of the index columns. This restriction guarantees that rows with identical keys always map the same partitions, where uniqueness violations can be detected. Consider the following table:

create table students(student_id number, dept_id number, name varchar2(20)) partition by range (dept_id) interval (1) (partition p1 values less than (1));

By default, when we create an index on the partitioned table without LOCAL keyword, oracle creates a regular non-partitioned index.

create unique index students_pk on students(student_id);

alter table students add constraint students_pk primary key (student_id);

To create a Unique Local Index, we must specify the partitioning key as the index column dept_id along with the student_id column.

create unique index students_pk on students(student_id, dept_id ) local;

alter table students add constraint students_pk primary key (student_id, dept_id) using index local;

As Oracle treats each index partition as a separate segment, it has no way to guarantee that student_id with value 1 exists in other index partitions, in other words scanning each index partition to make sure that a value doesn’t exist causes additional reads which can significantly reduce the insert performance. Even if the column student_id is populated using a database or application sequence which guarantees unique value every time a new value is generated, Oracle has no way of guaranteeing this from the partitions perspective since database sequences also have the ability to CYCLE and Oracle is basically clueless with regard to application sequences. If you really want to make sure that student_id is unique, you have no choice other than to create a regular non-partitioned index and rely on features such as asynchronous global index maintenance for partition maintenance operations or few cases in which you can’t modify the SQL statements partitioning may not be the right choice.

Let’s start with the very basics – Partition pruning:

Partition pruning allows the optimizer to analyze FROM and WHERE clauses in the SQL statements to eliminate unneeded partitions when building the partition access list. Depending on the actual SQL statement, Oracle Database may use static or dynamic pruning. Static pruning occurs at the compile time as the information about partitions accessed is available beforehand. Dynamic pruning occurs at run-time, meaning that the exact partitions to be accessed by a statement are not known beforehand (happens mostly with joins and bind variables). Partition pruning depends on whether the partitioning key is specified in the predicates. Without the partitioning key column, partition pruning will not happen. Consider the following table and index:

create table students(student_id number, dept_id number, name varchar2(20)) partition by range (dept_id) interval (1) (partition p1 values less than (1));

alter table students add constraint students_pk primary key (student_id, dept_id) using index local;

student_id is populated using a monotonically increasing value and is populated by a database sequence and is guaranteed to be unique across all the partitions.

For a query such as the following:

select * from students where student_id = :1 and dept_id=:2;

With the partition key specified in the predicate, Oracle can quickly determine which index partition this value maps to and redirect the reads to that particular partition and eventually to the corresponding table partition using the ROWID:

The execution statistics are as follows:

Without the partition key column specified in the predicate, Oracle has no other but to scan every index partition (segment) as student_id = 1 can be in any partition, so Oracle ends up traversing each index partition to find the corresponding rows. The number of reads (index leaf blocks) is directly proportional to the number of partitions - (blevel+1)* number of partitions.

select * from students where student_id = :1

The explain plan is as follows:

Oracle performs Index Range Scan rather than an Index Unique Scan because both student_id and dept_id make up the primary key and dept_id is not specified in the predicate and. Execution statistics are as follows:

– 100 underlying table partitions:

Execution statistics – 10 underlying partitions:

Even though student_id is unique and is guaranteed to be present in only one index partitions, the SQL:

select * from students where student_id = 1

results in traversing B-Tree structure of each index partition starting from root to the leaf even though the data is stored in the first index partition.

The following is the snippet of the 10046 level 12 trace indicating the reads:

To understand why Oracle requires traversing down the B-Tree structure of each index partition, we need to understand how the entries are stored and organized in the root branch block, branch block, leaf blocks.

The contents of the root-branch block and branch blocks is basically the same, entries in a root branch block or branch block can point to DBA of either leaf blocks or the next level branch blocks. The contents of the branch blocks depend on various factor such as

- Whether the index is unique.

- NDV of the indexed column.

- NDV of the leading column of a multi-column index as we have seen in multi-column indexes.

Unique Index:

row#0[8047] dba: 29435353=0x1c125d9 // DBA of either branch or leaf

col 0; len 4; (4): c3 24 0c 37

row#1[8038] dba: 29436088=0x1c128b8 // DBA of either branch or leaf

col 0; len 4; (4): c3 47 07 4c

Non-unique index: High NDV

row#0[8046] dba: 29434551=0x1c122b7 // DBA of either branch or leaf

col 0; len 4; (4): c3 1f 1d 07 //Starting value:

col 1; TERM

row#1[8036] dba: 29434552=0x1c122b8 // DBA of either branch or leaf

col 0; len 4; (4): c3 1f 21 38

col 1; TERM

Non-Unique Index: low NDV

row#0[8044] dba: 29428848=0x1c10c70 // DBA of either branch or leaf

col 0; len 3; (3): c2 0d 09 //Starting value

col 1; len 3; (3): 01 c0 1b //portion of ROWID

row#1[8032] dba: 29429412=0x1c10ea4

col 0; len 3; (3): c2 19 0a //Starting value:

col 1; len 3; (3): 01 c0 5c

There is a reason for highlighting portion of entries “col 1; TERM”, we will see eventually why. We will start with a value that is the least or minimum column value in the index which is guaranteed to be present in the left most index leaf block. The following depicts the actual process, procedure involved, starting from root block. As things get more complicated let’s start with basic flow:

Oracle interprets the entries in the Branch blocks (root or branch) as follows:

No condition … just leaf block pointer…

>= 1000 … leaf block pointer … with optional TERM … 1000 is the starting value in the leaf block

>= 2000 … leaf block pointer … with optional TERM

When we issue the Query such as the following:

select * from students where student_id = 1; (Student_id is the primary key populated using a monotonically increasing value starting with 1 Incremented with 1)

The corresponding 10046 level 12 trace is as follows:

The relevant section of the tree dump is as follows:

Root block is always the first block following the segment header – L3 BMB (as stated earlier), you can confirm this using the following query:

Oracle starts by reading the root block and you can see dump of the root block as follows:

The value of interest student_id = 1 in byte representation is c1 02. Oracle starts by reading the entries in the root block starting from row#0. The value here is c3 24 0c 37 which is basically higher than the value we are looking for. Now that Oracle is sure that this branch doesn’t contain the value it starts by reading the branch block 0x1c0cc72 pointed by kdxbrlmc which may contain the value we are looking for.

Dump of index branch block is as follows:

Upon reaching the row#0 it determines that the leaf block doesn’t contain value that we are looking for and starts by reading the leaf block 0x1c0025c pointed by kdxbrlmc which may contain the value we are looking for.

Dump of the index leaf block is as follows:

It starts reading the entries in the leaf block and now that value we are looking for “c1 02” is present it gets the corresponding ROWID and starts reading the table block and finally returns the results to the user. Since the column value (in byte representation) forms the starting point for the entries in the subsequent branch and blocks pointed using the DBA Oracle has to read the subsequent entry next entry (ROW#) to determine out whether the value of interest falls in range and finally read the read and proceed further proceed further. Due to this, even if the maximum column value is 999999 and when we try to search for 99999999 which is not indexed oracle ends up reading the right most block of the B-Tree structure (last branch, leaf block of treedump).

There is another aspect that we should be aware of when it comes to optimization that Oracle employs in the branch blocks not only to save space but also to reduce the number of branch blocks required. Oracle can use either ROWID or TERM in the branch blocks depending on various factors such as the following:

Use of TERM:

- When the indexed column value and the corresponding ROWIDs are contained in the leaf block the exception of last entry when the value becomes the starting value of the entries in the next subsequent block.

- To indicate the starting leaf block if a column value spans across multiple leaf blocks.

- Not used in unique single column indexes.

Use of ROWID:

- When the entries of a particular value in the leaf block span across multiple leaf or branch blocks.

Let’s look at a very basic scenario:

create table students (student_id number);

insert into students select rownum from dual connect by level < 1000;

create index students_pk on students(student_id);

commit;

Since we are loading the same data twice we are sure that any value of student_id will result in only two rows. The treedump of the index (students_pk) is as follows:

----- begin tree dump

branch: 0x1c00163 29360483 (0: nrow: 5, level: 1)

leaf: 0x1c00164 29360484 (-1: row:492.492 avs:818) // avsà free space in the leaf block -

leaf: 0x1c00165 29360485 (0: row:478.478 avs:830) … - due to the default PCTFREE 10

leaf: 0x1c00166 29360486 (1: row:479.479 avs:817)

leaf: 0x1c00167 29360487 (2: row:478.478 avs:830)

leaf: 0x1c00168 29360488 (3: row:71.71 avs:6931)

----- end tree dump

We can obtain a treedump of an index using the following statement.

alter session set events 'immediate trace name treedump level &[OBJECT_ID]';

Remember that X$BH.OBJ always points to the DATA_OBJECT_ID.

Oracle uses the same format/structure for all the branch blocks and the entry kdxbrlmc always points to either a leaf or next level branch block depending on its position in the index structure. Treedump displays the index leaf blocks starting from -1, this is because Oracle uses the pointer kdxbrlmc to point to the first leaf block the numbering for the rest of the leaf blocks is based on the ROW# in the branch block. Now let’s look at the branch block dump:

Branch block dump

=================

header address 139695531102284=0x7f0d6682104c

kdxcolev 1

KDXCOLEV Flags = - - -

kdxcolok 0

kdxcoopc 0x80: opcode=0: iot flags=--- is converted=Y

kdxconco 2 //number of columns in the leaf block

kdxcosdc 0

kdxconro 4 //number of rows/entries in the branch block starting from ROW#0

kdxcofbo 36=0x24

kdxcofeo 8012=0x1f4c

kdxcoavs 7976 //available free space

kdxbrlmc 29360484=0x1c00164 //points to the leaf block

kdxbrsno 0

kdxbrbksz 8056

kdxbr2urrc 3

row#0[8047] dba: 29360485=0x1c00165

col 0; len 3; (3): c2 03 30

col 1; TERM

row#1[8038] dba: 29360486=0x1c00166

col 0; len 3; (3): c2 05 57

col 1; TERM

row#2[8025] dba: 29360487=0x1c00167

col 0; len 3; (3): c2 08 1a

col 1; len 4; (4): 01 c0 01 5f

row#3[8012] dba: 29360488=0x1c00168

col 0; len 3; (3): c2 0a 41

col 1; len 4; (4): 01 c0 01 5f

----- end of branch block dump -----

Notice the use of the keyword TERM for col 1; for the first two entries in the branch block. As all the entries in the leaf block (pointed by the branch block - dba) are contained and doesn’t span across multiple blocks, oracle uses TERM instead of ROWID or portion of ROWID. Use of TERM in the branch block for a particular leaf block doesn’t necessarily mean that the entries in that particular leaf block doesn’t span across multiple across – in certain cases the last entry of that particular leaf can become the starting point or starting value of the next subsequent leaf block in the same branch block. This can happen due to various reasons.

- Immediately following an index build/rebuild.

Let’s start with the first leaf block 0x1c00164 and tailing section of the index leaf block dump is as follows:

row#490[1851] flag: -------, lock: 0, len=13

col 0; len 3; (3): c2 03 2f //Column value

col 1; len 6; (6): 01 c0 01 5e 00 f5

row#491[1838] flag: -------, lock: 0, len=13

col 0; len 3; (3): c2 03 2f //Column value

col 1; len 6; (6): 01 c0 01 5f 02 48

----- end of leaf block Logical dump -----

----- end of leaf block dump -----

End dump data blocks tsn: 4 file#: 7 minblk 356 maxblk 356

Here Oracle is sure that the rows in the leaf block are contained and doesn’t span across multiple leaf blocks, it uses the keyword TERM in the branch blocks. Things get interesting from here, starting from the second and third entry in the branch blocks:

row#1[8038] dba: 29360486=0x1c00166

col 0; len 3; (3): c2 05 57

col 1; TERM //Use of TERM

row#2[8025] dba: 29360487=0x1c00167

col 0; len 3; (3): c2 08 1a

col 1; len 4; (4): 01 c0 01 5f //Use of portion of ROWID

Snippet of index leaf block 0x1c00166 dump is as follows:

kdxlenxt 29360487=0x1c00167

kdxleprv 29360485=0x1c00165

kdxledsz 0

kdxlebksz 8032

row#0[8019] flag: -------, lock: 0, len=13

col 0; len 3; (3): c2 05 57

col 1; len 6; (6): 01 c0 01 5b 00 a4

row#1[8006] flag: -------, lock: 0, len=13

col 0; len 3; (3): c2 05 57

col 1; len 6; (6): 01 c0 01 5e 01 e5

…

row#477[1824] flag: -------, lock: 0, len=13

col 0; len 3; (3): c2 08 19

col 1; len 6; (6): 01 c0 01 5f 00 3f

row#478[1811] flag: -------, lock: 0, len=13

col 0; len 3; (3): c2 08 1a

col 1; len 6; (6): 01 c0 01 5b 01 93

----- end of leaf block Logical dump -----

The last entry of the leaf block 0x1c00166 which is the column value c2 08 1a becomes the starting value for the leaf block 0x1c00167. Since the value c2 08 1a span across multiple leaf blocks Oracle uses ROWID in the index branch blocks. This basic example demonstrates that even when an indexed column value results in 2 rows Oracle has to read 2 index leaf blocks to get the corresponding ROWIDS. Similarly, following an index rebuild a value can span multiple branch blocks, but this will not cause any additional branch block reads as traverses through the index leaf block chain once the initial block is found.

Due to this optimization, a 50-50 block split will not always result in 50% of the entries moved to a new leaf block. The following scenario explains things:

drop table students;

create table students (roll number);

create index idx on students(roll);

insert into students select 1 from dual connect by level < 292;

insert into students select 2 from dual connect by level < 281;

commit;

insert into students values (1);

commit;

drop table students;

create table students (roll number);

create index idx on students(roll);

insert into students select 1 from dual connect by level < 291;

insert into students select 2 from dual connect by level < 282;

commit;

insert into students values (1);

commit;

Algorithm is basically as follows (for understanding certain aspects):

Read branch/root block and read the entries starting from ROW#0

Step A:

If value is higher

Action 1:

Read branch/leaf block pointed by previous ROW# or kdxbrlmc if ROW#=0

Step B:

If value is low read the next subsequent entry until the value is higher.

If higher Perform Action 1:

If no subsequent entry or is the last entry:

Read the block pointed by the current entry.

Step C:

If entry value equals…then check if ROWID or TERM is included:

If ROWID is in the branch entry:

check for TERM for previous entry

If TERM present read previous and proceed in the chain until higher value found.

If No TERM read the current block continue in the chain until higher value found

If TERM

Read the lead block.

Before even considering partitioning as an option, evaluate whether the SQL statements depend mostly on primary key or unique key lookups or joins (joining column usually primary key or foreign key constraint) can be made aware of the partitioning by including the partitioning key in the predicates. Most ORM solutions such as Hibernate are not partition aware and the development/application teams should carefully consider before implementing or accept partitioning as a solution. If the workload is mostly dependent on primary key lookups and the application cannot be made partition aware, then consider using non-partitioned indexes for the primary/foreign keys. Oracle does support asynchronous global index maintenance.

The same concepts apply to delete and update statement as well if the predicate doesn’t contain the partitioning key column.

Local Prefixed Indexes:

Local indexes are of two types:

- Prefixed indexes

- Non-Prefixed indexes:

Local prefixed indexes are the indexes which have partitioning key as the leading column in the index definition. Consider the following table:

create table students(student_id number, name varchar2(30), dept_id number, sub_id number, mark1 number, mark2 number, mark3 number, mark4 number, mark5 number, mark6 number, mark7 number, mark8 number, mark9 number)

partition by range(Dept_id) interval(1) (partition p1 values less than (1))

create index dept_sub_marks_idx on students(dept_id, sub_id,mark9) local; //prefixed index

create index sub_dept_mark_idx on students(sub_id,dept_id,mark9) local; //non-prefixed index.

If the SQL predicate contains the partitioning key regardless of whether the index (local) is prefixed or non-prefixed partition pruning occurs and the performance of a local prefix index will almost be identical to non-prefixed local index.

Let’s consider the query: (the table students is loaded with uniform distribution of data with a total of 100 partitions (dept_id – 1 - 100))

select * from students where dept_id = ? and sub_id = ? and mark9 = ?

Remember that when all the predicates are specified in the query and if there are multiple indexes on the table with these columns (different order), then optimizer estimates the cost of index access to be same for both these indexes resulting in optimizer choosing an index with either lower DISTINCT_KEYS or LEAF BLOCKS or ASCI name.

The execution plans using both the indexes are as follows:

Non-Prefixed Local Index

Prefixed Local Index

The execution statistics for a random set of bind values using both the indexes is as follows:

The case is basically the same even when the query results in traversing B-Tree structures of multiple partitions:

For the queries

select * from students where dept_id in (?,?) and sub_id = ? and mark9 = ?

select * from students where dept_id in (?,?,?) and sub_id = ? and mark9 = ?

IOT:

Consider the following table and its corresponding index:

create table shares(share_id number, share_name varchar2(20), conditions varchar2(200));

create index shares_idx on shares (share_id);

And when a session issues a query such as the following:

select share_id, share_name from shares where share_id =19;

Oracle starts by working down the index shares_idx from root branch block to the leaf block containing the key that matches the range. Considering the height of the B-Tree index to be 2, it requires 3 block reads

- Root branch block

- Branch block

- Leaf block.

Once Oracle obtains the ROWIDs from the index block and it reads the corresponding table block. This requires additional table block read to get the extra relevant details. What if we can avoid the table lookup altogether even for the queries such as the following:

select * from shares where share_id=19;

To accomplish this, we have to end up creating an index including all the columns of the table, which is basically redundant considering the same data is stored in two segments: table and index segment. Enter Index Organized Table or IOT to the rescue.

Index Organized Table is introduced starting from 8i. An IOT stores data according to the primary key column values and it can store the data as if the entire table was created in an index. A normal index will store only the indexed columns, but an IOT can store either extra columns or all columns of the table in the index. As all or most of the table columns are stored in the index, the rows of the table will not have a physical ROWID. Though we can select the ROWID pseudo column from an IOT (this is a computed value based on guess).

These ROWIDs are logical ROWIDs (we will look into this soon when we discuss about secondary indexes). An IOT is appropriate when the queries contain predicates that involve primary key columns (be it index range scan, index skip scan or index unique scan). For example, if the primary key is based on columns (a, b, c)

IOT is appropriate for the following types of predicates:

where a=?

where a=? and b=?

where a=? and c=?

where a=? and b=? and c=?

where b=? //skip scan when NDV of column a is very less.

where b=? and c=? // same as above

several combinations of range based predicates also work

In case of an IOT, even though Optimizer considers skip scans when the NDV of the leading column is less and this should be considered carefully. As additional columns are included as part of the B-Tree structure, benefits of having an IOT will soon disappear with the steady increase in the NDV in which case a full index scan or considering a secondary index will make things even worse.

An IOT DDL is similar to the normal heap table, but requires additional keywords such as

- ORGANIZATION INDEX (mandatory)

- PCTTHRESHOLD (optional)

- INCLUDING and (optional)

- OVERFLOW (optional)

Let’s look at the DDL:

create table students (

student_id number,

name varchar2(20),

contact_details varchar2(200), primary key(student_id))

organization index

pctthreshold 50

including student_id

overflow;

ORGANIZATION INDEX clause specifies that the table we are creating is an Index Organized Table. INCLUDING defines column name up to which the columns defined in the table that doesn’t make up the primary key are stored in the B-Tree index structure. PCTTHRESHOLD defines the percentage of the block that is reserved for an IOT row, if the row size exceeds this threshold then the non-key columns of the table will be stored in the overflow segment and a pointer is stored to locate the tail piece. Valid values for PCTTHRESHOLD are 1-50. If we don’t specify a value for PCTTHRESHOLD while creating an IOT, oracle by default sets the value to 50. OVERFLOW TABLESPACE specifies the tablespace that stores the overflow segment (which stores additional columns of the table that were either not included using INCLUDING clause or both columns that didn’t fit in the index leaf block due to the size restriction along with rest of columns that are not part of INCLUDING clause).

Oracle imposes a restriction on maximum length of index key and this depends on db_block_size or the tablespace block size. For an 8k block size, the maximum allowed index key length is 6398. For IOT, things work little differently, since the rows in IOT are stored based on primary key values and the restriction on the PCTTHRESHOLD max value to 50, the maximum allowed index key length is only 3218 for an 8KB data block size. This restriction has following implications

- Oracle doesn’t allow us to create an IOT, when the total length of primary key columns is > ~3218.

- When the total length of the primary key column/s and columns included in IOT index structure crosses this threshold > ~3218, Oracle stores the additional columns included in the B-Tree structure using INCLUDING clause in the overflow segment.

This restriction should give us a clear indication whether it is even worth considering bigger block size or IOT for that matter, considering several factors such as

- Index segment Bloat due to 50-50 block splits (when leading column of the primary key index is non-unique) or combination of 90-10 and 50-50 when multiple sessions insert the data (when leading column is unique and monotonically increasing).

- Number of rows retrieved for each SQL execution (when the leading column is not unique).

- Range of predicates used.

Consider the following table:

create table students(

Student_id number,

Name varchar2(20),

Phone number,

Address varchar2(20), primary key (student_id))

Organization index PCTTHRESHOLD 50

Including phone overflow tablespace users;

In this case all the columns student_id, name, phone will be stored in the index leaf block. There is yet an interesting behavior to be noticed here, consider the following DDL:

create table students(

Student_id number,

Name varchar2(20),

Phone number,

Address varchar2(20), primary key (student_id, phone))

Organization index PCTTHRESHOLD 50

Including phone overflow tablespace users;

Here even though we have specified the column “phone” in including clause since, this is part of the primary key column, oracle does not store the column “name” in the index structure. To include name column in the B-Tree structure or IOT specify name in the including clause. INCLUDING clause is used to specify the columns that do not make up the primary key to be stored in the B-Tree structure. Consider another statement:

create table students(

Student_id number,

Name varchar2(20),

Phone number,

Address varchar2(20), primary key (student_id))

Organization index PCTTHRESHOLD 50

Including phone overflow tablespace users;

In this case columns name and phone will be stored as part of the index structure.

Now let’s look at how the snippet of index leaf block trace for the below table:

create table temp(roll number, name varchar2(40), mark1 number, mark2 number, mark3 number, mark4 number, mark5 number, mark6 number, primary key (roll, name)) organization index including mark3 overflow;

Here we have specified that the columns mark1, mark2 and mark3 be included as part of the index structure.

row#115[981] flag: K------, lock: 0, len=65

col 0; len 3; (3): c2 0d 62 //Columns that markup the primary key

col 1; len 40; (40):

63 76 74 78 66 69 73 6b 6e 72 64 79 64 71 65 6f 72 64 73 63 6f 63 71 70 62

71 66 72 6f 61 6b 72 65 6a 63 6e 6f 6c 6c 68

tl: 18 fb: --H-F--- lb: 0x0 cc: 3

nrid: 0x01c223e4.72 //pointer overflow segment data block that contains rest of the columns (chained row)

col 0: [ 2] c1 04 //mark1, mark2, mark3 columns included using INCLUDING clause

col 1: [ 2] c1 06

col 2: [ 2] c1 09

row#116[1046] flag: K------, lock: 0, len=65

When the queries such as the following that refer only the columns that make up the IOT segment and the predicates that include the primary key columns, the data retrieval is very fast, as all the data required is already present in the index leaf blocks.

select mark1, mark2 from temp where roll = ? and name=?

select mark1, mark2 from temp where roll = ?

select mark1, mark2, mark3 from temp where name=? //if the NDV of roll is very less then index skip scan is possible otherwise optimizer may prefer an index full scan or index fast full scan, but remember that skip scans require more blocks to be read.

But when the queries refer columns that does not markup the primary key or columns included in the B-Tree structure in the select statement, additional IO is required to read the corresponding data block from the overflow segment using the forwarding addresses stored in the index leaf blocks.

Select mark4 from temp where roll=?

Select mark4, mark5, mark6 from temp where roll=?

Select * from temp where roll=?

If queries frequently refer to additional columns (part of overflow segment) in the SELECT or PREDICATE clause of a SQL statement, the benefits of having an IOT subside gradually with the number of executions, in such cases either consider adding those columns to the table. (2 possibilities, PCTTHRESHOLD may reach or adding those columns doesn’t make sense to create a bigger block size tablespace in which case converting the IOT into a normal table).

The statistic table fetch continued row is incremented for every row/column that is read from an overflow segment, which makes sense as the rows are chained to the overflow segment as indicated in the IOT leaf block trace file. Due to the nature of the IOT and the way the data is stored in an overflow segment, chained rows are expected. To identify the number of rows or entries stored in the overflow segment, query the view DBA_TABLES.CHAIN_CNT or DBA_TAB_STATISTICS.CHAIN_CNT, but remember that this statistic is not computed or the CHAIN_CNT is not populated unless we run analyze table … compute statistics command.

Even though we include certain columns as part of the B-Tree structure, these columns cannot be used for tree traversal or in other words specifying these additional columns included in the B-Tree structure without the primary key columns in the predicate results in an index fast full scan consider the following table:

create table temp (order_id number, product_id number, customer_id number, details varchar2(40), status varchar2(30), primary key (order_id)) organization index including status overflow;

In the above table we have included all the columns to be part of the index with ORDER_ID being the primary key. Only the queries that have ORDER_ID in the predicates results in index unique/range scans, specifying any other column will result in an index fast full scan as shown below:

Depending on whether we choose to include all the columns or certain columns of the table in the B-Tree structure, Oracle creates two segments in the tablespace when we create an IOT.

- Index segment, to hold all the primary key columns and any additional columns if we have specified INCLUDING clause in the table DDL. We can store all the columns of the table if necessary in the index structure provided that the max size of the row does not cross the PCTTHRESHOLD.

- Overflow segment to hold the additional columns (apart from primary key columns) if specified using INCLUDING and any additional columns.

Oracle internally uses the naming scheme as SYS_IOT_TOP/OVER_OBJECT_ID for naming the corresponding index and overflow segments. Consider the following CREATE TABLE statement (deferred_segment_creation = false):

create table RANDOMLOAD(roll number, name varchar2(40), mark1 number, mark2 number, mark3 number, mark4 number, mark5 number, mark6 number, primary key (roll, name)) organization index including mark3 overflow;

If we notice IOT RANDOMLOAD doesn’t have an associated DATA_OBJECT_ID, this is because the data is actually stored in the Index and associated overflow segment. Since the contents or the portion of the rows are chained from index segment to the overflow segment, even though the query does read the overflow segment or references only the columns present in the overflow segment, we will never see SYS_IOT_OVER_74542 or RANDOMLOAD in any of the explain plans. Following is an example of the explain plan

SQL> select * from randomload;

Even though Oracle generates the names for both Index and Overflow segments and these are visible in views such as DBA_INDEXES/DBA_TABLES. We cannot refer them directly (different from virtual columns) in the SQL statements, Oracle simply doesn’t allow us to do that. It’s not surprising that statements such as the following fail:

select count(*) from SYS_IOT_OVER_75889;

alter index SYS_IOT_TOP_75889 rebuild online;

alter index SYS_IOT_TOP_75889 move;

with errors such as

ORA-28650: Primary index on an IOT cannot be rebuilt

ORA-02243: invalid ALTER INDEX or ALTER MATERIALIZED VIEW option

ORA-25191: cannot reference overflow table of an index-organized table

IOTs serve a specific purpose that is to retrieve the data faster when the queries mostly refer primary key columns in their predicates. Since the primary data segment of an IOT is the index segment, IOTs suffer the same fate as the B-Tree indexes when it comes to index leaf block splits. When the leading column of an IOT is populated using monotonically increasing value always prefer database sequence, avoid using batch inserts (JDBC PreparedStatement.addBatch()/ExecuteBatch() API) as this leads to more bloat. If the additional columns included in the B-Tree structure using the INCLUDING clause are updated regularly, this can result in 50-50 leaf block splits adding more bloat to the IOT. The size of an IOT will be higher than a regular table. Due to this DBAs should constantly monitor the space utilization of an IOT either using segment advisor or by running analyze index … validate structure (INDEX_STATS.PCT_USED), remember that analyze index … validate structure will lock the table, preventing the application from performing the DML.

As the data is stored based on primary key values, column order of the primary key is very critical when it comes to reaping or exploiting various features such as key compression to reduce the size of the IOT there by reducing the number of lookups required to read the corresponding index blocks. Till now we have seen only the cases when the leading column of the index or IOT (primary key constraint) is unique. Let’s look at a different case:

Consider the following table:

create table Test (

b1 number, //b1 – 200 NDV

b2 number, //b2 - almost unique

b3 number,

b4 varchar2(20),

primary key (b1, b2)) organization index compress 1 including b4 overflow;

In this case we can exploit the index key compression to reduce the overall size of the index structure. For the predicates such as

b1 = ? //significant reduction in number of buffer gets compared to a normal heap table

b1 = ? and b2 = ? //results in a single row since b1 and b2 make up the primary key.

b2 = ? //skip scans possible but usually higher number of buffer gets.

Oracle allows us to create Bitmap and additional secondary indexes on an IOT. As the ROWS in an IOT move physically from one block to other due to 50-50 index leaf block splits, it is mandatory for the row movement feature to be turned on always, in fact due to this behavior we cannot disable row movement or disabling row movement for an IOT is not supported. As rows in an IOT lack a physical ROWID as the data is stored in an index, Oracle uses the concept of logical ROWID and/or logical ROWID guesses to support secondary indexes in order to locate the corresponding rows.

Secondary Indexes:

DDL to create secondary indexes on an IOT is same as normal tables, the differences lie within the leaf blocks regarding how the data is stored and organized. Consider the following table:

create table students(student_id number, department_id number, name varchar2(20), year number, primary key (student_id,department_id)) organization index including year overflow;

and the secondary index DDL as follows:

create index students_dept_idx on students(department_id);

In the above example, we have included all the columns of the table as part of the IOT structure. In order to uniquely identify a row in the IOT (due to the absence of a physical ROWID), secondary indexes store all the columns that make up the primary key of an IOT and a guess regarding the last known location of the row in the secondary index. Snippet of students_dept_idx leaf block dump reveals the details.

row#94[6228] flag: K------, lock: 2, len=19

col 0; len 3; (3): c2 05 0c //Department_id column

col 1; len 4; (4): c3 5f 05 0c //Primary key column Student_id

tl: 8 fb: --H-FL-- lb: 0x0 cc: 1

col 0: [ 4] 01 c0 2c 4b // last known physical location of the row.

row#95[6209] flag: K------, lock: 2, len=19

Consider the following query:

select * from students where department_id = 123;

The server process starts by traversing the students_dept_idx index to find the corresponding rows that match the predicate, as the physical ROWID is not present and with the presence of last known location the session directly reads the corresponding IOT leaf blocks without having to traverse the B-Tree structure to locate the corresponding rows. It is only when the leaf block splits that occur at a later point in time which results in table rows moved to different locations, Oracle ends up traversing the index structure to find the corresponding row based on the primary key values obtained from the secondary index.

Over time due to IOT leaf block splits (50-50) entries in the index leaf blocks eventually move to different leaf blocks and the physical rowed guesses in the secondary indexes will not be accurate and will eventually require additional IO to find the corresponding blocks which will increase the elapsed times of a query (not that significant). DBA_INDEXES.PCT_DIRECT_ACCESS (1-100) indicates the percentage of valid guesses, it is recommended to update these guesses once the PCT_DIRECT_ACCESS falls below 85% (this is not a hard threshold and depends on how the data is organized). Use the command alter index … update block references to update all stale guess data block addresses stored in the secondary index.

Since the primary key columns are included in the secondary index, any references to those columns will not require an IOT (index lookup) unless we refer other columns that are included as part of the IOT. The following explain plans reveals the details.

select student_id from students where department_id= 123;

select name from students where department_id=123;

How does full table scan perform?

Even though the IOT is considered as a table, the data is actually stored in an index segment with pointers to an optional overflow segment stored as a table. Even though the naming says Index Organized Table, IOT should not be considered as a table. Think of it as an index with optional pointers to the overflow segment (chained rows). There is no such thing as a full table scan but a full scan is performed using the Index Fast Full scan. Consider the following table:

If we issue the following query:

select avg(mark1) from randomload;

Depending on the size of the table and various factors, the server process will end up performing an index fast full scan using either db file scattered read or direct path read. The main point being that most of the IO is basically multiblock. It is when we refer the columns that are stored as part of the overflow segment things start to slow down, as the rows (columns) stored in the overflow segment is chained to the IOT primary structure that is B-Tree index, full table scan or index fast full scan behavior changes completely (in terms of IO). Consider the following query:

select avg(mark6) from randomload;

The above query results in a full scan. Oracle issues multiblock IO for the index segment and as the rows are chained, oracle issues single block IO (db file sequential read – will not cache the blocks in the buffer cache) to read the contents from the overflow segment. If the table is too large, this will increase the IOPS on the server considerably.

Rebuilding an IOT:

Overtime, an IOT can bloat and since we cannot use the IOT index name directly in the SQL statements, the only way to rebuild the index structure of an IOT is to use the statement alter table … move online and to move an overflow segment use the statement alter table … move overflow.

The following are various restrictions of IOT:

- All key columns must fit the specified threshold.

- If the maximum size of the row exceeds 50% of the index block size, and if we do not specify an overflow segment, the CREATE TABLE statement fails.

- IOT cannot have virtual columns, so function based, reverse key, descending indexes are not supported.

- The maximum number of columns that can include the primary key is 32.

- The maximum number of columns in the index portion of the row is 255, including both key and non-key columns. If more than 255 columns are required, we must use an overflow segment – similar to chained rows restriction.

- The maximum number of columns is 1000.

Index Access Paths:

B-Tree Access Paths:

Index Unique Scans:

The database performs a unique scan when a predicate references all columns of a unique key or primary key using an equality operator such as where student_id = 123. Remember that unique or primary key constraint should be enforced/policed by a unique index for the index unique scan to work, if these constraints are enforced/policed by a non-unique index, oracle performs an index range scan instead of index unique scan (even though the lookups may not change, CBC latches required, redo/undo generation in case of a rollback etc. increases).

Consider the following table:

create table students (student_id number primary key, name varchar2(30));

for a query such as the following Oracle performs an index unique scan.

select * from students where student_id = ?

select * from students where student_id in (?,?,?)

select * from students where student_id = ? or student_id = ?

The explain plan will be as follows:

-------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 1 | 35 | 2 (0)| 00:00:01 |

| 1 | TABLE ACCESS BY INDEX ROWID| STUDENTS | 1 | 35 | 2 (0)| 00:00:01 |

|* 2 | INDEX UNIQUE SCAN | STUDENTS_PK | 1 | | 1 (0)| 00:00:01 |

-------------------------------------------------------------------------------------------

There is a reason for specifying SQL statements with predicates containing in and or clauses. Oracle considers both the statements equivalent. The scan searches the index in order for the specified key. An index unique scan stops processing as soon as it finds the first record because no second record is possible. The database obtains the ROWID from the index entry, and then retrieves the row specified in the ROWID. An index unique scan with an equality predicate uses the optimization technique named consistent gets examination (fastpath) which requires only 1 CBC latch to read a buffer. But for queries with predicates student_id in (?,?) and student_id=? or student_id=?, this optimization will not the same and it works for index leaf blocks alone. Remember that Oracle continuously optimizes the features, as a result this behavior can change.

The execution flow for a unique scan is as follows:

- Read the index root branch block.

- Read branch block.

- Read the leaf block and obtain the corresponding ROWID.

- Based on the ROWID read the table block.

Storage/OS Level Perspective:

Unless the SQL statement includes IN clause, an index unique scan always results in a single row. Since a predicate value always results in only a single row server process always makes use of pread64 OS call to read a block from the disk to the memory. Don’t be fooled into thinking that Oracle uses only pread64 calls, post database restart, there is another mechanism that can kick in which makes the session utilize readv system calls.

One might wonder why Oracle cannot utilize asynchronous IO while performing an Index Unique scan. Once the optimizer comes up with the plan to access the table through the unique index using INDEX UNIQUE SCAN. The server process executing the SQL statement performs the following:

- Server processes reads the index root block (block immediately following the header block – L3 BMB). Unless it reads the root block, it cannot determine the which branch block it has to read consequently no scope for asynchronous IO in this case as the root block can easily be either a root block or leaf block. (remember no branch blocks until the first leaf block allocated to the index is split).

- Reads the branch block and determine the correct leaf block, unless it reads the branch block it cannot determine the correct leaf block.

- Reads the leaf block to determine the corresponding ROWID or to determine whether the row even exists. Here checks are performed to see if there is an open ITL slot and if the row we are looking for is locked by a particular transaction and CR fabrication if required.

- If a ROWID is obtained then read the corresponding Table block. Here various possibilities may raise –

o Read the corresponding table block – perform a delayed commit cleanout if necessary.

o If an open ITL slot – and the transaction is still in progress – read the undo transaction table and the corresponding undo block containing the changes – fabricate a CR block.

o If the table block contains a forwarding address due to row migration – read the table block containing the corresponding ROW.

o If the SQL statement contains columns that are stored in a different row piece due and the rows are chained across multiple leaf blocks – then read the table block.

During the entire process described above, Oracle has no way of knowing which block to read in advance and it has to mandatorily follow the path. So, no asynchronous IO is used. Since an index unique scan results in only one row the clients fetchsizes are usually large enough to (default 15) not cause additional latches while reading the block (this statement may not necessarily be true when the statement contains IN clause with the list greater than fetch sizes). Remember that since it’s not possible to accurately determine the next leaf block beforehand and the branch blocks contains only the boundaries, db file parallel reads on index leaf blocks is nearly impossible.

Index Range Scan:

Index range scan occurs when the SQL statement contains the predicates such as =, <=, <, >=, <=, between, in, or like etc. Remember that a range scan occurs even for unique or primary key indexes in the following situations:

- Predicate refer to only one column of a multi-columned unique or primary key. For example, if the primary key is (student_id, admission_id) then referring to student_id alone in the query will result in a range scan (where student_id = ?).

- When a SQL statement contains range-based predicates such as >, >=, <=, <, between and like etc.

For an index range scan, multiple values can be possible for the index key (if the index is not a unique index, even though a predicate result in only one row Oracle still uses index range scan). By now we already know that an (non-unique) index entry consist of a column value and its corresponding ROWIDs. By default, when an indexed column value returns multiple rows, index stores the column value along with the corresponding ROWIDs stored in the ascending order. Let’s start with the basics.

Single Column Indexes:

Equality Predicates:

Consider the following table and query:

create table students (student_id number ,department_id number, student_name varchar2(20));

create index students_dept_idx on students(department_id);

Since students_dept_idx is a non-unique index an index key value can result in multiple rows. We may not realize this yet but storing the ROWIDs in the sorted order offers several benefits. When executing queries with equality predicates, as ROWIDs are stored in the sorted order, the server process reads/accesses the table blocks only once (we will subsequently find out why this is not always the case). For example, consider the following query:

select * from students where department_id = 1679;

This is clearly evident from the following:

select roll, rowid, DBMS_ROWID.ROWID_TO_ABSOLUTE_FNO (rowid, 'VISHNU', 'STUDENTS') "FILE_ID", dbms_rowid.rowid_block_number(rowid) "BLOCK" from students where department_id=1679;

The corresponding sql trace or 10046 level 12 is as follows: (very bad clustering factor)

3 index block (root block, branch block, leaf block) and 13 different table data blocks.

Now consider the same scenario if all the table blocks reside in the same block:

The corresponding SQL trace is as follows:

Before proceeding further, we need to understand few concepts regarding the index range scan which are very important. When a query contains a predicate with = (equality condition) and that the column value results in multiple rows and spans across multiple index leaf blocks, Oracle or server process always (in order to retrieve all the corresponding rows from the table) works through the left most index leaf block containing the value till the right most index leaf block in sequence, the same holds true for range-based predicates (>, <, <=, >=, between, but the case is different with in, or – we will soon see why) as well, here the leaf block leaf block contains the minimum value and the right most leaf block contains the highest value. Since the index leaf blocks are linked together using a double linked list, Oracle can also scan the index leaf blocks in the descending order from the right most index leaf block to the left most index block and we will see this in INDEX RANGE SCAN DESCENDING section.

Consider the following query:

select * from students where department_id = 1679;

The corresponding 10046 level 12 trace is as follows (don’t be surprised by the number of index blocks that are read, index is bloated purposefully to demonstrate this)

If we dump the leaf blocks we get the chain as follows:

kdxlenxt 29459780=0x1c18544 --> 99652

kdxlenxt 29459781=0x1c18545 --> 99653

kdxlenxt 29459782=0x1c18546 --> 99654

kdxlenxt 29459783=0x1c18547 --> 99655

kdxlenxt 29459784=0x1c18548 --> 99656

treedump of the index reveals the details:

branch: 0x1c1854a 29459786 (12: nrow: 17, level: 1)

leaf: 0x1c18537 29459767 (-1: row:13.13 avs:1158)

leaf: 0x1c18539 29459769 (0: row:13.13 avs:1158)

....

leaf: 0x1c18542 29459778 (9: row:13.13 avs:1158)

leaf: 0x1c18543 29459779 (10: row:13.13 avs:1158)

leaf: 0x1c18544 29459780 (11: row:13.13 avs:1158)

leaf: 0x1c18545 29459781 (12: row:13.13 avs:1158)

leaf: 0x1c18546 29459782 (13: row:13.13 avs:1158)

leaf: 0x1c18547 29459783 (14: row:13.13 avs:1158)

leaf: 0x1c18548 29459784 (15: row:13.13 avs:1158)

Predicates containing IN and/or OR:

Oracle uses the In-List iterator whenever the predicates contain the IN clause with the values specified such as the following:

select * from students where student_id in (4,2,1,3);

and if these values are fed using another SQL statement, the resultant plan becomes a join and the In-List iterator will not be used.

select * from students where student_id in (select a.roll from temp a, temp2 b where a.roll=b.roll);

with tempa as (select a.roll from temp a, temp2 b where a.roll=b.roll) select * from students where student_id in (select roll from tempa);

Whenever the predicates contain the IN-clause oracle internally expands the statement as follows and estimates the selectivities of each individual value specified in the IN clause, this is evident in 10053 trace as well.

Final query after transformations:******* UNPARSED QUERY IS *******

SELECT … FROM "VISHNU"."STUDENTS" "STUDENTS" WHERE "STUDENTS"."STUDENT_ID"=1 OR "STUDENTS"."STUDENT_ID"=2 OR "STUDENTS"."STUDENT_ID"=3 OR "STUDENTS"."STUDENT_ID"=4

The execution plan is as follows:

There are several steps involved before it generates a final query is generated post transformations and these are as follows:

- Eliminate duplicate values in the list using various optimizations.

- Sort the values in the ascending order (regardless of whether the index is ascending or descending).

One might wonder why the HASH UNIQUE or SORT ORDER BY is not displayed in the explain plan, this is because this optimization is performed even before the final query (after transformations) is generated and costs of index or table access is calculated. We will eventually see why this trivial (insignificant) detail is mentioned here eventually. Same behavior applies irrespective of the underlying index. Sorting the values in the IN clause pose several benefits instead of accessing the index leaf block and the corresponding table blocks such as If the values are close enough this can potentially reduce the number of latches requires to access the branch/leaf block multiple times over the course of execution.

There is an important aspect that a DBA/Developer should be aware of with regards to IN or OR clause when working with very close or continuous values:

Consider the following statement:

select * from students where student_id in (1,2,3,4,5,6,7,8);

The corresponding execution plan is as follows:

If we observe closely, oracle didn’t use Batched access as it knows that each value results in the IN-clause results in only one value. The statistics in terms of average latch gets for each block is as follows:

The average number of CBC-latch gets for each execution is 29. Now consider the same statement and we modify the predicate part as follows:

select * from students where student_id between 1 and 9;

The average number of CBC latch gets for each execution is 21. The difference in latch gets increases considerably due to the following factors.

- Lower clustering factor.

- Number of values specified in the IN-clause.

- Average row length size – short or long tables.

- Block size.

- Extent sizes.

Now that we have seen the behavior in terms of latch gets, let’s look at few aspects especially related to IN clause. An ORDER BY clause on the indexed column doesn’t necessarily mean that Oracle avoids a SORT operation, it can still use SORT ORDER BY to sort the resultant rows. This is especially true when it comes to joins – even if the ORDER BY clause refers to columns that belongs to the outermost table in the join.

Consider the following table and query:

create table students(student_id number, name varchar2(20), primary key (student_id));

create table marks (student_id number, subject_id number, mark1 number, primary key (student_id,subject_id));

Oracle uses additional steps such as SORT UNIQUE / HASH UNIQUE to basically eliminate the duplicate values and if possible the need for sorting the final results from the outer most query which is students in our case. Consider the following statement:

select * from students where student_id in (select student_id from temp where mark1=4) order by student_id;

In this case, since the primary key on marks table is (student_id, subject_id), Oracle has no way to determine whether the rows that result from the subquery results in duplicate value, so it uses SORT UNIQUE to both sort and eliminate the duplicates values returned by the subquery. Since the values are already sorted, Oracle simply eliminates the need to sort the results obtained from the students table again. Consider the following statement:

select * from students where student_id in (select a.student_id from marks a , departments b where a.student_id=b.student_id and a.mark1 > 90 ) order by student_id;

Even though students.student_id is indexed, Oracle performs a sort of values at the end, this is because since the HASH UNIQUE step is used to effectively eliminate the duplicates the values obtained can be in any order, in order to satisfy the ORDER BY clause Oracle effectively performs sort at the end. This is a case which demonstrates that in the presence of ORDER BY clause even if that column is indexed Oracle still perform the sort at the end, we will see more about this in Joins and how an non-unique column used as a join condition can result in sort.

If queries depend on IN clause, developer often make a mistake of creating separate PreparedStatements or generate them based on the number of entries to be included in the IN clause such as the following:

select * from students where student_id in (?,?);

select * from students where student_id in (?);

select * from students where student_id in (?,?,?);

select * from students where student_id in (?,?,?,?);

This often results in multiple SQL_IDs (more SQLs in the Shared Pool), which is not optimal. A simple fix is to use a subquery in the IN clause that makes use of either REGEX function or a PL/SQL block that results in multiple values given only one bind variable. Using dual is far better than using a custom PL/SQL block to parse the results. For example, the above statement can be converted as follows:

select * From students where student_id in (select regexp_substr(:bindval,'[^,]+', 1, level) from dual connect BY regexp_substr(:bindval,'[^,]+', 1, level) is not null);

Similar approaches can be employed, but the only downside of this is that Oracle tends to use HASH UNIQUE (which just eliminates the duplicates) and not SORT UNIQUE (which not only eliminate duplicate values but also sorts these values). Unless the values are extremely close (numeric), the resulting increase in number of latches required to complete execution may not be significant.

Range Based Predicates (>, <, <=, >=, between):

An index range scan is an ordered scan of values. A predicate can be bounded on both sides or unbounded on one or both sides and the values that fall in the range are usually continuous (numbers - data type can be integer, float etc.). We already know that unless the index is a descending index, oracle stores column values and their corresponding ROWIDs in the ascending order. How is the behavior different when it comes to equality predicates? The behavior is essentially the same as performing a range scan using equality predicate on non-unique index. Oracle starts by scanning the root block and reads all the corresponding branch blocks that satisfy the range and beings scanning from the left most index leaf block that holds the minimum value till it reaches the right most index leaf block that holds the maximum value, it doesn’t get back to the branch blocks or access the branch blocks again. There is a reason for reiterating this again – refer to equality predicates.

For example, when we issue a query with the predicate student_id < 10, Oracle uses a range scan to return the rows sorted by index keys 0, 1, 2, 3 etc. Unless the ORDER BY <INDEXED_COLUMN> DESC is specified in the query, oracle always starts scanning from the left most index leaf block to the right the most index leaf block that satisfy the range specified in the predicates. Consider the following query:

select * from students where student_id < 10;

For demonstration purposes the index is bloated purposefully using the following DDL:

create unique index students_idx on students (student_id,’< random junk 2500bytes in length>’);

The corresponding trace is as follows:

From the trace we can notice that the it started scanning from index leaf block 476 (which held the minimum student_id value) till block 479 (which held the maximum student_id value). There is a reason for highlighting the wait events that correspond to reading the table blocks (conventional, prefetching, batching), we will look into this shortly.

We can see the following link from the index leaf block dumps and the Treedump.

kdxlenxt 29360605=0x1c001dd (kdxlenxt points to next index leaf block in the chain)

kdxlenxt 29360606=0x1c001de

kdxlenxt 29360607=0x1c001df

kdxlenxt 29360608=0x1c001e0

Index Treedump:

branch: 0x1c001db 29360603 (0: nrow: 100, level: 2)

branch: 0x1c00bf4 29363188 (-1: nrow: 734, level: 1)

leaf: 0x1c001dc 29360604 (-1: row:3.3 avs:1951)

leaf: 0x1c001dd 29360605 (0: row:3.3 avs:1951)

leaf: 0x1c001de 29360606 (1: row:3.3 avs:1951)

leaf: 0x1c001df 29360607 (2: row:3.3 avs:1951)

leaf: 0x1c001e0 29360608 (3: row:3.3 avs:1951)

leaf: 0x1c001e1 29360609 (4: row:3.3 avs:1951)

Even though Oracle stores entries in the index leaf blocks in the ascending order based on the column value and retrieves the corresponding rows from the table accordingly doesn’t mean that the results obtained from the SQL statement or the rows that the application fetches are sorted based on the predicate. For example, the following statements have the identical plan but the order in the application sees the table rows may not be the same for the 1^st SQL as it depends on various factors that we will discuss shortly.

select * from students where student_id < 10;

select * from students where student_id < 10 order by student_id;

Let’s look at the corresponding execution plans:

select student_id,name from students where student_id < 100

Plan hash value: 3459646009

select student_id,name from students where student_id < 100 order by student_id

Plan hash value: 1738957801

Do observe the table access method used in both the cases – TABLE ACCESS BY INDEX ROWID BATCHED (can be batched, prefetch, or generic) and TABLE ACCESS BY INDEX ROWID (which can be prefetch or generic). We will be discussing the batched, prefetch, and generic access methods in the subsequent sections.

Situation is different when we select only the index columns and the execution plan results in either the INDEX RANGE SCAN or INDEX FULL SCAN (without fetching results from the table – table access by index ROWID) as Oracle or server process always follow the exact precise order when it comes to reading the index blocks using INDEX RANGE SCAN or INDEX FULL SCAN that is reading the index leaf blocks and the corresponding entries in order one by one.

select student_id from students where student_id < 100;

Plan hash value: 3503402594

select student_id from students where student_id < 100 order by student_id;

Plan hash value: 3503402594

we can confirm the results using the following query:

with temp as (select /*+MATERIALIZE*/ rownum "ROLL", student_id from students where student_id < 1000), temp1 as (select /*+MATERIALIZE*/ rownum "ROLL", student_id from students where student_id < 1000 order by student_id) select count(*) from temp a , temp b where a.roll=b.roll and a.student_id=b.student_id;

This holds true only for single table lookups and only in explicit cases in multi-table joins. When the SQL statement joins multiple tables, it is mandatory that we specify the ORDER BY clause to retrieve the results in an ordered/sorted manner. The order in which the rows retrieved may not be the same when the optimizer chooses INDEX FAST FULL SCAN.

Another interesting point here is that if an index leaf block contains several columns values, and if all these column values satisfy the predicate (>, <, between etc), then oracle effectively batches the table block reads effectively moving on to a different index leaf even before batching the last few entries of the current leaf block.

Situations in which a SQL execution has higher buffer gets / consistent gets than physical reads:

How Application Result Set Fetch size affects Buffer gets / CBC latches:

When a SQL statement is executed factors such as JDBC fetchsize, SQL*Plus arraysize has an exponential effect on the number of buffer gets (subsequently CBC latches) required for each execution depending on whether the query is TABLE ACCESS BY INDEX ROWID/BATCHED or just INDEX RANGE SCAN (index only scan). The extent of the impact depends on the factors such as the following:

- Clustering factor of the index.

- Whether PGA is used to store the results – during sorting, aggregate functions etc.

- Index only scans.

- Depending on Table and Index structure and the following contributes:

o Table average Row Length and whether the table is short.

o Datatype of Indexed column and/or freeness status of index leaf block.

- How the data is loaded into the table.

INDEX UNIQUE SCAN which usually result in a single row are not affected due to this, but if the predicate contains the IN clause with many values we can expect elevated buffer gets and latch activity (as Oracle database still performs a INDEX UNIQUE SCAN For predicates containing the IN clause). This situation can occur for equality and range-based predicates on single column or multi-column indexes when an indexed column value or a set of predicates results in number of rows greater than the fetch size and the server process doesn’t use PGA to store the intermediate results.

Consider an ideal and worst-case scenario in which the clustering factor of the index equals the number of blocks in the table and the table is short. (there is a reason for specifying clustering factor here and we will soon see why). Following statement retrieves 999 rows from the table starting from student_id = 1

select * from students where student_id < 1000;

10046 level 12 trace is as follows (during an index range scan in order to retrieve the data more efficiently oracle employs various optimizations such as prefetching, batching, buffer warmup etc., which can result in additional blocks read into the memory. In order to accurately depict the exact number of blocks read prefetching, batching, buffer warmup etc., are disabled to get the following trace. We will soon see why the V$SQLAREA.BUFFER_GETS or consistent gets from autot cannot be relied upon and cannot be used to accurately depict the number of data or index blocks read from the disk).

And the explain plan is as follows:

As we can see from the above the SQL statement results in 14 blocks read from the disk into the memory. From the application, executing the SQL statement with various fetch sizes will give us the following statistics:

We can clearly infer that the number of buffer gets increases considerably with the decrease in the application’s resultset fetchsize. The execution statistics for the SQL_ID 4cxpytntkdb9z are optimal as the application level fetchsize (from SQL*Plus arraysize) is 1000 and the buffer gets is precisely what we would expect as the entire result set is sent to the client or the client fetches the entire resultset at once. Ignore the CPUmsPerX and ElapmsPerX statistic for now as we will go through them in the network tuning section.

When a query results in multiple rows and if the number is greater than the client’s fetchsize (or SQL*Plus arraysize), the client receives the results in batches. For example, if the client’s fetchsize is only 10 and the query results in 100 rows, client fetches the results 10 times over the course of SQL execution. Once the query execution results in 10 rows, server process halts or more precisely pauses SQL execution releases all the CBC latches and switches the context to sending the results to the client, Once the results are sent, it will resume from the point where it paused, and it reacquires the latches to access the data/index block or buffer again. Another way of looking into this is Oracle uses a dedicated server process for each client connection and it is evident from the /proc/<server_process_pid>/status that each process runs with only one thread, unless the process is multi-threaded or use techniques such as asynchronous APIs, the process flow cannot perform two activities at once.

Consider the following SQL trace:

As we can see from the sections highlighted in RED, server process reads the table blocks sends the data to the client and goes back to reading the data again from where it left(here the fetchsize is less or the JDBC drivers default), now enter prefetch/batched table access which reads multiple table blocks at once and subsequently sends the results to the client.

Now that we have seen how resultset fetchsize and/or SQL*Plus arraysize affects the buffer gets (CBC latches inherently) for queries executed using TABLE ACCESS BY ROWID/BATCHED. Index only scans are also affected as there is a high chance that the leaf blocks contain many entries. Consider the following SQL statement:

select student_id from students where dept_id=123;

and the corresponding index as

create index stud_dept_std_idx on students(dept_id,student_id);

The explain plan is as follows:

Snippet of 10046 level 12 trace of the above SQL is as follows:

From the application, executing the above SQL statement with various fetch sizes will give us the following statistics:

As we can see from the above buffer gets increase with the number of fetches. By default, unless configured Oracle JDBC fetchsize is 10. Increasing the fetchsize to an arbitrarily high value does have its consequences, as more data is sent in one fetch operation sessions will end up waiting on SQL*Net more data to client, make sure the SDU sizes etc., are configured properly. We will see more on this when we deal with SQL*Net wait events.

Oracle recommends the resultset fetchsize be set to 256, but this recommendation may not be accurate for workloads, we will soon see why. Following are the various situations in which the resultsset fetchsize will have minimal to no effect on the number of buffers gets of a SQL statement.

- Optimizer decides to use SORT ORDER BY to sort the results in which case the rows obtained from the SQL execution are initially stored in the session’s PGA and then sorted based on ORDER BY clause

- Use of Aggregate functions such as group by, avg, sum, count, distinct etc.,

- Effect of reacquiring the latches decreases with the increase in the row size but a low FETCHSIZE eventually results in more IOPS on server.

- Since the index leaf blocks can accommodate more entries, there is a high chance that the index leaf block will be accessed multiple times (provided that an indexed column value result in multiple rows). Considering 90-10 and 50-50 splits, rebuilding an index which had more 50-50 leaf block splits will elevate the number of fetches as the index leaf block now accommodates more rows. In such cases, it is better not to build the index or increase the PCTFREE of an index should help, but remember that increasing the PCTFREE reduces the efficiency of the buffer cache as the blocks contain more free space.

Now that we have seen how fetchsize or arraysize affects the buffer gets and latches from the database perspective. Let’s look at things from the application perspective and how a default value of 10 or low fetchsize helps reduce buffer gets and latches. Before proceeding further, we need to understand the concept of partial query execution.

Consider the following query

select * from products where name like ‘SSD%’;

snippet of 10046 level 12 trace for the above statement is as follows:

The execution statistics for the above statement are as follows:

Here the clustering factor of the index is very high and the buffer gets mentioned here is the actual number of buffers accessed by the SQL statement. We can expect the statistics to be the same each time we execute the above SQL statement, considering the fetches per execution is only 1 or the client fetches all the rows in one operation (one fetch of resultset) But this is not always the case. An SQL execution can be either

- Full or

- Partial

Full execution occurs is when an SQL statement executes and all the corresponding rows that result from that execution are sent to the client. Partial execution occurs when the client closes the cursors after fetching only certain number of rows, and the requirement of the workflow is satisfied. For example, if you have to purchase an SSD in any retail site such as eBay or Amazon, you would end up performing a search with the keyword SSD and the subsequent page displays the first 10 or 40 results depending on the whether you using a web browser or a mobile app, once you either scroll down or move to next page the next set of results are displayed. Same happens when you search the tickets in the queue by your team in Remedy or any ticketing tool. It is most likely that you will find the SSD you want in the first 1-3 pages, and it is highly unlikely that you browse through all the pages. Even though the SQL statement shown above results in over 1900 rows, the chances are infinitely small that you gaze upon the first 200 results before making a purchase. Considering the above, if you select a particular product from the 1^st page itself, there is no point in executing or fetching the rest of the results. In this case the application closes the cursors and the SQL is said to be partially executed (though successfully executed).

Consider the fetchsize in this case is 40 and you would end up selecting a particular product from the 1^st page of search results. The execution statistics will be as follows:

If you observe closely, the metric elapsed times per execution, CPU per execution, buffer gets per execution have reduced significantly, and the client only fetched once – first 40 rows from the SQL. Another important metric that we should observe closely is the V$SQLAREA. END_OF_FETCH_COUNT or V$SQL.END_OF_FETCH_COUNT. This value is incremented each time for full execution and is not incremented for partial executions. If the END_OF_FETCH_COUNT/EXECUTIONS is less than 1 then it indicates the SQL is executed partially. In cases such as these, application benefit from reduced response times and the database can service more clients.

Now suppose you browse through first 3 pages which is unlikely in most of the cases resulting 120 rows the execution statistics will be as follows:

This is most likely the case in many front end and few backend applications. Whenever you are troubleshooting a particular SQL statement, always be on the lookout for the metric END_OF_FETCH_COUNT. The Same query can become incredibly slow at times even with the same bind variables. One might wonder why? The reason for this is that, by the time the client fetches the next set of rows, the new blocks or blocks that are already in the cache may be modified, resulting in CR block fabrication.

This may raise another thought – why can’t we use features such as FETCH FIRST 5 ROWS ONLY or limiting the result set using clauses such as ROWNUM < 40. This is because

- We have to execute the SQL statement each time and keeping track of the records already fetched requires additional work and is highly inefficient.

- Depending on the operations such as group up, order by etc. every time query runs the resultset may or may not result in the same rows and in the same order.

Now that we know the basics of partial execution and how buffer gets are affected, let’s look at various aspects from a DBA perspective:

As soon as the session obtain the first batch of results, the session status becomes inactive and state indicates the true nature - waiting on SQL*Net message from client until the client or application requests for another batch of results.

This has rather very interesting implications let’s look at each. Due to the nature of partial execution, it is recommended to be used only on tables which contain static data. If the rows in the table constant change by the time the client requests for or fetches another batch of results, if the blocks are already changed, fresh CR copies of the block have to be constructed. This is one of the cases in which a session ends up creating CR versions of the same block multiple times by the time it finishes execution. One might wonder if the CR version of the block is already present in the memory, why does the session require building another same copy of the block. It depends on various factors such as the following:

- In adequately sized buffer cache.

- Remember that the CR block constructed will have a touch count of 1 and is highly likely to be flushed from the buffer cache – flushing here in this case refers to invalidating the buffer and using it to store a new block coming in.

If you still need the benefits of the partial execution even though the table contents are modified a lot. You can recommend one of the following:

- Use ORDER BY clause to temporarily store the results in which case, the session gets the results from the PGA instead of accessing the blocks over and over again.

- Use data structures such as ArrayLists etc to hold the results at the application level, instead of fetching them over and over again.

In both the cases, the query results in full execution, and you will not reap the benefits of partial execution. V$SQL/AREA.USER_EXECUTING does indicate the number of users currently executing the SQL statement even though the status of the session is INACTIVE.

Clustering Factor and other factors that influence buffer gets/latches:

When dealing with range-based predicates, there is a chance that a server process (user session) ends up reading/accessing the same table blocks multiple times over the course of SQL execution. This occurs when one or combination of the cases is true.

- Way the data is loaded into the table and the number of sessions inserting the data.

- When the predicates contain bounded, unbounded, in, or clauses.

- When extent sizes are small.

- When the average row length of the table is small and/or table is small.

- NDV and distribution of the column plays a vital role.

- Predicate refers the leading column of a multi-column index.

Consider the following query: student_id is the primary key (single column index)

select * from students where student_id between 30 and 40;

10046 level 12 trace is as follows:

As you can see all the rows that match the predicates are contained in only two table blocks, and the corresponding index entries stored in only one index leaf block. Execution statistics for the above SQL is as follows:

We have 14 buffer gets even though client fetches the entire resultset at once. Now let’s look at how the data is organized in the index leaf block.

SELECT student_id, dbms_rowid.rowid_to_absolute_fno(ROWID,'VISHNU','STUDENTS' ) "FILE_NO" , dbms_rowid.rowid_block_number(rowid) "BLOCK_NO" , rowid from students where student_id between 30 and 40;

STUDENT_ID FILE_NO BLOCK_NO ROWID

---------- ---------- ---------- ------------------

30 5 1038 AAASPNAAFAAAAQOAAM

31 5 469 AAASPNAAFAAAAHVAAJ

32 5 1038 AAASPNAAFAAAAQOAAN

33 5 469 AAASPNAAFAAAAHVAAK

34 5 1038 AAASPNAAFAAAAQOAAO

35 5 469 AAASPNAAFAAAAHVAAL

36 5 1038 AAASPNAAFAAAAQOAAP

Since the subsequent entries in the index leaf block points to different blocks Oracle accesses each block multiple times and remember that each time it accesses a block in the buffer cache, it has to acquire the CBC latches.

When the predicates contain or refer only the leading column of a multi-column indexes, things are little different, consider the following table and the corresponding index:

create table orders (order_id number, store_id number, item_id number, volume number);

create index orders_store_items_idx on orders(store_id, item_id);

and the SQL Query

select sum(volume) from orders where store_id = 12312;

In this case its certain that oracle will use any repeated reads when it comes to index blocks as Oracle starts reading from the leaf most index leaf till the right most index block but it can read the same table data block over and over again. For example: a seller in Amazon or eBay gives heavy discounts on few products to clear the piled-up stocks in the warehouse and the stock is limited. After noticing the reduced prices, customers start ordering. In this case, eventually at the database end - inserts will primarily contain the store_id of that particular seller. Consider the following scenario:

Session A inserts row with store_id, item_id (12312, 1651) into table block 416.

Session B inserts into table store_id, item_id (12312, 1651) block 417

Session A inserts into table store_id, item_id (12312, 1652) block 416

Session B inserts into table store_id, item_id (12312, 1652) block 417

From Oracle perspective, as ROWIDs of each individual distinct combination indexed column values as well as the indexed columns are stored in the ascending order, Oracle ends up reading or accessing the same data blocks 416, 417 again as it starts processing the entries from left most index leaf block to the right most index leaf block. Let’s see this in action:

The execution plan and the corresponding statistics are as follows:

Notice the mismatch in the number of physical reads vs Consistent reads, this is because oracle access the same table blocks over and over again.

The snippet of the SQL trace is as follows (entries are truncated as the trace is long):

We can clearly observe this behavior from the above 10046 level 12 trace:

Session starts by reading the index root branch block, branch block, and as it works through the index from the left most index block to the right most index leaf block and reads the corresponding rows from the table blocks, as we get to the tail section, tail section includes details regarding only the index blocks but not the table blocks, this is because all the table blocks are already brought into the memory as part of previous IO requests. You should remember that an 10046 trace contains entries each time a block is read/accessed from the disk and not from the memory.

In such cases an index on store_id column alone significantly reduces the number of buffer gets, to demonstrate this let’s drop the old multi-column index and create a single column index with just the store_id as follows:

create index store_id_idx on orders(store_id);

The execution plan for the statement is as follows:

The number of consistent gets reduced by a factor of 82X. Remember that this behavior is common only in certainly application types in which the data load is time dependent. Such as a shift in which only a particular department works, raising tickets to a particular department in case of outages etc.

CR block fabrication:

Whenever the user or application issues a DML against the database, Oracle modifies the contents of the table block and the corresponding index blocks. This modification is done even before the user/application commits the transaction. (Oracle maintains the SCN at the block level not at the row level). Each data block has a header portion that contains ITL (Interest Transaction List) slots that has information regarding the transactions – whether in progress or committed etc. The following is the snippet of table block dump.

LCK potion indicates the number of rows locked by a transaction and the FLAG indicates status of the transaction. From the above, we can see that the transaction is still in progress (Flag empty). Unless we specified MAXTRANS during the initial table creation, oracle can add as many ITL entries in the header as possible provided that there is enough space in the block and multiple sessions concurrently update different rows in the same blocks or till the maximum 255. (less likely to have 255 transactions modifying the same block).

The XID for a transaction (in format that’s stored in block) can be obtained as follows:

select to_char(XIDUSN,'XXXXXX'), to_char(XIDSLOT,'XXX'), to_char(XIDSQN,'XXXXXXXX') from v$transaction where addr in (select taddr from v$session where username='VISHNU');

The rows (in the block) that are locked by the transaction will have the flags set as follows:

lb à indicates the ITL slot.

UBA or Undo Block Address (in Hex format) section contains the details regarding the undo block that contains the entries (transaction table). Oracle can sometimes write a dirty block with un-committed changes on to the disk as a result of one or any combination of the following situations.

- manual checkpoint

- memory pressure (buffer cache)

- if the transaction is very large.

- Incremental checkpoint or a checkpoint during a log file switch.

As the block is modified and no longer contains the original data (which is committed), and yet the transaction that modified the data is not yet committed (session may be struck, or the transaction is very large), a session that issues a SELECT statement against the database will not be able to see the most recent committed data (or the data according to the SCN or point in time at which the select statement started – long running query), the session ends up cloning the block and rollback the changes applied to the block thereby creating a consistent read copy of the block. As we have already seen before the path of the insert or update.

Before proceeding further, you need to understand how Oracle marks a block as dirty:

- If the transaction is still in progress, Oracle doesn’t necessarily set the dirty flag even if the block/buffer is modified in the memory (there are various cases, insert, update etc).

- Oracle always set the dirty flag immediately following a commit or rollback or DML failure due to a constraint violation or some other issue which results in automatic rollback.

Let’s take a look at how Oracle generates CR blocks: consider the following scenario:

create table students (roll number, name varchar2(300));

alter table students add constraint students_pk primary key (roll);

create index name_idx on students(name);

insert into students select rownum, dbms_random.string(0,300) from dual connect by level < 50000;

update students set name='SRAVYZZ' where roll = 10000;

commit;

now that the data is populated let’s execute the following DML statement in session A:

update students set name='VISHNU' where roll = 10000;

The states of the buffers are as follows:

If we notice only the index block is marked as dirty.

From a different session execute the following query:

select * From students where name='SRAVYZZ'; //old value

The states of the buffers are as follows:

Since the buffers are already cached, the CR block fabrication is pretty straight forward. But if we have to look precisely at how this is done, perform the following steps:

- Flush the buffer cache.

- Disable features such as (table prefetch, batch access, index prefetch, buffer warmup).

- Enable 10046 level 12 trace.

The corresponding execution plan is as follows:

The snippet of the trace is as follows:

Let’s take a look at the SQL statements we executed:

update students set name='VISHNU' where roll = 10000;

select * From students where name='SRAVYZZ';

The update statement updated not only the table but also the NAME_IDX index leaf block. Remember that for an index UPDATE operation is typically a DELETE and INSERT.

Let’s look at the ITL portion of both the table and the index leaf blocks:

Table Block:

Index block (where the entry is deleted)

Index block(where the new entry is inserted)

If we notice the XID or the transaction ID it is same for all these three operations. Snippet of the undo block contains the details is as follows:

Now consider the case that the session A which initiated the update is stuck:

update students set name='VISHNU' where roll = 10000;

And other sessions/new sessions issue the following statement:

select * from students where name='VISHNU';

Technically the entry in the leaf block exists but the data is not committed yet and previous to this transaction there is no matching row. In this case since the entry is exclusively locked, server process ends up fabricating a CR block (index). One might wonder why?

Consider a scenario in which an update to the partition key causes the ROW to be moved to a different partition and the underlying index is an global index, since the indexed column (index doesn’t contain the partitioning key) value never changes but the ROWID does (the entries in the leaf blocks are sorted based on column value and the ROWID) the position of this particular entry can either move up or down or even to a different leaf block altogether. In this case, since the server process which initiated the query is not sure whether the index entry existed before or even moved to this block as part of an ongoing transaction, to make sure that this value existed before and also to find the corresponding table block that previously held the value (assuming the transaction is still in progress), Oracle generates a CR block for that index block alone and once the CR block is read, it determines that the row never existed immediately returns.

Remember that If there is an open ITL slot prior to executing the SQL statement, Oracle ends up creating a CR block even though the LB flag for the row we are interested in is not set or in other words the row of interest is not locked by another transaction – the server process ends up fabricating a CR block be it Index or Table block.

Till now we have seen how CR blocks are generated when the transaction is already in progress, now let’s look a scenario in which the blocks are modified post the SQL statement started.

Considering the same scenario as before:

create table students (roll number, name varchar2(300));

alter table students add constraint students_pk primary key (roll);

create index name_idx on students(name);

insert into students select rownum, dbms_random.string(0,300) from dual connect by level < 50000;

commit;

Instead of going writing a complex SQL that touches certain blocks that we need, we can simulate this using a simple Java program as follows:

void selectLoad(){

ExecutorService asd = Executors.newFixedThreadPool(1);

asd.submit(new Load());

}

public void run() {

try {

Connection oraCon = DBConnection.getOraConn();

PreparedStatement pstmt = oraCon.prepareStatement("select * from students where roll < 100");

pstmt.setFetchSize(1);

ResultSet rs = pstmt.executeQuery();

while (rs.next()) {

System.out.println(rs.getInt(1));

Thread.currentThread().sleep(5000);

}

catch(Exception E) {

E.printStackTrace();

}

Since the setFetchSize is set to 1, the server process has to mandatorily access these data blocks again to get the next batch of results, in our case, the client waits 5 seconds allowing us ample time to modify the data blocks to understand the behavior. X$BH.CR_SCN_BAS indicate the SCN of the CR block required by the statement or could also indicate the SCN at which the statement started.

Post starting the program (SCN 96332535) the states of the buffers is as follows:

In another session try modifying the contents of the table as follows:

update students set name=’SRAVYZZ’ WHERE ROLL = 20;

The states of the buffers will be as follow:

Oracle generates a CR block for the table only if it results in a row that modified. Suppose if we modify a row whose roll number is 101, Oracle still generates a CR block. Does it mean that Oracle generates a CR block always, no, if the distance between the boundaries of the modified row and the rows that it is processing is ~20 it doesn’t generate a CR block even though the block’s SCN is higher than the select statements SCN.

In order to comply with Read-Commit transaction isolation level (default – level), when a SQL statement is issued against the database, Oracle must ensure that the resultset or result should contain only the rows that were committed until the precise SCN - SQL statement is issued. This behavior is same for both DML and Select statements. Let’s start with pretty basics:

Consider the following table:

create table test(roll number primary key);

insert into test select rownum from dual connect by level < 1000;

commit;

There is a reason for creating a table with single column, we will see the reason shortly

Now consider the query:

select * from test where roll in (1,2,3);

Plan is as follows:

The states of the corresponding index blocks (buffers) is as follows:

Block 355 is the root branch block and the 357 is the leaf block. These buffers are clean buffers (not dirty). The query used to get the above result is as follows:

select a.object_name, b.dbarfil"FILE_ID", b.dbablk"BLOCK", decode(b.state, 0, 'free', 1, 'xcur', 2, 'scur', 3, 'cr', 4, 'read', 5, 'mrec', 6, 'irec', 7, 'write', 8, 'pi', 9, 'memory', 10, 'mwrite', 11, 'donated', 12, 'protected', 13, 'securefile', 14, 'siop', 15, 'recckpt', 16, 'flashfree', 17, 'flashcur', 18, 'flashna')"STATE", decode(bitand(flag, 1), 0, 'NO', 'YES')"DIRTY", count(*) from x$bh b, dba_objects a where b.obj=a.data_object_id and a.owner='VISHNU' group by a.object_name, b.dbarfil, b.dbablk, decode(b.state, 0, 'free', 1, 'xcur', 2, 'scur', 3, 'cr', 4, 'read', 5, 'mrec', 6, 'irec', 7, 'write', 8, 'pi', 9, 'memory', 10, 'mwrite', 11, 'donated', 12, 'protected', 13, 'securefile', 14, 'siop', 15, 'recckpt', 16, 'flashfree', 17, 'flashcur', 18, 'flashna'), decode(bitand(flag, 1), 0, 'NO', 'YES');

Now let’s update the roll number 1 to 10001 and see the status of the blocks:

update test set roll = 10001 where roll = 1;

commit;

The index leaf block and the table block that originally contained the value 1 is updated resulting in dirty blocks. Remember that an update is basically delete and insert for Index. The update statement resulted in index entry holding the value 1 being marked as deleted and then the value 10001 is inserted in the index leaf block 358. The on-disk and in-memory version of the index leaf block will have the following entries:

(when you dump a block, oracle first dumps the in-memory version followed by the On-disk version). These blocks will have the flag DIRTY set until the time they are written to the disk, once they are written to the disk, due to either memory pressure or part of incremental checkpoint, the DIRTY flag will be cleared.

Now in another session if we issue the below SQL statement without committing the transaction:

update test set roll = 1 where roll = 10001;

The entry in the block 358 (10001 will be marked as deleted) and the entry which was previously marked as deleted – the deleted flag in the block 357 will be cleared. Following is the dump of the index leaf block 357.

From another session issue the following statement:

select * from test where roll = 10001 ;

Since the transaction is still in progress and the block has an open ITL slot with un-committed changes, session ends up building a CR image of the block.

Remember that Oracle maintains SCNs at block level, all the ROWs in the table block and entries in the index leaf blocks are considered to have same ORA_ROWSCN. This additional CR block fabrication requires reading entries from the Transaction table – undo tablespace, cloning the block and rolling back the changes. This process increases the number of consistent gets for a SQL.

When it comes to the DML statements, blocking sessions also results in generating CR blocks: the following are the scenarios where Oracle ends up creating multiple CR blocks (index).

CR block generation:

Multiple sessions execute same SQL / update or delete and ends up blocking against each other.

Unique index:

Plan involves unique scan…update single column of single column unique index – CR block generation within the limit.

Plan involves unique scan… update a different column (other than unique index column) – CR block generation within the limit.

Plan involves unique scan… update tailing column of a multicolumn unique index – CR block generation within the limit.

Plan involves unique scan…update both columns of a multicolumn index – CR block generation within the limit.

Plan involves unique scan… update leading column of a multicolumn unique index – – CR block generation within the limit.

Plan involves range scan on multicolumn unique index… update tailing column of unique index – more than 6 CR blocks generated.…

Plan involves range scan on multicolumn unique index … update leading column of unique index – more than 6 CR blocks generated.…

Non-Unique index:

plan includes range scan – update same column – cr block generation – more than 6 CR blocks.

Plan includes range scan – update different column – no cr block generation.

Plan includes range scan – update different column (unique index column) – no cr block generation.

Delete on non-unique index – more than 6 CR blocks generated.

Insert statement:

Insert statements – for inserts CR blocks are generated typically for BMB blocks as the data in those blocks get updated. It so happens that during the checkpoints

Delete creates more than 6 CR blocks – if range scan over unique index.

Delayed /- Commit Clean out: (processing / undo lookup Overhead)

The Database writer process can write modified block (dirty block) that contains uncommitted changes to the disk due to various reasons such as

- Checkpoints. (manual or automatic)

- Insufficient buffer cache or (buffer cache flush).

- Aged out of the buffer cache.

- Small redo log file sizes.

- Small value for fast_start_mttr_target etc.

In these cases, the ITL slot in the header portion of the on-disk version of the block indicates that the transaction is still progress (the flags are not cleared) and the session that initiated the transaction (or modified the blocks) doesn’t read the blocks into the memory again to update the ITL or flags indicating the transaction is committed. This is basically done for efficiency. To demonstrate this, let’s look at very basic example:

create table students(roll number primary key, name varchar2(20));

insert into students select rownum, dbms_random.string(0,20) from dual connect by level < 100000;

commit;

Session A modifies a row as follows:

update students set name='VISHNU' WHERE ROLL = 100;

The state of the buffers is as follows:

Now consider if the DBWn writes this dirty block to the disk (due to various reasons) and later the session commits the transaction.

The ITL portion of the header for the table block 351 for the on-disk version will be as follows:

Now suppose if another session selects the same row:

select * from students where roll = 100;

The server process reads the block from the disk and performs a delayed commit clean out (marking the transaction as committed), and ITL portion post commit clean out will be as follows:

It so happens that if multiple transactions modify the block and the block gets written to the disk before the session can commit the transaction, a session ends up performing additional IO to the undo tablespace to find the status of these transactions and perform a commit cleanout.

A session may also fail to perform a commit cleanout in cases when another session has pinned the buffer, in this case, server process gives up on the block and the other session which is reading/modifying the block will end up performing the commit cleanout. Remember that a commit cleanout operation will modify the block (update the ITL slots and the corresponding rows) thereby making the block dirty which causes the DBWn process to write it again to the disk.

Impact Due to Row Chaining:

Every block in the database has a header area that stores information such as segment type, SCN, transaction information, etc. Oracle database uses this information to manage the block itself. For an 8KB database block the space required for this overhead depends on the various factors such as INITRANS, average row length, the number of columns, size of columns etc. For a statement that is inserting data into a table, the effective space left in the block can be calculated as follows:

Space Left (inserts) ~= DB_BLOCK_SIZE – (BLOCK_HEADER) – Space reserved (DB_BLOCKS_SIZE * PCTFREE / 100)

Rows are always inserted from the tail end of the block and continue upward as shown the in the following figure:

As you go through the rest of the section, you will realize how this behaviour is critical to the performance and points to consider while designing the layout of the tables and the column order.

Oracle will not split the row into multiple row pieces if it can fit the row entirely in a new formatted block. Row Chaining is the process of splitting a row into several row pieces and storing these row pieces either in the same block or in different blocks depending on the size of the row. Row chaining can occur due to the following reasons:

- Row size greater than allowable user data size in a block (after considering BLOCK_HEADER size and PCTFREE).

- Table has more than 255 columns.

- Tables containing LONG and LONG RAW columns (oracle recommends us to convert them to LOBs).

Let’s look at each case:

Scenario 1: Row size greater than allowable user data size:

In this case Oracle automatically splits the row into several row pieces and stores each row piece in a different data block. An interesting point to note here is that oracle chains the row pieces starting from the tail end of the table. To illustrate this behaviour, let’s look at an example, the following table is created in a tablespace with standard block size (8KB).

create table students (student_id number, t1 varchar2(3000), t2 varchar2(3000), dept_id number, t3 varchar2(3000), t4 varchar2(3000), t5 varchar2(3000));

alter table students add constraint students_pk primary key (student_id);

create index dept_idx on students (dept_id);

insert into students select rownum, dbms_Random.string(0,3000), dbms_random.string(0,3000), round (Dbms_random.value()*100)), dbms_random.string(0,3000), dbms_random.string(0,3000), dbms_random.string(0,3000) from dual connect by level < 10000;

cmmit;

From the table definition it is evident that a block cannot accommodate an entire row and Oracle requires at least 3 blocks to store the same row. Row piece containing the Columns (t4, t5) are stored in one block, columns (t2, dept_id, t3) are in another, and finally columns (roll, t1) in one block. Graphically this is represented as follows:

Looking the block dumps:

nrid points to the block that contains the next row piece. The row piece containing the columns (student_id, t1) is the head row piece and all the indexes on this table basically point to the head row piece. We can confirm this using the following:

student_id IDX:

Dept_idx:

Remember that the row is split into multiple row pieces and the tailing columns are inserted first.

How this scenario Impacts Performance:

Impact on index range scans:

Since the index entries always point to the heat row piece, the impact on performance due to additional reads depends on the columns that are included in the SELECT part of the query. If we are selecting only the columns that are present in the head row piece then no additional IO is required. Due to this it is recommended to have all the indexed columns and most frequently accessed columns in the first row piece in order to avoid extra IO to read the subsequent row pieces.

Similarly for the full table scan, if the select part includes only the columns that are present in the head row piece the server process doesn’t necessarily use single block reads (db file sequential read) instead uses multiblock read (direct path read). But if the select part included columns that are not in the head row piece than subsequently oracle has to read the rest of the row piece using single block IO. This presents as very interesting issue as Oracle will end up reading the same block multiple times. The same holds true for index scans as well.

WAIT #140051776711336: nam='db file sequential read' ela= 195 file#=7 block#=95478 blocks=1 obj#=74680 tim=5026470481

WAIT #140051776711336: nam='db file sequential read' ela= 205 file#=7 block#=95479 blocks=1 obj#=74680 tim=5026470758

WAIT #140051776711336: nam='db file sequential read' ela= 209 file#=7 block#=95480 blocks=1 obj#=74680 tim=5026471148

WAIT #140051776711336: nam='db file sequential read' ela= 207 file#=7 block#=95484 blocks=1 obj#=74680 tim=5026471470

WAIT #140051776711336: nam='db file sequential read' ela= 199 file#=7 block#=95406 blocks=1 obj#=74680 tim=5026472151

WAIT #140051776711336: nam='db file sequential read' ela= 210 file#=7 block#=95390 blocks=1 obj#=74680 tim=5026472429

WAIT #140051776711336: nam='direct path read' ela= 296 file number=7 first dba=95744 block cnt=128 obj#=74680 tim=5026473385

WAIT #140051776711336: nam='db file sequential read' ela= 227 file#=7 block#=94378 blocks=1 obj#=74680 tim=5026474069

WAIT #140051776711336: nam='db file sequential read' ela= 145 file#=7 block#=94362 blocks=1 obj#=74680 tim=5026474276

WAIT #140051776711336: nam='db file sequential read' ela= 197 file#=7 block#=95730 blocks=1 obj#=74680 tim=5026475029

WAIT #140051776711336: nam='db file sequential read' ela= 140 file#=7 block#=95714 blocks=1 obj#=74680 tim=5026475230

WAIT #140051776711336: nam='db file sequential read' ela= 223 file#=7 block#=95733 blocks=1 obj#=74680 tim=5026475524

WAIT #140051776711336: nam='db file sequential read' ela= 165 file#=7 block#=95669 blocks=1 obj#=74680 tim=5026475749

WAIT #140051776711336: nam='db file sequential read' ela= 207 file#=7 block#=95737 blocks=1 obj#=74680 tim=5026476076

WAIT #140051776711336: nam='db file sequential read' ela= 258 file#=7 block#=95721 blocks=1 obj#=74680 tim=5026476406

Try to place all the optional columns in the tail end of the table, so that if we are not storing any data or nulls in these columns there is a high change that the rows may not be chained. Each time we hit a chained row, the statistic table fetch continued row is incremented.

Scenario 2: Table has more than 255 columns.

If the number of columns in a table is greater than 255, Oracle automatically splits the row into multiple row pieces and stores each row piece either in the same block or in different data blocks depending on the size of the row. An interesting point here is that split happens from the tail end of the table. Suppose if we have a 260-column table, columns 5-260 will be chained (depending on the size of the columns, the above case is true if all the columns are integers), so the first row-piece will store 1-5 columns and the second row-piece contain the rest of the columns. If Oracle fits the all the row pieces of the row in a single block, we have something called Intra-block row chaining.

How this scenario Impacts Performance:

Intra-block row chaining where the row pieces are stored in the same block may not result in an additional IO, there is CPU overhead involved as we have to read both the row pieces. If a row piece gets migrated to a different block as part of update operation, then this could result in additional IO not only during index range or unique scans but also for full table scans. Intra-block row chaining will not result in table fetch continued row statistic to be incremented, and the details such as CHAIN_CNT in DBA_TABLES|DBA_TAB_STATISTICS is not displayed. Oracle treats a chained row as such only if the row pieces span multiple rows and the statistic table fetch continued row is incremented only when the reading a row results in reading two table blocks.

Row Migration:

Row migration is a form of row chaining. During the update if it is found that the current block does not have the enough free space to accommodate the new size of the row, the entire row is migrated to a new block and a forwarding address with the new ROWID is left behind in original row piece. After the row piece is migrated to a new block, the corresponding index entries are not updated with the new location because it can take significant amount of time to update all the indexes and their corresponding index leaf blocks. With row migration, each time we refer a particular row using index lookup we make an additional IO as the index keys point to the original row piece which contains the forwarding address. A chained row piece can also be migrated to a different block.

How this scenario Impacts Performance:

Since the entire row is migrated from the current location to a new block and a forwarding address is left behind, an additional I/O is always warranted. Low PCTFREE or default value of PCTFREE on tables with heavy update activity are prone to have more row migrated due to space issues.

Monitoring the Row Chaining and Migration Impact:

Increased db file sequential reads for a SQL execution or db file scattered reads for index range or unique scan can be an indicator of row chaining or migration. Monitoring the statistic table fetch continued row for a session during the SQL execution is a good starting point to confirm the presence of chained or migrated rows. CHAIN_CNT column in DBA_TABLES or DBA_TAB_STATISTICS can be used to find the number of chained rows, but the contents of these view may not be update, to get an accurate value on the number of chained row use ANALYZE TABLE ... COMPUTE STATISTICS command. DBA_TAB_MODIFICATIONS view provides us with an insight into the workload profile for a particular table, lots of updates along with chained rows in a table is a clear indication that the PCTFREE value might not be sufficient. ANALYZE TABLE ... LIST CHAINED ROWS command gets the list of chained rows in a table. This command populates CHAINED_ROWS table (created using ?/rdbms/admin/utlchain.sql script). HEAD_ROWID column in the CHAINED_ROWS indicate the ROWID of a ROW in the table that is spread across multiple blocks.

Fixing Row Chaining / Migration:

If the number of chained rows is very less, we can delete and insert the corresponding back to the table in a single transaction when the load on the table is very less. If there are many chained rows, it can be clear indication that either the PCTFREE setting for the underlying table might not be sufficient or the tablespace block size is pretty small. If we change the existing value of PCTFREE for a segment only the new blocks allocated to the segment will inherit the changed value and the existing blocks retain the same value. If we have to make sure that we have a consistent value of PCTFREE for all the blocks in the table after modifying the PCTFREE value for a table we can to make use of the features such as

- Online Table Move.

- Online Table Redefinition (if earlier version)

- Export and Import (if the table is static).

Which result in the creation of a new segment. Since the maximum size for LONG or LONG RAW columns is very high, Oracle’s recommendation is to convert these columns into the corresponding LOB data types as LOBs are stored in a different data segment. Remember that the ultimate goal of any Performance Tuning DBA is to reduce the number of disk reads which is the basically the choking point in any OLTP database that prevents us from attaining the peak TPS on the system.

From the Storage Perspective:

If the select statement references only the columns that are present in the head row piece, then no additional reads are required, and depending on various factors such as the number of ROWIDs for a particular column value, fetch size, use of aggregate functions, order by (not including in the SQL), Oracle can efficiently batch as many table block reads as possible into one single read. So what happens if the columns that are stored in the tailing row pieces are included in the select statement? To answer this we have to understand that Oracle has no way of determining whether which columns are stored in the tail end (tailing row piece) from the ROWIDs it obtains the index lookup, and more over only a single or few rows may have resulted in chained rows. So when a user executes a query such as the following:

select contact1 from students where marks = 100

and If there are 9 chained rows, Oracle proceeds as if no chained rows are present, batching as many table blocks as it can and issues an IO_SUBMIT call to read the corresponding blocks, as it accesses each ROW based on the ROWID, if it encounters a chained row, it immediately issues a separate IO request using pread64 to read the corresponding row piece. This behaviour is the same for both row chaining and migration (range based or bound based predicates as well).

Is there a way to Prevent Row Migration altogether?

Introduction – Hakan Factor:

Hakan Factor specifies the maximum number of rows that can fit in a block. During the table creation time Hakan Factor is calculated based on various factors such as:

- Database Block size.

- Data type and length of the column

- Number of columns.

- NOT NULL constraint.

Unfortunately, factors that indirectly restrict the number of ROWS in a block such as are not used in the calculation – we will look into this shortly.

- INITTRANS

- PCTFREE

We can find the Hakan factor value for a table using the following query.

select mod(t.spare1, 32768) hakan from tab$ t where obj# in (select object_id from dba_objects where object_name='&OBJECT_NAME' and owner='&OWNER');

or by issuing the following Query:

select max(sys_op_rpb(rowid)) from owner.object_name;

Hakan factor can be set manually by issuing the following statement:

alter table table_name minimize records_per_block;

once you execute the statement, oracle internally uses the above-mentioned SQL statement to calculate the Hakan factor (by performing a full table scan).

Hakan factor is primarily used

- To restrict the number of rows in a particular table.

- To map the rows to the corresponding ROWID entries for bitmap indexes.

By default, when a table is created (without a not null constraint, or in many cases even with the presence of not null constraint), Oracle sets the value to 736 (which is a bit unrealistic) for a database block size of 8KB.

Let’s look at an example:

create table students(roll number, name varchar2(300));

The Hakan value for this table is as follows:

Oracle estimates that this block can hold as many as 736 rows. Now let’s populate this table as follows (ensuring all varchar data is stored):

insert into students select rownum, dbms_random.string(0,300) from dual connect by level < 10000;

commit;

and update the Hakan value manually by issuing the following statement:

alter table students minimize records_per_block;

Oracle now populates the Hakan Factor as follows:

This indicates that a block cannot hold more than 22 rows. Now that the Hakan value is populated, lets truncate the table and repopulate it a smaller row size.

insert into students select rownum, dbms_random.string(0,1) from dual connect by level < 10000;

commit;

Let’s look at the average row length, blocks etc., using the following query:

select table_name, avg_row_len, num_rows, blocks, from dba_tables where table_name ='STUDENTS';

As we can see even though the average length is small, Oracle did allocate a greater number of blocks than required, this is because of the Hakan factor. Let’s look at the number of rows stored in each block:

select cnt, count(*) from (select dbms_rowid.rowid_block_number(rowid) "BLOCK", count(*) "CNT" from students group by dbms_rowid.rowid_block_number(rowid)) group by cnt;

As we can see from the above Oracle did restrict the number of rows per block to 23 (+/- 1 Hakan factor) even though it can accommodate more rows. This is a way with which we can restrict the number of rows per block.

One might wonder if PCTFREE can be used to restrict the number of rows or reserve more space for future updates. One disadvantage of the PCTFREE is that it doesn’t restrict the number of rows per block and there is always a change that the rows size can grow as a result of the update statement resulting in row migration. So how do we set a proper value for Hakan Factor?

Load the table with approximate (average) sizes of each columns by talking to the developers, issue the statement alter table … minimize records per block, and truncate the table. (initially when the table is created so that you can prevent the row migration from occurring). Remember that Oracle doesn’t insert a row into a block if it finds that the blocks doesn’t have enough free space. Default value of Hakan factor may not be appropriate as the value is always an overestimate and doesn’t consider INITTRANS (this actually reserves space for ITL slots in a block).

Remember that when you use statements such as CTAS (Create Table as Select) Hakan factor gets carried over to the new table.

Impact of Additional Predicates:

Use of additional predicates on columns other than the indexed columns in an SQL statement can undermine the Batching or Async IO capabilities of a database, depending on the ROW’s position that matches all the predicates in the SQL in the leaf block. This behavior is common for both prefetching and batching. Let’s look at an example:

Consider the following scenario:

create table students (student_id number, name varchar2(30), dept_id number, mark1 number, mark2 number);

create index mark1_mark2_idx on students(mark1,mark2);

Populate the table with random data,

SQL> select dept_id,count(*) From students where mark1=113 and mark2 = 3 group by dept_id;

The position of the corresponding ROWIDs of interest in the leaf blocks for the entry mark1=113 and mark2=3 is as follows:

Let’s look at 10046 level 12 trace for the following statements:

select * from students where mark1 = 113 and mark2 = 3 and dept_id =1;

select * from students where mark1 = 113 and mark2 = 3 and dept_id = 2;

select * from students where mark1 = 113 and mark2 = 3 and dept_id = 3;

In all the cases, the number of buffer gets or consistent reads are all the same, but it is basically the number of single block read requests and the inability to effectively make use of asynchronous IO through batch or prefetch will add more delay to elapsed times of the queries. If you are working on true OLTP system where even milli-second delays cannot not tolerated, you should be really careful on selecting the right columns of a multi-column index. If all the table blocks are cached, this behavior will not impact the response times at all and this is evident from the following.

This situation (in which all the table blocks are in the buffer cache) may be the case on most databases and as buffer caches are considerably smaller, the session may eventually have to read the blocks from the disk. There are too many variables

Working with Dates/Timestamps:

Developers often make a mistake of mapping wrong data types when it comes to DATES or use functions that are not compatible specifically due to which Oracle will not be able to use the index on the date column. Let’s start with a basic example:

create table students(roll number, name varchar2(30), day date);

create index day_idx on students(day);

Populate it with random data.

Application issues the following query: (in this case developer maps the bind variable as timestamp instead of date).

select * from students where day = ?

The execution plan is as follows:

The bind value supplied by the application is of the type TIMESTAMP.

Even though the column day is indexed, Oracle Ignores the index and performs a full table scan. This is because Oracle applies the function INTERNAL_FUNCTION on the day column (this INTERNAL_FUNCTION performs implicit datatype conversion) to expand/convert the column data to Oracle TIMESTAMP data type rather than truncating/converting the TIMESTAMP value supplied into Oracle DATE data type.

In this case you may think that creating an INDEX as follows will help, but it doesn’t because Oracle doesn’t allow us create an index using a non-deterministic function and secondly INTERNAL_FUNCTION cannot be used directly (19c).

create index day_idx on students(internal_function(day));

ERROR at line 1:

ORA-00904: "INTERNAL_FUNCTION": invalid identifier

In this case the solution is to ask the developer to fix the code to change the data type from TIMESTAMP to DATE. This issue is more prevalent with hibernate. Oracle database release 12c R2 addresses this issue by automatically picking the index even though the supplied data type is timestamp, in this case, oracle doesn’t convert the data type and uses the index directly. The following execution plan on 12c release 2.

But if the OPTIMIZER_FEATURES_ENABLE is set to <=12.2.0.1, Oracle reverts back to the old behavior and applies the INTERNAL_FUNCTION. Before going further let’s look at the data types returned by the Oracle supplied functions:

sysdate – DATE

systimestamp – TIMESTAMP WITH TIME ZONE.

trunc(systimestamp) - DATE

Be cautious when using these functions in the predicates. Consider the following table:

create table students (roll number, name varchar2(30), when timestamp);

create index when_idx on students(when);

A query such as the following:

select * from students where when=systimestamp – interval 1 day;

will result in a TABLE ACCESS FULL rather than an INDEX RANGE SCAN.

This is because oracle expands the column WHEN data type from TIMESTAMP into TIMESTAMP WITH TIME ZONE and then applies the SYS_EXTRACT_UTC function extracts the UTC (coordinated universal time – formerly Greenwich mean time) from a data time value with the time zone offset or time zone region name. In other words, SYS_EXTRACT_UTC takes TIMESTAMP WITH TIME ZONE as input and returns TIMESTAMP.

When comparing the TIMESTAMP column always use SYSDATE function.

Always compare a timestamp column with either sysdate or trunc(systimestamp) or using java.sql.timestamp api (while setting a value for a bind variable at the application level (PreparedStatement)). An interesting point to remember here is since Oracle automatically applies SYS_EXTRACT_UTC function whenever the predicates contains columns of type TIMESTAMP WITH TIME ZONE. Creating an index on the column with data type TIMESTAMP WITH TIME ZONE automatically creates function-based index.

Consider the following:

create table students(roll number, name varchar2(20), when2 timestamp with time zone);

create index when2_idx on students(when2);

Even though the DDL suggests that we are create a normal B-Tree index, Oracle internally creates a function-based index.

Extracted DDL will be as follows:

CREATE INDEX "VISHNU"."WHEN2_IDX" ON "VISHNU"."STUDENTS" (SYS_EXTRACT_UTC("WHEN2"))

PCTFREE 10 INITRANS 2 MAXTRANS 255 COMPUTE STATISTICS

TABLESPACE "USERS" ;

Remember that since the index created on a TIMESTAMP WITH TIME ZONE column is a function-based index. Limitations of a function-based index applies here as well.

Null Values:

Null values hold a special space in Oracle, if a column in a row has no value, then the column value is said to be null or to contain null. We cannot compare a null with a null since they are not treated as equal values (null represents no values more like an infinite value 1/0 paradox ) using relation operation =, that being said the condition null=null is always false and null is null always equals to true. Nulls can appear in any data type that are not restricted by not null or primary key integrity constraints. Oracle treats null as unknown value, so if we a unique constraint on a column, we can insert many rows containing null values, in this case oracle treats each null value as a distinct value and since we are not storing any value oracle doesn’t store these entries in the index. Since we are discussing about the B-Tree indexes, considering only B-Tree indexes, by default, oracle doesn’t store the entries in the index when all the indexed columns of the index contain null values. Oracle does store the entries in the index at least a column of composite index or multi-column index contain non-null values. Let’s look at scenarios where we can leverage this to improve the response time.

Single Column B-Tree indexes:

Indexes built on single column do not store entries containing null column values. As these values are not stored in the index NUM_ROWS or DISTINCT KEYS in DBA_IND_STATISTICS doesn’t populate these details. So, if we have queries with predicates that depend on the null such as :b1 is null or :b1 is not null, oracle has no way of determining how many rows in the table contain the null values, similarly, when the predicate condition contains not null, this equals to all the rows in the index. In both the cases, the optimizer simply prefers a full table scan or a fast-full index scan or a full index scan.

When we update a particular value to null Oracle simply updates the flag for that particular entry in the index leaf block as deleted, even though this behavior is not of much significance, this becomes a starting point to explain certain behavior when we proceed to multi-column indexes. The corresponding leaf block dump showing the details is as follows:

row#0[8020] flag: ---D---, lock: 2, len=12

col 0; len 2; (2): c1 02

col 1; len 6; (6): 01 c0 01 66 00 00

row#1[8008] flag: ---D---, lock: 2, len=12

col 0; len 2; (2): c1 03

col 1; len 6; (6): 01 c0 01 67 00 00

Since the nulls are not stored in a single-column index, query that depend on null are particularly affected such as the following:

select * from students where marks is null;

A simple workaround in this case to add a virtual column to the index and leverage the fact that oracle does store index entries when the leading column values contain null. An example is as follows:

Create index students_idx on students(marks, 1);

Here, we added an extra column in the index which stores the value 1. Oracle adds a virtual column to the table which can be viewed using all_tab_cols or dba_tab_col_statistics. The end result of this is a function-based index on the table with columns (marks, virtual_column). The storage requirements of 1 is basically 2 bytes which can be seen in the leaf block dump as follows:

The character such as “.” Or space “ “ requires only 1 byte of storage. If we are indexing numeric or character-based columns with finite length, then consider adding such characters so that it maximizes the number of entries that can be stored in the leaf blocks, but also reduces the size of the index (this space savings in this case might be finite but for OLTP workloads each additional block access counts when it comes to response times).

The leaf block dump is as follows:

Multi-Column B-Tree indexes:

For multi-columned or composite indexes things are a little different, Oracle persists the nulls in the index even if the leading column or tail column contains nulls The block dump of an index leaf block shows the details.

We know that Oracle by default stores the column values in the ascending order, but since the value of the NULL is unknown, the corresponding entries are always stored at the last or the right most part of the B-Tree structure following the entries that contain values.

- If the leading column has nulls, these entries with nulls are stored in either the last branch block or last leaf block depending on the number of rows that contains nulls in the leading indexed column.

- If the tailing column has null, these corresponding entries are stored in the last index leaf block or immediately following the last entry that contains the maximum value for the tailing column (for a particular value of leading column)

It is only when all the indexed columns are null, Oracle doesn’t persist these entries in the index leaf blocks.

This has rather strange implications on how the oracle retrieves the data from this index when the predication uses null or not null condition on a column or set of columns in the composite index. Let’s look at all the various cases where an index will used and the tips that we can use:

We create the table as follows:

Create table temp (roll number, name varchar2(20), mark1 number, mark2 number, mark3 number);

Create index idx on temp(mark1, mark2);

We carefully load the data into the table such that the number of rows in the table is large and selectively populate mark1 and mark2 to contain nulls as follows:

Populating 100 rows so that mark1 column value is null

Insert into temp select rownum ,dbms_random.string(0,20), null, 112312, 123 from dual connect by level < 100;

Populating 100 rows so that mark2 column value is null

Insert into temp select rownum ,dbms_random.string(0,20), 112312, null, 123 from dual connect by level < 100;

For the following query:

Select * from temp where mark1 is null;

Even though the number of rows that matches the condition is only 10, oracle will end up performing with the full table scan. As mark2 column is not protected by a not null constraint, even though we know that the data doesn’t contain any rows with null values for both mark1, mark2 columns. Oracle has no way of guaranteeing based on the statistics of individual columns from dba_tab_col_statistics, it assumes or infers that there is a chance that the null values in mark1 column may overlap with null values in mark2, and that the index entries may not have the details. Due to this Oracle always prefers a full table scan. In this case in make sure that the optimizer picks up the index, we have two solutions.

1. Add a not null constraint to the Mark2 column.

2. Recreate the index with columns (mark1, mark2, 1), so that Oracle knows that even when mark1 and mark2 column values are null, due to the presence of virtual column which always contains a value, oracle prefers the index scan.

For the following query:

Select * from temp where mark2 is null;

In this case as well, since mark1 column is not protected by a not null constraint, oracle infers that there is a chance that the null values in mark1 may overlap with null values in mark2, oracle prefers a full table scan, in this case as well, we have two solutions

1. Add a not null constraint to the mark1 column.

2. Add an extra trailing column to the index which is not a constant, and that which is protected by a not null constraint.

In this particular case, adding an extra constant column will not work because the function-based indexes don’t support skip scans. This is one of the limitations of the skip scans.

Both the following queries ensure an index scan as Oracle is deterministic that the values are persisted in the index columns because the predicates guarantee that.

Select * from temp where mark1 is not null and mark2 is null;

Select * from temp where mark1 is null and mark2 is not null;

For the following query:

select * from temp where mark1 = 12 and mark2 = null?

Even though the tailing column is null in order to satisfy the queries that contain just the predicate mark1 = ? oracle has to mandatorily store all the entries in the leaf blocks. Since all the entries with mark2 as nulls are stored at the last, oracle reads the last branch block entry for mark1=12 and reads the corresponding index leaf blocks.

when it comes to Oracle ability to perform FAST FULL INDEX SCAN and/or INDEX FULL SCAN having a NOT NULL constraint is mandatory on a particular column or all columns of a concatenated or multi-column index. There is yet another important aspect that a DBA should be aware of when it comes to NOT NULL constraints. Consider the following example:

create table students(student_id number primary key, name varchar2(20) not null, marks number not null);

create index name_idx on students(name);

insert into students select rownum, dbms_random.string(0,20),trunc(dbms_random.value()*100) from dual connect by level < 1000000;

commit;

In the above table all the columns student_id, name, and marks are protected by NOT NULL constraint.

For the queries such as the following:

select * from students where student_id is null;

select * from students where student_id is null order by student_id;

select * from students where name is null order by student_id;

select * from students where marks is null;

Oracle doesn’t even bother to read the segment (table or index blocks) as optimizer finds a conflict between the WHERE-clause predicates and the constraints declared on the table provided that the constraint is VALIDATED (DBA/USER_CONSTRAINTS.VALIDATED). This is a scenario in which Oracle executes the SQL query without even touching the table or index segment.

There are many takeaways from this. Let’s start with the first SQL statement:

select * from students where student_id is null;

Here student_id column is protected by a primary key constraint (UNIQUE – NOT NULL) and it is evident from the predicates used (student_id is null). Once the optimizer detects that there is conflict with the NOT NULL constraint and the predicates used, the predicate section of the explain plan shows filter(NULL IS NOT NULL), if such a predicate (that results in a conflict with the constraint) is used along with other predicates, oracle will or will not ignore the predicate depending on whether the both the predicates are connected by AND/OR. Let’s look at the execution statistics:

From the above we can clearly see that Oracle didn’t perform a single block read.

Remember that if the constraint is marked as NOT VALIDATED, oracle treats as if the constraint never existed and performs IO to check for the same causing unnecessary IO (provided that there are no rows with corresponding column values as null. Another important aspect of this behavior is that the explain plan can indicate an index that doesn’t make sense. Consider the following query:

select * from students where name is null order by student_id;

The execution/explain plan for the above statement is as follows:

Instead of wondering why Oracle picked the index students_pk2 check the predicate information section filter(NULL IS NOT NULL) which is a pure giveaway of the above-mentioned behavior. When all the columns of an index are NULL then Oracle doesn’t store that entry in the index. Don’t take this as an ironclad rule, this holds true for normal B-Tree index but this is not the case with descending, function-based, reverse key or bitmap indexes. Bitmap indexes do store nulls.

Descending Indexes and nulls:

By now we already know that Oracle doesn’t persist in entries in the index when all the columns of the index are null and that Oracle internally uses the function SYS_OP_DESCEND() on the column value to get the form that is used/stored in the descending index. The use of the function sys_op_descend has strange implications when it comes to null. When the null is passed, this function returns 00 as shown below:

SQL> select sys_op_descend(null) from dual;

Since this function returns a value for null, this value 00 is stored in the index leaf blocks. In other words, nulls are stored in a descending index. but the retrieving the rows with nulls present a challenge. Let’s create a schema as follows:

create table students(student_id number, name varchar2(30), marks number);

insert into students select rownum, dbms_random.string(0,30), round(dbms_random.value()*1000) from dual connect by level < 1000000;

insert into students select rownum, dbms_random.string(0,30) , null from dual connect by level < 100;

commit;

create index marks_idx on students(marks) desc;

Consider the following query:

select * from students where marks is null;

Oracle performs a full table scan (even though the nulls are present), in this case Oracle doesn’t even apply the function SYS_OP_DESCEND.

Interestingly when we modify the predicates section as follows, Oracle uses the index.

select * from students where sys_op_descend(marks) = sys_op_descend(null);

Another way to address this issue is to add a virtual column to the table definition as follows:

create index marks_desc_idx on students(marks desc, 1);

Interestingly even in this Oracle ends up with a full table scan for the following query:

select * from students where marks is null;

Unless we modify the predication section as follows:

select * from students where sys_op_descend(marks) is null;

So, when it comes to single column indexes, Oracle does store nulls. For a descending index nulls are stored at the left most section of the B-Tree structure.

Reverse Key indexes and Nulls:

When it comes to reverse key indexes, things are way different. Even though the predicate part doesn’t specify a specific function being used and as the bytes of the column value are reversed, things work more like a regular index, in the sense for a single column index, NULLs are not stored and for multi-column indexes the behavior is same as a normal index.

For single column indexes, even after adding a virtual column to the index, oracle doesn’t prefer an index range scan.

create index marks_idx on students(marks, 1);

When it comes to multicolumn indexes, if the tailing column is null and the application issues queries such as the following:

Select * from students where marks = 10 and dept_id is null;

Oracle starts by scanning the entire portion of the index and the corresponding table blocks for the entire MARKS=10, even though the nulls are stored at the right most end (this is not the case with a regular index).

For a multi-column index, nulls are usually stored at the right most part of the B-Tree structure for the corresponding value.

Function Based Indexes:

Similar to descending indexes, unless we specify either the entire function or virtual column optimizer may or may not prefer an index range scan. Also storing of nulls in the index is dependent on the function used.

Predicates used and column order in Multi-Column indexes:

ORDER BY clause and use of PGA:

The use of PGA for sorting the final result set depends on whether the column name used in ORDER BY clause is indexed and that the optimizer ends up using the same index. Use of any non-indexed column on the table (not used in the predicated or not indexed) in the ORDER clause results in the PGA used for sorting purposes.

Remember that if Oracle decides to use PGA to store the intermediate results, then it will utilize ASYNC IO or batches the table block reads (through index lookup) as much as it can based on several factors such as

- Additional

Use of ORDER BY clause with ASC or DESC severely undermines the

Index Range Scan Descending:

An Index Range Scan Descending is identical to an index range scan except that the database returns rows in the descending order. The optimizer can choose a descending scan when the SQL statement contains an ORDER BY clause with DESC keyword, or when seeking a value less than a specified value. There is more to just returning rows in the descending order. Lets start with very basic examples and the schema as follows:

create table students (student_id number, dept_id number, sub_id number, marks number);

alter table students add constraint students_pk primary key (student_id);

create index sub_marks_idx on students(sub_id,marks);

insert into students select rownum, round(dbms_random.value()*100), round(dbms_random.value()*100), round(dbms_random.value()*100), from dual connect by level < 10000000;

commit; ASYNC IO capabilities of the database and the scans are mostly sequential in nature, reading one table block at a time. (you can see batching or ASYNC IO even if you use ORDER BY clause provided that we specifically select a single index key value).

Consider the following query:

select * from students where student_id < 10 order by student_id desc;

The SQL Trace is as follows:

WAIT #140603161896376: nam='db file sequential read' ela= 271 file#=7 block#=475 blocks=1 obj#=74652 tim=14585256046

WAIT #140603161896376: nam='db file sequential read' ela= 248 file#=7 block#=3399 blocks=1 obj#=74652 tim=14585256472

WAIT #140603161896376: nam='db file sequential read' ela= 219 file#=7 block#=479 blocks=1 obj#=74652 tim=14585256742

WAIT #140603161896376: nam='db file sequential read' ela= 243 file#=7 block#=40206 blocks=1 obj#=74651 tim=14585257042

WAIT #140603161896376: nam='db file sequential read' ela= 217 file#=7 block#=99345 blocks=1 obj#=74651 tim=14585257914

WAIT #140603161896376: nam='db file sequential read' ela= 253 file#=7 block#=105735 blocks=1 obj#=74651 tim=14585258252

WAIT #140603161896376: nam='db file sequential read' ela= 212 file#=7 block#=30681 blocks=1 obj#=74651 tim=14585258517

WAIT #140603161896376: nam='db file sequential read' ela= 208 file#=7 block#=11531 blocks=1 obj#=74651 tim=14585258770

WAIT #140603161896376: nam='db file sequential read' ela= 228 file#=7 block#=155448 blocks=1 obj#=74651 tim=14585259042

WAIT #140603161896376: nam='db file sequential read' ela= 209 file#=7 block#=31423 blocks=1 obj#=74651 tim=14585259300

WAIT #140603161896376: nam='db file sequential read' ela= 205 file#=7 block#=89683 blocks=1 obj#=74651 tim=14585259558

WAIT #140603161896376: nam='db file sequential read' ela= 207 file#=7 block#=168747 blocks=1 obj#=74651 tim=14585259813

If we look at the which table block each corresponding student_id is stored in we get the following results.

select student_id,dbms_rowid.rowid_block_number(ROWID) "BLOCK" from students where student_id < 10;

Oracle works its way from the last entry in reverse until the lower bound is reached eliminating the need to sort the rows. Use of or extent to which ASNC IO is used is completely undermined due to the use of ORDER BY clause as BATCHING or PREFETCH can result in any block - that may contain a high value to come to the buffer cache first. Remember that even if an indexed column value spans across multiple leaf blocks oracle always starts from the last entry and works its way in the reverse order. One might wonder why because, (one way of looking at this is) due to the inherent structure of the index and as a leaf block can contain multiple column values and their corresponding ROWIDs and these column values can span across multiple leaf blocks, Oracle has no way of knowing where the previous value is stored or in which particular leaf block in the doubly linked list, and in order to determine that it has to perform a traverse the B-Tree structure or use the branch blocks, which is expensive considering the number of latch gets etc. When it comes to multi column indexes if the predicate on the leading column contains an = condition and an ORDER BY clause with DESC on the tailing column, oracle can still consider an index range scan descending.

Since Oracle can perform index range scan in descending order, the use of descending index is undermined for single column indexes and their benefits can be clearly seen in multi-column indexes.

Now that we have seen various scenarios that can lead to additional IO while performing an index range scan, let’s look at various optimizations that Oracle uses while performing an index range scan. These optimizations depend on various factors such as

- Number of ROWIDs for a particular index key value.

- Client Fetch Size

- Operations that use PGA such as aggregate, group by etc

- Use of ORDER BY clause.

Index Skip Scan:

Prior to 9i, Optimizer ignored the index when the predicates didn’t contain the leading column of a multi-column index, in other words the index is used only when the predicates contain the leading subset or all the columns of a multiple column index the index. Starting with 9i, with the introduction of skip scans, oracle can consider a multi-column index even when the leading column is not included in the predicates.

Consider the following example:

create table students (student_id number, name varchar2(300), dept_id number, sub_id number, marks number);

create index sub_marks_idx on students(sub_id, marks);

create index marks_idx on students(marks);

insert into students

select rownum, dbms_random.string(0,300), round(dbms_random.value()*100),

round(dbms_random.value()*100), round(dbms_random.value()*10000) from dual connect by level < 1000000;

Commit; //gather statistics post commit;

Let’s look at the execution plan/statistics for the following statement making the marks_idx visible and invisible resulting in Index range scan and Index Skip scan.

select * from students where marks = 0;

The execution plans are as follows:

It is obvious from the above that the plan with index skip scan resulted in higher buffer gets, elapsed time and CPU per execution than the plan with index range scan. But this doesn’t tell us the entire story and concepts behind a skip scan.

Let’s look from the storage perspective:

10046 level 12 trace of plan with index skip scan:

10046 level 12 trace of plan with index range scan:

For the purpose of understanding think of multi-columned or concatenated index as a partitioned table with the leading column as the partitioning key. When querying a partitioned table if the predicate doesn’t contain the partitioning key, partition pruning cannot happen, and since the data can be in any partition, Oracle has to read each and every partition to get the corresponding rows. Similarly, in a multicolumn index when the leading column is not specified, rows that match the predicate (with only tailing columns) can be in any branch of the B-Tree structure. Do recall that when the cardinality or NDV of the leading column is less Oracle includes the tailing columns in the Branch blocks to efficiently traverse the B-Tree to find a particular key value. INDEX SKIP SCAN works by exploiting this aspect or behavior of the B-Tree structure and works by reading all the corresponding Branch and leaf blocks that may contain the value of interest. This flow is not as simple as it appears. Let’s look at how this works in more detail:

kdxbrlmc 29371974=0x1c02e46 //no starting value stored

kdxbrsno 6

row#0[8018] dba: 29397625=0x1c09279

col 0; len 2; (2): c1 0f //starting column value in 0x1c09279 branch block. All less values are stored in 0x1c02e46

col 1; len 2; (2): c2 61

col 2; TERM

row#1[8037] dba: 29381124=0x1c05204

col 0; len 2; (2): c1 1d

col 1; len 2; (2): c2 05

col 2; TERM

row#2[8002] dba: 29403110=0x1c0a7e6

col 0; len 2; (2): c1 29

col 1; TERM

We know that a B-Tree block branch/root block doesn’t store the starting values for the first leaf/branch block pointed by kdxbrlmc and the starting value of col 1 indicated by the first branch block entry is higher than 80 (decimal value 0) as shown below:

row#0[8018] dba: 29397625=0x1c09279

col 0; len 2; (2): c1 0f

col 1; len 2; (2): c2 61 //starting value higher than 80

col 2; TERM

Oracle has to mandatorily read the block as there is a chance that the relevant index entries may be present. Let’s look at the dump of the first branch block 0x1c02e46.

kdxbrlmc 29360494=0x1c0016e

kdxbrsno 286

kdxbrbksz 8056

kdxbr2urrc 0

row#0[4466] dba: 29397613=0x1c0926d

col 0; len 1; (1): 80

col 1; len 2; (2): c2 08 //starting value higher than 80

col 2; TERM

row#1[4476] dba: 29381211=0x1c0525b

col 0; len 1; (1): 80

col 1; len 3; (3): c2 0d 5b

col 2; TERM

…

row#17[4571] dba: 29381155=0x1c05223

col 0; len 2; (2): c1 02

col 1; len 3; (3): c2 03 2d

col 2; TERM

Same logic applies here, since the first entry has a higher starting value, Oracle has to read the leaf block which may contain the values of interest. Let’s look at the dump of the leaf block 0x1c0016e.

row#0[4459] flag: -------, lock: 0, len=14

col 0; len 1; (1): 80

col 1; len 2; (2): c1 02 // First entry indicates a higher value so backtrack

col 2; len 6; (6): 01 c0 90 36 00 12

row#1[4473] flag: -------, lock: 0, len=14

col 0; len 1; (1): 80

col 1; len 2; (2): c1 03

col 2; len 6; (6): 01 c0 20 50 00 11

Since the first entry in the leaf block has a higher value for col 1, Oracle knows that in this section or part of the B-Tree there are no matching values for the predicate marks = 0, so it starts with the next higher value for col 0 or leading column and proceeds ahead, here, TERM and section of ROWIDs are used to determine if the values of interest are present in the previous index entries. This process repeats until the right most leaf block is read (remember why the right most leaf block in the B-Tree structure is read). If the NDV of the leading column is high, then Oracle will store only the leading column in the branch blocks, in which case performing a skip scan involves reading all leaf blocks which is highly undesirable. Due to this, several branch and leaf blocks should be visited for each distinct value of the leading column to determine whether the actual values of interest (constraint specified in the exist.

Following are few interesting points to note:

- A limitation of skip scan is the lack of support for IN-Clause though bounded and unbounded predicates are supported such as <, >, <=, >=, between.

- As previously stated – Oracle keeps adding multiple leading columns to the branch blocks once it determines that two columns may not be enough to efficiently traverse the B-Tree structure – due to this Oracle can perform a skip scan even when a third or sixth column of a multi-column index is specified in the predicates.

- Immediately following an index rebuild the number of reads on branch blocks will be reduced, as Oracle tends towards having separate entries for each distinct leading column value following an index leaf block split as the data gets loaded.

Index Fast Full Scan:

An index fast full scan is similar to a full table scan except that it reads the index as if it were a table. Oracle reads the index blocks similar to a full table scan in an orderly fashion that is extent by extent. Remember that these index blocks are read as they are stored in the disk and not in the order from the leaf most index leaf to the right most index leaf block (which typically happens using an Index full scan). Due to this an Index Fast Full Scan cannot eliminate a sort operation.

Oracle uses multiblock IO to read the index segment (even though the root and branch blocks are read into the PGA they are ignored as they contain only the information needed to traverse through the B-Tree structure). Bitmap indexes do support Index Fast Full Scans. In order for the optimizer to pick an Index Fast Full Scan the following conditions must be met.

- All the columns referenced in the query must be form either a subset of columns or all columns of the index. It doesn’t matter the which position the column is in the index Oracle can still perform an Index Fast Full Scan if the following conditions are met.

Eg: marks_idx(mark1, mark2, mark3, dept_id)

select dept_id,count(*) from students group by dept_id;

- A NOT NULL constraint

o If the index is a single column index, then that column has to be protected by a not null constraint.

o If the index is a multi-column index, at least one of the indexed columns has to be protected by a not null constraint, it doesn’t matter which position the column is in the multi-column index that is protected by a NOT NULL Constraint.

- In the absence of a NOT NULL constraint – a predicate on any indexed column or any column referenced in the query with IS NOT NULL condition is enough to trigger a Index Fast Full Scan.

- Operation has to either select or delete.

When it comes to Index Fast Full Scan Oracle uses the value of _very_large_table_threshold (which is ~5X_db_block_buffers or 5 times the size of the buffer cache) as the upper limit to decide whether to perform an Index Fast Full Scan using either db file scattered read or direct path read. If the object size in number of blocks is less than the threshold specified by _very_large_table_threshold, Oracle always performs a Fast Full Index Scan using db file scattered read or buffered read. This _very_large_table_threshold is used to decide whether to perform a full segment scan using either a direct path read or db file scattered read only for indexes and IOTs.

You can observe this behavior by tracing the session using the following statement:

alter session set events 'trace [NSMTIO] disk highest';

The following is a snippet of the trace indicating the same behavior:

NSMTIO: kcbivlo: nblks 7048 vlot 500 pnb 51728 kcbisdbfc 0 is_large 0

NSMTIO: qerixFetchFastFullScan:[MTT < OBJECT_SIZE < VLOT]:NSMTIO: AdditionalInfo: Object_size: 7048 (blocks), vlot=258640

SqlId=4wupv26yztx6x, plan_hash_value=3862345769,Object#=74746

NSMTIO: kcbzib: Cache Scan triggered for tsn = 4, rdba=0x0142b780, fn = 5, kobjd = 74747, block cnt = 0, noncontig = TRUE pre-warm = FALSE, prefetch = FALSE

This behavior can become problematic in environments with higher buffer cache sizes (for example if your buffer cache size is 100GB, then Oracle will choose a buffered read for the indexes whose size is smaller than ~500GB, and remember each time you bring a buffer into the buffer cache, there is a penalty in terms of latches not to mention the need to find to free buffer), unless the blocks in the buffer cache are already hot (their touch count is high), this can significantly increase the IO activity on the server if the buffers are aged out due to this scan. In such environments it’s better to reduce the value of _very_large_object_threshold (default value is 500 indicating 5x times the buffer cache) to whatever value is appropriate (usually 100 meaning the same size as buffer cache). We can also override the behavior using _serial_direct_read (always, never, auto), this parameter basically acts as a master switch that controls whether the direct path read should be used or never used or let Oracle automatically determine based on the number of blocks. Now that we have looked at one aspect of this let’s look at the direct path read decision and how other factors can impact the performance. On a side note, a Fast Full Index Scan has the ability to flood the buffer cache.

By now we already know that Oracle performs a full object checkpoint before performing a full table or fast full index scan using direct path read. Let’s see the downside of this.

CR block generation/Delayed commit cleanout.

An object level checkpoint results in flushing down all the dirty blocks that belong to a particular segment to the disk even though the transaction that modified the blocks is still in progress (blocks modified – changes uncommitted). Due to this, once the object level checkpoint is complete, Oracle reads the table or index blocks into the PGA during which the server process checks the ITL portion of each block to see if there are any open transactions or ITL slots. If there are any open ITL slots or transactions that are in progress Oracle will roll back the changes by reading the corresponding undo blocks that contain the relevant entries into the SGA (buffer cache) and proceeds ahead. The following snippet of 10046 level 12 trace reveals the details:

This does impact the response time of the SQL statement as you read undo blocks into the SGA (if they are not already present) and rollback the changes applied to each individual block. Remember that these undo blocks are read into the SGA (buffer cache) during this process. So, does a direct path read result in reading blocks into the buffer cache?

- It doesn’t for the table/index blocks.

- But it does result in reading undo blocks into the buffer cache.

It so happens at times that if there are multiple transactions that modifies the same set of index blocks before the start of the Index Fast Full Scan (which did result in flushing these dirty blocks - thereby marking these buffers as clean - from buffer cache onto the disk) and in the mean while due to space pressure in buffer cache if these buffers are replaced by a new blocks coming in, the session which started the transaction will not clean the ITL slots post commit for efficiency purposes. Due to this behavior Oracle has to mandatorily clean up the ITL slots next time the blocks are read.

So now enter Index Fast Full Scan again, if Oracle detects that it has to perform a delayed commit cleanout, it performs the clean out on the blocks by reading the corresponding undo blocks, but since these blocks are read into the PGA and cleanout is performed in the PGA and as these blocks are discarded once they are read or accessed, subsequent sessions performing a Fast-full index scan will repeated perform the commit clean over and over again. The following is a snippet of trace depicting the same behavior:

Remember that optimizer does consider and performs costing for Index Full Scan whenever it performs an Index Fast Full Scan. Oracle never accesses the table blocks using INDEX FAST FULL SCAN (SELECT) with an exception of DELETE statement. For a delete statement, if the optimizer determines that access the table blocks using the index range scan is costlier Oracle can access the table blocks through an INDEX FAST FULL SCAN when an SQL statement deletes

- Most/all rows from the table.

- When the SQL statement doesn’t contain any predicates.

The explain plan for the DELETE statement will be as follows:

Once the session starts scanning the index using an INDEX FAST FULL SCAN it scans the entire segment from the initial extent until the last extent unless we restrict the number of rows returned from the query using the clauses such as ROWNUM or FETCH FIRST N ROWS ONLY. Do understand that a NOT NULL constraint on any of the indexed columns is mandatory for Oracle to consider an INDEX FAST FULL SCAN (Exadata – query offloading).

You can force an INDEX FAST FULL SCAN using INDEX_FFS SQL hint. Since the behavior is same as a full table scan, availability of ASYNC IO does improve the performance (depending on how the extents are allocated and how evenly the data file/files are spread across multiple disks).

Index Full Scans:

The server process works its way by traversing down the B-Tree structure by reading the root block first and then navigate down to the left most leaf block and then proceeds by reading the subsequent index leaf blocks until all the leaf blocks are read (one block at a time). In other words, the server process starts by reading the root block and obtains the details of the left most branch or leaf block using the entry kdxbrlmc, it then repeats the process by reading the block pointed by kdxbrlmc until it reaches the left most index leaf block and then works though the index by reading the blocks pointed by kdxlenxt entry in the index leaf block header. Since the subsequent leaf block can reside in any position in the index segment and the location of the next leaf block in the chain cannot be determined unless it reads the index leaf block, database uses single block IO.

For SELECT statements oracle will never access the table blocks through INDEX FULL SCAN access path with the exception of the DELETE or UPDATE statement. Optimizer considers an INDEX FULL SCAN when

- A predicate in the query references a column in the index.

- When no predicates are used, all the following conditions are to be met.

o All columns in the index are specified in the query.

o At least one indexed column is protected by a NOT NULL constraint.

- Query includes an ORDER BY clause on the NOT NULL columns.

Following are the main differences between INDEX FULL SCAN and INDEX FAST FULL SCAN.

Operation	INDEX FULL SCAN	INDEX FAST FULL SCAN
Sort Elimination	Yes	No
IO type	Single	Multi-block
Access Table contents	Select – NO, Delete – yes.	Select – NO, Delete – YES
Async IO	No	Yes
NOT NULL req	Yes	Yes

INDEX FULL SCAN can also be performed in descending order or in other words INDEX FULL SCAN DESCENDING is also supported.

Index Join Scans:

Optimizer can still consider a hash join on multiple indexes that contains all the columns referenced in the query without the need to access the table as all the data is retrieved from the indexes.

Consider the following example:

create table students(student_id number, dept_id number, name varchar2(30), sub_id number, marks number, details varchar2(3000), details2 varchar2(3000));

create index dept_student_idx on students(dept_id, student_id);

create index sub_student_idx on students(sub_id, student_id);

populate the data into the table.

Let’s issue the following query

select student_id, sub_id, dept_id from students where dept_id between 1 and 10 and sub_id between 1 and 10;

The table has been purposely bloated (why? We will get to this shortly). The explain plan is as follows:

Let’s consider a best-case scenario in which there are only 10 rows with column values dept_id = 4 and sub_id = 4 that match the above predicates and no other rows that falls in the range even with single predicate alone. As the session has to perform operations such as traverse both the indexes, apply the hash function, build a hash table etc. Performance in terms of elapsed times and maximum achievable executions per second will still be slow compared to having an index on columns(dept_id,sub_id,student_id) or even on columns (dept_id, sub_id).

Consequently, an Index join scan can also improve performance by avoiding accessing the table blocks – depends on the selectivity of the predicates and clustering factor.

Index join scans are pretty useful in the following scenarios:

- When dealing with large tables (a greater number of columns and/or high average row length size)

- When adding additional indexes impact the DML performance.

- Elapsed times of the SQL queries do not matter up to a certain extent unless you have a very strict threshold.

Index Full/Range Scan (MIN/MAX):

When selecting data from only a particular indexed column or use aggregate functions as sum, avg etc., which requires scanning every row or few entries from the index (functions that require reading more than one index entry), Optimizer has options such as the following to choose from:

- INDEX RANGE SCAN.

- INDEX FULL SCAN

- INDEX FAST FULL SCAN.

- FULL TABLE SCAN

Functions MIN or MAX results typically results in only a single row and since the data is sorted either in ascending or descending order in the index, finding a minimum or maximum value is pretty straight forward, traverse either the left most or right most part of the index which requires only few lookups (the number of reads or consistent gets is equal to the height of the index BLEVEL + 1) considering the best case scenario. To find either a minimum or maximum indexed column value, Oracle can utilize either an INDEX FULL SCAN (MIN/MAX) or INDEX RANGE SCAN (MIN/MAX) depending on the predicates and performs the same.

Consider the following schema:

create table students(student_id number primary key, dept_id number, sub_id number, marks number);

create index marks_idx on students(marks);

create index dept_sub_marks_idx on students(dept_id, sub_id,marks);

// populate some random data:

When we issue a query such as the following:

select min(student_id) from students;

Let’s look at the plan:

The corresponding statistics are as follows:

The trace is as follows:

We can confirm the same using the treedump of the index as follows:

The procedure is basically this.

- Read the root branch block.

- Read the left most(first) or right most(last) entry depending on MIN or MAX.

- Read the leaf block first/last.

Optimizer considers an INDEX FULL SCAN (MIN/MAX) when the SQL uses either MIN or MAX function on the leading index column and prefers an INDEX RANGE SCAN (MIN/MAX) when the SQL uses MIN or MAX on the tailing column and an equality predicate on the leading column. Consider the following query:

select max(marks) From students where dept_id = 19 and sub_id = 24;

The corresponding explain plan is as follows:

Remember that an INDEX FULL SCAN (MIN/MAX) or INDEX RANGE SCAN (MIN/MAX) works only if it is possible to find the value by traversing down the left or right path down the index.

Does an INDEX FULL SCAN (MIN/MAX) or INDEX RANGE SCAN (MIN/MAX) require only few blocks (equal to the height of the index (BLEVEL + 1) all the time? Unfortunately, the answer is no. Before proceeding further, let’s take a step back and look at a certain behavior that is vital to the understanding.

Even after all the index entries in the index leaf blocks are deleted, these index leaf blocks still exist in the index structure unless you shrink, coalesce or rebuild them.

That being said, if the index is on a column which is populated using a monotonically increasing value be it an application or database sequence, and if we delete the old rows or column values from the table, the leaf blocks still exist or stay linked into the index structure and when a user issues a query such as min(column_name), due to the way the index is structured and the absence of starting value, there is always a possibility that the left most index leaf block contains the minimum value, and if the value doesn’t exist, Oracle starts proceeding in the chain until it finds the value. Same is the case when finding a max value.

As the data is being loaded into the table at the extreme right end, issuing a select max(value) can cause CR blocks to be generated.

Oracle cannot perform an INDEX FULL SCAN (MIN/MAX) on function-based indexes (including descending indexes as these are internally implemented as function-based indexes), and reverse key indexes or even INDEX RANGE SCAN (MIN/MAX) on the tailing column with the predicate specified on the leading column.

Automatic Indexing:

Bitmap Index:

Structure of bitmap index is essentially the same as a normal index with the exception of how the data is stored in the leaf and branch blocks (fundamentals regarding how the entries are organized in the branch blocks and how Oracle uses these entries to traverse down the B-Tree structure to reach the leaf block that contains the corresponding index key value remains the same). Instead of storing ROWIDs of each corresponding row, oracle uses bitmap to store the ROWIDs for each indexed key value. Let’s look at an example:

create table students (student_id number, name varchar2(30), mark1 number, mark2 number);

insert into students select rownum, dbms_random.string(0,30), round(dbms_random.value()*100), round(dbms_random.value()*400) from dual connect by level < 1000000;

create index mark1_idx on students(mark1);

create bitmap index bitmap_mark1_idx on students(mark1) invisible;

Let’s peak at the Normal Index (B-Tree) leaf block (uncompressed):

As we can see from the above Oracle stores the column value along with the ROWID for each row in the table. Now let’s look at the bitmap index leaf block:

A bitmap index stores three additional columns apart from the column actual value.

Col 0 indicating the actual column value

Col 1 indicates start of the ROWID range.

Col 2 indicates the end of ROWID range.

Col 3 is the actual bitmap (hex/internal representation of the bitmap).

Each bit in the bitmap corresponds to a particular ROWID (don’t get confused by the hex representation and the bits in the actual bitmap (internal), if the starting ROWID and end ROWID are far off then the bitmap requires more bytes to store). Oracle uses a mapping function to convert the bits into actual ROWIDs. Although the representation of data in the leaf blocks is different a bitmap index serves the same purpose a B-Tree or normal index. As the bitmap index stores the corresponding ROWIDs in a bitmap representation the size of the bitmap index is typically smaller than a B-Tree index as shown below:

Following are few important characteristics that you should be aware of.

- Index key Compression feature is not available for the bitmap indexes.

- Analyze index … validate structure is supported only for B-Tree indexes.

Let’s look at things from the overhead perspective:

For each index key value, since Oracle stores both the starting and the end range of ROWIDs as additional columns in the index, the overhead of

1 (length of ROWID) +

6 (starting ROWID range) +

1 (length of ROWID) +

6 (end of ROWID range)

14 bytes is mandatory.

Overhead due to the bitmap depends on various factors such as the following:

- Hakan factor.

- Clustering factor

Hakan Factor Effect:

We have already seen how the Hakan factor is calculated initially during the Table creation and how we can set it manually using the clause alter table … minimize records_per_block. Hakan factor indicates the maximum number of rows that we can fit into a particular block, but this value may not always accurately depict the actual number. The greater the row size a smaller number of rows fit inside a block. While creating a bitmap index, Oracle uses the Hakan factor internally to allocate slots in the bitmap so that it can cover all the possible ROWID locations in a particular block. If the difference between the calculated Hakan value and the actual maximum number of rows in a block is high, then Oracle will end up allocating more slots in the bitmap to cover all the possible ROWIDs that may never exist.

Let’s look at an example: (the DDL is same as the above

create table students (student_id number primary key, name varchar2(30), mark1 number, mark2 number);

insert into students select rownum, dbms_random.string(0,30), round(dbms_random.value()*100), round(dbms_random.value()*400) from dual connect by level < 1000000;

commit;

create bitmap index mark1_idx on students(mark1);

Oracle calculates Hakan factor for the above table is as follows:

The size of the bitmap index is as follows:

As the bitmap indexes are dependent on Hakan factor any attempt to modify the Hakan factor using alter table … minimize records_per_block will fail. In order to modify the Hakan factor we have to drop the bitmap indexes if any and then rebuild the index.

SQL> alter table students minimize records_per_block;

alter table students minimize records_per_block

ERROR at line 1:

ORA-28602: statement not permitted on tables containing bitmap indexes

Post setting the Hakan factor manually by dropping the bitmap index and using the statement alter table … minimize records_per_block and recreating the bitmap index results in the size reduction as follows:

Effect of Clustering Factor:

When the rows with the same column value are stored adjacent to each other in the table blocks or in other words when the clustering factor is less, Oracle can skip storing additional data required to map the bits to the corresponding ROWIDs resulting in significant reduction in the bitmap ultimately the index size as well. Let’s look at the Index leaf block for the above index:

Let’s reorganize the table as follows and recreate the bitmap index and see the contents of the index leaf block:

create table temp as select * From students;

drop index mark1_idx;

insert into students select * from temp order by mark1;

create bitmap index mark1_idx on students(mark1);

As we can see from the above the length of col 3 is reduced significantly. The size of the bitmap index is as follows:

Hakan and Clustering Factor can significantly influence the size of a bitmap index.

Regardless of the Hakan factor and the clustering factor, the size of a typical bitmap index will be smaller than a regular or compressed B-Tree index (assuming the NDV is less, we will look into this shortly). Now that we have seen the overhead associated with a bitmap index. Let’s look at how many distinct values is recommended.

From the structure wise a B-Tree index is similar to a bitmap index and the overhead associated with a bitmap index is basically the inclusion of additional columns (starting range, ending range, bitmap) in the Index leaf blocks (an overhead of 14 bytes is mandatory due to starting and ending range). If there are enough rows for each distinct value that could compensate with the additional overhead, a Bitmap index is still preferred.

Consider the following case:

create table test (t1 number, t2 number);

load the data into the table such that, t2 has about 1 million distinct values and the table has about 10 million rows, and create B-Tree and bitmap indexes as follows:

create index bree_t2_idx on test(t2);

create bitmap index bitmap_t2_idx on test(t2) invisble;

SQL> select count(*), count(distinct t2) From test;

COUNT(*) COUNT(DISTINCT T2)

---------- -----------------

11290010 999985

The sizes of the indexes are as follows:

Even though there are over 1 million distinct values, the size of the index will still be smaller (don’t take this as an ironclad guarantee or rule if there are several values with one or only few values and the data is highly skewed, a bitmap index can become larger than a b-tree index). A compression achieved by a bitmap index (especially using the bitmap) is still higher than a regular compressed B-Tree index.

drop index btree_t2_idx on test(t2);

create index btree_comp_t2_idx on test(t2) compress;

The sizes of the indexes are as follows:

DML and Bitmap Indexes:

Let’s look at the bitmap index leaf block again:

By now, we know that the bitmap index has only one entry Index Entry for each associated column value and that the bitmap corresponds to all the ROWIDs associated (in the form of a bitmap – index leaf block contains the HEX/internal representation of the bitmap bits) and only one Flag to indicate whether the index entry is locked etc. When we insert a row into the table, with the column 1, Oracle in turn locks the entire Index Entry thereby locking all the rows that corresponds to the T2 value of 1. This can be shown from the index leaf block dump immediately following a insert/update/delete statement.

insert into test (t1, t2) values (1,1);

The index leaf block dump is as follows:

As we can see from the above the entire Index Entry is locked. Similar when we update a column value (remember update is basically a delete and insert), but for bitmap index, an update operation is more complex and can result in two index entries:

- Reconstruct the bitmap of the old index entry by removing the old bitmap and update the index entry.

- Insert a new Index entry into the leaf block surprisingly even though the bitmap changes occur for the old entry, Oracle preserves the Index entry of the old value instead of modifying the bitmap of the old entry.

Nonetheless, bitmap indexes should not be created on tables which has DML activity or in other words tables that doesn’t have DML activity on it. (OLTP workloads also do benefit from the bitmap indexes if the base Tables are static remember that the even though the size of the index is reduced, Oracle is more likely to perform an bitmap index single scan or bitmap index range scan rather than performing an table access full scan which is reality is less costly and resource intensive (= predicates)).

Bitmap Conversion To/From ROWIDs:

Bitmap conversion typically involves converting an entry in the bitmap into a ROWID in the table. This conversion can happen both ways.

- Converting entry in the bitmap to a ROWID. BITMAP CONVERSION TO ROWIDS

- Converting ROWID into the bitmap entry. BITMAP CONVERSION FROM ROWIDS

Types of Bitmap Access Paths:

BITMAP INDEX SINGLE VALUE

Optimizer uses a Bitmap Index Singe Value access path when the predicate contains an equality operator against an indexed bitmap indexed column. The server process traverses down the B-Tree structure to the corresponding leaf block/s that contain the value and converts the bitmaps into ROWIDs using TABLE ACCESS BY INDEX ROWID BATCHED.

Since the key value is stored in index leaf blocks in the form of a bitmap, Oracle can effectively make use of Batched Table Access thereby Asynchronous IO but is subject to various factors such as client fetchsize etc. Remember that we are creating a bitmap index on low cardinality columns, even when the number of rows retrieved from the table are very high, Optimizer still considers an index scan rather than a full table scan. (this is not the case with a normal B-Tree index).

BITMAP INDEX RANGE SCAN:

When the SQL contains the predicates that include ranges such <, >, <=, >=, in etc., optimizer can consider a BITMAP INDEX RANGE SCAN.

Bitmap Merge:

It so happens that if there are indexes on each individual column specified in the predicate and the SQL statement’s predicates refer to each such column, Oracle has the ability to merge the results obtained from each index (BITMAP INDEX SINGLE VALUE or BITMAP INDEX RANGE SCAN or both) using either Boolean operator AND, OR, etc.

Consider the following example:

create table students (student_id number, name varchar2(40) , dept_id number, sub_id number, marks number);

//populate the table

create bitmap index sub_idx on students(sub_id);

create bitmap index marks_idx on students(marks);

//gather statistics

Consider the following query:

select * from students where sub_id = 12 and marks < 2;

Explain plan is as follows:

Here in the above query the server process first obtains all the bitmaps of all the index keys that falls under the range <2 and merges then using Boolean OR to create a single bitmap and then finally performs a bitmap AND operation on this resultant bitmap and the bitmap obtained from the key sub_id = 12.

Consider the following query:

select * from students where sub_id = 12 and marks = 2;

As we can see from the above, the server process performs a Boolean AND operation between two bitmaps and uses the resultant bitmap to generate a list ROWIDs. The AND operation is similar to the following:

MARKS_IDX	1	0	1	0	1	1
SUB_IDX	1	1	0	1	0	1
FINAL BITMAP	1	0	0	0	0	1

Similarly, Oracle can also perform OR operation as shown below:

In data warehouse environments, these bitmap indexes are pretty important to achieve the start transformation.

Index Range Scan Optimizations used:

Starting from 12c, Oracle predominantly uses Batched Table access (TABLE ACCESS BY INDEX ROWID BATCHED) and generic table access (TABLE ACCESS BY INDEX ROWID) while accessing a table contents using an index. Performance of batched table access depends on whether the async IO capability is available through filesystemio_options and the file system layout. Remember that you may not notice any performance benefit than traditional IO (blocking pread64) if all the data files are stored on a single disk or if the data files are not striped across multiple physical disks or RAID volumes. By default, when you create a database filesystemio_options parameter is set to none (default value is none), which prevents the database processes such as LGWR, DBWR, server processes, etc., to utilize or reap the benefits of asynchronous IO capabilities provided by the underlying Operating system. If the filesystemio_options parameter is set to none, all the read calls such as pread64, readv etc., initiated by the processes will be synchronous and if the data files are stored in a volume that is striped/mirrored across multiple disks, using synchronous IO will always be slower than an asynchronous read operation.

Generic/Normal/Conventional:

Regardless of value of the filesystemio_options parameter, Oracle reads one block at a time using pread64 (synchronous) OS call due to the following:

- ORDER BY clause contains a column that is indexed and the optimizer decides to use that particular index for accessing the table blocks provided that

o For single column indexes

§ Predicate doesn’t contain a = condition for non-unique indexes, for unique indexes it doesn’t matter.

§ Predicate contains an IN clause or bound based predicates such as <, >, <=, >=, between etc., and order by on the same column.

o For multi column indexes

- When performing index unique scan - be it an index or table block (buffer warmup is a different).

- During an index range scan if an index key value results in <5 ROWIDs and these ROWIDs points to different table blocks or if all the ROWIDs of a particular index key value (considering the fetch size to greater than the number of ROWIDs retrieved) results in < 5 table blocks. (this threshold can change to even 6 or 7 db file sequential reads initially if the session encounters chained or migrated rows).

Oracle’s ability to use large batches (number of table blocks batched into one single call) of table block reads reduces due to the following:

- When the clustering factor is very less and the row sizes exceed a particular threshold.

- When using range predicates (bounded or unbounded) if the subsequent entries in index leaf block points to table blocks that are clustered (in this case this is directly dependent on fetchsizes as well.):

- Client ResultSet fetchsize and/or arraysize.

- Use of additional non-indexed columns as predicates in the SQL statement.

- When operations such as aggregate functions, order by on a different column, grouping etc. that make use of the PGA are used in the query and the clustering factor for the index is very low provided that the row sizes are small – and the subsequent ROWIDs in the index leaf block points to different contiguous blocks or same block, up to a certain point beyond which oracle starts using db file scattered read (basically a single IO call – large IO – which reads 2 blocks in the same call).

Batched Table Access:

Upon retrieving multiple ROWIDs from index leaf blocks a session performing a table access using TABLE ACCESS BY INDEX ROWID BATCHED can batch multiple table block read requests and waits on db file parallel read wait event until these blocks are read into the memory. A session can batch 2 – 127 blocks depends on various factors such as mentioned above. You are most likely to notice Oracle batching 127 table blocks when SQL statement results in using PGA for operations such as sorting, grouping, use of aggregate functions etc., and when using equality or range predicates on the indexed columns when the clustering factor is actually high (in this case client fetch size if it is greater than 127, then Oracle can effectively batch up to 127 blocks). Performance of a query executed using TABLE ACCESS BY INDEX ROWID BATCHED relies heavily on async IO and if the data files are spread evenly across multiple disks then the performance of the query increases considerably.

Remember that even if an explain plan indicates TABLE ACCESS BY INDEX ROWID BATCHED Oracle may not use db file parallel read (batched IO – batching multiple table block read requests).

Remember that loading the data into the table using the SELECT … FROM DUAL CONNECT BY LEVEL, may not accurate depict the load from multiple sessions which typically happens on many OLTP sessions, and since we are loading the data into the table using a single session, the clustering factor usually tends to be very low when compared with loading the data using multiple sessions.

Creating an Index:

alter system dump datafile # block #;

dumps both the inmemory version of the block and the on-disk version of the block as well. The ORA-01450 error is related to your db_block_size and the maximum key length for your database is approximately 40% of the database block size minus some overhead. However, you have a bigger problem, getting the CBO to use an index with a large key value… skip scan or expansion…

serial_direct_read only for tables and not for indexes… fts algorithm changes when performing fast full index scans… db file scattered read!!!

Enq: KO – fast object checking… dirty blocks… cr block generation…

Foreign keys

select avg(mark3) from randomload where mark1 in (1,2) and mark2=19 order by mark1 desc

index skip scan descending havoc… table move, clustering factor increase… create table as select increase clustering factor. Chained rows always results in db file sequential reads. When as migrated rows will not result, migrated rows doesn’t share the same fate. Scattered reads mid point insertion strategy.. db file sequential read (not cached) full table scan chained rows… index fast full scan – null not null – character string. Index fast full scan null dependency. Index rebuild – other indexes becomining undesirable.

Select mark6,count(*) from randomload group by mark6 - > index fast full scans. Batched tablbe access will not work in the following cases:

create table t1 as

select rownum n1

,trunc((rownum-1)/3) n2

,rpad('x',100) v1

from dual

connect by level <= 1e4;

create index t1_ind1 on t1(n2, n1);

select * from t1 where n2 = 3329 order by n1 desc;

select * from table(dbms_xplan.display_cursor);

alter index t1_ind1 invisible;

create index t1_ind2 on t1(n2);

select * from t1 where n2 = 3329 order by n1 desc;

select * from table(dbms_xplan.display_cursor);

No distinct 1-1 mapping between branch and index key values. as new blocks are brought in , old blocks get flushed mid point insertion strategy check LRU chains latches…. Not null constraint - index full scan – index fast full scan. Cursor duration temporary tables – log file sync , scn increase. Db file parallel read – fetchsize when multiple rows, pga allocation…

Lower fetch sizes increase IOPS on the server. SSD controller cache write performance. In-line view unnest…

Why insert statement slow (tracing of no use).

Scattered read – table and index while scanning the table using index range or skip scan.

Oracle Index Fundamentals

Post a Comment

0 Comments

Search

Popular Posts