Oracle Index Fundamentals
Indexes are one of the most powerful and misunderstood aspects of SQL
performance. Consider the following table,
create table
students (student_id number, name varchar2(20), department_id number);
and if a session issues query such as the following:
select name from
students where department_id = 12;
Without indexes on the department_id column, the session would have to scan the
entire table, to find all the matching entries. This is not only slow as the
session has to read all the blocks of the table but also resource intensive as
the session consumes lot of CPU and IO. Time taken by the session to execute
the query is directly proportional to the size of the table and the available
CPU resources, IO bandwidth. The execution plan will be as follows:
and the corresponding execution statistics are as follows:
To understand the concept of an index, a simple analogy is to look at the
index at either the front or back of a book. It’s pretty simple to find the
subject that we are interested in by referring the index, and flip to those
pages in the book instead of searching each and every page in the book.
Similarly, a database index helps us visit only relevant table blocks where the
necessary information is. Whenever a user executes a query, optimizer will
generate all the possible execution plans and picks the plan with the lowest
cost based on various criteria such as presence of indexes, number of blocks,
NDV etc. Without the presence of an index, the optimizer is forced to pick a
full table scan. Now if we create an index on department_id column of the students table using the
following statement:
CREATE INDEX STUDENTS_DEPTID_IDX ON STUDENTS (DEPARTMENT_ID);
The optimizer will consider the index in the subsequent executions of
the SQL statement in its cost estimates, and if the cost of accessing the table
using the index is less than the cost of full table scan, optimizer picks a
plan with the index. The execution plan will be as follows:
and the corresponding execution statistics are as follows:
Let’s look at the 10053 trace (level 12) of the session executing the
statement: There is a reason for including the session’s optimizer trace here,
as there are several concepts at play and we will look at each in the subsequent
sections. Pay special attention to those highlighted below:
***************************************
BASE STATISTICAL
INFORMATION
***************************************
Table Stats::
Table: STUDENTS Alias: STUDENTS
#Rows: 999999 SSZ: 0
LGR: 0 #Blks: 4906 AvgRowLen:
30.00 NEB: 0 ChainCnt:
0.00 ScanRate: 0.00
SPC: 0 RFL: 0 RNF: 0
CBK: 0 CHR: 0 KQDFLG: 1
#IMCUs: 0
IMCRowCnt: 0 IMCJournalRowCnt:
0 #IMCBlocks: 0 IMCQuotient: 0.000000
Index Stats::
Index: STUDENTS_DEPTID_IDX Col#: 3
LVLS: 2 #LB: 2724 #DK: 4801 LB/K: 1.00
DB/K: 203.00
CLUF: 978571.00 NRW: 999999.00 SSZ: 0.00 LGR: 0.00 CBK: 0.00
GQL: 0.00 CHR: 0.00 KQDFLG: 8192 BSZ: 1
SINGLE TABLE
ACCESS PATH
Single Table Cardinality Estimation for
STUDENTS[STUDENTS]
SPD: Return code in qosdDSDirSetup: NOCTX,
estType = TABLE
kkecdn: Single Table
Predicate:"STUDENTS"."DEPARTMENT_ID"=123
Column (#3): DEPARTMENT_ID(NUMBER)
AvgLen: 4 NDV: 4801 Nulls: 0 Density:
0.000208 Min: 0.000000 Max: 4800.000000
Estimated selectivity:
2.0829e-04 , col: #3
Table: STUDENTS Alias: STUDENTS
Card: Original: 999999.000000 Rounded: 208
Computed: 208.289731 Non
Adjusted: 208.289731
Scan IO
Cost (Disk) = 1330.000000
Scan CPU Cost (Disk) = 224937594.640000
Cost of predicates:
io = NOCOST, cpu = 50.000000, sel = 0.000208 flag = 2048
("STUDENTS"."DEPARTMENT_ID"=123)
Access Path: TableScan
Cost: 1337.374257
Resp: 1337.374257 Degree: 0
Cost_io: 1330.000000 Cost_cpu: 274937545
Resp_io: 1330.000000 Resp_cpu: 274937545
****** Costing Index STUDENTS_DEPTID_IDX
SPD: Return code in qosdDSDirSetup: NOCTX,
estType = INDEX_SCAN
SPD: Return code in qosdDSDirSetup: NOCTX,
estType = INDEX_FILTER
Estimated selectivity: 2.0829e-04 , col: #3
Access Path: index (AllEqRange)
Index: STUDENTS_DEPTID_IDX
resc_io: 207.000000 resc_cpu: 1555648
ix_sel: 2.0829e-04 ix_sel_with_filters: 2.0829e-04
Cost: 207.041725 Resp: 207.041725 Degree: 1
Best:: AccessPath:
IndexRange
Index: STUDENTS_DEPTID_IDX
Cost: 207.041725 Degree: 1
Resp: 207.041725 Card: 208.289731 Bytes: 0.000000
From the above we can see clearly that the optimizer picks the index as
the cost of accessing the table using the full table scan is very high compared
to accessing the table using the index STUDENTS_DEPTID_IDX and If we look at the execution statistics the
session accessing the table through the index had to read only 265 blocks
compared to 4871 blocks using a full table scan thereby reducing the number of
blocks read by 94.6%. You might wonder why there are no physical reads in both
the execution statistics for full table scan and index range scan, if you have
gone through the direct path read section, you will understand why Oracle
caches blocks of a table during a full table scan.
Before proceeding with Index concepts, let’s look at how the Optimizer
decides whether to choose an index or perform full table scan.
Storage Level Perspective:
From the storage perspective, let’s look at how Oracle handles full
table scans and index range scans.
Full Table Scan:
During a full table scan we know that Oracle/session/Server process uses
multiblock IO or reads multiple blocks at a time to read the table contents,
this is evident in the OS calls a server process makes while reading (which can
be obtained from the strace output). The following is the STRACE output of a
server process performing a full table scan.
As we can see from the above each read request (pread64) reads 1MB of
continuous chunks of data from the datafiles. (since
db_file_multiblock_read_count (precisely _db_file_exec_read_count) is set to
128 and the database block size is 8k, each read request translates to 1MB. If
Async IO is made available using the filesystemio_options parameter, we would
see that the session actually uses issues io_submit os calls instead of
pread64.
Even with Async IO enabled, as highlighted in the above snippet, the
maximum IO size per call will be 1 MB.
Considering the factors such as the following:
-
Smaller initial extent sizes (extent allocation auto
at tablespace level).
-
Extent boundaries.
-
Range of formatted blocks between Low HWM and high
HWM.
The server process issues about 3074 OS calls to read a segment about 3
GB in size. The output of the iostat command during the same period is as
follows:
r/s indicate the read IO per second
rkB/s indicates the size in KB read per second.
Util is the disk utilization rate.
Pay close attention to the rkB/s, r/s and util columns as highlighted,
as the session issues large IOs, with full table scans we are more likely to
saturate the bandwidth (maximum allowed data transfer rate) at the storage
layer, even though the storage is capable of handling more IOPS.
Index Scan:
During an index range/unique scan, a session traverses the index B-tree
structure to find the index leaf blocks that contain the matching key value and
based on the ROWIDs obtained from the index leaf blocks the session reads the
corresponds table blocks. The Strace output of the session performing a index
range/unique scan is as follows:
As you can see from the above the session issues pread64 calls which reads 1
database block at a time (database block size 8kb). Pread64 is a synchronous
read call. Oracle employs various optimizations or techniques such as
table/index prefetching, batching during index range scans where a session ends
up waiting on either db file scattered read (multi-block read, multiple contiguous blocks)
or db file parallel read (batches multiple data block read operations stored in different
regions of the disk in one IO operation) to speed up the index scans. The
following snippet of strace output with the Async IO enabled shows the IO_SUBMIT calls.
As we can see from the above io_submit call, a session batched about 127 block read
operations into one call and this io_submit call is subjective to whether the Async IO is
enabled through filesystemio_options. Io_submit is an asynchronous read call and its API allows
the caller to batch either multiple blocks or single read request into one
operation. The output of IOSTAT command
can be seen below.
Pay attention to rkB/s, r/s and util columns as highlighted, as the
session issues multiple small IO requests or batches multiple small IO read
operations into one. A session performing index range scan is more likely to
saturate the IOPS at the storage layer.
Summarizing the concept:
Since the session issues large IO requests, session performing full
table scans are more likely to saturate the bandwidth at the storage even
though the storage is capable of handling much more IOPS. Sessions performing
index range scans are more likely to saturate the IOPS as they issue multiple
small IO requests. A HDD that provides about 150-200 IOPS can only sustain a
data transfer rate of about 1.17 – 1.5 MBps as sessions perform random IOs.
Remember that IO by a session accessing the table contents through the index is
mostly is mostly random in nature.
Optimizer’s Decision
Optimizer uses various factors such as the following to estimate the
cost of accessing a table using a particular access method:
-
Initialization parameters:
o
db_file_multiblock_read_count
o
optimizer_index_cost_adj
o
optimizer_index_caching
o
optimizer_mode
-
System
Statistics
o
SREADTIM
o
MREADTIM
o
MBRC
o
IOSEEKTIM
o
IOTFRSPEED
-
Table Statistics:
o
Blocks below HWM
-
Index Statistics:
o
Index Height (Blevel).
o
Selectivity of a value.
o
Clustering Factor.
o
Number of Leaf blocks.
Let’s look at how Oracle estimates the cost of a full table scan vs
index range scans.
Cost of full table scan:
Cost of index range scan:
Following are the most important factors listed in the order of
magnitude of impact that can tip the balance off from a full table scan to an
index range scan and vice versa
-
Effective Selectivity
-
DB_FILE_MULTIBLOCK_READ_COUNT & MBRC
-
OPTIMIZER_INDEX_COST_ADJ
-
CLUSTERING_FACTOR.
-
OPTIMIZER_INDEX_CACHING
Effective Selectivity:
Selectivity is nothing but the portion of rows in percentage of total
rows in a row source retrieved by a query for a particular predicate or set of
predicates. Oracle calculates the selectivity as a value between 0 and 1. 0
being no rows retrieved and 1 being all rows retrieved. If this value is close
to 0, it means that the predicate is more selective and the query is most
likely to benefit from an index-based scan. The formula to calculate the
selectivity is as follows.
Selectivity of a particular predicate or set of predicates is very
useful in determining the actual access costs. By default, selectivity is based
on high value, low value and NDV (number of distinct values) with an assumption
of even distribution of data between these two points. Histograms provides
better selectivity estimates for unevenly distributed data, but do remember
that histograms may be mandatory even on primary key columns when using queries
with predicates containing (>, < or between) especially when there are
gaps in the ranges. Without the histograms the selectivities for a predicate
are calculated as follows:
|
Predicate |
Selectivity |
|
c1 = '32' |
1/NDV |
|
c1 > '32' |
(High - Value / High - Low) |
|
c1 >= '32' |
(High - Value / High - Low) +
1/NDV |
|
c1 < '32' |
(Value - Low / High - Low) |
|
c1 <= '32' |
(Value - Low / High - Low) +
1/NDV |
|
c1 like '32' |
1/NDV |
Unless the predicate used is a equality predicate with uniform
distribution of data, especially for range predicates bind variables in the
query present a challenge to the optimizer as the optimizer as the optimizer is
unaware of the value used for the bind variable during the query optimization. In
this case the optimizer has nothing but to assume the selectivities as follows:
|
Predicate |
Selectivity |
|
c1 = :B1 |
1/NDV |
|
c1 > :B1 |
Default of 5% |
|
c1 >= :B1 |
Default of 5% |
|
c1 like :B1 |
Default of 25% |
It’s enough to know that the data Histogram provide is more accurate and
the optimizer has better chance of accurately calculating the selectivities for
given predicate or set of predicates. In the subsequent section and in the
optimizer statistics part we will look at various scenarios where the
histograms are mandatory.
db_file_multiblock_read_count & MBRC:
This is one of the parameters that shifts the balance between the full
table scan and an index range scan more towards a full table scan, Oracle
recommends us not to set this parameter explicitly and it’s better to let
Oracle to determine the actual value. Contrary to popular belief optimizer
actually uses the value of _db_file_optimizer_read_count or MBRC obtained from system statistics to
calculate the cost of a full table scan instead of db_file_multiblock_read_count.
When the system statistics are not available or when the MBRC value is null in aux_stats$ (which stores the
system statistics), oracle uses the value the _db_file_optimizer_read_count to determine the cost. The value of this
parameter defaults to 8 when oracle automatically determines the value of db_file_multiblock_read_count
and if db_file_multiblock_read_count is set manually, _db_file_optimizer_read_count
always equals db_file_multiblock_read_count. Remember that regardless of the values of db_file_multiblock_read_count
and _db_file_optimizer_read_count, oracle always uses MBRC if the MBRC and
MREADTIM are calculated as part of the system statistics. Following are the
only ways with which we can influence the optimizer’s cost estimates for a full
table scan
-
Without System statistics
o
Setting DB_FILE_MULTIBLOCK_READ_COUNT to a value
above 8 (provided system statistics are unavailable).
-
With System Statistics
o
Manually increase the MBRC.
o
Reduce the difference between SREADTIM and MREADTIME so that MREADTIM/SREADTIM is always > 1.
OPTIMIZER_INDEX_COST_ADJ
This parameter lets us modify the optimizer behavior for access path
selection to be more or less index friendly – that is to make the optimizer
more or less prone to selecting an index access path over a full table scan.
The default value is 100% at which the optimizer evaluates the cost of
performing a single IO through an index equals the cost of performing a single
multiblock IO. Setting the value of this parameter to 10 makes the index access
path about 1/10th the actual or original cost.
The following snippet of 10052 Level 12 trace explains the details.
Here the optimizer_index_cost_adj is set to 50% indicating that the cost of
index access is actually the half the cost.
The optimizer initially estimates the cost of table scan using the index
at 8746 and since the optimizer_index_cost_adj is set to 50, it simply adjusts the cost of
index access to half the original cost.
As with every feature there are arguments which oppose the idea and
others which favor it, same goes for this parameter as well. Let’s look at both
sides of the discussion. Let’s start with the argument which favors or
recommends setting this parameter:
Arguments that favor adjusting optimizer_index_cost_adj:
During the normal OLTP operations, most of the table and index blocks
are already cached in the memory. With the default values for the parameters
listed below:
optimizer_index_caching
= 0
optimizer_index_cost_adj
= 100
For each SQL statement executed against the database, the CBO calculates
the IO cost as if it has to read all the blocks from the disk, in other words since
optimizer is not aware of the blocks that are already cached in the memory, it
estimates the costs as if it has to read all blocks from disk or perform physical
IOs. But in reality, this is far from the truth as the OLTP workloads demand
that that queries run as fast as possible and to ensure this the execution
plans are mostly index-based lookups and nested loop joins, rarely we do
encounter hash joins and full table scans.
That being said, since the optimizer uses various factors such as
clustering factor, selectivity of a predicate, index leaf block accessed to
determine the tipping point between the full table scan and index range scan.
Especially for tables with high clustering factor (due to the inherent way with
which it calculates it), blocks accessed or read from the disk may be far less
compared to the estimates, due to the which the optimizer may end up picking a
full table scan. There are also cases where queries only select a fraction of
rows from the table or access only a few blocks above the CBOs tipping point,
but most of the blocks may already be present in the buffer cache, in these
cases queries may require only a few blocks from the disk, in such situations,
reducing the optimizer_index_cost_adj can tip the scale from session performing a full table scan to a index
range scan and the performance of application or the overall TPS can increase
considerably.
When the table’s average row length is less the accuracy of the
clustering factor statistics may be far from the actual reality and inherently
with ASSM this statistic may will be high provided that there are enough
sessions executing the DML statements in parameter. In these scenarios there is
a high chance that the range queries (<, > or combination <= >=)
will hit the same set of blocks, using table_cached_blocks table’s statistics preference can help in
adjusting the clustering factor there by influencing the optimizer cost
estimates, unless it is the same case with all the tables in the application
reducing the optimizer_index_cost_adj make sense as there is a high chance that the queries can benefit from
reading the same set of blocks which are already cached and the queries select
only a small subset of data. There is a negative side to this, which is
discussed in the following section.
Arguments against setting optimizer_index_cost_adj:
Factors such as the following should be considered even before decreasing
the value of optimizer_index_cost_adj
-
How busy the buffer cache is? And is it appropriately
sized?
-
Is the workload primarily DML oriented?
-
Portion of the SQL statements that depend on the
optimizer’s cost estimates tipping point between the full table scan and index
range scan.
-
Tables with small average row length and high
clustering factor.
In a typical OLTP application the queries that depend on the optimizer’s
cost estimate that governs the tipping point between the full table scan and
fast full index scan are very less, and a perfectly sized buffer cache contains
buffers which are already warm, unless properly worked out reducing the value
of optimizer_index_cost_adj to help few queries that depend on the tipping can result in a set of
cascading issues as oracle or session eventually has to find free buffers to
read the data blocks into, if the buffers are clean a session can simply reuse
the buffer but if these blocks are required again, they have to read from the
disk again and this can increase the IOPS on the server significantly. If the
buffers are dirty then the DBWn process has to flush dirty blocks to make room
for the new blocks coming in. This also increase the latch activity at the
instance level as the buffers are linked and unlinked from the chains more
often. In these cases, increasing the buffer cache will help reduce the impact
but up to what extent depends on the variations of the workload.
If the workload is primarily DML oriented, inherent with the design ASSM
spreads the DML load across the available free blocks and is highly likely that
sessions inserting data may end up inserting data into different blocks rather
than inserting into the same block which contains the free space, it is highly
likely that the buffer cache contains more dirty blocks, and reducing the value
of optimizer_index_cost_adj to tip the balance between the full table scans can result in SQL
statements bringing more blocks into the buffer cache if not already present,
as batching and prefetching resulting in fetching multiple data blocks in one
IO request when Async IO enabled (the process is more aggressive when Async is
enabled), the demand for free buffers arises at one point in time or another
and this will eventually add pressure on the DBWn process to write the dirty
blocks to make room for the free buffers and if the corresponding redo
information related to the block is not flushed to the disk yet, DBWn process
can post/signal the LGWn process to write the redo entries to disk before it
can write the data block to the disk.
This entire process can significantly increase the IOPS on the server.
We have to remember that this parameter is global and affects all the
CBO estimates for all the queries that execute on the instance. When the
table’s row size is large the clustering factor statistic of the index tends to
be more accurate. If the SQL statement selects a significant portion even 2% of
the table rows when the table size is large reducing the optimizer_index_cost_adj can
result in reading a large portion of blocks into the buffer cache which not
only increases the latch activity at the instance level but also increases the
elapsed time to a point where performing a full table scan is faster. This can
impact other sessions as well. In these cases, executing a SQL query using a
full table scan requires reading a data block only once and a full table scan
may be faster due to SAN caching and/or features such as read ahead.
When clustering factor statistic is artificially very high, the sessions
executing a SQL statement may end up reading the same set of blocks over and
over again, in these cases reducing the optimizer_index_cost_adj parameter will increase the threshold and the
sessions may end up performing more logical IO than the reads performed by a
full table scan.
Remember that this parameter does not apply to user defined functions
for domain indexes. With all the cases listed above in which reducing optimizer_index_cost_adj can
have potential negative impact on the performance, most of the cases, it is
better to leave the value of this parameter to its default. Many argue that
comparing the average wait times of db file scattered
read and db file sequential read can
be used as a starting point for determining the value of optimizer_index_cost_adj, but
remember that the average wait time for these wait events is at the mercy of
several factors such as disk utilization rate, average response times, whether
there is a sufficient sample or number of times the sessions waited on db file scattered read, also
the number of blocks read using db file scattered read may not be constant if these are less full
table scans (buffered) as this wait event can also occur during prefetching
which actually reads less number of blocks rather than the value specified by db_file_multiblock_read_count.
CLUSTERING_FACTOR:
OPTIMIZER_INDEX_CACHING:
OPTIMIZER_INDEX_CACHING lets us adjust cost of index probe to favor
nested loop joins or in-list iterators. The range of values 0 to 100 indicate
percentage of index blocks already present in the buffer cache. The default
value of 0 indicates that the buffer cache doesn’t have any index blocks in it
and that Oracle has to read the index blocks to memory from disk and these
reads should be treated as physical IOs, this can make optimizer favor full
table scan. During the optimization phase, the optimizer has no clue regarding
the blocks that are already cached in memory and by default it assumes that it
has to read all the blocks from the disk and costs the plans accordingly.
During normal OTLP database operations, most of the index and table blocks are
already cached in the buffer cache and the optimizer has clue as to how much
percentage of these blocks are already present in the buffer cache. The only
way to influence the optimizer’s cost estimates is by using the parameter OPTIMIZER_INDEX_CACHING which
indicates the percentage of index blocks that are likely to be found in the
buffer cache and OPTIMIZER_INDEX_COST_ADJ which adjusts the cost of index probes.
Setting this parameter to a higher value makes nested loop joins and
in-list iterators look less expensive to the optimizer and as a result the
optimizer is more likely to pick a nested loop joins over hash or sort merge
joins and to pick indexes using in-list iterator over other indexes or full
table scans. A value of 100 infers that 100% of the index blocks are likely to be
found in the buffer cache and optimizer adjusts the cost of an index probe
using in-list iterator or nested loop accordingly. Once this parameter to a
particular value the CBO cost algorithm is adjusted as follows:
For example, if we set this parameter to 50, it means that CBO will
consider 50% of all IOs associated directly with the index blocks are already
cached and therefore reduce the overall cost of the index block IO by 50%. The following snippet of 10053 level 12 trace
shows the cost of executing a query with an in-list iterator
****** Costing
Index IDX
OPTIMIZER PERCENT INDEX CACHING = 50
SPD: Return code in qosdDSDirSetup: NOCTX,
estType = INDEX_SCAN
SPD: Return code in qosdDSDirSetup: NOCTX, estType
= INDEX_FILTER
Using prorated density: 1.6414e-04 of col #4
as selectivity of out-of-range/non-existent value pred
Using prorated density: 1.6247e-04 of col #4
as selectivity of out-of-range/non-existent value pred
Using prorated density: 1.6079e-04 of col #4
as selectivity of out-of-range/non-existent value pred
Using prorated density: 1.5912e-04 of col #4
as selectivity of out-of-range/non-existent value pred
Access Path: index (RangeScan)
Index: IDX
resc_io: 7862.993271 resc_cpu: 73793163
ix_sel: 6.4652e-04 ix_sel_with_filters: 6.4652e-04
Cost: 7865.064483 Resp:
7865.064483 Degree: 1
****** Costing
Index IDX
SPD: Return code in qosdDSDirSetup: NOCTX,
estType = INDEX_SCAN
SPD: Return code in qosdDSDirSetup: NOCTX,
estType = INDEX_FILTER
Using prorated density: 1.6414e-04 of col #4
as selectivity of out-of-range/non-existent value pred
Using prorated density: 1.6247e-04 of col #4
as selectivity of out-of-range/non-existent value pred
Using prorated density: 1.6079e-04 of col #4
as selectivity of out-of-range/non-existent value pred
Using prorated density: 1.5912e-04 of col #4
as selectivity of out-of-range/non-existent value pred
Access Path: index (RangeScan)
Index: IDX
resc_io: 7908.978163 resc_cpu: 74163478
ix_sel: 6.4652e-04 ix_sel_with_filters: 6.4652e-04
Cost: 7911.059769 Resp:
7911.059769 Degree: 1
Remember that the amount of caching depends on various factors that the
optimizer cannot predict, such as the load the system, block access patterns,
variations in workload at different times etc.
Similar to optimizer_index_cost_adj, this is a global parameter which means that
any value that we set will be the average for all the indexes and for all the
situations in which this parameter can have an effect. In an OLTP database,
there are some indexes which can be quite small and are heavily accessed and likely
to be cached in the entirety in the buffer cache, whereas few are huge and
accessed less frequently. Setting a value for this parameter averages out for
all the indexes.
As with optimizer_index_caching which indicates the amount or percentage of blocks
cached in the buffer cache, there is no equivalent parameter for table blocks
and as such Oracle treats all the IO for the table blocks as physical IOs. One
way to influence the behavior by setting the optimizer_index_cost_adj parameter which effectively reduces the
overall cost of the index-based table lookups. As with optimizer_index_cost_adj
parameter care should be taken when setting this parameter to a particular
value, as this can tip off the balance between the nested loop and hash join. A
hash join could have easily been faster than a nested loop join. Likewise, same
goes for the index-based lookup and full table scans for queries involving
in-list iterators.
How a Session performs a full
table scan:
The following diagram depicts how a session reads the table during a
full table scan.
Optimizer uses the statistics and the availability of indexes to
determine whether it is better to perform read the table contents using either
a full table scan or index range scan. Since we are talking about full table
scans, once the decision is made to read the table using a full table scan, a
session starts by reading the segment/table header block to get the details
regarding the extent map (how many extents are allocated), their addresses, Low
HWM, High HWM etc and starts by reading db_file_multiblock_read_count (more precisely _db_file_exec_read_count) blocks, one extent at a time until low HWM as
the session is 100% certain that all the blocks below the low HWM are known to
be formatted and carefully reads the formatted blocks beyond the low HWM and
high HWM and blocks above the high HWM can be either formatted or unformatted.
Oracle just ignores or doesn’t read unformatted blocks as it is known that they
do not contain any data. The performance
of a full table scan is influenced by various factors such as availability of
Asynch IO, how busy the disks/buffer cache (in case of buffered reads) etc and
you will be seeing about these in the direct path read wait event section.
These reads are basically sequential in nature.
Index Fundamentals:
How Oracle manages B-Tree index:
In Oracle, B-Tree index is the default index type, CREATE INDEX DDL statements always results
in an B-Tree index. In a B-Tree index, the data blocks are arranged in a
balanced tree like structure with branch and leaf nodes. Contrary to popular
belief regarding the root node, oracle doesn’t have the concept of special type
of block called root node or block, all the blocks in an index are either BMB
which are used to track the free space or branch or leaf blocks, the branch
root block that contains the pointers to all the branch blocks can be
considered as a root node. In terms of performance, B-Trees are excellent for a
wide variety of queries, including those with the equality and range
predicates. A B-Tree index is balanced because all leaf blocks automatically
stay at the same depth and thus to retrieve a record from anywhere in the index
takes approximately the same amount of time and/or same number of lookups and
the B-Tree is always balanced. Every index has a height associated with it and
the height of the index is the number of blocks required to go from the root
block to a leaf block. The following diagram shows the basic structure of
B-Tree index.
Branch Blocks:
Branch blocks are used for traversing/searching the B-Tree structure and
depending on the size of the index branch blocks can be are arranged in levels
or in other words a branch block can have pointers to either leaf block or
other branch blocks. Branch blocks store the minimum key prefix needed to make
a branching decision between two keys. This technique enables the database to
fit as much data as possible on each branch block. The number of keys and
pointers stored in a branch block is limited by the block size and the column
length.
The dump of a branch block contains the details such as the following:
The following is a dump of the branch root block.
Branch block dump
=================
header address
139999760347212=0x7f543bfbf04c
kdxcolev 2
kdxconco 1
kdxconro 600
kdxcofbo
1228=0x4cc
kdxcofeo
2689=0xa81
kdxcoavs 1461
kdxbrlmc
29369552=0x1c024d0
kdxbrsno 599
kdxbrbksz 8056
kdxbr2urrc 0
The explanation of relevant details are as follows:
kdxcolev – indicates index height
Kdxconco – number of columns in the index
Kdxconro - number of entries in this block.
Kdxcofbo – offset to the beginning of the free space
within the block.
Kdxcofeo – offset to the end of free space within the
block.
Kdxcoavs – available free space in the block.
Kdxbrlmc – pointer to other branch or leaf blocks.
Kdxbrbksz – maximum usage space within the block (basically
less than the block size due to the overhead).
When a table is initially created (depending on the deferred_segment_creation
parameter), an extent is not allocated, only when the first-row insert is
created, extents are allocated to the table and its corresponding indexes if
any. Oracle doesn’t create a branch blocks until a certain threshold or certain
number of rows are inserted to the table. So, the very first block immediately
following the L3 BMB (segment header) will be a leaf block. Only when
sufficient number of rows are inserted into the table (which causes the leaf
block to split) oracle creates branch nodes, it then formats or changes the
very first block following the segment header from leaf block to a branch
block, and once sufficient number of rows are inserted into the table, the oracle
splits the branch block into a root branch block and the other branch blocks. The
root branch block is always the 1st block following the index’s
segment header block that is the fourth block in the initial extent allocated
to the segment as shown in the following figure.
Even though a column value can span across multiple leaf blocks, branch
blocks stores all the leaf block pointers. Do remember that a delete operation
can cause an update or deletion of an entry from branch blocks.
Leaf blocks:
All the leaf blocks reside at the same level or height. The leaf blocks
contain indexed column value and a corresponding ROWID which points to the
actual location (table data block) where the row is stored. Each entry is a
B-Tree index is sorted by the (key-value). Leaf blocks are stored or arranged
in a doubly linked list format with pointers to its previous and the next leaf
blocks, so that during a range scan or queries with predicates such as > or
< a session can simply traverse leaf blocks instead of reading the branch
blocks over and over again.
Leaf block dump
===============
header address
140609502597364=0x7fe2337450f4
kdxconco 1
kdxlenxt
29368846=0x1c0220e
kdxleprv 0=0x0
The explanation for relevant columns is as follows:
Kdxleprv – pointer to the previous leaf block
Kdxleprv – pointer to the next leaf block.
The following is the leaf block format:
There is a reason for specifying the PCTFREE here for the index leaf
block and we will get back to this shortly.
Let’s look at how the entries are stored in an index leaf block:
row#0[8021] flag: -------,
lock: 0, len=11
col 0; len 1;
(1): 80
col 1; len 6;
(6): 01 c0 01 5b 00 5f
row#1[8010] flag:
-------, lock: 0, len=11
col 0; len 1;
(1): 80
col 1; len 6;
(6): 01 c0 01 5d 00 34
As you can see from the above each entry requires 11 bytes
2 bytes to store the flag and lock.
1 byte to store the length of col 0
1 byte to store the col 0 value
1 byte to store the length of col 1
6 bytes to store the col 1 value – Hex/internal representation
of the ROWID.
When a B-Tree index is created and if PCTFREE and INITRANS parameters are not specified, then Oracle automatically
set the default value for PCTFREE as 10 and INITRANS as 2. But for a table default value of INITRANS is 1. Unlike tables
where the PCTFREE
specifies the amount of free space reserved for future updates, PCTFREE for an index block
specifies amount of space reserved for future updates and insert of new key values.
PCTFREE is only relevant during the initial index
creation or rebuild (we will later see why a rebuild causes the index size to
increase) It causes optimal splitting of the B-Tree in preparation for
subsequent growth. The idea is to do as much splitting a possible during the
initial creation and avoid having to pay the penalty later during insertion
into the table. This is what, a high PCTFREE setting on an index gives us. An increased
PCTFREE value can significantly increase the bloat in an index and is not
recommended for all workloads, the default value of 10 should suffice in most
of the cases, and we will look at these scenarios eventually. Similar to a table, once the index is created,
an index block can accommodate keys until there is no space left in the block
(this includes the space for ITLs as well, unless the space is reserved using
the INITRANS
parameter). In order to modify the contents of a particular block, a session
must first acquire an ITL slot, but what happens if a table/index block is
completely full? Sessions succeed in acquiring ITL slots if the ITL slots are
reserved using the INITRANS parameters during the table or index creation,
but what if the INITRANS is not set and the value defaults to 1 and
multiple sessions try to modify the same table/index block. In this scenario, Oracle
handles the situation based on the type of block and correspondingly the
sessions behave. When a table block is completely full and multiple sessions
try to modify the same block, since the block is full, it cannot accommodate
more ITL slots, in this case, the first session which tries to modify the block
acquires the ITL slot (default 1 reserved), but either the second or subsequent
sessions end up waiting on enq: TX - allocate ITL entry wait event until the first session holding the
ITL slot releases it, even though the statement can be a delete statement which
deletes multiple rows, until the first sessions commits the transactions, the
rest of the sessions will end up being blocked on enq: TX - allocate ITL entry.
This is not the case with the index blocks, when an index block is full and
multiple sessions try to modify the same block, in these cases, since the block
does not have enough free space to hold additional ITL slots, oracle straight
away performs a block split which can be either a 50-50 or 90-10 depending on
which part of the index structure the block resides.
How are key – ROWID pairs stored in a leaf block?
By default, if we don’t specify a DESC clause in the CREATE INDEX DDL
statement, index stores all the key - ROWID values in the ascending order. For
example, if we create an index on EMP_ID which is unique (not unique index) by the
way, the index leaf block contains the similar to the following:
The order of EMPID takes precedence over the ROWID which makes sense as
the values are stored in ascending order in an index leaf block. So, what if an
index key value results in multiple ROWIDs? In this case, the database stores
the entries are sorted by ROWID and stored in a leaf block.
In other words, the entries in the leaf blocks are sorted by key value
and then the ROWID. So now one might wonder why does Oracle have to sort the
ROWIDs in the index leaf blocks? The answer to this question, we need to first
understand what a ROWID is.
A ROWID is a pseudo column that returns the address of a row. Oracle
Database rowed values contains the following information:
-
The datafile in which the row resides. The file number
is relative to the tablespace.
-
The data block in which the row resides.
-
The data object number of the object.
-
The position of the row in the data block (first row
is 0).
The value of the ROWID increases with the increase the FILE_ID and the
location of the data block in the datafile and the position of the row in that
data block. Remember that the ROWID of a row only changes when the row is moved
to a different partition due to an update operation, in other cases even if an
update operation results in a migrated row, the ROWID entries in the index does
not change as oracle uses pointers with forwarding address.
When the ROWIDs are persisted in a sorted order, there is a high chance
or probability that these ROWIDs points to data blocks that reside very near on
the sectors and that the disk spindle doesn’t have to move too much there by
reducing the seek time eventually, the disk spindle eventually moves only in
one direction from the higher to lower tracks, and by the time the last ROWID
is reached it has made only one complete ARC. If the ROWIDs are not stored in
ascending order there is a high chance that the data blocks are stored randomly
in a data file and the spindle has to move back and forth multiple times to
retrieve the data, this not only increases the seek time but also reduces the
life span of the disk, this is one explanation and the other is when using
queries with predicates such as COL1=:A1 and COL2:A2 and both are indexed
columns (two separate indexes), since the ROWIDs are already sorted, it is very
easy to filter the rows, otherwise the session would end up sorting the ROWIDs
to filter the rows.
Index Block Splits:
Leaf Block Splits:
An index leaf block split happens in two ways:
-
50 – 50
-
90 – 10 = misleading instead it
should be 100-0
50 – 50 splits:
An index leaf block split happens when an index leaf block becomes full.
This happens for all indexes even when the value is monotonically increasing.
In this case oracle takes a free index block and links it into the index
structure and moves approximately half of the indexed entries of the full block
into the new leaf block. The following is the snippet of index tree dump which
gives us the details.
leaf: 0x2c0010e 46137614 (175: row:268.268
avs:3928)
leaf: 0x2c0016a 46137706 (176: row:265.265
avs:3973)
90-10 splits:
A 90-10 split happens only for columns with monotonically increasing
values or the blocks that are present at the right most part of the index.
90-10 split naming is really misleading as the correct naming should have been
100-0. In this case once the once the index leaf block is full, a new leaf
block is linked to the index structure, and the subsequent inserts are directed
towards that index leaf block. The following is the snippet of a index tree
dump.
leaf: 0x2c4726b 46428779 (29: row:533.533
avs:6)
leaf: 0x2c4726c 46428780 (30: row:533.533
avs:7)
leaf: 0x2c4726d 46428781 (31: row:533.533
avs:6)
leaf: 0x2c4726e 46428782 (32: row:199.199
avs:5013)
As we can see from the above, the index leaf blocks are almost full.
Branch/Root branch Block Splits:
A branch block split is also subject to the same rules as the leaf block
split, a split can occur in two ways:
-
50 – 50
-
90 – 10
(a root block is a special type of branch block and should be considered
as a branch block as specified before).
DML behavior
Insert an Index entry:
Once the index segment is created, new rows inserted into the table will
eventually create index entries (key-value pairs). Depending upon the index key
values, these entries are inserted into appropriate index leaf blocks until they
are full, once the index leaf block is full as there is not space left, a
subsequent insert will cause the index leaf block to split. An index leaf block
can be split using either 50-50 or 90-10 approach
depending on its location in the index structure. Several factors such as
sequences, column order of the index etc., contribute whether the block split
is either 50-50 or 90-10.
When an index key value results in multiple ROWIDs (which is typically
the case in most non-unique columns), these entries are stored in the leaf
block in ROWID order. So, all the blocks for that particular index key value
are inspected in turn until an entry is found with a greater ROWID than the new
row, and the index key, ROWID pair is inserted into the block. This doesn’t
mean that all the corresponding leaf blocks will be read, Let’s look at various
such scenarios:
When a column value is populated as a monotonically increasing value:
In most databases/environments, column values that are populated using a
monotonically increasing value using either an application sequence or database
sequence are usually never updated and are mostly protected using a unique or
primary key constraint, and will stay constant unless the corresponding rows
are deleted or older partitions dropped. eg: ORDER_ID/STUDENT_ID/EMPLOYEE_ID remains constant and are never likely be changed.
This is very important to remember even before proceeding, as this helps us determine
whether to or not to rebuild indexes built on such columns.
Considering an ideal situation in which the column value of each
subsequent insert statement is higher than the previous value and the value is
increasing monotonically, the index entries are usually inserted at the right
most index leaf block, and as a result once the index leaf block is full it is split
using 90-10 approach (or more precisely 100-0 approach) in which case a new
leaf block is attached the doubly linked list and the inserts are directed into
that leaf block. In this case we see no wastage of space. But remember that since
the block is 100% used (unless we have space reserved in the index block by
defining INITRANS during index creation or rebuild), when multiple sessions try
to delete the table rows pointed by this leaf block, they must initially add
ITL entries into the index leaf block header and this operation requires space
(default INITRANS for an index is 2), and as there is no free space in the leaf
block, oracle splits the index leaf block further using 50-50 approach which
results in wastage of space. Also remember that for indexes PCTFREE is valid
only during index segment creation (like CREATE INDEX …,
ALTER INDEX… REBUILD... etc.) and the space is not reserved for future
updates or inserts during normal processing.
The following is a snapshot of how a 90-10 block split looks like:
leaf: 0x2417821 37845025 (83: row:533.533
avs:6) //space usage is almost full
leaf: 0x2417822 37845026 (84: row:533.533
avs:7)
leaf: 0x2417829 37845033 (85: row:533.533
avs:6)
leaf: 0x241782a 37845034 (86: row:533.533
avs:6)
leaf: 0x241782b 37845035 (87: row:533.533
avs:7)
leaf: 0x241782c 37845036 (88: row:303.303
avs:3454)
The above behavior regarding 90-10 split (precisely 100-0 split) is more
optimistic and is not usually the case in most environments and this happens
only when we have a database sequence created with NOCACHE ORDER or the
application makes sure that the subsequent inserts results in higher value for
the particular column. With application sequences, in highly concurrent system
we cannot guarantee that the subsequent insert statement will result in a
higher value as this is dependent on various factors such as network delays,
CPU scheduling, distributed cache synchronization mechanism etc., as the number
of sessions inserting the data into the table can increase the gap which
results in 50-50 splits. For example, if there are 20 sessions inserting data
into the table, and due to delay in OS CPU scheduling or thread synchronization
or JVM garbage collection etc., a thread which is supposed to insert a value
10000 may get delayed by few milliseconds, by that time other sessions would have
already inserted many rows and a 90-10 split (100-0) split occurred, and this
session when it tries to insert the data, ends up finding that the index leaf
block is full, which ultimately results in a 50-50 block split.
Before proceeding let’s look at two most common ways that applications
generate monotonically increasing values:
1.
Application sequences
2.
Database Sequences.
Application Sequences:
There are several aspects or behavior of the application or Java or synchronization
mechanisms that a DBA should be aware:
Application Sequences or custom code or functions to generate certain
values:
Database Sequences inherently generate only numeric values, if the
application demands sequences to be in the following order. (such as those used
for the ORDER_ID columns in a retail application such as the following).
0000-001-012-0000310
0000-001-012-0000321
In this case using database functions reduces the overall concurrency when
compared to applications generating the sequences since there are multiple
layers involved and can lead to library cache contention depending on the
complexity of the function and the number of sources from which it retrieves
the data, not to mention the interim tables or sequences that it requires to
hold the maximum values. In cases such as these, application developers are
forced to write separate code to generate/handle such values.
Multiple Sessions inserting data:
When multiple application processes or threads insert data and/or
generate sequence values several factors such as the following come into play:
-
CPU scheduling algorithm
-
Network delays.
-
Processes Synchronization mechanisms used for generating
sequence values etc.
In multi-threaded and/or multiple application server (where application
or middle-tier runs on multiple servers) environments, we cannot guarantee that
the subsequent column values (rows into the table) will be higher than the last
value as shown below: (this is for Strings or character-based columns, we will
look into this shortly).
pool-1-thread-19
--- 0000-001-180533
pool-1-thread-19
--- 0000-001-181259
pool-1-thread-50
--- 0000-001-180522
Notice the gap between the values, the scenario is basically the same
when we use application sequences to generate integer values: (this is just a
snippet showing how the subsequent inserts may result in lower values as well)
pool-1-thread-37
--- 202757
pool-1-thread-5
--- 202259
pool-1-thread-5
--- 202759
pool-1-thread-34
--- 202247
Due to the nature of 90-10 split (precisely 100-0 split), a block may be
full and a subsequent insert statement with a lower value will result in the
index leaf block to be split using 50-50 approach, thereby wasting even more space. Synchronization
mechanisms used, CPU scheduling, distributed cache layer, network delays etc.,
also play a role in such sequence gap. Due to the reasons specified earlier,
when populating columns using monotonically increasing values for single column
indexes (primary or unique) or leading column value of an index, always prefer
database sequences as there is less risk of space wastage.
Database Sequences:
When using database sequences to populate the value of a particular
column we cannot guarantee a 90-10 split each time for the underlying index
unless
-
Sequence is created using NOCACHE ORDER clause
-
Only one session inserts the data.
-
Only few sessions insert the data that too less
frequently.
Creating the sequence with NOCACHE ORDER essentially undermines the concurrency
capabilities as the sessions will be plagued by events such as row cache lock, latch: shared pool etc. To improve the concurrency, it is recommended to alter the
sequence with CACHE and NOORDER. Due
to the behavior explained earlier, this can result in 50-50 index leaf block splits
adding more space wastage.
The following snippet of index treedump reveals the details especially:
leaf: 0x1c016da 29365978 (212: row:480.480
avs:8) à 90-10 block
split
leaf: 0x1c016e4 29365988 (213: row:480.480
avs:9) à 90-10 block
split
leaf: 0x1c016ab 29365931 (214: row:480.480
avs:9) à 90-10 block
split
leaf: 0x1c016bb 29365947 (215: row:245.245
avs:3531) à result of 50-50 block split
leaf: 0x1c016ea 29365994 (216: row:237.237
avs:3652) à result of 50-50 block split
leaf: 0x1c0169b 29365915 (217: row:480.480
avs:9) à 90-10 block
split
leaf: 0x1c016b7 29365943 (218: row:480.480
avs:8) à 90-10 block
split
leaf: 0x1c016a7 29365927 (219: row:480.480
avs:9) à 90-10 block
split
leaf: 0x1c016b3 29365939 (220: row:245.245
avs:3532) à result of 50-50 block split
leaf: 0x1c016bd 29365949 (221: row:480.480
avs:9) à 90-10 block
split
leaf: 0x1c016f2 29366002 (222: row:480.480
avs:9) à 90-10 block
split
leaf: 0x1c01697 29365911 (223: row:480.480
avs:9) à 90-10 block
split
leaf: 0x1c016fe 29366014 (224: row:242.242
avs:3576) à result of 50-50 block split
leaf: 0x1c016de 29365982 (225: row:239.239
avs:3622) à result of 50-50 block split
leaf: 0x1c016d8 29365976 (226: row:480.480
avs:9) à 90-10 block
split
leaf: 0x1c016f6 29366006 (227: row:480.480
avs:8) à 90-10 block
split
When multiple sessions insert data into the table concurrently, database
sequences perform better compared to application sequences in terms of 90-10
index leaf block splits. There will be fewer 50-50 block splits. For single
column indexes (primary or unique) or when the leading column value of an index
is populated using monotonically increasing values, always prefer database
sequences as there is less wastage.
The above discussion above about 90-10 and 50-50 index leaf block splits
holds true only for
-
Single column indexes or
-
When the leading column of a concatenated index is
populated using monotonically increasing values
and falls apart when the column whose value is populated using sequence
or monotonically increasing value is not the leading column in which case the
index leaf block splits are mostly 50-50. regardless of whether we are using
database sequences or application sequences, when you load the data using bulk
insert or batch inserts using PreparedStatements JDBC API, there will be more 50-50 index leaf
block splits and the extent to which the 50-50 index leaf block splits occur is
directly linked to the following:
-
Batch size used.
-
Number of sessions inserting the data.
-
Index key size and database/tablespace block size.
The data type of the column also plays a major role in index bloat
factor. An indexed column which is of numeric data type, and populated using monotonically
increasing value using either a database sequence or application sequence will
have more 90-10 splits than the same column which is defined as varchar or char.
Using an incorrect data type can
increase the bloat in the index.
When the column value is mostly unique but not monotonically increasing
and it is randomly populated:
In this case, there will be wasted space. But since the data is mostly
unique, rebuilding or shrinking space may do more damage as the index leaf
blocks may get split using 50-50 method causing redo generation unless the
PCTFREE defined can accommodate the data coming in. As the data is mostly
unique any potential gains achieved through rebuilding or shrinking or
coalescing will be minimal as the number of index blocks read to locate a value
will not drastically improve, thereby fewer potential gains for SELECT
statements. Examples of such columns include message digest or message hash
etc.
When the column value is not unique and non-monotonically increasing a key value will usually have multiple ROWID
values. By default, when there are multiple ROWIDs for each indexed key value
Oracle sorts these ROWIDs in the index leaf blocks. As the index leaf block
becomes full the blocks are usually split using the 50-50 approach. This
imposes two different scenarios especially when the extent sizes are different
and the data files with enough free space in the tablespace. Usually on a ASSM
tablespace, since it uses bitmaps to track the free space and formats blocks in
ranges in random which can be in any part of the extent, and due to its
inherent behavior with regards to spreading the load across multiple blocks
especially when multiple sessions try to insert the data into the table, there
is a high change the that the index leaf blocks will be properly utilized
following a 50-50 block split, and the utilization rate also increases when
multiple data files are present in a tablespace with free space as extents are
allocated
-
When using uniform extent allocation and smaller
extent sizes with a single data file added to a tablespace when the space
crunch occurs, the likelihood of space wastage is a bit high with the increase
in the ROWID.
-
When using auto allocate extent allocation as the
extent sizes tend to become large and with the default ASSM behavior as
mentioned about the likelihood of the index block wastage is very low as the
multiple sessions end up inserting data into the table into different blocks
whose positions may be different and with multiple data files in the tablespace
with enough free space. The index leaf blocks often tend to be better utilized.
The above-mentioned behavior is rare but this will help us understand
oracle properly in its entirely.
Update an Index entry:
There is no such concept as updating a key value when it comes to
indexes. When an update to a row results in an indexed column value change, the
old index key value pair is marked as deleted and is re-inserted into an
appropriate index leaf block (at the correct location in the B*Tree).
Deleting an Index entry:
When an index key value is deleted, all the subsequent entries in the
leaf blocks will be marked as deleted. Remember even if an index leaf block
contains all the deleted rows, it will be scanned during an index range scan.
In other words, the index leaf block will not be delinked from the leaf block
doubly linked list.
Generic DML behavior:
Always create your unique index or the indexes that support the
constraints first and make sure that object_id of these indexes is always less than other
indexes on the same table.
Insert behavior:
Oracle uses the following approach while inserting a row into the table.
1.
Read the segment header (L3 BMB 3st block of the 1st extent)
2.
Read the L2 BMB and
determine the L1 BMB that contains details
regarding free space.
3.
Read the L1 BMB and find
the block that contains enough space.
a. If the L1 BMB indicates that blocks that are formatted are all full and there are
blocks that are unformatted. Format a range of data blocks (make session wait
on ENQ: FB – Contention).
b. If there is
enough space insert the row into the table.
4.
If there are indexes on the base table.
a. Start
inserting entries into the indexes in the order starting from the least OBJECT_ID in other words if there are 4 indexes, start with the
index with the least OBJECT_ID and proceed
further.
5.
If the inserted value results in a constraint
violation then rollback the change to all the blocks modified earlier.
An interesting behavior here is that if the OBJECT_ID of the index
policing the unique constraint is higher than the rest of the indexes present
on the table, policing will be delayed which can cause higher redo and undo
generation in case of roll back or constraint violation.
For example, consider the following statements if executed in the order
listed below:
create table
students(roll number, name varchar2(20), dept_id number, sub_id number, mark1
number, mark2 number, mark3 number);
create index
mark1_idx on students(mark1);
create index
mark2_idx on students(mark2);
create index
mark3_idx on students(mark3);
alter table
students add constraint students_pk primary key (roll);
insert into
students select rownum, dbms_random.string(0,20),
round(dbms_random.value()*100), round(dbms_random.value()*100),
round(dbms_random.value()*100), round(dbms_random.value()*100) , round(dbms_random.value()*100)
from dual connect by level < 1000;
commit;
Querying the DBA_OBJECTS results in the following:
The OBJECT_ID
of STUDENTS_PK
constraint is higher because it is created after the other indexes. Let’s look
at the implications of this, let’s insert a row into the table which causes a
unique key constraint violation and observe the amount of redo generation.
SQL> insert
into students values (10, 'VISHNU', 123, 123, 123, 123, 123);
insert into
students values (10, 'VISHNU', 123, 123, 123, 123, 123)
*
ERROR at line 1:
ORA-00001: unique
constraint (VISHNU.STUDENTS_PK) violated
The 10046 level 12 trace of the above statement is as follows:
We can clearly see that the oracle starts reading and modifying the
index entries in the order specified by OBJECT_ID. The
corresponding redo generation and rollback changes applied reveals the details.
Now let’s create the primary key immediately following the table
creation and see the benefits:
create table
students(roll number, name varchar2(20), dept_id number, sub_id number, mark1
number, mark2 number, mark3 number);
alter table
students add constraint students_pk primary key (roll);
create index
mark1_idx on students(mark1);
create index
mark2_idx on students(mark2);
create index
mark3_idx on students(mark3);
insert into
students select rownum, dbms_random.string(0,20),
round(dbms_random.value()*100), round(dbms_random.value()*100),
round(dbms_random.value()*100), round(dbms_random.value()*100) , round(dbms_random.value()*100)
from dual connect by level < 1000;
commit;
Querying the DBA_OBJECTS results in the following:
Specifying the keywords PRIMARY KEY or UNIQUE in the CREATE TABLE DDL results in the creation of the unique indexes
immediately following the table creation thereby resulting in subsequent OBJECT_IDs for the indexes.
The corresponding 10046 level 12 trace is as follows:
From the above we can clearly see that the execution halted immediately
following the unique key constraint violation. The redo generation and rollback
changes statistics are as follows:
The redo generation is significantly less and the only 1 undo record had
to be applied – which is basically rolling back the changes applied to the
table block.
Update Behavior:
An update (in terms of index) is basically marking the existing entry as
deleted and inserting the new entry (based on its value and ROWID in the corresponding
position in the B-Tree structure). Let’s see the path of flow using an update statement.
Consider the following:
create table
students(roll number, name varchar2(20), mark1 number, mark2 number);
create index
mark1_idx on students(mark1);
alter table
students add constraint students_pk primary key (roll);
insert into
students select rownum, dbms_random.string(0,20),
round(dbms_random.value()*100), round(dbms_random.value()*100) from dual
connect by level < 1000;
Commit;
The OBJECT_IDs are as follows:
Let’s see the behavior of the following SQL statement:
update students
set mark1 = 100, roll = 999 where roll = 1;
(the above statement sounds really strange but this required if you have
to understand how an update works and how the logic flow continues. This
statement does result in a unique constraint violation)
Execution plan is as follows:
10046 level 12 trace is as follows:
The entries related to unique index are marked as RED. The flow is as
follows:
-
Read the index root, leaf block to find the
corresponding ROWID.
-
Once the ROWID is identified, read the corresponding
table block and modify the contents.
-
Proceed updating the indexes based on OBJECT_ID
-
Read the mark1 index root and locate the old value
that has to be updated.
-
Mark the old value as deleted in the mark1 index leaf
block.
-
Read the mark1 leaf block (new location) where the row
has to be inserted.
-
Now upon reaching Index unique index (which has higher
OBJECT_ID), the block that stored the old value roll = 1 is already in the
memory, so mark the entry as deleted.
-
Read the block (new location) where the entry has to
be inserted, upon finding an existing entry initiate a rollback.
The status of the blocks post rollback is as follows:
The Snippet of the redo log dump indicating the changes is as follows:
CHANGE #1 CON_ID:0 TYP:0 CLS:1 AFN:7
DBA:0x01c0015c OBJ:76207 SCN:0x00000000008a80a1 SEQ:2 OP:11.5 ENC:0 RBL:0
FLG:0x0000
... update values in the table block
col 0: [ 3] c2 0a 47
col 2: [ 2] c2 02
CHANGE #2 CON_ID:0 TYP:0 CLS:57 AFN:4
DBA:0x01000228 OBJ:4294967295 SCN:0x00000000008a8055 SEQ:1 OP:5.2 ENC:0 RBL:0
FLG:0x0000
ktudh redo: slt: 0x0000 sqn:
0x000008ad flg: 0x0012 siz: 144 fbi: 0
uba: 0x0106c69f.00a8.07 pxid:
0x0000.000.00000000
CHANGE #3 CON_ID:0 TYP:0 CLS:1 AFN:7
DBA:0x01c0016e OBJ:76208
SCN:0x00000000008a80a1 SEQ:2 OP:10.4 ENC:0 RBL:0 FLG:0x0000
index redo (kdxlde): delete leaf row //Mark1_IDX
CHANGE #4 CON_ID:0 TYP:0 CLS:1 AFN:7
DBA:0x01c0016f OBJ:76208
SCN:0x00000000008a80a1 SEQ:2 OP:10.2 ENC:0 RBL:0 FLG:0x0000
index redo (kdxlin): insert leaf row //Mark1_IDX
CHANGE #5 CON_ID:0 TYP:0 CLS:1 AFN:7
DBA:0x01c00166 OBJ:76209
SCN:0x00000000008a80a1 SEQ:2 OP:10.4 ENC:0 RBL:0 FLG:0x0000
index redo (kdxlde): delete leaf row //Unique Index
CHANGE #6 CON_ID:0 TYP:0 CLS:58 AFN:4
DBA:0x0106c69f OBJ:4294967295 SCN:0x00000000008a8054 SEQ:1 OP:5.1 ENC:0 RBL:0
FLG:0x0000
rollback initiated...
Now that you have seen how the insert/update are done and how Oracle OBJECT_ID to
start modifying the index, always make sure that the primary key/unique indexes
are created immediately following a table creation. Always remember that the
SQL trace will never show the details regarding the blocks that are already in
the buffer cache, this abstracts us from looking at how many times a buffer is
accessed over the course of SQL execution.
Now Let’s take this a step further and see the impact of this when it comes
to foreign keys.
Insert Behavior – Constraints
Foreign key:
Foreign key constraint checks are done post the insert or update
statement completes modifying the existing table and its dependent indexes.
Consider the following:
create table parent (roll number);
alter table parent add constraint
parent_pk primary key(roll);
insert into parent select rownum - 1
from dual connect by level < 102;
commit;
create table child (child_id number,
roll number, mark1 number, mark2 number);
create index child_mark1_idx on
child(mark1);
create index child_mark2_idx on
child(mark2);
alter table child add constraint
child_pk primary key (child_id);
create index child_fk on child(roll);
alter table child add constraint
child_fk foreign key (roll) references parent(Roll);
insert into child
select rownum,
round(Dbms_random.value()*100),
round(Dbms_random.value()*100),
round(Dbms_random.value()*100) from
dual connect by level < 20000;
commit;
The OBJECT_IDs for the tables is as follows:
Inserting into the Parent Table:
insert into parent values (1000);
commit;
Inserting into parent table doesn’t require any checks on foreign keys
for data integrity and is obvious from the snippet of 10046 level 12 trace for
the above statement.
Inserting into the child table:
Inserting into the child table triggers a foreign key constraint check
(post modifying – inserting - the child table and its dependent index blocks)
to validate whether the parent table has the corresponding value present, if
not throw an exception. Lets look at the following example:
SQL> insert into child values
(1231231, 1231, 123, 123);
insert into child values (1231231,
1231, 123, 123)
*
ERROR at line 1:
ORA-02291: integrity constraint
(VISHNU.CHILD_FK) violated - parent key not found
Snippet of 10046 level 12 trace is as follows:
As we can see from the above the execution halts immediately following
the foreign key constraint check.
Deleting a row:
Parent Table:
Unless the foreign key constraint is created with ON DELETE CASCADE
clause, Oracle will not allow a row to be deleted from the parent table if the
corresponding values or rows are present in the child table.
Child Table:
Deleting a row from the child table is similar to deleting a row from a
normal table. The procedure starts by finding the corresponding ROWID and delete
the rows from the table and then start marking the index entries in the
corresponding indexes as deleted in OBJECT_ID order.
Updating a row:
Parent Table:
Oracle will not allow us to update the foreign key column value if there
are any matching values present in the child table (ON UPDATE CASCADE is not
supported yet).
Child Table:
Similar to the Insert, foreign key constraint checks are performed post
the modification to the current table and its dependent indexes are done.
Take away of this, always index the foreign key columns on the child and
parent tables as this can significantly slow down the performance of
insert/update statements.
Sequence of Steps involved a in Index Range/Unique Scan:
During an index range scan or unique scan, based on the key value
specified in the predicate oracle or server process retrieves the ROWID from
the index and based on the ROWID it accesses the corresponding table blocks.
The process flow is as follows:
-
Read the branch root block and determine the branch
block
-
Read the branch block and determine the right leaf
block
-
Read the ROWID from the leaf block
-
Read the table data block based on the ROWID.
Identifying bloat in an Index:
We know that in a tablespace created with segment space management auto
(simply ASSM), oracle uses bitmaps to track the freeness status of blocks in a
segment. Oracle uses different blocks such as L1BMB, L2BMB & L3BMB to track/manage
freeness status of blocks in an extent and the block can be in any of the
following different states:
-
FREE
-
FULL
-
FS1 (0 – 25% used)
-
FS2 (25 – 50% used)
-
FS3 (50 – 75% used)
-
FS4 (75 – 100% used).
-
UNFORMATTED
When it comes to Indexes, index block can have only 2 states
-
FS2 – can be reused.
-
FULL – even if it has only few
entries.
We can find the bloat in an index using any the following:
-
Segment advisor
-
Analyze index … validate structure (INDEX_STATS.PCT_USED).
-
Treedump
Using the segment advisor is by far the best and recommended approach
there is - to identify the bloat in an index. Analyze
index … validate structure command
can also be used against an index to populate various details, but this
operation locks the table thus preventing any DML on the table. Treedump can
also be used but calculating the bloat using this approach is not tidy and thus
not at all recommended.
Impact of additional Indexes on DML
Although an index helps speed up the queries significantly, there is
performance penalty when it comes to maintaining these indexes, and this is
clearly visible when it comes to performing DML on the table. Since the indexes
store the column value and its associated ROWID, Oracle has to maintain each
index on the table. When rows are inserted, updated or deleted, all the
corresponding indexes on the table need updated which not only slows down the
performance but also increases the undo and redo generation as well. The
following chart gives insight into the increase in the time taken for loading
100k rows with additional index on the table.
By now we already know that the B-Tree indexes are implemented as
balanced trees, before proceeding further let’s understand the concepts behind
the tree data structure as it is very important for further concepts and how
this impacts the performance of the database. Tree data structure requires that
key-value pairs to be unique and no two key-value pairs be the same. Similarly,
in Oracle, regardless of whether the index is a unique or non-unique index, the
entries in a leaf block are always unique and whether the column value, ROWID
pair is stored together or separately depends on whether the index is unique or
non-unique index. The structure of a unique and non-unique index is essentially
the same but difference is evident in how the data get stored in the leaf
blocks.
Unique Index on STUDENTS.ROLL_NO column leaf block dump.
Normal B-Tree index on STUDENTS.ROLL_NO column leaf block dump:
If we notice kdxconco for a unique index which indicates the number
of columns is 1, but for a non-unique index on the same column the kdxconco value is 2. This is
because for a unique index the ROWID is not considered as part of the key and
in a non-unique index the ROWID is considered as part of the key. Due to this
behavior the size of a unique index is always smaller than a non-unique index.
We can clearly see that in the leaf block dumps kdxconro value which indicates the number of entries in
the leaf block is higher compared to the number of entries in a leaf block for
a non-unique index. In other words, due this behavior a unique index leaf block
can accommodate more rows when compared to a non-unique index.
B-Tree indexes and Constraints:
Oracle enforces unique and primary key constraints using the B-Tree
indexes. A B-Tree indexes can be either Unique or Non-Unique. Unique indexes
guarantee that no two column values in the index are the same. When we add a
unique or primary key constraint to a table oracle automatically create a unique
B-Tree index. This has rather interesting implications, when we drop or disable
the constraint, oracle automatically drops the underlying index. In order to
prevent this from happening use the KEEP INDEX clause when disabling or dropping the
constraint. Another important aspect is that when we create a unique index even
though we are indirectly enforcing a unique constraint, it doesn’t show up in
the DBA/USER/ALL_CONSTRAINTS view unless we manually add a constraint using ALTER TABLE command. A DBA
should be aware of the following behavior when it comes to constraints and
Unique indexes.
-
If a B-Tree index doesn’t already exist, Oracle
creates a unique index automatically when we add a unique or primary key
constraint.
-
If the columns are already indexed using either a
B-Tree unique or non-unique index, oracle simply uses that index to enforce the
primary or unique key constraint.
When it comes to performance, enforcing unique and/or primary key
constraints using non-unique Indexes presents different challenges.
-
Increased buffer gets / consistent gets / latch
activity for each execution.
o
Index Range Scan vs Index Unique Scan
-
Increased redo / undo generation while performing roll
back.
Increased Logical IO (LIO)/Latch
activity:
When the unique or primary key constraints are enforced using a
non-unique index, a predicate with an equality condition always results in
index range scan instead of index unique scan to retrieve the table contents as
shown below:
Index Range Scan:
select * From temp
where roll = 123;
Index Unique Scan:
select * From temp
where roll = 123;
With a unique index, Oracle is certain that a key value returns in only
value. Due to this it examines the buffer with only one CBC latch, but when it
comes to non-unique indexes, oracle uses a full 2 gets CBC latch for the index
leaf block and the data block resulting in an increase of about 50% latch
activity. Both the cases the number of buffer gets doesn’t change. Depending on
the height of the index, a unique/range scan that results in a single row has
the following path
-
Index Branch root block
-
Index Branch block
-
Index Leaf block
-
Table block
When queries have predicates such as <, <=, >, >=, or,
between, in etc, the latch activity is the same for both the unique and
non-unique indexes. This behavior has a noticeable effect only when the
application issues lot of queries that use predicates that use equality condition
on primary key or unique columns but these constraints are protected or policed
by non-unique indexes.
Increased Redo / Undo Generation
during rollback:
When a non-unique index enforces the primary key or unique key
constraint, when performing inserts, oracle straightaway modifies the index
leaf blocks and after modifying the contents the constraint check will be done
and the oracle issues an ORA-00001: unique constraint violated error, this result
in both undo and redo generation. This is not the case when a unique index
enforces a primary key or unique key constraint. Oracle validates the unique
constraint even before modifying the contents of the index and this results in
less redo or undo generation.
This behavior is the same when it comes to update statements updating
unique key or primary key column, such as the following
Update temp set
roll=10912 where roll=1; //roll column is protected via a unique index.
In this case, the end result should be that the entry in the index leaf
block containing the value 1 should be deleted, and a new entry with value
10912 should be inserted into a different leaf block. In this particular case
Oracle straight away deletes the entry in a leaf block containing the value 1
and before inserting the value 10912 it checks whether the value already
exists, and if it exists, before modifying the block it throws an ORA-00001
error. The end result is that only one leaf block gets modified, and it is
rolled back. The redo generation in this case is reduced by half by using a
unique index rather than a non-unique index enforcing a unique or primary key
constraint.
Reverse Key indexes:
Contention for the right most index leaf blocks is clearly visible when
the data is inserted into a table by multiple sessions using either a database
or application sequence to populate certain columns such as primary key or
unique key. This problem is more severe in RAC environments where application
sessions connected to different instances insert the data concurrently as the
instance spends most of the time transferring the blocks across several
instances buffer cache via interconnect. The severity increases with the
increase in the number of nodes in the cluster and if the sessions are evenly
distributed among the instances. In these cases, implementing a reverse key
index can help alleviate the contention on index leaf blocks. We can even
consider reverse key indexes even in single instance environments, if we notice
high contention (Buffer busy waits, enq – TX: index
contention) for index leaf blocks.
Reverse key index is a type of b-tree index that physically reverses the
bytes of each index key while keeping the column order and preserving the ROWID.
For example, if the index key is 123456 and if the bytes stored for this key in
hexadecimal is 15 2c 46 20 c4 in a standard B-Tree index, then the reverse
key index reverses the bytes as c4 20 45 2c 15, so the index key values 123456 and 123457
have a huge gap between them instead of 1 and these values are likely to be
stored in index blocks which are far from each other. Thus, the IO is spread
uniformly across the index leaf blocks.
A reverse key index can be created just adding REVERSE clause to the
create index DDL statement, and an example is as follows:
CREATE INDEX temp_idx on temp(roll) reverse;
Since there is no particular order in which the index branches or values
are organized in a reverse key index. The function min or max will result in an
INDEX FAST FULL SCAN.
Inherent to the way in which reverse indexes are implemented, a reverse
key indexes have the following disadvantages: and we will look at each one of
them.
-
Ability to perform Index range scans
-
Inability to make use of Asynch IO.
o
Index prefetching.
o
Batched Table access.
-
Increased IOPS.
-
Increased number of buffers in buffer cache.
-
Increased Clustering Factor
-
More fragmentation in the index
o
50-50 block splits.
-
Ability to use Batching in PreparedStatements
(application level), which is necessary to implement bind variables using JDBC,
but do no batch transactions especially inserts!
A big disadvantage of reverse key indexes is its ability to perform
index range scans. By default, oracle performs an index range scan on a
non-unique index even when predicates contain = condition as key value in a
non-unique index can return multiple ROWIDs. Reverse key indexes on non-unique
indexes do support index range scans in a way when the predicate contains conditions
such as IN, or, = but range-based predicates such as <, <=, >, >=,
between are not supported as the adjacent leaf blocks may contain the key
values that are greater than the value specified in the predicate clause, due
to this predicates based on ranges are not supported and the session always
ends up performing either a full table scan or full index scan. Remember that
index full scans or index fast full scans are supported.
When a key value results in retrieving multiple rows from the table, Oracle
does a pretty good job when it comes utilizing the features such as prefetching
and batching in which a session batches multiple IO requests to read the table
contents into one IO_SUBMIT call provided that the Async IO is available so
that all the blocks are read in parallel instead of serially, this improves the
performance considerably provided that the IO subsystem have enough IOPS to
spare and the disk layout is proper. Implementing reverse key indexes in these
cases, where most of the SQL statements query data form this table alone (or
single table access – no joins) will not have any performance impact (predicate
condition =). Summarizing, if the queries access this table alone we may not
see any performance effect.
When the indexed key column has
high number of distinct values or when the columns are protected by a primary
key, unique key constraints, and the queries against the database uses this
indexed column in the joining condition, oracle’s ability to effectively make
use of prefetching, batching, async IO diminishes, regardless of whether the
table is driving or driven table. Oracle resorts to old behavior of reading one
entry from the index, and reading the corresponding table block. This process
increases the IOPS on the server considerably if the data blocks are not
already cached, if the query is a frequently executing query, and if the blocks
are already cached in the buffer cache, we might not see any noticeable
performance impact, unless it has to read the blocks from the disk.
When the table’s average row length is very less and the table size is
extremely large, there is high chance that the size of the dependent structures
such as the indexes will be high. Due to its inherent nature, as the keys are
distributed (may not be even) among the available index leaf blocks and even
though the application works with the latest data. Converting the existing
index into a reverse key index or rebuilding the existing index as reverse key
index will eventually bring more data blocks into the buffer cache as the data
is spread across several leaf blocks, and the buffer cache size requirements will
increase considerably depending on the size of the table and amount of activity
on the table. Remember that all it takes is an additional 100MB buffer cache size
requirement to hold the active working region of the segments to cause a
cascading effect resulting in waits such as free buffer waits, write complete
waits, buffer exterminate, db file sequential read etc. Before
implementing a reverse key index, always perform the following checks
-
Number of index buffers in the buffer cache and the
ratio of Clean/XCUR vs CR.
-
How busy is the buffer cache?
-
Is the buffer cache properly sized?
Since the byte order is reversed subsequent insert statements result in
updating the index blocks leaf at various location and as such the block splits
are 50-50 and due to this space requirement for a reverse key index is usually
larger than a regular index and as such the ability of an index leaf block to
hold more key values is reduced. Periodically monitor the size of the index and
see if rebuilding can help reduce the bloat.
There is a general recommendation floating around which says when you
see high waits on log file sync due to huge number of commits – advice the
application team to batch the transaction, this can prove deadly when it comes inserting
into tables that have reverse key indexes. The impact increases with the
increase in the batch size. as the
sessions batch insert statements the likely hood of session hitting the same
set of index blocks increases considerably and as the block splits are 50-50
more time is wasted moving the entries in the leaf blocks.
Clustering factor for a reverse key index is inherently higher than a
regular index and will usually be very close to the number of rows in the table,
so consider this when implementing reverse key indexes as the clustering
factor. This may not have a noticeable impact when compared with a regular
index especially in cases where the table’s average row length is high, but
will have a significant impact when the table’s average row length is less as
the index size is larger.
Invisible Indexes:
Starting with 11g R1, an index can be marked as visible or invisible. By
default, the optimizer ignores these invisible indexes unless we explicitly set
the OPTIMZIER_USER_INVISIBLE_INDEXES initialization parameter to TRUE either at the
session or system level. Unlike unusable indexes, an invisible index is
maintained by Oracle during DML statements, the visibility status is designed
exclusively to hide or abstract certain indexes from the optimizer during the
cost estimation. Both normal and invisible indexes are maintained by oracle in
a similar fashion. We can even make indexes that support integrity constraints
or referential integrity constraints as invisible, but oracle internally still
maintains them, unless you make them unusable. Although we can make a
partitioned index invisible, we cannot make an individual index partition
invisible while leaving other partitions visible. Invisible indexes feature is
primary used for the following:
-
Test the removal of an index before dropping it.
-
Test new indexes without affecting the overall
application.
Making an index invisible to the optimizer is less invasive than making
it unusable and dropping it. Once the index is made unusable, the only way to
make sure that the optimizer uses it again is to rebuild it which can take huge
time depending on the size of the index. Another important use case is when
testing the performance of a new index without affecting the existing
application queries. Introduction of a new index may cause the existing plans
to change to better or worse, in these cases we can create an index invisible
so that the optimizer is not aware of the new index, and we can test the
performance of the queries in the session by setting the OPTIMIZER_USE_INVISIBILE_INDEXES to TRUE, once
the performance is known to have improved or the necessary benefits attained,
we can make the index visible.
Making an index visible or invisible invalidates all the dependent
cursors on that table and these cursors have to reparsed again resulting in a new
child cursor. Immediately after the visibility change If there are large number
of cursors in the shared pool or if there are huge number of sessions concurrently
executing SQLs against this table, then expect to see library cache contention.
Sessions may end up waiting on wait events such as cursor: pin s or library cache: mutex x. The
extent or impact due to this is directly proportional to the number of cursors and
total number of sessions. The information regarding why the cursor is
invalidated is stored in V$SQL_SHARED_CURSOR.REASON. The following is an example.
<ChildNode>
<ChildNumber>0</ChildNumber>
<ID>41</ID>
<reason>Marked
for Purge(5)</reason>
<size>1x1</size>
<unsafe_ddl_code>11</unsafe_ddl_code>
</ChildNode>
Composite Indexes:
Creating an index on multiple columns results in a composite or
multicolumn index. We can create either a B-Tree or Bitmap composite index.
Consider the following table:
create table temp(A
number, B number, C number, D number);
and we create a composite index on the table as follows:
create index
comp_idx on temp(A, B, C);
Optimizer considers this index for the cost estimates for all the
following set of predicates:
Index Range Scans:
Where a=? and b=?
and c=?
Where a=? and b=?
Where a=? and c=?
Where a=?
Where (a between ?
and ?) or (a > ?) or (a < ?) or (a in ()) or (a = ? or
a = ?)
Etc.,
Index Skip scans:
Where b=?
Where c=?
//depending on NDV on column a and b
Where b=? and c=?
Oracle can still utilize the index when the predicate conditions OR
condition, but only when the predicates contains a leading column of a concatenated
index as follows, this is due to the restriction imposed by the index skip
scans and we will look into this later.
Where a=? or b=?
In this case the plan will appear as follows:
In the older versions (pre-11g), Optimizer considered concatenated
indexes only when leading column or columns of the index are specified in the
predicate clause. Starting from 11g, oracle can consider this index for skip
scans as well. Column order in a composite index doesn’t matter when all the
columns of the composite index are specified in the query. When all the
predicates are specified in a query the number of lookups or reads required to
execute the query doesn’t change regardless of the column order. But there is
an important catch here, the size of a composite index increases when the
number of distinct values in the leading column/columns is high. It is always
better to have the most frequently used column/columns as the leading column/columns
in a concatenated index. A better starting point is to use the SYS.COL_USAGE$ to get a high-level
overview of the types of predicates used and the number of times they are used
for the columns.
In situations such as the following:
When both the columns A and B are specified in the predicate clause as
follows:
Where a=? and b=?
Column A has only 100 distinct values and Column B has only 90 distinct
values, and if the column values are either uniformly or randomly distributed.
Optimizer may not pick the index on either Column A or column B due to the cost
as it has to read many table blocks. But if there exists either a relationship
between column A and B or the combined cardinality of the columns A and B is
really high, the optimizer is likely to consider an index created on A, B
columns there by reducing number of lookups required. For example, there are 50
states and 3242
counties and with a table with over 1 billion rows, querying a table based on
indexed state or county column alone will always result in a full table as the
number of retrieved through the index lookup is very high. But we know that a
county is in only one state, there exists a clear relationship, creating an
index on (states, counties) results in optimizer selecting an index scan over a
costly full table scan as the combined cardinality of both the columns states
and counties is 50*3242 = 162100, and listing the employees in a state
department using and index on county and state columns will results in an index
scan. Similarly, if the cardinality of a column A is 200 and column B is 200 the combined cardinality
of both the columns is 40000, in these cases when queries use both the columns in their predicates,
creating an index on both the columns is very beneficial. Therefore, by
creating a multicolumn or concatenated index on multiple low cardinality
columns, we are improving the odds of optimizer choosing an index scan over a
costly full table scan.
There are several aspects that we have to understand regarding
multi-columned indexes:
For example, let’s take students table which records the marks a student
gained in a particular subject and load it with 1000 rows.
create table
students (student_id number, entry_id number, dept_id number, subject_id number, mark1 number);
create index
students_idx on (subject_id, mark1, student_id);
Depending on the distribution of the data and NDV of the leading columns
in a multi-column index, oracle can use either the leading column or several
columns in the index branch blocks to uniquely identify a leaf block containing
the corresponding index entry. As rows are inserted into the table, if oracle
determines that the NDV of the leading column is not enough to uniquely
identity the leaf block, it includes the second column of the multi-column
index to be part of the branch blocks and this process can continue until all
the columns are included in the index leaf blocks. So, don’t be surprised to find branch blocks
containing entries with only a leading column, or more. Let’s demonstrate this.
We load 1000 rows into the above table as follows:
insert into
students select rownum, round(dbms_random.value(0,100)),
round(Dbms_random.value(0,300)), round(Dbms_random.value(0,100)) from dual
connect by level < 1000;
commit;
Dumping the B-Tree structure gives us the following:
branch: 0x1c00173
29360499 (0: nrow: 3, level: 1)
leaf: 0x1c00174 29360500 (-1: row:344.344
avs:819)
leaf: 0x1c00175 29360501 (0: row:328.328
avs:824)
leaf: 0x1c00176 29360502 (1: row:327.327
avs:842)
now Let’s look at the branch block dump:
row#0[8048] dba:
29360501=0x1c00175
col 0; len 2;
(2): c1 64
col 1; TERM
row#1[8039] dba:
29360502=0x1c00176
col 0; len 3;
(3): c2 03 07
col 1; TERM
----- end of
branch block dump -----
In this case, subject_id has 294 distinct values (the number of
distinct values varies depending on the how dbms_random.value generates values). Now let’s load 10000 rows
into the table and see the contents of the branch block:
row#0[8045] dba:
29360485=0x1c00165
col 0; len 2; (2): c1 0a
col 1; len 2;
(2): c1 4e
col 2; TERM
row#1[8037] dba:
29360486=0x1c00166
col 0; len 2;
(2): c1 14
col 1; TERM
row#2[8023] dba:
29360487=0x1c00167
col 0; len 2;
(2): c1 1d
col 1; len 2;
(2): c1 43
col 2; len 2;
(2): c2 2d
col 3; TERM
We can clearly observe that a particular range of values oracle uses only
the leading columns or leading two columns or all the columns of the index.
This is done for efficiency and to reduce the number of branch blocks and to
increase the number of entries in the branch block. So, a leading column with
high NDV requires a smaller number of branch blocks as Oracle includes only the
leading column, inherently the branch block can hold more entries. Having a
high NDV leading column also increases chances of branch blocks being subjected
to 50-50 splits but this is dependent on various factors such as the following:
-
Whether the value is a monotonically increasing value
and strict ordering is used.
-
The data type of the leading column.
-
The nature of data load.
But following an index rebuild or creation, the number of branch blocks in
a multi-column index will be significantly less than other multi-column indexes
with low NDV leading column as oracle stores two or more columns in the branch
blocks.
There are several takeaways from this:
-
Regardless of the column order, when all the indexed
columns are specified in the predicates the number of lookups required to
execute the query doesn’t change.
-
Regardless of the column order, following an index
rebuild or creation the size of the concatenated index will almost be similar with
the exception of branch blocks(unless we have enabled index key compression).
Size of a multi-columned index – column order:
Consider the following table and its indexes:
create table
students(student_id number primary key, dept_id number not null, mark1 number
not null, mark2 number);
create index
sid_did_mark1_idx on students(student_id, dept_id, mark1);
create index
did_sid_mark1_idx on students(dept_id, student_id, mark1);
create index
did_mark1_sid_idx on students(dept_id, mark1, student_id);
The cardinality of each column is as follows:
student_id à Unique
dept_id à 300 (1-300)
mark1 à 300 (1-300)
After an initial load from multiple sessions, the index structure and
the space usage is as follows:
Remember that the branch blocks can also be split using 50-50, 90-10 approach. The
bloat in the index leaf blocks and branch blocks is completely at the mercy of
how the data is loaded into the table. If we observe closely USED_SPACE is almost the same
for all the indexes, this is because the leaf blocks store all the indexed
column values along with the ROWID, small variation in the USED_SPACE can be due to the
presence of NULLs. Variation in the BTREE_SPACE is due to the nature of the splits. Following
the rebuild of these indexes, the corresponding structure is as follows:
If we notice the index SID_DID_MARK1_IDX which has a high NDV leading column (in our
case unique) requires only a smaller number of branch blocks as since student_id column is student_id is enough to
traverse down the index to reach a particular leaf block/index entry without
additional leaf block reads.
Now let’s see a quick introduction into the benefits of index
compression and how it can reduce the overall size of the index. Let’s rebuild
the index DID_MARK1_SID_IDX using additional clause COMPRESS 2 and see the corresponding size.
If we notice USED_SPACE and BTREE_SPACE is reduced significantly. We will see more
details regarding how this is achieved in the key-compressed indexes section.
High NDV leading column in a concatenated index deprives us from
-
Reaping the benefits of Index key compression, which
not only reduces the overall size of the index but also reduces the number of
lookups required (this is when the table’s average row length is small in which
case the index size will be large) thereby not only improving the efficiency of
an buffer cache as less number of blocks are required but also reduces the
overall query elapsed time (though the saving or benefits with regards to
buffer cache and query elapsed time is subjective to many factors such as the
number of SQL executions, space savings achieved, etc.).
-
Ability of the optimizer to use or consider that index
for index – skip scans.
-
Write about leading column when it comes to joins.
Having low cardinality columns as the leading columns in a multi-column
or composite index has the following advantages.
-
Ability to use index key compression.
-
Ability to perform index skip scan – even when the
predicates refer to the third column.
Having multiple indexes on same set of columns but with different column
order can result in the optimizer’s index access cost to be same and in certain
situations this can cause problems. For example, consider the following scenario:
create table
students (student_id number, name varchar2(20), dept_id number, sub_id marks
number);
create index dept_sub_marks_idx
on students(dept_id, sub_id, marks);
create index marks_sub_dept_idx
on students(marks, sub_id, dept_id);
Consider the following query:
select * From
students where dept_id = 1 and sub_id = 10 and marks = 100
The corresponding 10053 level 12 trace is as follows:
****** Costing
Index DEPT_SUB_MARKS_IDX
SPD: Return code in qosdDSDirSetup: NOCTX,
estType = INDEX_SCAN
ColGroup Usage:: PredCnt: 3 Matches Full: #1 Partial:
Sel: 1.1259e-07
SPD: Return code in qosdDSDirSetup: NOCTX,
estType = INDEX_FILTER
ColGroup Usage:: PredCnt: 3 Matches Full: #1 Partial:
Sel: 1.1259e-07
Estimated selectivity: 0.010000 , col: #2
Estimated selectivity: 0.010000 , col: #4
Estimated selectivity: 3.1250e-04 , col: #5
Access Path: index
(AllEqRange)
Index:
DEPT_SUB_MARKS_IDX
resc_io: 5.000000 resc_cpu:
36467
ix_sel: 1.1259e-07
ix_sel_with_filters: 1.1259e-07
Cost: 5.000978 Resp:
5.000978 Degree: 1
****** Costing Index MARKS_DEPT_SUB_IDX
SPD: Return code in qosdDSDirSetup: NOCTX,
estType = INDEX_SCAN
ColGroup Usage:: PredCnt: 3 Matches Full: #1 Partial:
Sel: 1.1259e-07
SPD: Return code in qosdDSDirSetup: NOCTX,
estType = INDEX_FILTER
ColGroup Usage:: PredCnt: 3 Matches Full: #1 Partial:
Sel: 1.1259e-07
Estimated selectivity: 3.1250e-04 , col: #5
Estimated selectivity: 0.010000 , col: #2
Estimated selectivity: 0.010000 , col: #4
Access
Path: index (AllEqRange)
Index:
MARKS_DEPT_SUB_IDX
resc_io: 5.000000 resc_cpu:
36467
ix_sel: 1.1259e-07
ix_sel_with_filters: 1.1259e-07
Cost: 5.000978 Resp:
5.000978 Degree: 1
The final plan is as follows:
One might wonder how did oracle choose MARKS_DEPT_SUB_IDX. The answer to this is Oracle uses the
following statistics/metrics in the following order when dealing with equally
ranked indexes.
-
Index which has a higher DISTINCT KEYS.
-
Index with lower number of leaf blocks.
-
Index with the lower ASCII name.
Example index name that starts with A will be preferred over Z*.
For the above indexes the NDV, and the number of leaf blocks is as
follows:
Since MARKS_DEPT_SUB_IDX had more distinct keys, clearly MARKS_DEPT_SUB_IDX is the choice but wait a minute how can number
of distinct keys varies? This is because while gathering statistics using DBMS_STATS.GATHER_*_STATS procedure
estimate_percent =>
dbms_stats.auto_sample_size. Oracle also estimates the numbers of leaf blocks instead of actual
number of leaf blocks. We will learn more about this on statistics gathering
section.
Once you gather the statistics using estimate_percent
=> 100, we will get the actual
results.
When the predicates contain only the subset of the indexed columns this
behavior can cause the optimizer to pick up an index or plan which may appear
good for the first execution but for subsequent executions this index might not
be good. (dept_id = ? sub_id=?).
Key-Compressed Indexes:
Index key compression is introduced in Oracle 8i, COMPRESS option of CREATE or ALTER INDEX statement allows
us to enable the key compression option which eliminates the need to store
repeated occurrence of column values in the index lead blocks and this results
in reduction of index space requirements (there are cases where key compression
can result in increased space usage, we will look into them as we progress).
Consider the following index:
create index idx
on randomload(mark5);
Here mark5 has about 3200 distinct values with uniform distribution. Here
as an indexed column value results in multiple rows, oracle stores the column
value for each corresponding ROWID in the index leaf blocks as follows:
row#402[2793]
flag: -------, lock: 0, len=13
col 0; len 3;
(3): c2 13 43
col 1; len 6;
(6): 00 80 55 8a 00 1a
row#403[2780]
flag: -------, lock: 0, len=13
col 0; len 3;
(3): c2 13 43
col 1; len 6;
(6): 00 80 55 8c 00 34
Don’t you think that it’s a little waste of space? Now if we rebuild the
same using compress option, the instead of storing the key value again and
again for each entry, it just stores the column value only once in the leaf
blocks. The entries in the leaf blocks appear as follows:
prefix row#0[8027] flag:
-P-----, lock: 0, len=5
col 0; len 2;
(2): c1 3c
prc 651
row#0[8018] flag: -------,
lock: 0, len=9
col 0; len 6;
(6): 01 c0 da e2 00 0a
psno 0
row#1[8009] flag: -------,
lock: 0, len=9
col 0; len 6;
(6): 01 c0 da ea 00 33
psno 0
From the above leaf block dump, we can notice that column key value is
stored only once and the subsequent index entries store only the ROWID. Running
analyze index … validate structure displays the details before and after the
compression is enabled for the index that mentioned before.
NAME
LF_BLKS BR_BLKS DISTINCT_KEYS
----------
---------- ---------- -------------
IDX
107696 201 3200
IDX 79299 149
3200 <-- COMPRESS
So, reiterating again, index key compression feature eliminates the need
to store the indexed column value multiple times in an index leaf block.
Rebuilding an index using the COMPRESS will result in the reduction of the number of
index leaf and/or branch blocks if the column values are uniformly distributed
or if a column key value usually results in multiple rows.
We can use this COMPRESS option for both the B-Tree indexes and IOT. The compression is actually
achieved by splitting the index key into two parts: the prefix part (which contains the column/columns values depending on whether the index is
a single column index or a multi-column index) and the suffix part (the suffix part usually
contains ROWID or the rest of the columns of a multi-column or concatenated
index if we have manually specified the prefix length). When an indexed key value results in
multiple rows, the prefix part (which is the column/columns) is shared across multiple
suffix entries as the suffix entries are usually unique. We get better
compression when an index key value results in multiple rows.
The compression is usually done in the leaf blocks of a B-Tree index.
The index entry (key value – ROWID) are compressed locally within the same block.
As each one of these compressed rows (suffix part) refers to the corresponding
prefix, the prefix is also stored in the same block. By storing the prefix and
suffix locally in the same block, each index leaf block is self-contained and
in order to reconstruct the complete key no additional IO is required.
A single column unique index is not a candidate for key compression
feature as it is not supported in the first place and the following explains
why the key compression feature is not good for both unique and mostly unique
columns.
-
As all the rows are unique and key compression feature
relies on repeated column values, there are no potential benefits to
compressing that index. In fact, utilizing key compression will not only
increase the space requirements but also the CPU utilization.
-
Unique indexes are special in the way they store the
entries in the leaf blocks, they consider the ROWID as part of the part of the
key. As the key compression feature splits an index entry into prefix and
suffix (prefix containing the column value and the suffix containing
the ROWID), ROWIDs are no longer considered part of the key.
For a multicolumn unique index, if the leading columns are not unique or
results in multiple values, key compression feature can help. Even in this case
unless the distribution of non-unique leading columns is high, the potential
benefits of key compression can be less. We need to find the perfect balance here
between the distribution of values and the effects of prefix and suffix split.
By default, COMRESS option of CREATE/ALTER INDEX includes all the columns of an index as part
of the prefix. If we want to include only certain leading columns to be part of
the prefix since these columns are known to have repeating values, COMPRESS clause also lets us
specify the prefix length or the columns to be included as part of the prefix
as follows:
create index
temp_idx on temp(mark1, mark2, mark3) compress 2;
In this case oracle includes only the columns mark1, mark2 as part of
the prefix.
Remember that although the key compression reduces the size of an index
by sharing parts of keys across multiple entries, there is a CPU overhead to
reconstruct the column values during the index loops or scans even though the
prefixes are stored locally in the block. If a SQL statement runs more
frequently and touches only few index blocks then the CPU overhead may be in
the order of < 1% and the elapsed times for each execution for such SQL statement
may not increase at all. But it is only when the SQL statement has to read or
process many index blocks for each SQL execution, the impact can be clearly
seen. This increases with the number of index leaf blocks it has to read and the
elapsed times of the SQL statement increase considerably when compared with the
CPU utilization.
The following is an example where the SQL statement touches about
>39+ index blocks. We can clearly see the impact on the CPU utilization and
elapsed times for each execution. (shared pool is flushed so that the impact of
the impact of the disk reads is not considered). We can see that even though
the consistent reads is far less for the SQL statement which used compressed
index, the elapsed times are significantly higher.
Similarly, the following is a situation which basically shows the impact
on the elapsed times and CPU utilization for a SQL statement that basically
touches very few index blocks.
There is a popular misconception that key compression is effective only
when the number of distinct values is less. This is not true, even if there are
100k distinct values in a column uniformly distributed in a table with over 1
million rows, the key compression can still help reduce the size of the index.
Finally summarizing, index key compression can be considered in the
following cases:
-
In case of a non-unique index in which an index key
value results in multiple rows.
-
In case of multi-column unique index in which the NDV
of leading columns is very less.
-
In case of IOT, -- Check
-
The key compression feature becomes less effective
when the NDV of a column or set of leading columns is very high and data is
randomly distributed. For example, there are 100k distinct values of a
column/columns in a table with just 200k rows and only few column values result
in multiple rows fetched from the table and if the select statements end up
selecting data from the index uniformly. So, even before implementing the index
key compression, investigate how the data is organized in the table
-
By default, using the COMPRESS clause without a prefix
compresses all columns of an index.
-
Similar to unique indexes, ROWIDs are included as part
of the key in a compressed multi-column unique index.
Advanced key compression:
Consider a scenario in which the leading column of a multi-column index
has less NDV and the tail column has values distributed randomly, and has
unique values. With default key compression feature (COMPRESS), the DBA has no
option but to include only the lead column as the prefix. Now enter advanced
index key compression to the rescue. Advanced key compression is introduced in
12.1, and is available only with advanced compression license. This feature
allows Oracle to automatically determine the column or set of columns to
compress in the index leaf blocks.
compress for
advanced low
Consider the following composite index:
Create index idx
on students (department_id, address_gps);
Here we are sure that there are many students in each department, and
the address of the student in GPS coordinates is most unique, only for students
who stay in Hostels the GPS coordinates will be the same. (no application
requires such an index but this gives an overall idea of the situations when we
can consider a composite index). In cases such as these, advanced key
compression will help in reducing the size of the index. Without the advanced
compression feature the DBA has no other option but to specify the department_id as prefix, even
though there is a great scope of compressing other common values.
Create index idx
on students (department_id, address_gps) compress 1;
To use this feature, create or rebuild the index as follows:
create index idx
on students (department_id, address_gps) compress for advanced low;
alter index idx
rebuild compress for advanced low;
As is the case with single column unique indexes, advanced compression
does not allow us to compress a single column unique index (compress for
advanced low, but compress for advanced high allows as it uses a different
compression algorithm altogether similar to hybrid columnar compression).
Oracle introduced compress for advanced high with 12.2 which uses a
completely new algorithm and the compression ratios are very high even when the
column contains unique values. This is only for warehouse or OLAP workloads,
and not for OLTP.
alter index idx
rebuild compress for advanced high;
Index key compression feature doesn’t alter the optimizer’s ability to
consider index range/full/skip/fast full/unique scans. All these are supported
Function Based Indexes:
If we use predicates such as lower(name) like
‘asdf%’ on normal indexed columns,
Oracle always performs a full table scan because Oracle has no way of knowing
the selectivity of the predicate and it has to apply the function to every
indexed column value in the table which is much costlier than performing a full
table scan. Enter function-based index to the rescue. A function-based index
computes the value of a function (which can be predefined or user defined
function) or expressions (+, -, /, case, etc.) involving one or more columns
and stores it in an index. Oracle supports both B-Tree and Bitmap
function-based indexes. Function-based indexes or simply FBI are specialized
indexes, in that once they are created, oracle automatically adds a virtual
column to the table and this can be seen in DBA_TAB_COLS,
DBA_TAB_COL_STATISTICS, USER_TAB_COLUNNS.DATA_DEFAULT etc., views. Remember that descending indexes
are internally implemented as function-based indexes. Oracle will not allow us
to create a function-based index, unless the function returns a deterministic
value (the function that returns the same value at all times for example, value
returned by the function TO_CHAR in TO_CHAR(SYSDATE,’DD-MON-YYYY HH24:MI:SS’) will differ each time we invoke it and hence it
is not deterministic).
Since the values stored in a function-based index are precomputed based
on a function or expression on a column or set of columns, a session or a
server process can use the index directly instead of computing the value at the
runtime which not only saves CPU and I/O (as oracle performs a full table scan)
but also reduce the elapsed times of the query significantly (provided the
predicate results in fewer rows returned – cost). Function based indexes are
more frequently and predominantly used for indexing the case-sensitive columns
such as CHAR, VARCHAR2 etc.,
Once we create a function-based index, optimizer will consider that
index in the cost estimates provided that the function or expressions used is
identical to the one used when we create the index. This is very important and
we will see why it matters. Consider the following query:
select * from
students where upper(name)=’ROBERT’;
Unless the index is created as follows, the optimizer will not consider
the index in the cost estimates.
create index
stud_name_idx on students(upper(name));
We can find the expressions or function used to create a function-based
index using either DBA_IND_EXPRESSIONS.COLUMN_EXPRESSION or by extracting the DDL of the index using DBMS_METADATA.GET_DDL. When
indexing foreign key columns on the child table make sure that these indexes
are not created as function-based indexes as this causes locking issues, create
an additional function-based index apart from the b-tree index in this case, if
it is mandatory.
Oracle allows us to selectively index column value using either user
defined function or case expressions. Consider the following index:
create index
mark1_idx on randomload(case mark1 when 1 then null else mark1 end);
In this case, oracle will index all the values other than 1 since we are
using a case expression to store nulls in case the mark1 value is 1. Unless we
specify the entire expression in the where clause oracle will not use the
index:
select avg(mark2)
from randomload where case mark1 when 1 then null else mark1 end=98
Function-based indexes are not just restricted to applying functions or
expressions on single columns, we can build either single-column or
multi-column function-based indexes, using multiple table columns. Consider the
following:
create index
average on randomload (mark1 + mark2 + mark3 + mark4 + mark5 + mark6, round ( (mark1
+ mark2 + mark3 + mark4 + mark5 + mark6)/6));
A function-based index is used only when the predicate contains the
exact expression or functions used while creating the index. For example,
consider the index created above, unless we specify the SQL statement as follows
optimizer will not consider the index.
select roll from
randomload where mark1 + mark2 + mark3 + mark4 + mark5 + mark6 = 123;
Instead of specifying the entire expression or functions used while
creating the index, we can also use the virtual column which oracle creates
automatically when it creates a function-based index in the where clause as
follows:
SQL> select *
from randomload where "SYS_NC00009$" = 5747;
Predicate
Information (identified by operation id):
---------------------------------------------------
2 -
access(ROUND(("MARK1"+"MARK2"+"MARK3"+"MARK4"+"MARK5"+"MARK6")/6)=5747)
If we notice the column name, the name starts with SYS_*, this is because it is
automatically generated by oracle. As Oracle considers this as a hidden column,
histograms on such columns are not collected by default, unless we specify
either FOR ALL HIDDEN COLUMNS in the METHOD_OPT parameter or specify the hidden column name in
the METHOD_OPT. Unless the value
returned by a function-based index are uniform, collecting histogram on these
virtual/hidden columns is mandatory.
There is a specific reason for specifying that we can use the virtual
column in the predicate clause. As the expressions or functions used in the
predicate clause become too large, there is a higher probability of mistakes
while writing the code. We can even consider a better alternative which is to
add a virtual column to the table and index that column, which makes coding
simpler at the application level. Consider the same example, that we looked at,
instead of creating an index, we could perform the same as follows:
alter table
randomload add (total number generated always as (mark1 + mark2 + mark3 + mark4
+ mark5 + mark6) virtual, average number generated always as (round ( (mark1 + mark2
+ mark3 + mark4 + mark5 + mark6)/6)) virtual;
create index
total_avg_idx on randomload(sum,average);
And now the SQL statements can simply refer these columns in the
predicate clause as follows:
select roll from
randomload where sum = 123123;
we can disable the use of function-based indexes using _disable_function_based_index
hidden parameter. Min/Max functions on function-based indexes are supported
(descending indexes are an exception) and will result in INDEX FULL SCAN (MIN/MAX) for
numeric based columns.
Following are the limitations of function-based indexes:
-
Skip scans are not supported (remember reverse key indexes)
-
Expression cannot include aggregate functions such as sum, avg etc.
-
The database ignores a function-based index while
doing an OR-expansion.
-
Function or expression should result in deterministic
length (operations on strings should result in only finite length).
-
Direct path loads are not supported on a table with a function-based
index that is dependent on a user defined function.
-
Compared with a regular index, a table with a function-based
index experiences slow Inserts and updates.
Descending Index:
Starting from Oracle 8i, we can create indexes in descending order for a
specific column. Oracle internally implements descending indexes as
function-based indexes (Remember that skip scans are not supported with
function based or reverse key indexes). By default, when we create an index,
the entries are stored in ascending order. Descending indexes are useful in
cases where we order the data in descending order. To create a descending index
on a column, all we need to do is add the DESC keyword along with the column name in the CREATE INDEX DDL statement.
Consider the following example:
create table
students (student_id number, name varchar2(20), marks number);
create index
marks_idx on students(marks desc);
insert into
students select rownum, dbms_random.string(0,20), round(dbms_Random.value()*1000)
from dual connect by level < 100000;
commit;
Since Oracle creates a function-based index, it adds a virtual (hidden)
column to the table:
Remember that for descending indexes, the actual function used is not
displayed in the DATA_DEFAULT column but can be seen in the predicate
section of the Explain plan. Consider the following query let’s look at the
predicate section for this plan:
select * from
students where marks = 9;
Oracle internally applies the SYS_OP_DESCEND function on the column value and stores it in
the index.
Technically both the following queries are equivalent.
Select * from
students where marks = 9;
Select * from
students where sys_op_descend(marks) = SYS_OP_DESCEND(9));
Select * from
students where sys_op_descend(marks) = 9;
Now that we know that the column values are stored in the descending
order, how about ROWIDs? ROWIDs are
stored in ascending order as usual.
Creating a descending index on a single column may not be beneficial
since Oracle can perform INDEX RANGE SCAN DESCENDING when we use order by clause with DESC keyword to eliminate the
sort operation. For normal multicolumn indexes, things work little differently,
since the data is sorted in ascending order for all columns in the index leaf
blocks, sorting is eliminated when the SQL contains the ORDER BY clause
contains leading indexed column and/or tailing columns.
E.g.: consider the following index:
create index
mark1_idx on randomload(mark1, mark2);
Sorting will not happen for the clauses:
Order by mark1,
Order by mark1
desc;
Order by mark1,
mark2;
Order by mark2;
(if we use = predicate on mark2 column)
But for the rest sorting of results happen and this can be clearly seen
in the explain plan as SORT ORDER BY.
Oracle can utilize either INDEX SKIP SCAN
DESCENDING or INDEX RANGE SCAN DESCENDING
to avoid a sort operation when order by clause includes the leading column. But
it is when we use ORDER BY clause on the tailing columns with DESC keyword (without a =
predicate on the leading column), sorting occurs. Descending indexes helps in
avoiding the sort operation post fetching the results, but inherently since
they are implemented as function-based indexes, they cannot support skip scans.
Another interesting point to remember here is that sys_op_descend(null) returns 00. in other words, nulls are
stored in a descending index.
Limitations of a descending index:
-
Skip scans are not supported.
Partitioning Basics:
Partitioning allows us to logically divide a large table, index or IOT
into smaller pieces called partitions. Each partition has its own name, and can
optionally have its own storage characteristics. Partitions can either be
managed collectively or individually. A row in a partition table is assigned to
only a single partition using a partition key and the partition key can consist
of one or more columns, Oracle automatically directs inserts, updates and
delete operation to the appropriate partition with the partitioning key. There
are several benefits of partitioning such as
-
Partition pruning.
-
Ability to drop old partitions instead of deleting
data (traditional table).
-
Ability to move old data to slower disks.
-
Reduce the index skew etc.
There are downsides to partitioning as well and these depend on how well
the schema and/or application is configured. As the table/index partitions are
completely transparent to the application, even with correct predicates in the
WHERE clause, oracle has to perform checks to ensure that the DML or SELECT
statement are redirected towards the right partition and this involves cost and
increase the elapsed times of a query. The delay also tends to increase with
the number of partitions. Consider the following table and insert statement:
create table
students(student_id number, dept_id number, name varchar2(30), sub_id number,
mark1 number, mark2 number, mark3 number, mark4 number, mark5 number, mark6
number, mark7 number, mark8 number) partition by range (dept_id) interval(1)
(partition p1 values less than (1));
create index
did_sid_idx on students (dept_id,
student_id) local;
insert into
students (student_id, dept_id, name, sub_id, mark1, mark2, mark3, mark4, mark5,
mark6, mark7, mark8) values (?,?,?,?,?,?,?,?,?,?,?,?);
For each insert Oracle should compare partition boundaries to determine
the right partition this row maps to and then finally insert the row. Similarly,
for the operations such as DELETE, UPDATE etc, even though the predicates
contain partitioning key such as the following, additional cost is involved in
order to find the correct partition.
delete from
students where dept_id = ? and student_id=?
Update students
set mark7 = ? where student_id = ? and dept_id = ?
Even though the workload is highly concurrent and you are pushing the
limits of IO in terms of IOPS and log file sync, you may not notice the effect
of the this for traditional inserts, updates, deletes (the additional cost for will
in the order of <0.02-0.03ms, but things become more apparent when you use
Batch Inserts or bulk loads as they are slowed down considerably as Oracle
should check each and every row. The following chart gives us an overview of delay
or time it takes to insert millions of records into a partitioned and
non-partitioned table.
The test environment is as follows:
1.
Database is cached in memory and in no archive log
mode.
2.
Segments were pre-allocated with extents.
3.
Redo sizes were more than 6GB in size (reduce the
frequency of incremental checkpoints).
4.
Tables had no indexes.
Partitioning:
Partitioned Indexes:
Partitioned indexes are of two types:
-
Local
o
Local Prefixed Indexes
o
Local Non-Prefixed Indexes
-
Global
Local Index:
In a local index, all the index partitions are equipartitioned with the
underlying table. Oracle partitions the index on the same columns as the
underlying table, creates the same number of partitions and/or sub-partitions
and gives them the same partition bounds as corresponding partitions of the
underlying table. Oracle also maintains the index partitioning automatically
when the partitions in the underlying table are added, dropped, merged, or
split or when hash partitions or sub-partitions are added or coalesced. This
ensures that the index remains equipartitioned with the table. Remember that a
Local index can be created UNIQUE only when partitioning column form a subset
of the index columns. This restriction guarantees that rows with identical keys
always map the same partitions, where uniqueness violations can be detected.
Consider the following table:
create table
students(student_id number, dept_id number, name varchar2(20)) partition by
range (dept_id) interval (1) (partition p1 values less than (1));
By default, when we create an index on the partitioned table without
LOCAL keyword, oracle creates a regular non-partitioned index.
create unique
index students_pk on students(student_id);
alter table
students add constraint students_pk primary key (student_id);
To create a Unique Local Index, we must specify the partitioning key as
the index column dept_id along with the student_id column.
create unique
index students_pk on students(student_id, dept_id ) local;
alter table
students add constraint students_pk primary key (student_id, dept_id) using
index local;
As Oracle treats each index partition as a separate segment, it has no
way to guarantee that student_id with value 1 exists in other index partitions, in other
words scanning each index partition to make sure that a value doesn’t exist
causes additional reads which can significantly reduce the insert performance.
Even if the column student_id is populated using a database or application
sequence which guarantees unique value every time a new value is generated,
Oracle has no way of guaranteeing this from the partitions perspective since
database sequences also have the ability to CYCLE and Oracle is basically clueless with regard
to application sequences. If you really want to make sure that student_id is unique, you
have no choice other than to create a regular non-partitioned index and rely on
features such as asynchronous global index maintenance for partition
maintenance operations or few cases in which you can’t modify the SQL
statements partitioning may not be the right choice.
Let’s start with the very basics – Partition pruning:
Partition pruning allows the optimizer to analyze FROM and WHERE clauses
in the SQL statements to eliminate unneeded partitions when building the
partition access list. Depending on the actual SQL statement, Oracle Database
may use static or dynamic pruning. Static pruning occurs at the compile time as
the information about partitions accessed is available beforehand. Dynamic
pruning occurs at run-time, meaning that the exact partitions to be accessed by
a statement are not known beforehand (happens mostly with joins and bind
variables). Partition pruning depends on whether the partitioning key is specified
in the predicates. Without the partitioning key column, partition pruning will
not happen. Consider the following table and index:
create table
students(student_id number, dept_id number, name varchar2(20)) partition by
range (dept_id) interval (1) (partition p1 values less than (1));
alter table
students add constraint students_pk primary key (student_id, dept_id) using index local;
student_id is populated using a monotonically increasing
value and is populated by a database sequence and is guaranteed to be unique
across all the partitions.
For a query such as the following:
select * from
students where student_id = :1 and
dept_id=:2;
With the partition key specified in the predicate, Oracle can quickly
determine which index partition this value maps to and redirect the reads to
that particular partition and eventually to the corresponding table partition
using the ROWID:
The execution statistics are as follows:
Without the partition key column specified in the predicate, Oracle has
no other but to scan every index partition (segment) as student_id = 1 can be in any
partition, so Oracle ends up traversing each index partition to find the
corresponding rows. The number of reads (index leaf blocks) is directly
proportional to the number of partitions - (blevel+1)* number
of partitions.
select * from students
where student_id = :1
The explain plan is as follows:
Oracle performs Index Range Scan rather than an Index Unique
Scan because both student_id and dept_id make up the primary
key and dept_id is
not specified in the predicate and. Execution statistics are as follows:
– 100 underlying table partitions:
Execution statistics – 10 underlying partitions:
Even though student_id is unique and is guaranteed to be present in
only one index partitions, the SQL:
select * from
students where student_id = 1
results in traversing B-Tree structure of each index partition starting
from root to the leaf even though the data is stored in the first index
partition.
The following is the snippet of the 10046 level 12 trace indicating the reads:
To understand why Oracle requires traversing down the B-Tree structure
of each index partition, we need to understand how the entries are stored and
organized in the root branch block, branch block, leaf blocks.
The contents of the root-branch block and branch blocks is basically the
same, entries in a root branch block or branch block can point to DBA of either leaf blocks or
the next level branch blocks. The contents of the branch blocks depend on
various factor such as
-
Whether the index is unique.
-
NDV of the indexed column.
-
NDV of the leading column of a
multi-column index as we have seen in multi-column indexes.
Unique Index:
row#0[8047] dba:
29435353=0x1c125d9 // DBA of either branch or leaf
col 0; len 4;
(4): c3 24 0c 37
row#1[8038] dba:
29436088=0x1c128b8 // DBA of either branch or leaf
col 0; len 4;
(4): c3 47 07 4c
Non-unique index: High NDV
row#0[8046] dba:
29434551=0x1c122b7 // DBA of either branch or leaf
col 0; len 4;
(4): c3 1f 1d 07 //Starting value:
col 1; TERM
row#1[8036] dba:
29434552=0x1c122b8 // DBA of either branch or leaf
col 0; len 4;
(4): c3 1f 21 38
col 1; TERM
Non-Unique Index: low NDV
row#0[8044] dba: 29428848=0x1c10c70
// DBA of either branch or leaf
col 0; len 3;
(3): c2 0d 09 //Starting value
col 1; len 3;
(3): 01 c0 1b //portion of ROWID
row#1[8032] dba:
29429412=0x1c10ea4
col 0; len 3;
(3): c2 19 0a //Starting value:
col 1; len 3;
(3): 01 c0 5c
There is a reason for highlighting portion of entries “col 1; TERM”, we will see eventually why. We will start
with a value that is the least or minimum column value in the index which is
guaranteed to be present in the left most index leaf block. The following
depicts the actual process, procedure involved, starting from root block. As things get more complicated let’s start
with basic flow:
Oracle interprets the entries in the Branch blocks (root or branch) as
follows:
No condition … just
leaf block pointer…
>= 1000 … leaf
block pointer … with optional TERM … 1000 is the starting value in the leaf block
>= 2000 … leaf
block pointer … with optional TERM
When we issue the Query such as the following:
select * from
students where student_id = 1; (Student_id is the primary key populated using a
monotonically increasing value starting with 1 Incremented with 1)
The corresponding 10046 level 12 trace is as follows:
The relevant section of the tree dump is as follows:
Root block is always the first block following the segment header – L3
BMB (as stated earlier), you can confirm this using the following query:
Oracle starts by reading the root block and you can see dump of the root
block as follows:
The value of interest student_id = 1 in byte representation is c1 02. Oracle starts by
reading the entries in the root block starting from row#0. The value here is c3 24 0c 37 which is
basically higher than the value we are looking for. Now that Oracle is sure
that this branch doesn’t contain the value it starts by reading the branch
block 0x1c0cc72 pointed
by kdxbrlmc
which may contain the value we are looking for.
Dump of index branch block is as follows:
Upon reaching the row#0 it determines that the leaf block doesn’t contain value that we are
looking for and starts by reading the leaf block 0x1c0025c pointed by kdxbrlmc which may contain the value we are looking for.
Dump of the index leaf block is as follows:
It starts reading the entries in the leaf block and now that value we
are looking for “c1 02” is present it gets the corresponding ROWID and starts reading the table block and finally
returns the results to the user. Since
the column value (in byte representation) forms the starting point for the
entries in the subsequent branch and blocks pointed using the DBA Oracle has to read the
subsequent entry next entry (ROW#) to determine out whether the value of interest
falls in range and finally read the read and proceed further proceed further. Due
to this, even if the maximum column value is 999999 and when we try to search for 99999999 which is not indexed
oracle ends up reading the right most block of the B-Tree structure (last
branch, leaf block of treedump).
There is another aspect that we should be aware of when it comes to
optimization that Oracle employs in the branch blocks not only to save space
but also to reduce the number of branch blocks required. Oracle can use either ROWID or TERM in the branch blocks
depending on various factors such as the following:
Use of TERM:
-
When the indexed column value and the corresponding ROWIDs are contained in the leaf
block the exception of last entry when the value becomes the starting value of
the entries in the next subsequent block.
-
To indicate the starting leaf block if a column value
spans across multiple leaf blocks.
-
Not used in unique single column indexes.
Use of ROWID:
-
When the entries of a particular value in the leaf
block span across multiple leaf or branch blocks.
Let’s look at a very basic scenario:
create table
students (student_id number);
insert into
students select rownum from dual connect by level < 1000;
insert into
students select rownum from dual connect by level < 1000;
create index
students_pk on students(student_id);
commit;
Since we are loading the same data twice we are sure that any value of student_id will result in
only two rows. The treedump of the index (students_pk) is as follows:
----- begin tree
dump
branch: 0x1c00163
29360483 (0: nrow: 5, level: 1)
leaf: 0x1c00164 29360484 (-1: row:492.492 avs:818) // avsà free space
in the leaf block -
leaf: 0x1c00165 29360485 (0: row:478.478
avs:830) … - due to the default PCTFREE
10
leaf: 0x1c00166 29360486 (1: row:479.479
avs:817)
leaf: 0x1c00167 29360487 (2: row:478.478
avs:830)
leaf: 0x1c00168 29360488 (3: row:71.71
avs:6931)
----- end tree
dump
We can obtain a treedump of an index using the following statement.
alter session set
events 'immediate trace name treedump level &[OBJECT_ID]';
Remember that X$BH.OBJ always points to the DATA_OBJECT_ID.
Oracle uses the same format/structure for all the branch blocks and the
entry kdxbrlmc always
points to either a leaf or next level branch block depending on its position in
the index structure. Treedump displays the index leaf blocks starting from -1,
this is because Oracle uses the pointer kdxbrlmc to point to the first leaf block the numbering
for the rest of the leaf blocks is based on the ROW# in the branch block. Now let’s look at the
branch block dump:
Branch block dump
=================
header address
139695531102284=0x7f0d6682104c
kdxcolev 1
KDXCOLEV Flags = -
- -
kdxcolok 0
kdxcoopc 0x80:
opcode=0: iot flags=--- is converted=Y
kdxconco 2 //number of columns in
the leaf block
kdxcosdc 0
kdxconro 4 //number of rows/entries in the branch
block starting from ROW#0
kdxcofbo 36=0x24
kdxcofeo
8012=0x1f4c
kdxcoavs 7976 //available free space
kdxbrlmc 29360484=0x1c00164 //points to
the leaf block
kdxbrsno 0
kdxbrbksz 8056
kdxbr2urrc 3
row#0[8047] dba:
29360485=0x1c00165
col 0; len 3;
(3): c2 03 30
col 1; TERM
row#1[8038] dba:
29360486=0x1c00166
col 0; len 3;
(3): c2 05 57
col 1; TERM
row#2[8025] dba:
29360487=0x1c00167
col 0; len 3;
(3): c2 08 1a
col 1; len 4;
(4): 01 c0 01 5f
row#3[8012] dba:
29360488=0x1c00168
col 0; len 3;
(3): c2 0a 41
col 1; len 4;
(4): 01 c0 01 5f
----- end of
branch block dump -----
Notice the use of the keyword TERM for col 1; for the first two entries in the branch block.
As all the entries in the leaf block (pointed by the branch block - dba) are contained and
doesn’t span across multiple blocks, oracle uses TERM instead of ROWID or portion of ROWID. Use of TERM in the branch block for a
particular leaf block doesn’t necessarily mean that the entries in that
particular leaf block doesn’t span across multiple across – in certain cases
the last entry of that particular leaf can become the starting point or
starting value of the next subsequent leaf block in the same branch block. This
can happen due to various reasons.
-
Immediately
following an index build/rebuild.
-
Let’s start with the first leaf block 0x1c00164 and tailing section of the index leaf block
dump is as follows:
row#490[1851]
flag: -------, lock: 0, len=13
col 0; len 3;
(3): c2 03 2f //Column value
col 1; len 6;
(6): 01 c0 01 5e 00 f5
row#491[1838]
flag: -------, lock: 0, len=13
col 0; len 3;
(3): c2 03 2f //Column value
col 1; len 6;
(6): 01 c0 01 5f 02 48
----- end of leaf
block Logical dump -----
----- end of leaf
block dump -----
End dump data
blocks tsn: 4 file#: 7 minblk 356 maxblk 356
Here Oracle is sure that the rows in the leaf block are contained and
doesn’t span across multiple leaf blocks, it uses the keyword TERM in the branch blocks. Things
get interesting from here, starting from the second and third entry in the branch
blocks:
row#1[8038] dba:
29360486=0x1c00166
col 0; len 3;
(3): c2 05 57
col 1; TERM //Use of TERM
row#2[8025] dba:
29360487=0x1c00167
col 0; len 3;
(3): c2 08 1a
col 1; len 4;
(4): 01 c0 01 5f //Use of portion of ROWID
Snippet of index leaf block 0x1c00166 dump is as follows:
kdxlenxt
29360487=0x1c00167
kdxleprv
29360485=0x1c00165
kdxledsz 0
kdxlebksz 8032
row#0[8019] flag:
-------, lock: 0, len=13
col 0; len 3;
(3): c2 05 57
col 1; len 6;
(6): 01 c0 01 5b 00 a4
row#1[8006] flag:
-------, lock: 0, len=13
col 0; len 3;
(3): c2 05 57
col 1; len 6;
(6): 01 c0 01 5e 01 e5
…
…
row#477[1824]
flag: -------, lock: 0, len=13
col 0; len 3;
(3): c2 08 19
col 1; len 6;
(6): 01 c0 01 5f 00 3f
row#478[1811]
flag: -------, lock: 0, len=13
col 0; len 3;
(3): c2 08 1a
col 1; len 6;
(6): 01 c0 01 5b 01 93
----- end of leaf
block Logical dump -----
The last entry of the leaf block 0x1c00166 which is the
column value c2 08
1a becomes the starting
value for the leaf block 0x1c00167. Since the value c2 08 1a span across multiple leaf blocks Oracle uses ROWID in the index branch
blocks. This basic example demonstrates that even when an indexed column value
results in 2 rows Oracle has to read 2 index leaf blocks to get the
corresponding ROWIDS. Similarly, following an index rebuild a value can span
multiple branch blocks, but this will not cause any additional branch block reads
as traverses through the index leaf block chain once the initial block is
found.
Due to this optimization, a 50-50 block split will not always result in
50% of the entries moved to a new leaf block. The following scenario explains
things:
drop table
students;
create table
students (roll number);
create index idx
on students(roll);
insert into
students select 1 from dual connect by level < 292;
insert into
students select 2 from dual connect by level < 281;
commit;
insert into
students values (1);
commit;
drop table
students;
create table
students (roll number);
create index idx
on students(roll);
insert into
students select 1 from dual connect by level < 291;
insert into
students select 2 from dual connect by level < 282;
commit;
insert into students
values (1);
commit;
Algorithm is basically as follows (for understanding certain aspects):
Read branch/root block and read the entries starting from ROW#0
Step A:
If value is higher
Action 1:
Read branch/leaf block pointed by previous ROW# or kdxbrlmc if ROW#=0
Step B:
If value is low read the next subsequent entry
until the value is higher.
If
higher Perform Action 1:
If no subsequent entry or is the last entry:
Read
the block pointed by the current entry.
Step C:
If entry value equals…then
check if ROWID or TERM is included:
If ROWID is in the branch entry:
check for TERM for previous entry
If TERM present read previous
and proceed in the chain until higher value found.
If No TERM read the current block continue in the chain
until higher value found
If TERM
Read
the lead block.
Before even considering partitioning as an option, evaluate whether the
SQL statements depend mostly on primary key or unique key lookups or joins
(joining column usually primary key or foreign key constraint) can be made
aware of the partitioning by including the partitioning key in the predicates.
Most ORM solutions such as Hibernate are not partition aware and the
development/application teams should carefully consider before implementing or
accept partitioning as a solution. If the workload is mostly dependent on primary
key lookups and the application cannot be made partition aware, then consider
using non-partitioned indexes for the primary/foreign keys. Oracle does support
asynchronous global index maintenance.
The same concepts apply to delete and update statement as well if the
predicate doesn’t contain the partitioning key column.
Local Prefixed Indexes:
Local indexes are of two types:
-
Prefixed indexes
-
Non-Prefixed indexes:
Local prefixed indexes are the indexes which have partitioning key as
the leading column in the index definition. Consider the following table:
create table students(student_id number, name
varchar2(30), dept_id number, sub_id number, mark1 number, mark2 number, mark3
number, mark4 number, mark5 number, mark6 number, mark7 number, mark8 number,
mark9 number)
partition by
range(Dept_id) interval(1) (partition p1 values less than (1))
create index
dept_sub_marks_idx on students(dept_id, sub_id,mark9) local; //prefixed index
create index
sub_dept_mark_idx on students(sub_id,dept_id,mark9) local; //non-prefixed
index.
If the SQL predicate contains the partitioning key regardless of whether
the index (local) is prefixed or non-prefixed partition pruning occurs and the
performance of a local prefix index will almost be identical to non-prefixed
local index.
Let’s consider the query: (the table students is loaded with uniform
distribution of data with a total of 100 partitions (dept_id – 1 - 100))
select * from
students where dept_id = ? and sub_id = ? and mark9 = ?
Remember that when all the predicates are specified in the query and if
there are multiple indexes on the table with these columns (different order),
then optimizer estimates the cost of index access to be same for both these
indexes resulting in optimizer choosing an index with either lower DISTINCT_KEYS or LEAF BLOCKS or ASCI name.
The execution plans using both the indexes are as follows:
Non-Prefixed Local Index
Prefixed Local Index
The execution statistics for a random set of bind values using both the
indexes is as follows:
The case is basically the same even when the query results in traversing
B-Tree structures of multiple partitions:
For the queries
select * from
students where dept_id in (?,?) and sub_id = ? and mark9 = ?
select * from
students where dept_id in (?,?,?) and sub_id = ? and mark9 = ?
IOT:
Consider the following table and its corresponding index:
create table shares(share_id
number, share_name varchar2(20), conditions varchar2(200));
create index shares_idx
on shares (share_id);
And when a session issues a query such as the following:
select share_id, share_name
from shares where share_id =19;
Oracle starts by working down the index shares_idx from root branch block to the leaf block containing
the key that matches the range. Considering the height of the B-Tree index to
be 2, it requires 3 block reads
-
Root branch block
-
Branch block
-
Leaf block.
Once Oracle obtains the ROWIDs from the index block and it reads the
corresponding table block. This requires additional table block read to get the
extra relevant details. What if we can avoid the table lookup altogether even
for the queries such as the following:
select * from
shares where share_id=19;
To accomplish this, we have to end up creating an index including all
the columns of the table, which is basically redundant considering the same
data is stored in two segments: table and index segment. Enter Index Organized
Table or IOT to the rescue.
Index Organized Table is introduced starting from 8i. An IOT stores data
according to the primary key column values and it can store the data as if the
entire table was created in an index. A normal index will store only the
indexed columns, but an IOT can store either extra columns or all columns of
the table in the index. As all or most of the table columns are stored in the
index, the rows of the table will not have a physical ROWID. Though we can
select the ROWID pseudo column from an IOT (this is a computed value based on
guess).
These ROWIDs are logical ROWIDs (we will look into this soon when we discuss
about secondary indexes). An IOT is appropriate when the queries contain
predicates that involve primary key columns (be it index range scan, index skip
scan or index unique scan). For example, if the primary key is based on columns
(a, b, c)
IOT is appropriate for the following types of predicates:
where a=?
where a=? and b=?
where a=? and c=?
where a=? and b=?
and c=?
where b=? //skip
scan when NDV of column a is very less.
where b=? and c=?
// same as above
several
combinations of range based predicates also work
In case of an IOT, even though Optimizer considers skip scans when the
NDV of the leading column is less and this should be considered carefully. As
additional columns are included as part of the B-Tree structure, benefits of
having an IOT will soon disappear with the steady increase in the NDV in which
case a full index scan or considering a secondary index will make things even worse.
An IOT DDL is similar to the normal heap table, but requires additional
keywords such as
-
ORGANIZATION INDEX (mandatory)
-
PCTTHRESHOLD (optional)
-
INCLUDING and (optional)
-
OVERFLOW (optional)
Let’s look at the DDL:
create table
students (
student_id number,
name varchar2(20),
contact_details
varchar2(200), primary key(student_id))
organization index
pctthreshold 50
including
student_id
overflow;
ORGANIZATION INDEX clause specifies that the table we are
creating is an Index Organized Table. INCLUDING defines column name up to which the columns
defined in the table that doesn’t make up the primary key are stored in the
B-Tree index structure. PCTTHRESHOLD defines the percentage of the block that is
reserved for an IOT row, if the row size exceeds this threshold then the non-key columns of
the table will be stored in the overflow segment and a pointer is stored to
locate the tail piece. Valid values for PCTTHRESHOLD are 1-50. If we
don’t specify a value for PCTTHRESHOLD while creating an IOT, oracle by default sets
the value to 50. OVERFLOW TABLESPACE specifies the tablespace that stores the
overflow segment (which stores additional columns of the table that were either
not included using INCLUDING clause or both columns that didn’t fit in the
index leaf block due to the size restriction along with rest of columns that
are not part of INCLUDING clause).
Oracle imposes a restriction on maximum length of index key and this
depends on db_block_size or the tablespace block size. For an 8k block size, the maximum allowed
index key length is 6398. For IOT, things work little differently, since the rows
in IOT are stored based on primary key values and the restriction on the PCTTHRESHOLD max value to 50,
the maximum allowed index key length is only 3218 for an 8KB data block size.
This restriction has following implications
-
Oracle doesn’t allow us to create an IOT, when the
total length of primary key columns is > ~3218.
-
When the total length of the primary key column/s and
columns included in IOT index structure crosses this threshold > ~3218,
Oracle stores the additional columns included in the B-Tree structure using
INCLUDING clause in the overflow segment.
This restriction should give us a clear indication whether it is even
worth considering bigger block size or IOT for that matter, considering several
factors such as
-
Index segment Bloat due to 50-50 block splits (when
leading column of the primary key index is non-unique) or combination of 90-10
and 50-50 when multiple sessions insert the data (when leading column is unique
and monotonically increasing).
-
Number of rows retrieved for each SQL execution (when
the leading column is not unique).
-
Range of predicates used.
Consider the following table:
create table
students(
Student_id number,
Name varchar2(20),
Phone number,
Address
varchar2(20), primary key (student_id))
Organization index
PCTTHRESHOLD 50
Including phone
overflow tablespace users;
In this case all the columns student_id, name,
phone will be stored in the index
leaf block. There is yet an interesting behavior to be noticed here, consider
the following DDL:
create table
students(
Student_id number,
Name varchar2(20),
Phone number,
Address
varchar2(20), primary key (student_id,
phone))
Organization index
PCTTHRESHOLD 50
Including phone overflow tablespace users;
Here even though we have specified the column “phone” in including clause
since, this is part of the primary key column, oracle does not store the column
“name” in the
index structure. To include name column in the B-Tree structure or IOT specify
name in the including
clause. INCLUDING clause
is used to specify the columns that do not make up the primary key to be stored
in the B-Tree structure. Consider another statement:
create table
students(
Student_id number,
Name varchar2(20),
Phone number,
Address
varchar2(20), primary key (student_id))
Organization index
PCTTHRESHOLD 50
Including phone overflow tablespace users;
In this case columns name and phone will be stored as part of the index structure.
Now let’s look at how the snippet of index leaf block trace for the
below table:
create table
temp(roll number, name varchar2(40), mark1 number, mark2 number, mark3 number,
mark4 number, mark5 number, mark6 number, primary key (roll, name))
organization index including mark3 overflow;
Here we have specified that the columns mark1, mark2 and mark3 be
included as part of the index structure.
row#115[981] flag:
K------, lock: 0, len=65
col 0; len 3;
(3): c2 0d 62 //Columns that markup the
primary key
col 1; len 40;
(40):
63 76 74 78 66 69 73 6b 6e 72 64 79 64 71 65
6f 72 64 73 63 6f 63 71 70 62
71 66 72 6f 61 6b 72 65 6a 63 6e 6f 6c 6c 68
tl: 18 fb:
--H-F--- lb: 0x0 cc: 3
nrid:
0x01c223e4.72 //pointer overflow segment data block that contains rest
of the columns (chained row)
col 0: [ 2]
c1 04 //mark1, mark2, mark3 columns included using INCLUDING clause
col 1: [ 2]
c1 06
col 2: [ 2]
c1 09
row#116[1046]
flag: K------, lock: 0, len=65
When the queries such as the following that refer only the columns that
make up the IOT segment and the predicates that include the primary key
columns, the data retrieval is very fast, as all the data required is already
present in the index leaf blocks.
select mark1,
mark2 from temp where roll = ? and
name=?
select mark1,
mark2 from temp where roll = ?
select mark1,
mark2, mark3 from temp where name=? //if the NDV of roll is very less then index skip scan is possible
otherwise optimizer may prefer an index full scan or index fast full
scan, but remember that skip scans
require more blocks to be read.
But when the queries refer columns that does not markup the primary key
or columns included in the B-Tree structure in the select statement, additional
IO is required to read the corresponding data block from the overflow segment
using the forwarding addresses stored in the index leaf blocks.
Select mark4 from
temp where roll=?
Select mark4,
mark5, mark6 from temp where roll=?
Select * from temp
where roll=?
If queries frequently refer to additional columns (part of overflow
segment) in the SELECT or PREDICATE
clause of a SQL statement, the benefits of having an IOT subside gradually with
the number of executions, in such cases either consider adding those columns to
the table. (2 possibilities, PCTTHRESHOLD may reach or adding those columns doesn’t make
sense to create a bigger block size tablespace in which case converting the IOT
into a normal table).
The statistic table fetch continued row is incremented for every row/column that is
read from an overflow segment, which makes sense as the rows are chained to the
overflow segment as indicated in the IOT leaf block trace file. Due to the nature of
the IOT and
the way the data is stored in an overflow segment, chained rows are expected. To
identify the number of rows or entries stored in the overflow segment, query
the view DBA_TABLES.CHAIN_CNT or DBA_TAB_STATISTICS.CHAIN_CNT, but remember that this statistic is not
computed or the CHAIN_CNT is not populated unless we run analyze table … compute statistics command.
Even though we include certain columns as part of the B-Tree structure,
these columns cannot be used for tree traversal or in other words specifying
these additional columns included in the B-Tree structure without the primary
key columns in the predicate results in an index fast full scan consider the
following table:
create table temp
(order_id number,
product_id number, customer_id number, details varchar2(40), status
varchar2(30), primary key
(order_id)) organization index including status overflow;
In the above table we have included all the columns to be part of the
index with ORDER_ID
being the primary key. Only the queries that have ORDER_ID in the predicates
results in index unique/range scans, specifying any other column will result in
an index fast full scan as shown below:
Depending on whether we choose to include all the columns or certain
columns of the table in the B-Tree structure, Oracle creates two segments in
the tablespace when we create an IOT.
-
Index segment, to hold all the primary key columns and
any additional columns if we have specified INCLUDING clause in
the table DDL. We can store all the columns of the table if necessary in the
index structure provided that the max size of the row does not cross the PCTTHRESHOLD.
-
Overflow segment to hold the additional columns (apart
from primary key columns) if specified using INCLUDING and any
additional columns.
Oracle internally uses the naming scheme as SYS_IOT_TOP/OVER_OBJECT_ID for naming the corresponding index and
overflow segments. Consider the following CREATE TABLE statement (deferred_segment_creation = false):
create table RANDOMLOAD(roll
number, name varchar2(40), mark1 number, mark2 number, mark3 number, mark4
number, mark5 number, mark6 number, primary key (roll, name)) organization
index including mark3 overflow;
If we notice IOT RANDOMLOAD
doesn’t have an associated DATA_OBJECT_ID, this is because the data is actually stored
in the Index and associated overflow segment. Since the contents or the portion
of the rows are chained from index segment to the overflow segment, even though
the query does read the overflow segment or references only the columns present
in the overflow segment, we will never see SYS_IOT_OVER_74542 or RANDOMLOAD in any of the explain plans. Following is an
example of the explain plan
SQL> select *
from randomload;
Even though Oracle generates the names for both Index and Overflow
segments and these are visible in views such as DBA_INDEXES/DBA_TABLES. We
cannot refer them directly (different from virtual columns) in the SQL
statements, Oracle simply doesn’t allow us to do that. It’s not surprising that
statements such as the following fail:
select count(*)
from SYS_IOT_OVER_75889;
alter index
SYS_IOT_TOP_75889 rebuild online;
alter index SYS_IOT_TOP_75889
move;
with errors such as
ORA-28650: Primary
index on an IOT cannot be rebuilt
ORA-02243: invalid
ALTER INDEX or ALTER MATERIALIZED VIEW option
ORA-25191: cannot
reference overflow table of an index-organized table
IOTs serve a specific purpose that is to retrieve the data faster when
the queries mostly refer primary key columns in their predicates. Since the
primary data segment of an IOT is the index segment, IOTs suffer the same fate
as the B-Tree indexes when it comes to index leaf block splits. When the
leading column of an IOT is populated using monotonically increasing value always
prefer database sequence, avoid using batch inserts (JDBC PreparedStatement.addBatch()/ExecuteBatch() API) as this leads to more bloat. If the additional columns included in
the B-Tree structure using the INCLUDING clause are updated regularly, this can result
in 50-50 leaf block splits adding more bloat to the IOT. The size of an IOT
will be higher than a regular table. Due to this DBAs should constantly monitor
the space utilization of an IOT either using segment advisor or by running analyze index … validate structure (INDEX_STATS.PCT_USED), remember that analyze index … validate structure will lock the table, preventing the
application from performing the DML.
As the data is stored based on primary key values, column order of the
primary key is very critical when it comes to reaping or exploiting various
features such as key compression to reduce the size of the IOT there by
reducing the number of lookups required to read the corresponding index blocks.
Till now we have seen only the cases when the leading column of the index or
IOT (primary key constraint) is unique. Let’s look at a different case:
Consider the following table:
create table Test
(
b1 number, //b1 –
200 NDV
b2 number, //b2 - almost
unique
b3 number,
b4 varchar2(20),
primary key (b1,
b2)) organization index compress 1 including b4 overflow;
In this case we can exploit the index key compression to reduce the
overall size of the index structure. For the predicates such as
b1 = ?
//significant reduction in number of buffer gets compared to a normal heap
table
b1 = ? and b2 = ?
//results in a single row since b1 and b2 make up the primary key.
b2 = ? //skip
scans possible but usually higher number of buffer gets.
Oracle allows us to create Bitmap and additional secondary indexes on an
IOT. As the ROWS in an IOT move physically from one block to other due to 50-50
index leaf block splits, it is mandatory for the row movement feature to be
turned on always, in fact due to this behavior we cannot disable row movement
or disabling row movement for an IOT is not supported. As rows in an IOT lack a
physical ROWID as the data is stored in an index, Oracle uses the concept of
logical ROWID and/or logical ROWID guesses to support secondary indexes in
order to locate the corresponding rows.
Secondary Indexes:
DDL to create secondary indexes on an IOT is same as normal tables, the
differences lie within the leaf blocks regarding how the data is stored and
organized. Consider the following table:
create table
students(student_id number, department_id number, name varchar2(20), year
number, primary key (student_id,department_id)) organization index including
year overflow;
and the secondary index DDL as follows:
create index
students_dept_idx on students(department_id);
In the above example, we have included all the columns of the table as
part of the IOT structure. In order to uniquely identify a row in the IOT (due
to the absence of a physical ROWID), secondary indexes store all the columns
that make up the primary key of an IOT and a guess regarding the last known
location of the row in the secondary index. Snippet of students_dept_idx leaf block
dump reveals the details.
row#94[6228] flag:
K------, lock: 2, len=19
col 0; len 3; (3): c2 05 0c
//Department_id column
col 1; len 4; (4): c3 5f 05 0c
//Primary key column Student_id
tl: 8 fb: --H-FL--
lb: 0x0 cc: 1
col 0: [ 4] 01 c0 2c 4b // last known physical location
of the row.
row#95[6209] flag:
K------, lock: 2, len=19
Consider the following query:
select * from
students where department_id = 123;
The server process starts by traversing the students_dept_idx index to
find the corresponding rows that match the predicate, as the physical ROWID is
not present and with the presence of last known location the session directly reads
the corresponding IOT leaf blocks without having to traverse the B-Tree structure to locate
the corresponding rows. It is only when the leaf block splits that occur at a
later point in time which results in table rows moved to different locations,
Oracle ends up traversing the index structure to find the corresponding row
based on the primary key values obtained from the secondary index.
Over time due to IOT leaf block splits (50-50) entries in the index leaf
blocks eventually move to different leaf blocks and the physical rowed guesses
in the secondary indexes will not be accurate and will eventually require
additional IO to find the corresponding blocks which will increase the elapsed
times of a query (not that significant). DBA_INDEXES.PCT_DIRECT_ACCESS (1-100) indicates the percentage of valid
guesses, it is recommended to update these guesses once the PCT_DIRECT_ACCESS falls below
85% (this is not a hard threshold and depends on how the data is organized).
Use the command alter index … update block references to update all stale guess data block addresses
stored in the secondary index.
Since the primary key columns are included in the secondary index, any
references to those columns will not require an IOT (index lookup) unless we
refer other columns that are included as part of the IOT. The following explain
plans reveals the details.
select student_id
from students where department_id= 123;
select name from
students where department_id=123;
How does full table scan perform?
Even though the IOT is considered as a table, the data is actually
stored in an index segment with pointers to an optional overflow segment stored
as a table. Even though the naming says Index Organized Table, IOT should not
be considered as a table. Think of it as an index with optional pointers to the
overflow segment (chained rows). There is no such thing as a full table scan but
a full scan is performed using the Index Fast Full
scan. Consider the following
table:
create table RANDOMLOAD(roll
number, name varchar2(40), mark1 number, mark2 number, mark3 number, mark4
number, mark5 number, mark6 number, primary key (roll, name)) organization
index including mark3 overflow;
If we issue the following query:
select avg(mark1)
from randomload;
Depending on the size of the table and various factors, the server
process will end up performing an index fast full
scan using either db file scattered read or direct path read. The main
point being that most of the IO is basically multiblock. It is when we refer
the columns that are stored as part of the overflow segment things start to
slow down, as the rows (columns) stored in the overflow segment is chained to
the IOT primary structure that is B-Tree index, full table
scan or index fast full scan behavior
changes completely (in terms of IO). Consider the following query:
select avg(mark6)
from randomload;
The above query results in a full scan. Oracle issues multiblock IO for
the index segment and as the rows are chained, oracle issues single block IO (db file sequential read –
will not cache the blocks in the buffer cache) to read the contents from the
overflow segment. If the table is too large, this will increase the IOPS on the
server considerably.
Rebuilding an IOT:
Overtime, an IOT can bloat and since we cannot use the IOT index name
directly in the SQL statements, the only way to rebuild the index structure of
an IOT is to use the statement alter table … move online and to move an overflow segment use the
statement alter table … move overflow.
The following are various restrictions of IOT:
-
All key columns must fit the specified threshold.
-
If the maximum size of the row exceeds 50% of the
index block size, and if we do not specify an overflow segment, the CREATE TABLE statement fails.
-
IOT cannot have virtual columns, so function based,
reverse key, descending indexes are not supported.
-
The maximum number of columns that can include the
primary key is 32.
-
The maximum number of columns in the index portion of
the row is 255, including both key and non-key columns. If more than 255
columns are required, we must use an overflow segment – similar to chained rows
restriction.
-
The maximum number of columns is 1000.
Index Access Paths:
B-Tree Access Paths:
Index Unique Scans:
The database performs a unique scan when a predicate references all columns of a unique key or primary key
using an equality operator such as where student_id = 123. Remember that unique or primary key
constraint should be enforced/policed by a unique index for the index unique
scan to work, if these constraints are enforced/policed by a non-unique index,
oracle performs an index range scan instead of index unique scan (even though
the lookups may not change, CBC latches required, redo/undo generation in case
of a rollback etc. increases).
Consider the following table:
create table
students (student_id number primary key, name varchar2(30));
for a query such as the following Oracle performs an index unique scan.
select * from
students where student_id = ?
select * from
students where student_id in (?,?,?)
select * from
students where student_id = ? or student_id = ?
The explain plan will be as follows:
-------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows
| Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 |
35 | 2 (0)| 00:00:01 |
| 1 |
TABLE ACCESS BY INDEX ROWID| STUDENTS
| 1 | 35 |
2 (0)| 00:00:01 |
|* 2 |
INDEX UNIQUE SCAN |
STUDENTS_PK | 1 | |
1 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------
There is a reason for specifying SQL statements with predicates
containing in and or clauses. Oracle considers both the statements equivalent. The
scan searches the index in order for the specified key. An index unique scan
stops processing as soon as it finds the first record because no second record
is possible. The database obtains the ROWID from the index entry, and then retrieves
the row specified in the ROWID. An index unique scan with an equality predicate
uses the optimization technique named consistent gets
examination (fastpath) which
requires only 1 CBC latch to read a buffer. But for queries with predicates student_id in (?,?) and student_id=? or student_id=?,
this optimization will not the same and it works for index leaf blocks alone.
Remember that Oracle continuously optimizes the features, as a result this
behavior can change.
The execution flow for a unique scan is as follows:
-
Read the index root branch block.
-
Read branch block.
-
Read the leaf block and obtain the corresponding
ROWID.
-
Based on the ROWID read the table block.
Storage/OS Level Perspective:
Unless the SQL statement includes IN clause, an index unique scan always
results in a single row. Since a predicate value always results in only a
single row server process always makes use of pread64 OS call to read a block from the disk to the
memory. Don’t be fooled into thinking that Oracle uses only pread64 calls, post database
restart, there is another mechanism that can kick in which makes the session
utilize readv system calls.
One might wonder why Oracle cannot utilize asynchronous IO while
performing an Index Unique scan. Once the optimizer comes up with the plan to
access the table through the unique index using INDEX UNIQUE SCAN. The server process executing the SQL
statement performs the following:
-
Server processes reads the index root block (block
immediately following the header block – L3 BMB). Unless it reads the root
block, it cannot determine the which branch block it has to read consequently
no scope for asynchronous IO in this case as the root block can easily be
either a root block or leaf block. (remember no branch blocks until the first
leaf block allocated to the index is split).
-
Reads
the branch block and determine the correct leaf block, unless it reads the branch
block it cannot determine the correct leaf block.
-
Reads
the leaf block to determine the corresponding ROWID or to determine whether the
row even exists. Here checks are performed to see if there is an open ITL slot
and if the row we are looking for is locked by a particular transaction and CR
fabrication if required.
-
If
a ROWID is obtained then read the corresponding Table block. Here various
possibilities may raise –
o
Read
the corresponding table block – perform a delayed commit cleanout if necessary.
o
If
an open ITL slot – and the transaction is still in progress – read the undo
transaction table and the corresponding
undo block containing the changes – fabricate a CR block.
o
If
the table block contains a forwarding address due to row migration – read the
table block containing the corresponding ROW.
o
If
the SQL statement contains columns that are stored in a different row piece due
and the rows are chained across multiple leaf blocks – then read the table
block.
During the entire process described above, Oracle has no way of knowing
which block to read in advance and it has to mandatorily follow the path. So,
no asynchronous IO is used. Since an index unique scan results in only one row
the clients fetchsizes are usually large enough to (default 15) not
cause additional latches while reading the block (this statement may not
necessarily be true when the statement contains IN clause with the list greater
than fetch sizes). Remember that since it’s not possible to accurately
determine the next leaf block beforehand and the branch blocks contains only
the boundaries, db file parallel reads on index leaf blocks is nearly impossible.
Index Range Scan:
Index range scan occurs when the SQL statement contains the predicates
such as =, <=, <, >=, <=, between, in, or like etc. Remember that a range scan occurs even for
unique or primary key indexes in the following situations:
-
Predicate refer to only one column of a multi-columned
unique or primary key. For example, if the primary key is (student_id, admission_id) then referring to student_id alone in
the query will result in a range scan (where student_id =
?).
-
When a SQL statement contains range-based predicates
such as >, >=, <=, <, between and like etc.
For an index range scan, multiple values can be possible for the index
key (if the index is not a unique index, even though a predicate result in only
one row Oracle still uses index range scan). By now we already know that an (non-unique)
index entry consist of a column value and its corresponding ROWIDs. By default,
when an indexed column value returns multiple rows, index stores the column value
along with the corresponding ROWIDs stored in the ascending order. Let’s start
with the basics.
Single Column Indexes:
Equality Predicates:
Consider the following table and query:
create table
students (student_id number ,department_id number, student_name varchar2(20));
create index
students_dept_idx on students(department_id);
Since students_dept_idx is a non-unique index an index key value can result in multiple rows. We
may not realize this yet but storing the ROWIDs in the sorted order offers several benefits.
When executing queries with equality predicates, as ROWIDs are stored in the
sorted order, the server process reads/accesses the table blocks only once (we
will subsequently find out why this is not always the case). For example,
consider the following query:
select * from
students where department_id = 1679;
This is clearly evident from the following:
select roll, rowid,
DBMS_ROWID.ROWID_TO_ABSOLUTE_FNO (rowid, 'VISHNU', 'STUDENTS')
"FILE_ID", dbms_rowid.rowid_block_number(rowid)
"BLOCK" from students where
department_id=1679;
The corresponding sql trace or 10046 level 12 is as follows: (very bad
clustering factor)
3 index block (root block, branch block, leaf block) and 13 different
table data blocks.
Now consider the same scenario if all the table blocks reside in the
same block:
The corresponding SQL trace is as follows:
Before proceeding further, we need to understand few concepts regarding
the index range scan which are very important. When a query contains a
predicate with = (equality condition) and that the column value results in
multiple rows and spans across multiple index leaf blocks, Oracle or server process always (in order to
retrieve all the corresponding rows from the table) works through the left most
index leaf block containing the value till the right most index leaf block in
sequence, the same holds true for range-based predicates (>, <, <=, >=,
between, but the case is different with in, or – we
will soon see why) as well, here the leaf block
leaf block contains the minimum value and the right most leaf block contains
the highest value. Since the index leaf blocks are linked together using a
double linked list, Oracle can also scan the index leaf blocks in the
descending order from the right most index leaf block to the left most index
block and we will see this in INDEX RANGE SCAN DESCENDING section.
Consider the following query:
select * from
students where department_id = 1679;
The corresponding 10046 level 12 trace is as follows (don’t be surprised by the
number of index blocks that are read, index is bloated purposefully to
demonstrate this)
If we dump the leaf blocks we get the chain as follows:
kdxlenxt
29459780=0x1c18544 --> 99652
kdxlenxt
29459781=0x1c18545 --> 99653
kdxlenxt
29459782=0x1c18546 --> 99654
kdxlenxt
29459783=0x1c18547 --> 99655
kdxlenxt
29459784=0x1c18548 --> 99656
treedump of the index reveals the details:
branch: 0x1c1854a
29459786 (12: nrow: 17, level: 1)
leaf: 0x1c18537 29459767 (-1: row:13.13
avs:1158)
leaf: 0x1c18539 29459769 (0: row:13.13
avs:1158)
....
leaf: 0x1c18542 29459778 (9: row:13.13
avs:1158)
leaf: 0x1c18543 29459779 (10: row:13.13
avs:1158)
leaf: 0x1c18544 29459780 (11: row:13.13 avs:1158)
leaf: 0x1c18545 29459781 (12: row:13.13 avs:1158)
leaf: 0x1c18546 29459782 (13: row:13.13 avs:1158)
leaf: 0x1c18547 29459783 (14: row:13.13 avs:1158)
leaf: 0x1c18548 29459784 (15: row:13.13 avs:1158)
Predicates containing IN and/or OR:
Oracle uses the In-List iterator whenever the predicates contain the IN
clause with the values specified such as the following:
select * from
students where student_id in (4,2,1,3);
and if these values are fed using another SQL statement, the resultant
plan becomes a join and the In-List iterator will not be used.
select * from
students where student_id in (select a.roll from temp a, temp2 b where
a.roll=b.roll);
with tempa as
(select a.roll from temp a, temp2 b where a.roll=b.roll) select * from students
where student_id in (select roll from tempa);
Whenever the predicates contain the IN-clause oracle internally expands
the statement as follows and estimates the selectivities of each individual
value specified in the IN clause, this is evident in 10053 trace as well.
Final query after
transformations:******* UNPARSED QUERY IS *******
SELECT … FROM
"VISHNU"."STUDENTS" "STUDENTS" WHERE
"STUDENTS"."STUDENT_ID"=1 OR
"STUDENTS"."STUDENT_ID"=2 OR
"STUDENTS"."STUDENT_ID"=3 OR
"STUDENTS"."STUDENT_ID"=4
The execution plan is as follows:
There are several steps involved before it generates a final query is
generated post transformations and these are as follows:
-
Eliminate duplicate values in the list using various optimizations.
-
Sort the values in the ascending order (regardless of
whether the index is ascending or descending).
One might wonder why the HASH UNIQUE or SORT ORDER BY is not displayed in the explain plan, this is
because this optimization is performed even before the final query (after
transformations) is generated and costs of index or table access is calculated.
We will eventually see why this trivial (insignificant) detail is mentioned
here eventually. Same behavior applies irrespective of the underlying index. Sorting
the values in the IN clause pose several benefits instead of accessing the
index leaf block and the corresponding table blocks such as If the values are
close enough this can potentially reduce the number of latches requires to
access the branch/leaf block multiple times over the course of execution.
There is an important aspect that a DBA/Developer should be aware of with
regards to IN or OR clause when working with
very close or continuous values:
Consider the following statement:
select * from
students where student_id in (1,2,3,4,5,6,7,8);
The corresponding execution plan is as follows:
If we observe closely, oracle didn’t use Batched access as it knows that
each value results in the IN-clause results in only one value. The statistics
in terms of average latch gets for each block is as follows:
The average number of CBC-latch gets for each execution is 29. Now
consider the same statement and we modify the predicate part as follows:
select * from
students where student_id between 1 and 9;
The average number of CBC latch gets for each execution is 21. The
difference in latch gets increases considerably due to the following factors.
-
Lower clustering factor.
-
Number of values specified in the IN-clause.
-
Average row length size – short or long tables.
-
Block size.
-
Extent sizes.
Now that we have seen the behavior in terms of latch gets, let’s look at
few aspects especially related to IN clause. An ORDER BY clause on the indexed column doesn’t
necessarily mean that Oracle avoids a SORT operation, it can still use SORT ORDER BY to sort the
resultant rows. This is especially true when it comes to joins – even if the
ORDER BY clause refers to columns that belongs to the outermost table in the
join.
Consider the following table and query:
create table
students(student_id number, name varchar2(20), primary key (student_id));
create table marks
(student_id number, subject_id number, mark1 number, primary key
(student_id,subject_id));
Oracle uses additional steps such as SORT UNIQUE / HASH
UNIQUE to basically eliminate the
duplicate values and if possible the need for sorting the final results from
the outer most query which is students in our case. Consider the following
statement:
select * from
students where student_id in (select student_id from temp where mark1=4) order
by student_id;
In this case, since the primary key on marks table is (student_id, subject_id),
Oracle has no way to determine whether the rows that result from the subquery
results in duplicate value, so it uses SORT UNIQUE to both sort and eliminate the duplicates
values returned by the subquery. Since the values are already sorted, Oracle
simply eliminates the need to sort the results obtained from the students table
again. Consider the following statement:
select * from
students where student_id in (select a.student_id from marks a , departments b
where a.student_id=b.student_id and a.mark1 > 90 ) order by student_id;
Even though students.student_id is indexed, Oracle performs a sort of values
at the end, this is because since the HASH UNIQUE step is used to effectively eliminate the
duplicates the values obtained can be in any order, in order to satisfy the ORDER BY clause Oracle
effectively performs sort at the end. This is a case which demonstrates that in
the presence of ORDER BY clause even if that column is indexed Oracle
still perform the sort at the end, we will see more about this in Joins and how
an non-unique column used as a join condition can result in sort.
If queries depend on IN clause, developer often make a mistake of
creating separate PreparedStatements or generate them based on the number of
entries to be included in the IN clause such as the following:
select * from
students where student_id in (?,?);
select * from
students where student_id in (?);
select * from
students where student_id in (?,?,?);
select * from
students where student_id in (?,?,?,?);
This often results in multiple SQL_IDs (more SQLs in the Shared Pool), which
is not optimal. A simple fix is to use a subquery in the IN clause that makes
use of either REGEX function or a PL/SQL block that results in multiple values
given only one bind variable. Using dual is far better than using a custom
PL/SQL block to parse the results. For example, the above statement can be
converted as follows:
select * From
students where student_id in (select regexp_substr(:bindval,'[^,]+', 1, level) from dual connect BY
regexp_substr(:bindval,'[^,]+',
1, level) is not null);
Similar approaches can be employed, but the only downside of this is
that Oracle tends to use HASH UNIQUE (which just eliminates the duplicates) and not
SORT UNIQUE
(which not only eliminate duplicate values but also sorts these values). Unless
the values are extremely close (numeric), the resulting increase in number of
latches required to complete execution may not be significant.
Range Based Predicates (>, <, <=, >=, between):
An index range scan is an ordered scan of values. A predicate can be
bounded on both sides or unbounded on one or both sides and the values that
fall in the range are usually continuous (numbers - data type can be integer,
float etc.). We already know that unless the index is a descending index,
oracle stores column values and their corresponding ROWIDs in the ascending
order. How is the behavior different when it comes to equality predicates? The
behavior is essentially the same as performing a range scan using equality
predicate on non-unique index. Oracle starts by scanning the root block and
reads all the corresponding branch blocks that satisfy the range and beings scanning
from the left most index leaf block that holds the minimum value till it
reaches the right most index leaf block that holds the maximum value, it
doesn’t get back to the branch blocks or access the branch blocks again. There
is a reason for reiterating this again – refer to equality predicates.
For example, when we issue a query with the predicate student_id < 10, Oracle
uses a range scan to return the rows sorted by index keys 0, 1, 2, 3 etc.
Unless the ORDER BY <INDEXED_COLUMN> DESC is specified in the query, oracle always
starts scanning from the left most index leaf block to the right the most index
leaf block that satisfy the range specified in the predicates. Consider the
following query:
select * from
students where student_id < 10;
For demonstration purposes the index is bloated purposefully using the
following DDL:
create unique
index students_idx on students (student_id,’< random junk 2500bytes in
length>’);
The corresponding trace is as follows:
From the trace we can notice that the it started scanning from index
leaf block 476 (which held the minimum student_id value) till block 479 (which held the maximum student_id value). There is a
reason for highlighting the wait events that correspond to reading the table
blocks (conventional, prefetching, batching), we will look into this shortly.
We can see the following link from the index leaf block dumps and the Treedump.
kdxlenxt
29360605=0x1c001dd (kdxlenxt points to next index leaf block in the chain)
kdxlenxt 29360606=0x1c001de
kdxlenxt
29360607=0x1c001df
kdxlenxt
29360608=0x1c001e0
Index Treedump:
branch: 0x1c001db
29360603 (0: nrow: 100, level: 2)
branch: 0x1c00bf4 29363188 (-1: nrow: 734,
level: 1)
leaf: 0x1c001dc 29360604 (-1: row:3.3
avs:1951)
leaf: 0x1c001dd 29360605 (0: row:3.3 avs:1951)
leaf: 0x1c001de 29360606 (1: row:3.3 avs:1951)
leaf: 0x1c001df 29360607 (2: row:3.3 avs:1951)
leaf: 0x1c001e0 29360608 (3: row:3.3 avs:1951)
leaf: 0x1c001e1 29360609 (4: row:3.3
avs:1951)
Even though Oracle stores entries in the index leaf blocks in the
ascending order based on the column value and retrieves the corresponding rows
from the table accordingly doesn’t mean that the results obtained from the SQL
statement or the rows that the application fetches are sorted based on the
predicate. For example, the following statements have the identical plan but
the order in the application sees the table rows may not be the same for the 1st
SQL as it depends on various factors that we will discuss shortly.
select * from
students where student_id < 10;
select * from
students where student_id < 10 order by student_id;
Let’s look at the corresponding execution plans:
select student_id,name
from students where student_id < 100
Plan hash value: 3459646009
select student_id,name
from students where student_id < 100 order by student_id
Plan hash value: 1738957801
Do observe the table access method used in both the cases – TABLE ACCESS BY INDEX ROWID BATCHED (can be batched, prefetch, or generic) and TABLE ACCESS BY INDEX ROWID
(which can be prefetch or generic). We will be discussing the batched,
prefetch, and generic access methods in the subsequent sections.
Situation is different when we select only the index columns and the
execution plan results in either the INDEX RANGE SCAN or INDEX FULL SCAN (without fetching results from the table –
table access by index ROWID) as Oracle or server process always follow the exact precise order when
it comes to reading the index blocks using INDEX RANGE SCAN or INDEX FULL SCAN that is reading the index leaf blocks and the
corresponding entries in order one by one.
select student_id from
students where student_id < 100;
Plan hash value: 3503402594
select student_id from
students where student_id < 100 order by student_id;
Plan hash value: 3503402594
we can confirm the results using the following query:
with temp as
(select /*+MATERIALIZE*/ rownum "ROLL", student_id from students
where student_id < 1000), temp1 as (select /*+MATERIALIZE*/ rownum "ROLL", student_id from
students where student_id < 1000 order by student_id) select count(*) from
temp a , temp b where a.roll=b.roll and a.student_id=b.student_id;
This holds true only for single table lookups and only in explicit cases
in multi-table joins. When the SQL statement joins multiple tables, it is
mandatory that we specify the ORDER BY clause to retrieve the results in an ordered/sorted
manner. The order in which the rows retrieved may not be the same when the
optimizer chooses INDEX FAST FULL SCAN.
Another interesting point here is that if an index leaf block contains
several columns values, and if all these column values satisfy the predicate
(>, <, between etc), then oracle effectively batches the table block
reads effectively moving on to a different index leaf even before batching the
last few entries of the current leaf block.
Situations in which a SQL
execution has higher buffer gets / consistent gets than physical reads:
How Application Result Set Fetch
size affects Buffer gets / CBC latches:
When a SQL statement is executed factors such as JDBC fetchsize, SQL*Plus arraysize has an exponential
effect on the number of buffer gets (subsequently CBC latches) required for
each execution depending on whether the query is TABLE ACCESS
BY INDEX ROWID/BATCHED or just INDEX RANGE SCAN (index only
scan). The extent of the impact depends on the factors such as the following:
-
Clustering factor of the index.
-
Whether PGA is used to store the results – during
sorting, aggregate functions etc.
-
Index only scans.
-
Depending on Table and Index structure and the
following contributes:
o
Table average Row Length and whether the table is
short.
o
Datatype of Indexed column and/or freeness status of
index leaf block.
-
How the data is loaded into the table.
INDEX UNIQUE SCAN which usually result in a single row are not
affected due to this, but if the predicate contains the IN clause with many
values we can expect elevated buffer gets and latch activity (as Oracle
database still performs a INDEX UNIQUE SCAN For predicates containing the IN clause). This
situation can occur for equality and range-based predicates on single column or
multi-column indexes when an indexed column value or a set of predicates
results in number of rows greater than the fetch size and the server process
doesn’t use PGA to store the intermediate results.
Consider an ideal and worst-case scenario in which the clustering factor
of the index equals the number of blocks in the table and the table is short. (there
is a reason for specifying clustering factor here and we will soon see why). Following
statement retrieves 999 rows from the table starting from student_id = 1
select * from
students where student_id < 1000;
10046 level 12 trace is as follows (during an index range scan in order
to retrieve the data more efficiently oracle employs various optimizations such
as prefetching, batching, buffer warmup etc., which can result in additional
blocks read into the memory. In order to accurately depict the exact number of
blocks read prefetching, batching, buffer warmup etc., are disabled to get the
following trace. We will soon see why the V$SQLAREA.BUFFER_GETS or consistent gets from autot cannot be relied upon
and cannot be used to accurately depict the number of data or index blocks read
from the disk).
And the explain plan is as follows:
As we can see from the above the SQL statement results in 14 blocks read
from the disk into the memory. From the application, executing the SQL
statement with various fetch sizes will give us the following statistics:
We can clearly infer that the number of buffer gets increases
considerably with the decrease in the application’s resultset fetchsize. The
execution statistics for the SQL_ID 4cxpytntkdb9z are optimal as the application level fetchsize (from SQL*Plus arraysize) is 1000
and the buffer gets is precisely what we would expect as the entire result set
is sent to the client or the client fetches the entire resultset at once. Ignore the
CPUmsPerX and ElapmsPerX statistic for now
as we will go through them in the network tuning section.
When a query results in multiple rows and if the number is greater than
the client’s fetchsize
(or SQL*Plus arraysize), the client receives the results in batches. For example, if the
client’s fetchsize
is only 10 and the query results in 100 rows, client fetches the results 10
times over the course of SQL execution. Once the query execution results in 10
rows, server process halts or more precisely pauses SQL execution releases all
the CBC latches and switches the context to sending the results to the client,
Once the results are sent, it will resume from the point where it paused, and it
reacquires the latches to access the data/index block or buffer again. Another
way of looking into this is Oracle uses a dedicated server process for each
client connection and it is evident from the /proc/<server_process_pid>/status that each process runs with only one thread,
unless the process is multi-threaded or use techniques such as asynchronous
APIs, the process flow cannot perform two activities at once.
Consider the following SQL trace:
As we can see from the sections highlighted in RED, server process reads
the table blocks sends the data to the client and goes back to reading the data
again from where it left(here the fetchsize is less or the JDBC drivers default), now enter prefetch/batched table access which reads multiple table blocks
at once and subsequently sends the results to the client.
Now that we have seen how resultset fetchsize and/or SQL*Plus arraysize affects the buffer gets (CBC latches
inherently) for queries executed using TABLE ACCESS BY
ROWID/BATCHED. Index only scans
are also affected as there is a high chance that the leaf blocks contain many
entries. Consider the following SQL statement:
select student_id
from students where dept_id=123;
and the corresponding index as
create index
stud_dept_std_idx on students(dept_id,student_id);
The explain plan is as follows:
Snippet of 10046 level 12 trace of the above SQL is as follows:
From the application, executing the above SQL statement with various
fetch sizes will give us the following statistics:
As we can see from the above buffer gets increase with the number of
fetches. By default, unless configured Oracle JDBC fetchsize is 10. Increasing
the fetchsize to
an arbitrarily high value does have its consequences, as more data is sent in
one fetch operation sessions will end up waiting on SQL*Net more data to
client, make sure the SDU sizes etc., are configured properly. We will see more
on this when we deal with SQL*Net wait events.
Oracle recommends the resultset fetchsize be set to 256, but this recommendation may not
be accurate for workloads, we will soon see why. Following are the various
situations in which the resultsset fetchsize will have minimal to no effect on the number of buffers gets of a SQL
statement.
-
Optimizer decides to use SORT ORDER BY to sort the results in which case the rows obtained
from the SQL execution are initially stored in the session’s PGA and then
sorted based on ORDER BY clause
-
Use of Aggregate functions such as group by, avg, sum,
count, distinct etc.,
-
Effect of reacquiring the latches decreases with the
increase in the row size but a low FETCHSIZE eventually
results in more IOPS on server.
-
Since the index leaf blocks can accommodate more
entries, there is a high chance that the index leaf block will be accessed
multiple times (provided that an indexed column value result in multiple rows).
Considering 90-10 and 50-50 splits, rebuilding an index which had more 50-50
leaf block splits will elevate the number of fetches as the index leaf block
now accommodates more rows. In such cases, it is better not to build the index
or increase the PCTFREE of an index should help, but
remember that increasing the PCTFREE reduces
the efficiency of the buffer cache as the blocks contain more free space.
Now that we have seen how fetchsize or arraysize affects the buffer gets and latches from the
database perspective. Let’s look at things from the application perspective and
how a default value of 10 or low fetchsize helps reduce buffer gets and latches. Before
proceeding further, we need to understand the concept of partial query
execution.
Consider the following query
select * from products
where name like ‘SSD%’;
snippet of 10046 level 12 trace for the above statement is as follows:
The execution statistics for the above statement are as follows:
Here the clustering factor of the index is very high and the buffer gets
mentioned here is the actual number of buffers accessed by the SQL statement. We
can expect the statistics to be the same each time we execute the above SQL
statement, considering the fetches per execution is only 1 or the client
fetches all the rows in one operation (one fetch of resultset) But this is not
always the case. An SQL execution can be either
-
Full or
-
Partial
Full execution occurs is when an SQL statement executes and all the
corresponding rows that result from that execution are sent to the client. Partial
execution occurs when the client closes the cursors after fetching only certain
number of rows, and the requirement of the workflow is satisfied. For example,
if you have to purchase an SSD in any retail site such as eBay or Amazon, you
would end up performing a search with the keyword SSD and the subsequent page
displays the first 10 or 40 results depending on the whether you using a web
browser or a mobile app, once you either scroll down or move to next page the
next set of results are displayed. Same happens when you search the tickets in
the queue by your team in Remedy or any ticketing tool. It is most likely that
you will find the SSD you want in the first 1-3 pages, and it is highly
unlikely that you browse through all the pages. Even though the SQL statement
shown above results in over 1900 rows, the chances are infinitely small that
you gaze upon the first 200 results before making a purchase. Considering the
above, if you select a particular product from the 1st page itself, there
is no point in executing or fetching the rest of the results. In this case the
application closes the cursors and the SQL is said to be partially executed
(though successfully executed).
Consider the fetchsize in this case is 40 and you would end up
selecting a particular product from the 1st page of search results.
The execution statistics will be as follows:
If you observe closely, the metric elapsed times per execution, CPU per
execution, buffer gets per execution have reduced significantly, and the client
only fetched once – first 40 rows from the SQL. Another important metric that
we should observe closely is the V$SQLAREA.
END_OF_FETCH_COUNT or V$SQL.END_OF_FETCH_COUNT.
This value is incremented each time for full execution and is not incremented for
partial executions. If the END_OF_FETCH_COUNT/EXECUTIONS is less than 1 then it indicates the SQL is
executed partially. In cases such as these, application benefit from reduced
response times and the database can service more clients.
Now suppose you browse through first 3 pages which is unlikely in most
of the cases resulting 120 rows the execution statistics will be as follows:
This is most likely the case in many front end and few backend
applications. Whenever you are troubleshooting a particular SQL statement,
always be on the lookout for the metric END_OF_FETCH_COUNT. The Same query can become incredibly slow at
times even with the same bind variables. One might wonder why? The reason for
this is that, by the time the client fetches the next set of rows, the new
blocks or blocks that are already in the cache may be modified, resulting in CR
block fabrication.
This may raise another thought – why can’t we use features such as FETCH FIRST 5 ROWS ONLY or
limiting the result set using clauses such as ROWNUM <
40. This is because
-
We have to execute the SQL statement each time and
keeping track of the records already fetched requires additional work and is
highly inefficient.
-
Depending on the operations such as group up, order by
etc. every time query runs the resultset may or may not result in the same rows and in the
same order.
Now that we know the basics of partial execution and how buffer gets are
affected, let’s look at various aspects from a DBA perspective:
As soon as the session obtain the first batch of results, the session status
becomes inactive
and state indicates the true nature - waiting on SQL*Net message
from client until the client or
application requests for another batch of results.
This has rather very interesting implications let’s look at each. Due to
the nature of partial execution, it is recommended to be used only on tables
which contain static data. If the rows in the table constant change by the time
the client requests for or fetches another batch of results, if the blocks are
already changed, fresh CR copies of the block have to be constructed. This is
one of the cases in which a session ends up creating CR versions of the same
block multiple times by the time it finishes execution. One might wonder if the
CR version of the block is already present in the memory, why does the session
require building another same copy of the block. It depends on various factors
such as the following:
-
In adequately sized buffer cache.
-
Remember that the CR block constructed will have a
touch count of 1 and is highly likely to be flushed from the buffer cache –
flushing here in this case refers to invalidating the buffer and using it to
store a new block coming in.
If you still need the benefits of the partial execution even though the
table contents are modified a lot. You can recommend one of the following:
-
Use ORDER BY clause to temporarily store the results
in which case, the session gets the results from the PGA instead of accessing
the blocks over and over again.
-
Use data structures such as ArrayLists etc to hold the
results at the application level, instead of fetching them over and over again.
In both the cases, the query results in full execution, and you will not
reap the benefits of partial execution. V$SQL/AREA.USER_EXECUTING
does indicate the number of users
currently executing the SQL statement even though the status of the session is
INACTIVE.
Clustering Factor and other
factors that influence buffer gets/latches:
When dealing with range-based predicates, there is a chance that a
server process (user session) ends up reading/accessing the same table blocks multiple
times over the course of SQL execution. This occurs when one or combination of
the cases is true.
-
Way the data is loaded into the table and the number
of sessions inserting the data.
-
When the predicates contain bounded, unbounded, in, or
clauses.
-
When extent sizes are small.
-
When the average row length of the table is small
and/or table is small.
-
NDV and distribution of the column plays a vital role.
-
Predicate refers the leading column of a multi-column
index.
Consider the following query: student_id is the
primary key (single column index)
select * from
students where student_id between 30 and 40;
10046 level 12 trace is as follows:
As you can see all the rows that match the predicates are contained in
only two table blocks, and the corresponding index entries stored in only one
index leaf block. Execution statistics for the above SQL is as follows:
We have 14 buffer gets even though client fetches the entire resultset at once. Now let’s
look at how the data is organized in the index leaf block.
SELECT student_id,
dbms_rowid.rowid_to_absolute_fno(ROWID,'VISHNU','STUDENTS' )
"FILE_NO" , dbms_rowid.rowid_block_number(rowid) "BLOCK_NO"
, rowid from students where student_id between 30 and 40;
STUDENT_ID FILE_NO BLOCK_NO
ROWID
----------
---------- ---------- ------------------
30 5 1038 AAASPNAAFAAAAQOAAM
31 5 469 AAASPNAAFAAAAHVAAJ
32 5 1038 AAASPNAAFAAAAQOAAN
33 5 469 AAASPNAAFAAAAHVAAK
34 5 1038 AAASPNAAFAAAAQOAAO
35 5 469 AAASPNAAFAAAAHVAAL
36 5 1038 AAASPNAAFAAAAQOAAP
Since the subsequent entries in the index leaf block points to different
blocks Oracle accesses each block multiple times and remember that each time it
accesses a block in the buffer cache, it has to acquire the CBC latches.
When the predicates contain or refer only the leading column of a multi-column
indexes, things are little different, consider the following table and the
corresponding index:
create table
orders (order_id number, store_id number, item_id number, volume number);
create index
orders_store_items_idx on orders(store_id, item_id);
and the SQL Query
select sum(volume)
from orders where store_id = 12312;
In this case its certain that oracle will use any repeated reads when it
comes to index blocks as Oracle starts reading from the leaf most index leaf
till the right most index block but it can read the same table data block over
and over again. For example: a seller in Amazon or eBay gives heavy discounts
on few products to clear the piled-up stocks in the warehouse and the stock is
limited. After noticing the reduced prices, customers start ordering. In this
case, eventually at the database end - inserts will primarily contain the store_id of that particular
seller. Consider the following scenario:
Session A inserts row with store_id, item_id
(12312, 1651) into table block
416.
Session B inserts into table store_id, item_id
(12312, 1651) block 417
Session A inserts into table store_id, item_id
(12312, 1652) block 416
Session B inserts into table store_id, item_id
(12312, 1652) block 417
From Oracle perspective, as ROWIDs of each individual distinct
combination indexed column values as well as the indexed columns are stored in
the ascending order, Oracle ends up reading or accessing the same data blocks
416, 417 again as it starts processing the entries from left most index leaf
block to the right most index leaf block. Let’s see this in action:
The execution plan and the corresponding statistics are as follows:
Notice the mismatch in the number of physical reads vs Consistent reads,
this is because oracle access the same table blocks over and over again.
The snippet of the SQL trace is as follows (entries are truncated as the
trace is long):
We can clearly observe this behavior from the above 10046 level 12
trace:
Session starts by reading the index root branch block, branch block, and
as it works through the index from the left most index block to the right most
index leaf block and reads the corresponding rows from the table blocks, as we
get to the tail section, tail section includes details regarding only the index
blocks but not the table blocks, this is because all the table blocks are
already brought into the memory as part of previous IO requests. You should
remember that an 10046 trace contains entries each time a block is read/accessed
from the disk and not from the memory.
In such cases an index on store_id column alone significantly reduces the number
of buffer gets, to demonstrate this let’s drop the old multi-column index and
create a single column index with just the store_id as follows:
create index
store_id_idx on orders(store_id);
The execution plan for the statement is as follows:
The number of consistent gets reduced by a factor of 82X. Remember that this behavior is
common only in certainly application types in which the data load is time
dependent. Such as a shift in which only a particular department works, raising
tickets to a particular department in case of outages etc.
CR block fabrication:
Whenever the user or application issues a DML against the database, Oracle
modifies the contents of the table block and the corresponding index blocks. This
modification is done even before the user/application commits the transaction.
(Oracle maintains the SCN at the block level not at the row level). Each data
block has a header portion that contains ITL (Interest Transaction List) slots
that has information regarding the transactions – whether in progress or
committed etc. The following is the snippet of table block dump.
LCK potion indicates the number of rows locked by a transaction and the
FLAG indicates status of the transaction. From the above, we can see that the
transaction is still in progress (Flag empty). Unless we specified MAXTRANS
during the initial table creation, oracle can add as many ITL entries in the
header as possible provided that there is enough space in the block and
multiple sessions concurrently update different rows in the same blocks or till
the maximum 255. (less likely to have 255 transactions modifying the same
block).
The XID for a transaction (in format that’s stored in block) can be
obtained as follows:
select to_char(XIDUSN,'XXXXXX'),
to_char(XIDSLOT,'XXX'), to_char(XIDSQN,'XXXXXXXX') from v$transaction where addr in (select taddr
from v$session where username='VISHNU');
The rows (in the block) that are locked by the transaction will have the
flags set as follows:
lb à indicates the ITL slot.
UBA or Undo Block Address (in Hex format) section contains the details
regarding the undo block that contains the entries (transaction table). Oracle
can sometimes write a dirty block with un-committed changes on to the disk as a
result of one or any combination of the following situations.
-
manual checkpoint
-
memory pressure (buffer cache)
-
if the transaction is very large.
-
Incremental checkpoint or a checkpoint during a log
file switch.
As the block is modified and no longer contains the original data (which
is committed), and yet the transaction that modified the data is not yet
committed (session may be struck, or the transaction is very large), a session
that issues a SELECT statement against the database will not be able to see the
most recent committed data (or the data according to the SCN or point in time
at which the select statement started – long running query), the session ends
up cloning the block and rollback the changes applied to the block thereby
creating a consistent read copy of the block. As we have already seen before the
path of the insert or update.
Before proceeding further, you need to understand how Oracle marks a
block as dirty:
-
If the transaction is still in progress, Oracle
doesn’t necessarily set the dirty flag even if the block/buffer is modified in
the memory (there are various cases, insert, update etc).
-
Oracle always set the dirty flag immediately following
a commit or rollback or DML failure due to a constraint violation or some other
issue which results in automatic rollback.
Let’s take a look at how Oracle generates CR blocks: consider the
following scenario:
create table
students (roll number, name varchar2(300));
alter table
students add constraint students_pk primary key (roll);
create index
name_idx on students(name);
insert into
students select rownum, dbms_random.string(0,300) from dual connect by level
< 50000;
update students
set name='SRAVYZZ' where roll = 10000;
commit;
now that the data is populated let’s execute the following DML statement
in session A:
update students
set name='VISHNU' where roll = 10000;
The states of the buffers are as follows:
If we notice only the index block is marked as dirty.
From a different session execute the following query:
select * From
students where name='SRAVYZZ'; //old value
The states of the buffers are as follows:
Since the buffers are already cached, the CR block fabrication is pretty
straight forward. But if we have to look precisely at how this is done, perform
the following steps:
-
Flush the buffer cache.
-
Disable features such as (table prefetch, batch
access, index prefetch, buffer warmup).
-
Enable 10046 level 12 trace.
The corresponding execution plan is as follows:
The snippet of the trace is as follows:
Let’s take a look at the SQL statements we executed:
update students
set name='VISHNU' where roll = 10000;
select * From
students where name='SRAVYZZ';
The update statement updated not only the table but also the NAME_IDX index leaf block.
Remember that for an index UPDATE operation is typically a DELETE and INSERT.
Let’s look at the ITL portion of both the table and the index leaf
blocks:
Table Block:
Index block (where the entry is deleted)
Index block(where the new entry is inserted)
If we notice the XID or the transaction ID it is same for all these three
operations. Snippet of the undo block contains the details is as follows:
Now consider the case that the session A which initiated the update is
stuck:
update students
set name='VISHNU' where roll = 10000;
And other sessions/new sessions issue the following statement:
select * from
students where name='VISHNU';
Technically the entry in the leaf block exists but the data is not
committed yet and previous to this transaction there is no matching row. In
this case since the entry is exclusively locked, server process ends up
fabricating a CR block (index). One might wonder why?
Consider a scenario in which an update to the partition key causes the
ROW to be moved to a different partition and the underlying index is an global
index, since the indexed column (index doesn’t contain the partitioning key)
value never changes but the ROWID does (the entries in the leaf blocks are
sorted based on column value and the ROWID) the position of this particular
entry can either move up or down or even to a different leaf block altogether. In
this case, since the server process which initiated the query is not sure
whether the index entry existed before or even moved to this block as part of
an ongoing transaction, to make sure that this value existed before and also to
find the corresponding table block that previously held the value (assuming the
transaction is still in progress), Oracle generates a CR block for that index
block alone and once the CR block is read, it determines that the row never
existed immediately returns.
Remember that If there is an open ITL slot prior to executing the SQL statement,
Oracle ends up creating a CR block even though the LB flag for the row we are
interested in is not set or in other words the row of interest is not locked by
another transaction – the server process ends up fabricating a CR block be it
Index or Table block.
Till now we have seen how CR blocks are generated when the transaction
is already in progress, now let’s look a scenario in which the blocks are
modified post the SQL statement started.
Considering the same scenario as before:
create table
students (roll number, name varchar2(300));
alter table
students add constraint students_pk primary key (roll);
create index
name_idx on students(name);
insert into
students select rownum, dbms_random.string(0,300) from dual connect by level
< 50000;
commit;
Instead of going writing a complex SQL that touches certain blocks that
we need, we can simulate this using a simple Java program as follows:
void selectLoad(){
ExecutorService
asd = Executors.newFixedThreadPool(1);
asd.submit(new Load());
}
public void run() {
try {
Connection oraCon = DBConnection.getOraConn();
PreparedStatement pstmt = oraCon.prepareStatement("select * from students where roll < 100");
pstmt.setFetchSize(1);
ResultSet rs = pstmt.executeQuery();
while (rs.next()) {
System.out.println(rs.getInt(1));
Thread.currentThread().sleep(5000);
}
}
catch(Exception E) {
E.printStackTrace();
}
}
Since the setFetchSize is set to 1, the server process has to
mandatorily access these data blocks again to get the next batch of results, in
our case, the client waits 5 seconds allowing us ample time to modify the data
blocks to understand the behavior. X$BH.CR_SCN_BAS indicate the SCN of the CR
block required by the statement or could also indicate the SCN at which the
statement started.
Post starting the program (SCN 96332535) the states of the buffers is as follows:
In another session try modifying the contents of the table as follows:
update students
set name=’SRAVYZZ’ WHERE ROLL = 20;
The states of the buffers will be as follow:
Oracle generates a CR block for the table only if it results in a row
that modified. Suppose if we modify a row whose roll number is 101, Oracle
still generates a CR block. Does it mean that Oracle generates a CR block
always, no, if the distance between the boundaries of the modified row and the
rows that it is processing is ~20 it doesn’t generate a CR block even though
the block’s SCN is higher than the select statements SCN.
In order to comply with Read-Commit transaction isolation level (default
– level), when a SQL statement is issued against the database, Oracle must ensure
that the resultset or result should contain only the rows that were committed until
the precise SCN - SQL statement is issued. This behavior is same for both DML
and Select statements. Let’s start with pretty basics:
Consider the following table:
create table test(roll
number primary key);
insert into test
select rownum from dual connect by level < 1000;
commit;
There is a reason for creating a table with single column, we will see
the reason shortly
Now consider the query:
select * from test
where roll in (1,2,3);
Plan is as follows:
The states of the corresponding index blocks (buffers) is as follows:
Block 355 is the root branch block and the 357 is the leaf block. These
buffers are clean buffers (not dirty). The query used to get the above result
is as follows:
select
a.object_name, b.dbarfil"FILE_ID", b.dbablk"BLOCK",
decode(b.state, 0, 'free', 1, 'xcur', 2, 'scur', 3, 'cr', 4, 'read', 5, 'mrec',
6, 'irec', 7, 'write', 8, 'pi', 9, 'memory', 10, 'mwrite', 11, 'donated', 12,
'protected', 13, 'securefile', 14, 'siop', 15, 'recckpt', 16, 'flashfree', 17,
'flashcur', 18, 'flashna')"STATE", decode(bitand(flag, 1), 0, 'NO',
'YES')"DIRTY", count(*) from x$bh b, dba_objects a where
b.obj=a.data_object_id and a.owner='VISHNU' group by a.object_name, b.dbarfil,
b.dbablk, decode(b.state, 0, 'free', 1, 'xcur', 2, 'scur', 3, 'cr', 4, 'read',
5, 'mrec', 6, 'irec', 7, 'write', 8, 'pi', 9, 'memory', 10, 'mwrite', 11,
'donated', 12, 'protected', 13, 'securefile', 14, 'siop', 15, 'recckpt', 16,
'flashfree', 17, 'flashcur', 18, 'flashna'), decode(bitand(flag, 1), 0, 'NO',
'YES');
Now let’s update the roll number 1 to 10001 and see the status of the
blocks:
update test set
roll = 10001 where roll = 1;
commit;
The index leaf block and the table block that originally contained the
value 1 is updated resulting in dirty blocks. Remember that an update is
basically delete and insert for Index. The update statement resulted in index
entry holding the value 1 being marked as deleted and then the value 10001 is
inserted in the index leaf block 358. The on-disk and in-memory version of the
index leaf block will have the following entries:
(when you dump a block, oracle first dumps the in-memory version
followed by the On-disk version). These blocks will have the flag DIRTY set until the time they
are written to the disk, once they are written to the disk, due to either
memory pressure or part of incremental checkpoint, the DIRTY flag will be cleared.
Now in another session if we issue the below SQL statement without committing the transaction:
update test set
roll = 1 where roll = 10001;
The entry in the block 358 (10001 will be marked as deleted) and the
entry which was previously marked as deleted – the deleted flag in the block
357 will be cleared. Following is the dump of the index leaf block 357.
From another session issue the following statement:
select * from test
where roll = 10001 ;
Since the transaction is still in progress and the block has an open ITL
slot with un-committed changes, session ends up building a CR image of the
block.
Remember that Oracle maintains SCNs at block level, all the ROWs in the
table block and entries in the index leaf blocks are considered to have same
ORA_ROWSCN. This additional CR block fabrication requires reading entries from
the Transaction table – undo tablespace, cloning the block and rolling back the
changes. This process increases the number of consistent gets for a SQL.
When it comes to the DML statements, blocking sessions also results in
generating CR blocks: the following are the scenarios where Oracle ends up
creating multiple CR blocks (index).
CR block generation:
Multiple sessions execute same SQL / update or delete and ends up
blocking against each other.
Unique index:
Plan involves unique scan…update single column of single column unique
index – CR block generation within the limit.
Plan involves unique scan… update a different column (other than unique
index column) – CR block generation within the limit.
Plan involves unique scan… update tailing column of a multicolumn unique
index – CR block generation within the limit.
Plan involves unique scan…update both columns of a multicolumn index –
CR block generation within the limit.
Plan involves unique scan… update leading column of a multicolumn unique
index – – CR block generation within the limit.
Plan involves range scan on multicolumn unique index… update tailing
column of unique index – more than 6 CR blocks generated.…
Plan involves range scan on multicolumn unique index … update leading
column of unique index – more than 6 CR blocks generated.…
Non-Unique index:
plan includes range scan – update same column – cr block generation –
more than 6 CR blocks.
Plan includes range scan – update different column – no cr block
generation.
Plan includes range scan – update different column (unique index column)
– no cr block generation.
Delete on non-unique index – more than 6 CR blocks generated.
Insert statement:
Insert statements – for inserts CR blocks are generated typically for
BMB blocks as the data in those blocks get updated. It so happens that during
the checkpoints
Delete creates more than 6 CR blocks – if range scan over unique index.
Delayed /- Commit Clean out:
(processing / undo lookup Overhead)
The Database writer process can write modified block (dirty block) that
contains uncommitted changes to the disk due to various reasons such as
-
Checkpoints. (manual or automatic)
-
Insufficient buffer cache or (buffer cache flush).
-
Aged out of the buffer cache.
-
Small redo log file sizes.
-
Small value for fast_start_mttr_target etc.
In these cases, the ITL slot in the header portion of the on-disk
version of the block indicates that the transaction is still progress (the
flags are not cleared) and the session that initiated the transaction (or
modified the blocks) doesn’t read the blocks into the memory again to update
the ITL or flags indicating the transaction is committed. This is basically
done for efficiency. To demonstrate this, let’s look at very basic example:
create table
students(roll number primary key, name varchar2(20));
insert into
students select rownum, dbms_random.string(0,20) from dual connect by level
< 100000;
commit;
Session A modifies a row as follows:
update students
set name='VISHNU' WHERE ROLL = 100;
The state of the buffers is as follows:
Now consider if the DBWn writes this dirty block to the disk (due to
various reasons) and later the session commits the transaction.
The ITL portion of the header for the table block 351 for the on-disk
version will be as follows:
Now suppose if another session selects the same row:
select * from
students where roll = 100;
The server process reads the block from the disk and performs a delayed
commit clean out (marking the transaction as committed), and ITL portion post
commit clean out will be as follows:
It so happens that if multiple
transactions modify the block and the block gets written to the disk before the
session can commit the transaction, a session ends up performing additional IO
to the undo tablespace to find the status of these transactions and perform a
commit cleanout.
A session may also fail to perform a commit cleanout in cases when another
session has pinned the buffer, in this case, server process gives up on the
block and the other session which is reading/modifying the block will end up
performing the commit cleanout. Remember that a commit cleanout operation will
modify the block (update the ITL slots and the corresponding rows) thereby
making the block dirty which causes the DBWn process to write it again to the
disk.
Impact Due to Row Chaining:
Every block in
the database has a header area that stores information such as segment type,
SCN, transaction information, etc. Oracle database uses this information to
manage the block itself. For an 8KB database block the space required for this
overhead depends on the various factors such as INITRANS, average row length,
the number of columns, size of columns etc. For a statement that is inserting
data into a table, the effective space left in the block can be calculated as
follows:
Space
Left (inserts) ~= DB_BLOCK_SIZE – (BLOCK_HEADER) – Space reserved
(DB_BLOCKS_SIZE * PCTFREE / 100)
Rows are always
inserted from the tail end of the block and continue upward as shown the in the
following figure:
As you go
through the rest of the section, you will realize how this behaviour is
critical to the performance and points to consider while designing the layout
of the tables and the column order.
Oracle will not
split the row into multiple row pieces if it can fit the row entirely in a new
formatted block. Row Chaining is the process of splitting a row into several
row pieces and storing these row pieces either in the same block or in
different blocks depending on the size of the row. Row chaining can occur due
to the following reasons:
-
Row
size greater than allowable user data size in a block (after considering
BLOCK_HEADER size and PCTFREE).
-
Table
has more than 255 columns.
-
Tables
containing LONG and LONG RAW columns (oracle recommends us to convert them to
LOBs).
Let’s look at
each case:
Scenario 1: Row size greater than allowable user data
size:
In this case
Oracle automatically splits the row into several row pieces and stores each row
piece in a different data block. An interesting point to note here is that oracle
chains the row pieces starting from the tail end of the table. To illustrate
this behaviour, let’s look at an example, the following table is created in a
tablespace with standard block size (8KB).
create table
students (student_id number, t1 varchar2(3000), t2 varchar2(3000), dept_id
number, t3 varchar2(3000), t4 varchar2(3000), t5 varchar2(3000));
alter table
students add constraint students_pk primary key (student_id);
create index
dept_idx on students (dept_id);
insert into
students select rownum, dbms_Random.string(0,3000), dbms_random.string(0,3000),
round (Dbms_random.value()*100)), dbms_random.string(0,3000),
dbms_random.string(0,3000), dbms_random.string(0,3000) from dual connect by
level < 10000;
cmmit;
From the table
definition it is evident that a block cannot accommodate an entire row and
Oracle requires at least 3 blocks to store the same row. Row piece containing
the Columns
(t4, t5) are stored in one block, columns (t2,
dept_id, t3) are in another, and finally columns (roll, t1) in one block. Graphically this is represented as follows:
Looking the
block dumps:
nrid points to the block
that contains the next row piece. The row piece containing the columns (student_id, t1) is the head
row piece and all the indexes on this table basically point to the head row
piece. We can confirm this using the following:
student_id
IDX:
Dept_idx:
Remember that
the row is split into multiple row pieces and the tailing columns are inserted
first.
How this scenario Impacts Performance:
Impact on index range scans:
Since the index
entries always point to the heat row piece, the impact on performance due to
additional reads depends on the columns that are included in the SELECT part of the query. If we are
selecting only the columns that are present in the head row piece then no
additional IO is required. Due to this it is recommended to have all the
indexed columns and most frequently accessed columns in the first row piece in
order to avoid extra IO to read the subsequent row pieces.
Similarly for
the full table scan, if the select part includes only the columns that are
present in the head row piece the server process doesn’t necessarily use single
block reads (db file sequential read) instead uses multiblock read (direct path
read). But if the select part included columns that are not in the head row
piece than subsequently oracle has to read the rest of the row piece using
single block IO. This presents as very interesting issue as Oracle will end up
reading the same block multiple times. The same holds true for index scans as
well.
WAIT
#140051776711336: nam='db file sequential read' ela= 195 file#=7 block#=95478 blocks=1
obj#=74680 tim=5026470481
WAIT
#140051776711336: nam='db file sequential read' ela= 205 file#=7 block#=95479 blocks=1
obj#=74680 tim=5026470758
WAIT
#140051776711336: nam='db file sequential read' ela= 209 file#=7 block#=95480 blocks=1
obj#=74680 tim=5026471148
WAIT
#140051776711336: nam='db file sequential read' ela= 207 file#=7 block#=95484
blocks=1 obj#=74680 tim=5026471470
WAIT
#140051776711336: nam='db file sequential read' ela= 199 file#=7 block#=95406
blocks=1 obj#=74680 tim=5026472151
WAIT
#140051776711336: nam='db file sequential read' ela= 210 file#=7 block#=95390
blocks=1 obj#=74680 tim=5026472429
WAIT
#140051776711336: nam='direct path read' ela= 296 file number=7 first dba=95744 block cnt=128
obj#=74680 tim=5026473385
WAIT
#140051776711336: nam='db file sequential read' ela= 227 file#=7 block#=94378
blocks=1 obj#=74680 tim=5026474069
WAIT
#140051776711336: nam='db file sequential read' ela= 145 file#=7 block#=94362
blocks=1 obj#=74680 tim=5026474276
WAIT
#140051776711336: nam='db file sequential read' ela= 197 file#=7 block#=95730
blocks=1 obj#=74680 tim=5026475029
WAIT
#140051776711336: nam='db file sequential read' ela= 140 file#=7 block#=95714
blocks=1 obj#=74680 tim=5026475230
WAIT
#140051776711336: nam='db file sequential read' ela= 223 file#=7 block#=95733
blocks=1 obj#=74680 tim=5026475524
WAIT
#140051776711336: nam='db file sequential read' ela= 165 file#=7 block#=95669
blocks=1 obj#=74680 tim=5026475749
WAIT
#140051776711336: nam='db file sequential read' ela= 207 file#=7 block#=95737
blocks=1 obj#=74680 tim=5026476076
WAIT
#140051776711336: nam='db file sequential read' ela= 258 file#=7 block#=95721
blocks=1 obj#=74680 tim=5026476406
Try to place all
the optional columns in the tail end of the table, so that if we are not
storing any data or nulls in these columns there is a high change that the rows
may not be chained. Each time we hit a chained row, the statistic table fetch
continued row is incremented.
Scenario 2: Table has more than 255 columns.
If the number of
columns in a table is greater than 255, Oracle automatically splits the row
into multiple row pieces and stores each row piece either in the same block or
in different data blocks depending on the size of the row. An interesting point
here is that split happens from the tail end of the table. Suppose if we have a
260-column table, columns 5-260 will be chained (depending on the size of
the columns, the above case is true if
all the columns are integers), so the first row-piece will store 1-5 columns
and the second row-piece contain the rest of the columns. If Oracle fits the
all the row pieces of the row in a single block, we have something called
Intra-block row chaining.
How this scenario Impacts Performance:
Intra-block row
chaining where the row pieces are stored in the same block may not result in an
additional IO, there is CPU overhead involved as we have to read both the row
pieces. If a row piece gets migrated to a different block as part of update
operation, then this could result in additional IO not only during index range
or unique scans but also for full table scans. Intra-block row chaining will
not result in table fetch continued row statistic to be incremented, and the
details such as CHAIN_CNT in DBA_TABLES|DBA_TAB_STATISTICS is not
displayed. Oracle treats a chained row as such only if the row pieces span
multiple rows and the statistic table fetch continued row is incremented only
when the reading a row results in reading two table blocks.
Row Migration:
Row migration is
a form of row chaining. During the update if it is found that the current block
does not have the enough free space to accommodate the new size of the row, the
entire row is migrated to a new block and a forwarding address with the new
ROWID is left behind in original row piece. After the row piece is migrated to
a new block, the corresponding index entries are not updated with the new
location because it can take significant amount of time to update all the
indexes and their corresponding index leaf blocks. With row migration, each
time we refer a particular row using index lookup we make an additional IO as
the index keys point to the original row piece which contains the forwarding
address. A chained row piece can also be migrated to a different block.
How this scenario Impacts Performance:
Since the entire
row is migrated from the current location to a new block and a forwarding
address is left behind, an additional I/O is always warranted. Low PCTFREE or
default value of PCTFREE on tables with heavy update activity are prone to have
more row migrated due to space issues.
Monitoring the Row Chaining and Migration Impact:
Increased db file sequential reads for a SQL
execution or db file scattered reads for index range or unique scan can be an indicator of row chaining
or migration. Monitoring the statistic table fetch
continued row for a session during the SQL execution is
a good starting point to confirm the presence of chained or migrated rows. CHAIN_CNT column in DBA_TABLES or DBA_TAB_STATISTICS can be used to find the number of chained rows, but the contents of
these view may not be update, to get an accurate value on the number of chained
row use ANALYZE TABLE ... COMPUTE STATISTICS command. DBA_TAB_MODIFICATIONS view provides us with an insight into the workload profile for a
particular table, lots of updates along with chained rows in a table is a clear
indication that the PCTFREE value
might not be sufficient. ANALYZE TABLE ... LIST CHAINED ROWS command gets the list of chained rows in a table. This command
populates CHAINED_ROWS
table (created using ?/rdbms/admin/utlchain.sql script). HEAD_ROWID
column in the CHAINED_ROWS
indicate the ROWID of a ROW in the table that is spread across multiple blocks.
Fixing Row Chaining / Migration:
If the number of
chained rows is very less, we can delete and insert the corresponding back to
the table in a single transaction when the load on the table is very less. If there
are many chained rows, it can be clear indication that either the PCTFREE
setting for the underlying table might not be sufficient or the tablespace
block size is pretty small. If we change the existing value of PCTFREE for a
segment only the new blocks allocated to the segment will inherit the changed
value and the existing blocks retain the same value. If we have to make sure
that we have a consistent value of PCTFREE for all the blocks in the table
after modifying the PCTFREE value for a table we can to make use of the
features such as
-
Online
Table Move.
-
Online
Table Redefinition (if earlier version)
-
Export
and Import (if the table is static).
Which result in
the creation of a new segment. Since the maximum size for LONG or LONG RAW
columns is very high, Oracle’s recommendation is to convert these columns into
the corresponding LOB data types as LOBs are stored in a different data
segment. Remember that the ultimate goal of any Performance Tuning DBA is to
reduce the number of disk reads which is the basically the choking point in any
OLTP database that prevents us from attaining the peak TPS on the system.
From the Storage Perspective:
If the select
statement references only the columns that are present in the head row piece,
then no additional reads are required, and depending on various factors such as
the number of ROWIDs for a particular column value, fetch size, use of
aggregate functions, order by (not including in the SQL), Oracle can
efficiently batch as many table block reads as possible into one single read.
So what happens if the columns that are stored in the tailing row pieces are
included in the select statement? To answer this we have to understand that
Oracle has no way of determining whether which columns are stored in the tail
end (tailing row piece) from the ROWIDs it obtains the index lookup, and more
over only a single or few rows may have resulted in chained rows. So when a
user executes a query such as the following:
select
contact1 from students where marks = 100
and If there are
9 chained rows, Oracle proceeds as if no chained rows are present, batching as
many table blocks as it can and issues an IO_SUBMIT call to read the
corresponding blocks, as it accesses each ROW based on the ROWID, if it
encounters a chained row, it immediately issues a separate IO request using
pread64 to read the corresponding row piece. This behaviour is the same for
both row chaining and migration (range based or bound based predicates as
well).
Is there a way to Prevent Row
Migration altogether?
Introduction – Hakan Factor:
Hakan Factor specifies the maximum number of rows that can fit in a
block. During the table creation time Hakan Factor is calculated based on
various factors such as:
-
Database Block size.
-
Data type and length of the column
-
Number of columns.
-
NOT NULL constraint.
Unfortunately, factors that indirectly restrict the number of ROWS in a
block such as are not used in the calculation – we will look into this shortly.
-
INITTRANS
-
PCTFREE
We can find the Hakan factor value for a table using the following
query.
select
mod(t.spare1, 32768) hakan from tab$ t where obj# in (select object_id from
dba_objects where object_name='&OBJECT_NAME' and owner='&OWNER');
or by issuing the following Query:
select
max(sys_op_rpb(rowid)) from owner.object_name;
Hakan factor can be set manually by issuing the following statement:
alter table
table_name minimize records_per_block;
once you execute the statement, oracle internally uses the above-mentioned
SQL statement to calculate the Hakan factor (by performing a full table scan).
Hakan factor is primarily used
-
To restrict the number of rows in a particular table.
-
To map the rows to the corresponding ROWID entries for
bitmap indexes.
By default, when a table is created (without a not null constraint, or
in many cases even with the presence of not null constraint), Oracle sets the
value to 736 (which is a bit unrealistic) for a database block size of 8KB.
Let’s look at an example:
create
table students(roll number, name varchar2(300));
The Hakan value for this table is as follows:
Oracle estimates that this block can hold as many as 736 rows. Now let’s
populate this table as follows (ensuring all varchar data is stored):
insert into
students select rownum, dbms_random.string(0,300) from dual connect by level
< 10000;
commit;
and update the Hakan value manually by issuing the following statement:
alter table
students minimize records_per_block;
Oracle now populates the Hakan Factor as follows:
This indicates that a block cannot hold more than 22 rows. Now that the
Hakan value is populated, lets truncate the table and repopulate it a smaller
row size.
insert into
students select rownum, dbms_random.string(0,1) from dual connect by level <
10000;
commit;
Let’s look at the average row length, blocks etc., using the following
query:
select
table_name, avg_row_len, num_rows, blocks, from dba_tables where table_name ='STUDENTS';
As we can see even though the average length is small, Oracle did
allocate a greater number of blocks than required, this is because of the Hakan
factor. Let’s look at the number of rows stored in each block:
select cnt,
count(*) from (select
dbms_rowid.rowid_block_number(rowid) "BLOCK", count(*)
"CNT" from students group by dbms_rowid.rowid_block_number(rowid))
group by cnt;
As we can see from the above Oracle did restrict the number of rows per
block to 23 (+/- 1 Hakan factor) even though it can accommodate more rows. This
is a way with which we can restrict the number of rows per block.
One might wonder if PCTFREE can be used to restrict the number of rows or
reserve more space for future updates. One disadvantage of the PCTFREE is that
it doesn’t restrict the number of rows per block and there is always a change
that the rows size can grow as a result of the update statement resulting in
row migration. So how do we set a proper value for Hakan Factor?
Load the table with approximate (average) sizes of each columns by
talking to the developers, issue the statement alter
table … minimize records per
block, and truncate the table. (initially when the table is created so that you
can prevent the row migration from occurring). Remember that Oracle doesn’t
insert a row into a block if it finds that the blocks doesn’t have enough free
space. Default value of Hakan factor may not be appropriate as the value is
always an overestimate and doesn’t consider INITTRANS (this actually reserves space for ITL slots in a
block).
Remember that when you use statements such as CTAS (Create Table as Select) Hakan factor gets carried over to the new table.
Impact of Additional Predicates:
Use of additional predicates on columns other than the indexed columns
in an SQL statement can undermine the Batching or Async IO capabilities of a
database, depending on the ROW’s position that matches all the predicates in
the SQL in the leaf block. This behavior is common for both prefetching and
batching. Let’s look at an example:
Consider the following scenario:
create
table students (student_id number, name varchar2(30), dept_id number, mark1
number, mark2 number);
create
index mark1_mark2_idx on students(mark1,mark2);
Populate the table with random data,
SQL>
select dept_id,count(*) From students where mark1=113 and mark2 = 3 group by
dept_id;
The position of the corresponding ROWIDs of interest in the leaf blocks
for the entry mark1=113 and mark2=3 is as follows:
Let’s look at 10046 level 12 trace for the following statements:
select *
from students where mark1 = 113 and mark2 = 3 and dept_id =1;
select *
from students where mark1 = 113 and mark2 = 3 and dept_id = 2;
select *
from students where mark1 = 113 and mark2 = 3 and dept_id = 3;
In all the cases, the number of buffer gets or consistent reads are all
the same, but it is basically the number of single block read requests and the
inability to effectively make use of asynchronous IO through batch or prefetch
will add more delay to elapsed times of the queries. If you are working on true
OLTP system where even milli-second delays cannot not tolerated, you should be
really careful on selecting the right columns of a multi-column index. If all
the table blocks are cached, this behavior will not impact the response times
at all and this is evident from the following.
This situation (in which all the table blocks are in the buffer cache)
may be the case on most databases and as buffer caches are considerably
smaller, the session may eventually have to read the blocks from the disk.
There are too many variables
Working with Dates/Timestamps:
Developers often make a mistake of mapping wrong data types when it
comes to DATES or use functions that are not compatible specifically due to
which Oracle will not be able to use the index on the date column. Let’s start
with a basic example:
create
table students(roll number, name varchar2(30), day date);
create
index day_idx on students(day);
Populate it with random data.
Application issues the following query: (in this case developer maps the
bind variable as timestamp instead of date).
select *
from students where day = ?
The execution plan is as follows:
The bind value supplied by the application is of the type TIMESTAMP.
Even though the column day is indexed, Oracle Ignores the index and
performs a full table scan. This is because Oracle applies the function INTERNAL_FUNCTION
on the day column (this INTERNAL_FUNCTION performs implicit datatype conversion) to
expand/convert the column data to Oracle TIMESTAMP data type rather than truncating/converting
the TIMESTAMP
value supplied into Oracle DATE data type.
In this case you may
think that creating an INDEX as follows will help, but it doesn’t because
Oracle doesn’t allow us create an index using a non-deterministic function and
secondly INTERNAL_FUNCTION cannot be used directly (19c).
create
index day_idx on students(internal_function(day));
ERROR at
line 1:
ORA-00904:
"INTERNAL_FUNCTION": invalid identifier
In this case the solution is to ask the developer to fix the code to
change the data type from TIMESTAMP to DATE. This issue is more prevalent with hibernate. Oracle
database release 12c R2 addresses this issue by automatically picking the index
even though the supplied data type is timestamp, in this case, oracle doesn’t
convert the data type and uses the index directly. The following execution plan
on 12c release 2.
But if the OPTIMIZER_FEATURES_ENABLE is set to <=12.2.0.1, Oracle reverts back to the old behavior and
applies the INTERNAL_FUNCTION. Before going further let’s look at the data
types returned by the Oracle supplied functions:
sysdate – DATE
systimestamp
– TIMESTAMP WITH TIME ZONE.
trunc(systimestamp)
- DATE
Be cautious when using
these functions in the predicates. Consider the following table:
create
table students (roll number, name varchar2(30), when timestamp);
create
index when_idx on students(when);
A query such as the following:
select *
from students where when=systimestamp
– interval 1 day;
will result in a TABLE ACCESS FULL rather than an INDEX
RANGE SCAN.
This is because oracle expands the column WHEN data type from TIMESTAMP into TIMESTAMP WITH
TIME ZONE and then applies the SYS_EXTRACT_UTC
function extracts the UTC (coordinated universal time – formerly Greenwich mean
time) from a data time value with the time zone offset or time zone region name.
In other words, SYS_EXTRACT_UTC takes TIMESTAMP
WITH TIME ZONE as input and returns
TIMESTAMP.
When comparing the TIMESTAMP column always use SYSDATE
function.
Always compare a timestamp column with either sysdate or trunc(systimestamp) or using java.sql.timestamp api (while setting a value for a bind variable at
the application level (PreparedStatement)). An interesting point to remember here is since
Oracle automatically applies SYS_EXTRACT_UTC function whenever the predicates contains
columns of type TIMESTAMP WITH TIME ZONE. Creating an index on the column with data
type TIMESTAMP WITH TIME ZONE automatically creates function-based index.
Consider the following:
create
table students(roll number, name varchar2(20), when2 timestamp with time zone);
create
index when2_idx on students(when2);
Even though the DDL suggests that we are create a normal B-Tree index,
Oracle internally creates a function-based index.
Extracted DDL will be as follows:
CREATE INDEX
"VISHNU"."WHEN2_IDX" ON
"VISHNU"."STUDENTS" (SYS_EXTRACT_UTC("WHEN2"))
PCTFREE 10 INITRANS 2 MAXTRANS 255 COMPUTE STATISTICS
TABLESPACE "USERS" ;
Remember that since the index created on a TIMESTAMP WITH TIME ZONE column is a function-based index. Limitations of a function-based index
applies here as well.
Null Values:
Null values hold a special space in Oracle, if a column in a row has no
value, then the column value is said to be null or to contain null. We cannot
compare a null with a null since they are not treated as equal values (null represents
no values more like an infinite value 1/0 paradox ) using relation operation =,
that being said the condition null=null is always false and null is null always
equals to true. Nulls can appear in any data type that are not restricted by
not null or primary key integrity constraints. Oracle treats null as unknown
value, so if we a unique constraint on a column, we can insert many rows
containing null values, in this case oracle treats each null value as a
distinct value and since we are not storing any value oracle doesn’t store
these entries in the index. Since we are discussing about the B-Tree indexes,
considering only B-Tree indexes, by default, oracle doesn’t store the entries
in the index when all the indexed columns of the index contain null values. Oracle does store the entries in the index at
least a column of composite index or multi-column index contain non-null
values. Let’s look at scenarios where we can leverage this to improve the
response time.
Single Column B-Tree indexes:
Indexes built on single column do not store entries containing null
column values. As these values are not stored in the index NUM_ROWS or DISTINCT KEYS in DBA_IND_STATISTICS doesn’t
populate these details. So, if we have queries with predicates that depend on
the null such as :b1 is null or :b1 is not null, oracle has no way of determining how many
rows in the table contain the null values, similarly, when the predicate
condition contains not null, this equals to all the rows in the index. In both
the cases, the optimizer simply prefers a full table scan or a fast-full index
scan or a full index scan.
When we update a particular value to null Oracle simply updates the flag
for that particular entry in the index leaf block as deleted, even though this
behavior is not of much significance, this becomes a starting point to explain
certain behavior when we proceed to multi-column indexes. The corresponding
leaf block dump showing the details is as follows:
row#0[8020] flag: ---D---, lock: 2, len=12
col 0; len 2;
(2): c1 02
col 1; len 6;
(6): 01 c0 01 66 00 00
row#1[8008] flag: ---D---, lock: 2, len=12
col 0; len 2;
(2): c1 03
col 1; len 6;
(6): 01 c0 01 67 00 00
Since the nulls are not stored in a single-column index, query that
depend on null are particularly affected such as the following:
select * from
students where marks is null;
A simple workaround in this case to add a virtual column to the index
and leverage the fact that oracle does store index entries when the leading
column values contain null. An example is as follows:
Create index
students_idx on students(marks, 1);
Here, we added an extra column in the index which stores the value 1.
Oracle adds a virtual column to the table which can be viewed using all_tab_cols or dba_tab_col_statistics. The
end result of this is a function-based index on the table with columns (marks, virtual_column). The
storage requirements of 1 is basically 2 bytes which can be seen in the leaf
block dump as follows:
The character such as “.” Or space “ “ requires only 1 byte of storage.
If we are indexing numeric or character-based columns with finite length, then
consider adding such characters so that it maximizes the number of entries that
can be stored in the leaf blocks, but also reduces the size of the index (this
space savings in this case might be finite but for OLTP workloads each
additional block access counts when it comes to response times).
The leaf block dump is as follows:
Multi-Column B-Tree indexes:
For multi-columned or composite indexes things are a little different,
Oracle persists the nulls in the index even if the leading column or tail
column contains nulls The block dump of an index leaf block shows the details.
We know that Oracle by default stores the column values in the ascending
order, but since the value of the NULL is unknown, the corresponding entries
are always stored at the last or the right most part of the B-Tree structure
following the entries that contain values.
-
If the leading column has nulls, these entries with
nulls are stored in either the last branch block or last leaf block depending
on the number of rows that contains nulls in the leading indexed column.
-
If the tailing column has null, these corresponding entries
are stored in the last index leaf block or immediately following the last entry
that contains the maximum value for the tailing column (for a particular value
of leading column)
It is only when all the indexed columns are null, Oracle doesn’t persist
these entries in the index leaf blocks.
This has rather strange implications on how the oracle retrieves the
data from this index when the predication uses null or not null condition on a
column or set of columns in the composite index. Let’s look at all the various
cases where an index will used and the tips that we can use:
We create the table as follows:
Create table temp
(roll number, name varchar2(20), mark1 number, mark2 number, mark3 number);
Create index idx
on temp(mark1, mark2);
We carefully load the data into the table such that the number of rows
in the table is large and selectively populate mark1 and mark2 to contain nulls
as follows:
Populating 100 rows so that mark1 column value is null
Insert into temp
select rownum ,dbms_random.string(0,20), null, 112312, 123 from dual connect by
level < 100;
Populating 100 rows so that mark2 column value is null
Insert into temp
select rownum ,dbms_random.string(0,20), 112312, null, 123 from dual connect by
level < 100;
For the following query:
Select * from temp
where mark1 is null;
Even though the number of rows that matches the condition is only 10,
oracle will end up performing with the full table scan. As mark2 column is not
protected by a not null constraint, even though we know that the data doesn’t
contain any rows with null values for both mark1, mark2 columns. Oracle has no
way of guaranteeing based on the statistics of individual columns from dba_tab_col_statistics, it
assumes or infers that there is a chance that the null values in mark1 column
may overlap with null values in mark2, and that the index entries may not have
the details. Due to this Oracle always prefers a full table scan. In this case
in make sure that the optimizer picks up the index, we have two solutions.
1.
Add a not null constraint to the Mark2 column.
2.
Recreate the index with columns (mark1, mark2, 1), so
that Oracle knows that even when mark1 and mark2 column values are null, due to
the presence of virtual column which always contains a value, oracle prefers
the index scan.
For the following query:
Select * from temp
where mark2 is null;
In this case as well, since mark1 column is not protected by a not null
constraint, oracle infers that there is a chance that the null values in mark1
may overlap with null values in mark2, oracle prefers a full table scan, in
this case as well, we have two solutions
1.
Add a not null constraint to the mark1 column.
2.
Add an extra trailing column to the index which is not
a constant, and that which is protected by a not null constraint.
In this particular case, adding an extra constant column will not work
because the function-based indexes don’t support skip scans. This is one of the
limitations of the skip scans.
Both the following queries ensure an index scan as Oracle is
deterministic that the values are persisted in the index columns because the
predicates guarantee that.
Select * from temp
where mark1 is not null and mark2 is null;
Select * from temp
where mark1 is null and mark2 is not null;
For the following query:
select * from temp
where mark1 = 12 and mark2 = null?
Even though the tailing column is null in order to satisfy the queries
that contain just the predicate mark1 = ? oracle has to mandatorily store all the
entries in the leaf blocks. Since all the entries with mark2 as nulls are
stored at the last, oracle reads the last branch block entry for mark1=12 and reads the
corresponding index leaf blocks.
when it comes to Oracle ability to perform FAST FULL INDEX SCAN and/or INDEX FULL SCAN having a NOT NULL constraint is
mandatory on a particular column or all columns of a concatenated or
multi-column index. There is yet another important aspect that a DBA should be
aware of when it comes to NOT NULL constraints. Consider the following example:
create table
students(student_id number primary key, name varchar2(20) not null, marks
number not null);
create index
name_idx on students(name);
insert into
students select rownum, dbms_random.string(0,20),trunc(dbms_random.value()*100)
from dual connect by level < 1000000;
commit;
In the above table all the columns student_id, name, and marks are protected by NOT NULL constraint.
For the queries such as the following:
select * from
students where student_id is null;
select * from
students where student_id is null order by student_id;
select * from
students where name is null order by student_id;
select * from
students where marks is null;
Oracle doesn’t even bother to read the segment (table or index blocks)
as optimizer finds a conflict between the WHERE-clause predicates and the constraints declared
on the table provided that the constraint is VALIDATED (DBA/USER_CONSTRAINTS.VALIDATED). This is a scenario in which Oracle executes
the SQL query without even touching the table or index segment.
There are many
takeaways from this. Let’s start with the first SQL statement:
select * from
students where student_id is null;
Here student_id
column is protected by a primary key constraint (UNIQUE – NOT
NULL) and it is evident from the predicates
used (student_id is null). Once the optimizer detects that there is conflict with the NOT NULL constraint and the
predicates used, the predicate section of the explain plan shows filter(NULL IS NOT NULL), if
such a predicate (that results in a conflict with the constraint) is used along
with other predicates, oracle will or will not ignore the predicate depending
on whether the both the predicates are connected by AND/OR. Let’s look at the
execution statistics:
From the above we can clearly see that Oracle didn’t perform a single
block read.
Remember that if the constraint is marked as NOT VALIDATED, oracle treats
as if the constraint never existed and performs IO to check for the same
causing unnecessary IO (provided that there are no rows with corresponding
column values as null. Another important aspect of this behavior is that the
explain plan can indicate an index that doesn’t make sense. Consider the
following query:
select * from
students where name is null order by student_id;
The execution/explain plan for the above statement is as follows:
Instead of wondering why Oracle picked the index students_pk2 check the
predicate information section filter(NULL IS NOT NULL) which is a pure giveaway of the above-mentioned
behavior. When all the columns of an index are NULL then Oracle doesn’t store
that entry in the index. Don’t take this as an ironclad rule, this holds true
for normal B-Tree index but this is not the case with descending,
function-based, reverse key or bitmap indexes. Bitmap indexes do store nulls.
Descending Indexes and nulls:
By now we already know that Oracle doesn’t persist in entries in the
index when all the columns of the index are null and that Oracle internally
uses the function SYS_OP_DESCEND() on the column value to get the form that is
used/stored in the descending index. The use of the function sys_op_descend has strange
implications when it comes to null. When the null is passed, this function
returns 00 as
shown below:
SQL> select
sys_op_descend(null) from dual;
SY
--
00
Since this function returns a value for null, this value 00 is stored in the index
leaf blocks. In other words, nulls are stored in a descending index. but the
retrieving the rows with nulls present a challenge. Let’s create a schema as follows:
create table
students(student_id number, name varchar2(30), marks number);
insert into
students select rownum, dbms_random.string(0,30),
round(dbms_random.value()*1000) from dual connect by level < 1000000;
insert into
students select rownum, dbms_random.string(0,30) , null from dual connect by
level < 100;
commit;
create index
marks_idx on students(marks) desc;
Consider the following query:
select * from students
where marks is null;
Oracle performs a full table scan (even though the nulls are present),
in this case Oracle doesn’t even apply the function SYS_OP_DESCEND.
Interestingly when we modify the predicates section as follows, Oracle
uses the index.
select * from
students where sys_op_descend(marks) = sys_op_descend(null);
Another way to address this issue is to add a virtual column to the
table definition as follows:
create index
marks_desc_idx on students(marks desc, 1);
Interestingly even in this Oracle ends up with a full table scan for the
following query:
select * from
students where marks is null;
Unless we modify the predication section as follows:
select * from
students where sys_op_descend(marks) is null;
So, when it comes to single column indexes, Oracle does store nulls. For
a descending index nulls are stored at the left most section of the B-Tree
structure.
Reverse Key indexes and Nulls:
When it comes to reverse key indexes, things are way different. Even though
the predicate part doesn’t specify a specific function being used and as the
bytes of the column value are reversed, things work more like a regular index,
in the sense for a single column index, NULLs are not stored and for
multi-column indexes the behavior is same as a normal index.
For single column indexes, even after adding a virtual column to the
index, oracle doesn’t prefer an index range scan.
create index
marks_idx on students(marks, 1);
When it comes to multicolumn indexes, if the tailing column is null and
the application issues queries such as the following:
Select * from
students where marks = 10 and dept_id is null;
Oracle starts by scanning the entire portion of the index and the
corresponding table blocks for the entire MARKS=10, even though the nulls are stored at the right
most end (this is not the case with a regular index).
For a multi-column index, nulls are usually stored at the right most
part of the B-Tree structure for the corresponding value.
Function Based Indexes:
Similar to descending indexes, unless we specify either the entire
function or virtual column optimizer may or may not prefer an index range scan.
Also storing of nulls in the index is dependent on the function used.
Predicates used and column order in Multi-Column indexes:
ORDER BY clause and use of PGA:
The use of PGA
for sorting the final result set depends on whether the column name used in
ORDER BY clause is indexed and that the optimizer ends up using the same index.
Use of any non-indexed column on the table (not used in the predicated or not
indexed) in the ORDER clause results in the PGA used for sorting purposes.
Remember that if
Oracle decides to use PGA to store the intermediate results, then it will
utilize ASYNC IO or batches the table block reads (through index lookup) as
much as it can based on several factors such as
-
Additional
Use of ORDER BY
clause with ASC or DESC severely undermines the
Index Range Scan Descending:
An Index Range Scan Descending is identical to an index range scan
except that the database returns rows in the descending order. The optimizer
can choose a descending scan when the SQL statement contains an ORDER BY clause
with DESC keyword, or when seeking a value less than a specified value. There
is more to just returning rows in the descending order. Lets start with very
basic examples and the schema as follows:
create table
students (student_id number, dept_id number, sub_id number, marks number);
alter table
students add constraint students_pk primary key (student_id);
create index
sub_marks_idx on students(sub_id,marks);
insert into
students select rownum, round(dbms_random.value()*100), round(dbms_random.value()*100), round(dbms_random.value()*100), from dual
connect by level < 10000000;
commit; ASYNC IO capabilities of the database and the scans are mostly
sequential in nature, reading one table block at a time. (you can see batching
or ASYNC IO even if you use ORDER BY clause provided that we specifically
select a single index key value).
Consider the following query:
select *
from students where student_id < 10 order by student_id desc;
The SQL Trace is as follows:
WAIT
#140603161896376: nam='db file sequential read' ela= 271 file#=7 block#=475
blocks=1 obj#=74652 tim=14585256046
WAIT
#140603161896376: nam='db file sequential read' ela= 248 file#=7 block#=3399
blocks=1 obj#=74652 tim=14585256472
WAIT
#140603161896376: nam='db file sequential read' ela= 219 file#=7 block#=479
blocks=1 obj#=74652 tim=14585256742
WAIT
#140603161896376: nam='db file sequential read' ela= 243 file#=7 block#=40206
blocks=1 obj#=74651 tim=14585257042
WAIT
#140603161896376: nam='db file sequential read' ela= 217 file#=7 block#=99345
blocks=1 obj#=74651 tim=14585257914
WAIT
#140603161896376: nam='db file sequential read' ela= 253 file#=7 block#=105735
blocks=1 obj#=74651 tim=14585258252
WAIT
#140603161896376: nam='db file sequential read' ela= 212 file#=7 block#=30681
blocks=1 obj#=74651 tim=14585258517
WAIT
#140603161896376: nam='db file sequential read' ela= 208 file#=7 block#=11531
blocks=1 obj#=74651 tim=14585258770
WAIT
#140603161896376: nam='db file sequential read' ela= 228 file#=7 block#=155448
blocks=1 obj#=74651 tim=14585259042
WAIT
#140603161896376: nam='db file sequential read' ela= 209 file#=7 block#=31423
blocks=1 obj#=74651 tim=14585259300
WAIT
#140603161896376: nam='db file sequential read' ela= 205 file#=7 block#=89683
blocks=1 obj#=74651 tim=14585259558
WAIT
#140603161896376: nam='db file sequential read' ela= 207 file#=7 block#=168747
blocks=1 obj#=74651 tim=14585259813
If we look at the which table block each corresponding student_id is stored in we get the following results.
select
student_id,dbms_rowid.rowid_block_number(ROWID) "BLOCK" from students where student_id < 10;
Oracle works its way from the last entry in reverse until the lower
bound is reached eliminating the need to sort the rows. Use of or extent to
which ASNC IO is used is completely undermined due to the use of ORDER BY clause
as BATCHING
or PREFETCH
can result in any block - that may contain a high value to come to the buffer
cache first. Remember that even if an indexed column value spans across
multiple leaf blocks oracle always starts from the last entry and works its way
in the reverse order. One might wonder why because, (one way of looking at this
is) due to the inherent structure of the index and as a leaf block can contain
multiple column values and their corresponding ROWIDs and these column values
can span across multiple leaf blocks, Oracle has no way of knowing where the
previous value is stored or in which particular leaf block in the doubly linked
list, and in order to determine that it has to perform a traverse the B-Tree
structure or use the branch blocks, which is expensive considering the number
of latch gets etc. When it comes to multi column indexes if the predicate on
the leading column contains an = condition and an ORDER BY clause with DESC on
the tailing column, oracle can still consider an index range scan descending.
Since Oracle can perform index range scan in descending order, the use
of descending index is undermined for single column indexes and their benefits
can be clearly seen in multi-column indexes.
Now that we have seen various scenarios that can lead to additional IO
while performing an index range scan, let’s look at various optimizations that
Oracle uses while performing an index range scan. These optimizations depend on
various factors such as
-
Number
of ROWIDs for a particular index
key value.
-
Client
Fetch Size
-
Operations
that use PGA such as aggregate, group by etc
-
Use
of ORDER BY clause.
Index Skip Scan:
Prior to 9i, Optimizer ignored the index when the predicates didn’t
contain the leading column of a multi-column index, in other words the index is
used only when the predicates contain the leading subset or all the columns of
a multiple column index the index. Starting with 9i, with the introduction of
skip scans, oracle can consider a multi-column index even when the leading
column is not included in the predicates.
Consider the following example:
create
table students (student_id number, name varchar2(300), dept_id number, sub_id
number, marks number);
create
index sub_marks_idx on students(sub_id, marks);
create
index marks_idx on students(marks);
insert into
students
select
rownum, dbms_random.string(0,300), round(dbms_random.value()*100),
round(dbms_random.value()*100),
round(dbms_random.value()*10000) from dual connect by level < 1000000;
Commit; //gather statistics post commit;
Let’s look at the execution plan/statistics for the following statement
making the marks_idx visible and invisible resulting in Index range
scan and Index Skip scan.
select *
from students where marks = 0;
The execution plans are as follows:
It is obvious from the above that the plan with index skip scan resulted
in higher buffer gets, elapsed time and CPU per execution than the plan with
index range scan. But this doesn’t tell us the entire story and concepts behind
a skip scan.
Let’s look from the storage perspective:
10046 level 12 trace of plan with index skip scan:
10046 level 12 trace of plan with index range scan:
For the purpose of understanding think of multi-columned or concatenated
index as a partitioned table with the leading column as the partitioning key.
When querying a partitioned table if the predicate doesn’t contain the
partitioning key, partition pruning cannot happen, and since the data can be in
any partition, Oracle has to read each and every partition to get the
corresponding rows. Similarly, in a multicolumn index when the leading column
is not specified, rows that match the predicate (with only tailing columns) can
be in any branch of the B-Tree structure. Do recall that when the cardinality
or NDV of the leading column is less Oracle includes the tailing columns in the
Branch blocks to efficiently traverse the B-Tree to find a particular key
value. INDEX SKIP SCAN works by exploiting this aspect or behavior of
the B-Tree structure and works by reading all the corresponding Branch and leaf
blocks that may contain the value of interest. This flow is not as simple as it
appears. Let’s look at how this works in more detail:
kdxbrlmc 29371974=0x1c02e46 //no starting value stored
kdxbrsno 6
row#0[8018] dba: 29397625=0x1c09279
col 0; len 2; (2): c1 0f //starting column
value in 0x1c09279 branch block. All less values are
stored in 0x1c02e46
col 1; len 2; (2): c2 61
col 2; TERM
row#1[8037] dba: 29381124=0x1c05204
col 0; len 2; (2): c1 1d
col 1; len 2; (2): c2 05
col 2; TERM
row#2[8002] dba: 29403110=0x1c0a7e6
col 0; len 2; (2): c1 29
col 1; TERM
We know that a B-Tree
block branch/root block doesn’t store the starting values for the first leaf/branch
block pointed by kdxbrlmc and the starting value of col 1 indicated by the first branch block entry is
higher than 80 (decimal value 0) as shown below:
row#0[8018] dba: 29397625=0x1c09279
col 0; len 2; (2): c1 0f
col 1; len 2; (2): c2 61 //starting value
higher than 80
col 2; TERM
Oracle has to
mandatorily read the block as there is a chance that the relevant index entries
may be present. Let’s look at the dump
of the first branch block 0x1c02e46.
kdxbrlmc 29360494=0x1c0016e
kdxbrsno 286
kdxbrbksz 8056
kdxbr2urrc 0
row#0[4466] dba: 29397613=0x1c0926d
col 0; len 1; (1): 80
col 1; len 2; (2): c2 08 //starting value higher than 80
col 2; TERM
row#1[4476] dba: 29381211=0x1c0525b
col 0; len 1; (1): 80
col 1; len 3; (3): c2 0d 5b
col 2; TERM
…
row#17[4571] dba: 29381155=0x1c05223
col 0; len 2; (2): c1 02
col 1; len 3; (3): c2 03 2d
col 2; TERM
Same logic applies here, since the first entry has a higher starting
value, Oracle has to read the leaf block which may contain the values of
interest. Let’s look at the dump of the leaf block 0x1c0016e.
row#0[4459] flag: -------, lock: 0,
len=14
col 0; len 1; (1): 80
col 1; len 2; (2): c1 02 // First entry
indicates a higher value so backtrack
col 2; len 6; (6): 01 c0 90 36 00 12
row#1[4473] flag: -------, lock: 0,
len=14
col 0; len 1; (1): 80
col 1; len 2; (2): c1 03
col 2; len 6; (6): 01 c0 20 50 00 11
Since the first entry in the leaf block has a higher value for col 1, Oracle
knows that in this section or part of the B-Tree there are no matching values
for the predicate marks = 0, so it starts with the next higher value for col 0 or leading
column and proceeds ahead, here, TERM and section of ROWIDs are used to
determine if the values of interest are present in the previous index entries.
This process repeats until the right most leaf block is read (remember why the
right most leaf block in the B-Tree structure is read). If the NDV of the
leading column is high, then Oracle will store only the leading column in the
branch blocks, in which case performing a skip scan involves reading all leaf
blocks which is highly undesirable. Due to this, several branch and leaf blocks
should be visited for each distinct value of the leading column to determine
whether the actual values of interest (constraint specified in the exist.
Following are few interesting points to note:
-
A limitation of skip scan is the lack of support for
IN-Clause though bounded and unbounded predicates are supported such as <,
>, <=, >=, between.
-
As previously stated – Oracle keeps adding multiple
leading columns to the branch blocks once it determines that two columns may
not be enough to efficiently traverse the B-Tree structure – due to this Oracle
can perform a skip scan even when a third or sixth column of a multi-column
index is specified in the predicates.
-
Immediately following an index rebuild the number of
reads on branch blocks will be reduced, as Oracle tends towards having separate
entries for each distinct leading column value following an index leaf block
split as the data gets loaded.
Index Fast Full Scan:
An index fast full scan is similar to a full table scan except that it
reads the index as if it were a table. Oracle reads the index blocks similar to
a full table scan in an orderly fashion that is extent by extent. Remember that
these index blocks are read as they are stored in the disk and not in the order
from the leaf most index leaf to the right most index leaf block (which
typically happens using an Index full scan). Due to this an Index Fast Full Scan cannot eliminate a sort operation.
Oracle uses multiblock IO to read the index segment (even though the
root and branch blocks are read into the PGA they are ignored as they contain
only the information needed to traverse through the B-Tree structure). Bitmap
indexes do support Index Fast Full Scans. In order for the optimizer to pick an
Index Fast Full Scan the following conditions must be met.
-
All the columns referenced in the query must be form
either a subset of columns or all columns of the index. It doesn’t matter the
which position the column is in the index Oracle can still perform an Index
Fast Full Scan if the following conditions are met.
Eg: marks_idx(mark1, mark2, mark3,
dept_id)
select dept_id,count(*) from students
group by dept_id;
-
A NOT NULL constraint
o
If the index is a single column index, then that
column has to be protected by a not null constraint.
o
If the index is a multi-column index, at least one of
the indexed columns has to be protected by a not null constraint, it doesn’t
matter which position the column is in the multi-column index that is protected
by a NOT NULL
Constraint.
-
In the absence of a NOT NULL constraint – a predicate on any indexed column or any
column referenced in the query with IS NOT NULL condition is enough to trigger a Index Fast Full Scan.
-
Operation has to either select or delete.
When it comes to Index Fast Full
Scan Oracle uses the value of _very_large_table_threshold (which is ~5X_db_block_buffers or 5 times the size of the buffer cache) as
the upper limit to decide whether to perform an Index Fast Full Scan using
either db file scattered read or direct path
read. If the object size in number
of blocks is less than the threshold specified by _very_large_table_threshold,
Oracle always performs a Fast Full Index Scan using db file scattered read or
buffered read. This _very_large_table_threshold is used to decide whether to perform a full
segment scan using either a direct path read or db file
scattered read only for indexes
and IOTs.
You can observe this behavior by tracing the session using the following
statement:
alter session set events 'trace [NSMTIO] disk highest';
The following is a snippet of the trace indicating the same behavior:
NSMTIO: kcbivlo: nblks 7048 vlot 500 pnb 51728 kcbisdbfc 0 is_large 0
NSMTIO: qerixFetchFastFullScan:[MTT < OBJECT_SIZE <
VLOT]:NSMTIO: AdditionalInfo: Object_size: 7048 (blocks), vlot=258640
SqlId=4wupv26yztx6x, plan_hash_value=3862345769,Object#=74746
NSMTIO: kcbzib: Cache Scan triggered for tsn = 4, rdba=0x0142b780, fn = 5,
kobjd = 74747, block cnt = 0, noncontig = TRUE pre-warm = FALSE, prefetch =
FALSE
This behavior can become problematic in environments with higher buffer
cache sizes (for example if your buffer cache size is 100GB, then Oracle will
choose a buffered read for the indexes whose size is smaller than ~500GB, and
remember each time you bring a buffer into the buffer cache, there is a penalty
in terms of latches not to mention the need to find to free buffer), unless the
blocks in the buffer cache are already hot (their touch count is high), this
can significantly increase the IO activity on the server if the buffers are
aged out due to this scan. In such environments it’s better to reduce the value
of _very_large_object_threshold (default value is 500 indicating 5x times the
buffer cache) to whatever value is appropriate (usually 100 meaning the same
size as buffer cache). We can also override the behavior using _serial_direct_read (always, never, auto), this parameter basically acts as a master
switch that controls whether the direct path read should be used or never used
or let Oracle automatically determine based on the number of blocks. Now that
we have looked at one aspect of this let’s look at the direct path read
decision and how other factors can impact the performance. On a side note, a Fast Full Index Scan has the ability to flood
the buffer cache.
By now we already know that Oracle performs a full object checkpoint
before performing a full table or fast full index scan using direct path read.
Let’s see the downside of this.
CR block generation/Delayed
commit cleanout.
An object level checkpoint results in flushing down all the dirty blocks
that belong to a particular segment to the disk even though the transaction
that modified the blocks is still in progress (blocks modified – changes
uncommitted). Due to this, once the object level checkpoint is complete, Oracle
reads the table or index blocks into the PGA during which the server process
checks the ITL portion of each block to see if there are any open transactions or
ITL slots. If there are any open ITL slots or transactions that are in progress
Oracle will roll back the changes by reading the corresponding undo blocks that
contain the relevant entries into the SGA (buffer cache) and proceeds ahead.
The following snippet of 10046 level 12 trace reveals the details:
This does impact the response time of the SQL statement as you read undo
blocks into the SGA (if they are not already present) and rollback the changes
applied to each individual block. Remember that these undo blocks are read into
the SGA (buffer cache) during this process. So, does a direct path read result
in reading blocks into the buffer cache?
-
It doesn’t for the table/index blocks.
-
But it does result in reading undo blocks into the
buffer cache.
It so happens at times that if there are multiple transactions that modifies
the same set of index blocks before the start of the Index Fast Full Scan
(which did result in flushing these dirty blocks - thereby marking these
buffers as clean - from buffer cache onto the disk) and in the mean while due
to space pressure in buffer cache if these buffers are replaced by a new blocks
coming in, the session which started the transaction will not clean the ITL
slots post commit for efficiency purposes. Due to this behavior Oracle has to
mandatorily clean up the ITL slots next time the blocks are read.
So now enter Index Fast Full Scan again, if Oracle detects that it has
to perform a delayed commit cleanout, it performs the clean out on the blocks
by reading the corresponding undo blocks, but since these blocks are read into
the PGA and cleanout is performed in the PGA and as these blocks are discarded
once they are read or accessed, subsequent sessions performing a Fast-full
index scan will repeated perform the commit clean over and over again. The
following is a snippet of trace depicting the same behavior:
Remember that optimizer does consider and performs costing for Index Full Scan
whenever it performs an Index Fast Full
Scan. Oracle never accesses the
table blocks using INDEX FAST FULL SCAN (SELECT) with an exception of DELETE statement. For a
delete statement, if the optimizer determines that access the table blocks
using the index range scan is costlier Oracle can access the table blocks
through an INDEX FAST FULL SCAN when an SQL statement deletes
-
Most/all rows from the table.
-
When the SQL statement doesn’t contain any predicates.
The explain plan for the DELETE statement will be as follows:
Once the session starts scanning the index using an INDEX FAST FULL SCAN it scans the entire segment from the initial extent until the last
extent unless we restrict the number of rows returned from the query using the
clauses such as ROWNUM or FETCH FIRST
N ROWS ONLY. Do understand that a NOT NULL
constraint on any of the indexed columns is mandatory for Oracle to consider an
INDEX FAST FULL SCAN (Exadata – query offloading).
You can force an INDEX FAST FULL
SCAN using INDEX_FFS SQL
hint. Since the behavior is same as a full table scan, availability of ASYNC IO does improve the
performance (depending on how the extents are allocated and how evenly the data
file/files are spread across multiple disks).
Index Full Scans:
The server process works its way by traversing down the B-Tree structure
by reading the root block first and then navigate down to the left most leaf
block and then proceeds by reading the subsequent index leaf blocks until all
the leaf blocks are read (one block at a time). In other words, the server
process starts by reading the root block and obtains the details of the left
most branch or leaf block using the entry kdxbrlmc, it then repeats the
process by reading the block pointed by kdxbrlmc until it reaches the
left most index leaf block and then works though the index by reading the
blocks pointed by kdxlenxt entry in the index leaf block header. Since
the subsequent leaf block can reside in any position in the index segment and
the location of the next leaf block in the chain cannot be determined unless it
reads the index leaf block, database uses single block IO.
For SELECT statements oracle will never access the table blocks through INDEX FULL SCAN
access path with the exception of the DELETE or UPDATE statement. Optimizer considers an INDEX FULL SCAN when
-
A predicate in the query references a column in the
index.
-
When no predicates are used, all the following
conditions are to be met.
o
All columns in the index are specified in the query.
o
At least one indexed column is protected by a NOT NULL constraint.
-
Query includes an ORDER BY clause on the NOT NULL columns.
Following are the main differences between INDEX FULL SCAN and INDEX FAST FULL SCAN.
|
Operation |
INDEX FULL SCAN |
INDEX FAST FULL SCAN |
|
Sort Elimination |
Yes |
No |
|
IO type |
Single |
Multi-block |
|
Access Table contents |
Select – NO, Delete – yes.
|
Select – NO, Delete – YES |
|
Async IO |
No |
Yes |
|
NOT NULL req |
Yes |
Yes |
INDEX FULL SCAN can also be
performed in descending order or in other words INDEX FULL SCAN DESCENDING is
also supported.
Index Join Scans:
Optimizer can still consider a hash join on multiple indexes that
contains all the columns referenced in the query without the need to access the
table as all the data is retrieved from the indexes.
Consider the following example:
create table students(student_id number, dept_id number, name
varchar2(30), sub_id number, marks number, details varchar2(3000), details2
varchar2(3000));
create index dept_student_idx on students(dept_id, student_id);
create index sub_student_idx on students(sub_id, student_id);
populate the data into the table.
Let’s issue the following query
select student_id, sub_id, dept_id from students where dept_id between 1
and 10 and sub_id between 1 and 10;
The table has been purposely bloated (why? We will get to this shortly).
The explain plan is as follows:
Let’s consider a best-case scenario in which there are only 10 rows with
column values dept_id = 4 and sub_id = 4 that match the above predicates and no other
rows that falls in the range even with single predicate alone. As the session
has to perform operations such as traverse both the indexes, apply the hash
function, build a hash table etc. Performance in terms of elapsed times and
maximum achievable executions per second will still be slow compared to having
an index on columns(dept_id,sub_id,student_id) or even on columns (dept_id, sub_id).
Consequently, an Index join scan can also improve performance by
avoiding accessing the table blocks – depends on the selectivity of the
predicates and clustering factor.
Index join scans are pretty useful in the following scenarios:
-
When dealing with large tables (a greater number of
columns and/or high average row length size)
-
When adding additional indexes impact the DML
performance.
-
Elapsed times of the SQL queries do not matter up to a
certain extent unless you have a very strict threshold.
Index Full/Range Scan (MIN/MAX):
When selecting data from only a particular indexed column or use
aggregate functions as sum, avg etc., which requires scanning every row or few entries from the index
(functions that require reading more than one index entry), Optimizer has
options such as the following to choose from:
-
INDEX RANGE SCAN.
-
INDEX FULL SCAN
-
INDEX FAST FULL
SCAN.
-
FULL TABLE SCAN
Functions MIN or MAX results typically results in only a single row
and since the data is sorted either in ascending or descending order in the
index, finding a minimum or maximum value is pretty straight forward, traverse
either the left most or right most part of the index which requires only few
lookups (the number of reads or consistent gets is equal to the height of the
index BLEVEL + 1) considering the best case scenario. To find
either a minimum or maximum indexed column value, Oracle can utilize either an INDEX FULL SCAN (MIN/MAX) or INDEX RANGE SCAN (MIN/MAX) depending on the predicates and performs the
same.
Consider the following schema:
create table students(student_id number primary key, dept_id number,
sub_id number, marks number);
create index marks_idx on students(marks);
create index dept_sub_marks_idx on students(dept_id, sub_id,marks);
// populate some random data:
When we issue a query such as the following:
select min(student_id) from students;
Let’s look at the plan:
The corresponding statistics are as follows:
The trace is as follows:
We can confirm the same using the treedump of the index as
follows:
The procedure is basically this.
-
Read the root branch block.
-
Read the left most(first) or right most(last) entry
depending on MIN or MAX.
-
Read the leaf block first/last.
Optimizer considers an INDEX FULL SCAN
(MIN/MAX) when the SQL uses either
MIN or MAX function on
the leading index column and prefers an INDEX RANGE SCAN (MIN/MAX) when
the SQL uses MIN or MAX on the tailing column and an equality predicate on the leading column. Consider
the following query:
select max(marks) From students where dept_id = 19 and sub_id = 24;
The corresponding explain plan is as follows:
Remember that an INDEX FULL SCAN
(MIN/MAX) or INDEX RANGE SCAN (MIN/MAX) works only if it is possible to find the value
by traversing down the left or right path down the index.
Does an INDEX FULL SCAN (MIN/MAX) or INDEX
RANGE SCAN (MIN/MAX) require
only few blocks (equal to the height of the index (BLEVEL + 1) all the time? Unfortunately,
the answer is no. Before proceeding further, let’s take a step back and look at
a certain behavior that is vital to the understanding.
Even after all the index entries in the index leaf blocks are deleted,
these index leaf blocks still exist in the index structure unless you shrink,
coalesce or rebuild them.
That being said, if the index is on a column which is populated using a
monotonically increasing value be it an application or database sequence, and
if we delete the old rows or column values from the table, the leaf blocks
still exist or stay linked into the index structure and when a user issues a
query such as min(column_name), due to the way the index is structured and
the absence of starting value, there is always a possibility that the left most
index leaf block contains the minimum value, and if the value doesn’t exist,
Oracle starts proceeding in the chain until it finds the value. Same is the
case when finding a max value.
As the data is being loaded into the table at the extreme right end,
issuing a select max(value) can cause CR blocks to be generated.
Oracle cannot perform an INDEX FULL SCAN
(MIN/MAX) on function-based indexes (including descending
indexes as these are internally implemented as function-based indexes), and
reverse key indexes or even INDEX RANGE SCAN
(MIN/MAX) on the tailing
column with the predicate specified on the leading column.
Automatic Indexing:
Bitmap Index:
Structure of bitmap index is essentially the same as a normal index with
the exception of how the data is stored in the leaf and branch blocks
(fundamentals regarding how the entries are organized in the branch blocks and
how Oracle uses these entries to traverse down the B-Tree structure to reach
the leaf block that contains the corresponding index key value remains the
same). Instead of storing ROWIDs of each
corresponding row, oracle uses bitmap to store the ROWIDs for each indexed key
value. Let’s look at an example:
create table students (student_id
number, name varchar2(30), mark1 number, mark2 number);
insert into students select rownum,
dbms_random.string(0,30), round(dbms_random.value()*100),
round(dbms_random.value()*400) from dual connect by level < 1000000;
create index mark1_idx on
students(mark1);
create bitmap index bitmap_mark1_idx
on students(mark1) invisible;
Let’s peak at the Normal Index (B-Tree) leaf block (uncompressed):
As we can see from the above Oracle stores the column value along with
the ROWID for each row in the table. Now let’s look at the bitmap index leaf
block:
A bitmap index stores three additional columns apart from the column
actual value.
Col 0 indicating the actual
column value
Col 1 indicates start of the ROWID range.
Col 2 indicates the end of ROWID range.
Col 3 is the actual bitmap (hex/internal representation of the bitmap).
Each bit in the bitmap corresponds to a particular ROWID (don’t get
confused by the hex representation and the bits in the actual bitmap
(internal), if the starting ROWID and end ROWID are far off then the bitmap
requires more bytes to store). Oracle uses a mapping function to convert the
bits into actual ROWIDs. Although the representation of data in the leaf blocks
is different a bitmap index serves the same purpose a B-Tree or normal index. As
the bitmap index stores the corresponding ROWIDs in a bitmap representation the
size of the bitmap index is typically smaller than a B-Tree index as shown
below:
Following are few important characteristics that you should be aware of.
-
Index key Compression feature is not available for the
bitmap indexes.
-
Analyze index …
validate structure is supported only for B-Tree indexes.
Let’s look at things from the overhead perspective:
For each index key value, since Oracle stores both the starting and the
end range of ROWIDs as additional columns in the index, the overhead of
1 (length of ROWID) +
6 (starting ROWID range) +
1 (length of ROWID) +
6 (end of ROWID range)
14 bytes is mandatory.
Overhead due to the bitmap depends on various factors such as the following:
-
Hakan factor.
-
Clustering factor
Hakan Factor Effect:
We have already seen how the Hakan factor is calculated initially during
the Table creation and how we can set it manually using the clause alter table … minimize records_per_block. Hakan factor indicates the maximum number of
rows that we can fit into a particular block, but this value may not always
accurately depict the actual number. The greater the row size a smaller number
of rows fit inside a block. While creating a bitmap index, Oracle uses the
Hakan factor internally to allocate slots in the bitmap so that it can cover
all the possible ROWID locations in a particular block. If the difference
between the calculated Hakan value and the actual maximum number of rows in a
block is high, then Oracle will end up allocating more slots in the bitmap to
cover all the possible ROWIDs that may never exist.
Let’s look at an example: (the DDL is same as the above
create table students (student_id
number primary key, name varchar2(30), mark1 number, mark2 number);
insert into students select rownum, dbms_random.string(0,30),
round(dbms_random.value()*100), round(dbms_random.value()*400) from dual
connect by level < 1000000;
commit;
create bitmap index mark1_idx on students(mark1);
Oracle calculates Hakan factor for the above table is as follows:
The size of the bitmap index is as follows:
As the bitmap indexes are dependent on Hakan factor any attempt to
modify the Hakan factor using alter table …
minimize records_per_block will
fail. In order to modify the Hakan factor we have to drop the bitmap indexes if
any and then rebuild the index.
SQL> alter table students minimize records_per_block;
alter table students minimize records_per_block
*
ERROR at line 1:
ORA-28602: statement not permitted on tables containing bitmap indexes
Post setting the Hakan factor manually by dropping the bitmap index and
using the statement alter table … minimize
records_per_block and recreating
the bitmap index results in the size reduction as follows:
Effect of Clustering Factor:
When the rows with the same column value are stored adjacent to each
other in the table blocks or in other words when the clustering factor is less,
Oracle can skip storing additional data required to map the bits to the
corresponding ROWIDs resulting in significant reduction in the bitmap ultimately
the index size as well. Let’s look at the Index leaf block for the above index:
Let’s reorganize the table as follows and recreate the bitmap index and see
the contents of the index leaf block:
create table temp as select * From students;
drop index mark1_idx;
insert into students select * from temp order by mark1;
create bitmap index mark1_idx on students(mark1);
As we can see from the above the length of col 3 is reduced
significantly. The size of the bitmap index is as follows:
Hakan and Clustering Factor can significantly influence the size of a
bitmap index.
Regardless of the Hakan factor and the clustering factor, the size of a
typical bitmap index will be smaller than a regular or compressed B-Tree index
(assuming the NDV is less, we will look into this shortly). Now that we have
seen the overhead associated with a bitmap index. Let’s look at how many
distinct values is recommended.
From the structure wise a B-Tree index is similar to a bitmap index and
the overhead associated with a bitmap index is basically the inclusion of
additional columns (starting range, ending range, bitmap) in the Index leaf
blocks (an overhead of 14 bytes is mandatory due to starting and ending range).
If there are enough rows for each distinct value that could compensate with the
additional overhead, a Bitmap index is still preferred.
Consider the following case:
create table test (t1 number, t2 number);
load the data into the table such that, t2 has about 1 million distinct
values and the table has about 10 million rows, and create B-Tree and bitmap
indexes as follows:
create index bree_t2_idx on test(t2);
create bitmap index bitmap_t2_idx on test(t2) invisble;
SQL> select count(*), count(distinct t2) From test;
COUNT(*) COUNT(DISTINCT T2)
---------- -----------------
11290010 999985
The sizes of the indexes are as follows:
Even though there are over 1 million distinct values, the size of the
index will still be smaller (don’t take this as an ironclad guarantee or
rule if there are several values with one or only few values and the data is
highly skewed, a bitmap index can become larger than a b-tree index). A
compression achieved by a bitmap index (especially using the bitmap) is still
higher than a regular compressed B-Tree index.
drop index btree_t2_idx on test(t2);
create index btree_comp_t2_idx on test(t2) compress;
The sizes of the indexes are as follows:
DML and Bitmap Indexes:
Let’s look at the bitmap index leaf block again:
By now, we know that the bitmap index has only one entry Index Entry for
each associated column value and that the bitmap corresponds to all the ROWIDs
associated (in the form of a bitmap – index leaf block contains the HEX/internal
representation of the bitmap bits) and only one Flag to indicate whether the index
entry is locked etc. When we insert a row into the table, with the column 1,
Oracle in turn locks the entire Index Entry thereby locking all the rows that
corresponds to the T2 value of 1. This can be shown from the index leaf block
dump immediately following a insert/update/delete statement.
insert into test (t1, t2) values (1,1);
The index leaf block dump is as follows:
As we can see from the above the entire Index Entry is locked. Similar when
we update a column value (remember update is basically a delete and insert),
but for bitmap index, an update operation is more complex and can result in two
index entries:
-
Reconstruct the bitmap of the old index entry by
removing the old bitmap and update the index entry.
-
Insert a new Index entry into the leaf block
surprisingly even though the bitmap changes occur for the old entry, Oracle
preserves the Index entry of the old value instead of modifying the bitmap of the
old entry.
Nonetheless, bitmap indexes should not be created on tables which has
DML activity or in other words tables that doesn’t have DML activity on it.
(OLTP workloads also do benefit from the bitmap indexes if the base Tables are
static remember that the even though the size of the index is reduced, Oracle
is more likely to perform an bitmap index single scan or bitmap index range
scan rather than performing an table access full scan which is reality is less
costly and resource intensive (= predicates)).
Bitmap Conversion To/From ROWIDs:
Bitmap conversion typically involves converting an entry in the bitmap
into a ROWID in the table. This conversion can happen both ways.
-
Converting entry in the bitmap to a ROWID. BITMAP
CONVERSION TO ROWIDS
-
Converting
ROWID into the bitmap entry. BITMAP
CONVERSION FROM ROWIDS
Types of Bitmap Access Paths:
BITMAP INDEX SINGLE VALUE
Optimizer uses a Bitmap Index Singe Value access path when the predicate
contains an equality operator against an indexed bitmap indexed column. The
server process traverses down the B-Tree structure to the corresponding leaf
block/s that contain the value and converts the bitmaps into ROWIDs using TABLE ACCESS BY INDEX ROWID BATCHED.
Since the key value is stored in index leaf blocks in the form of a
bitmap, Oracle can effectively make use of Batched Table Access thereby
Asynchronous IO but is subject to various factors such as client fetchsize etc.
Remember that we are creating a bitmap index on low cardinality columns, even
when the number of rows retrieved from the table are very high, Optimizer still
considers an index scan rather than a full table scan. (this is not the case
with a normal B-Tree index).
BITMAP INDEX RANGE SCAN:
When the SQL contains the predicates that include ranges such <, >, <=,
>=, in etc., optimizer can
consider a BITMAP INDEX RANGE SCAN.
Bitmap Merge:
It so happens that if there are indexes on each individual column
specified in the predicate and the SQL statement’s predicates refer to each
such column, Oracle has the ability to merge the results obtained from each
index (BITMAP
INDEX SINGLE VALUE or BITMAP
INDEX RANGE SCAN or both) using
either Boolean operator AND, OR, etc.
Consider the following example:
create table students (student_id
number, name varchar2(40) , dept_id
number, sub_id number, marks number);
//populate the table
create bitmap index sub_idx on
students(sub_id);
create bitmap index marks_idx on
students(marks);
//gather statistics
Consider the following query:
select * from students where sub_id =
12 and marks < 2;
Explain plan is as follows:
Here in the above query the server process first obtains all the bitmaps
of all the index keys that falls under the range <2 and merges then using Boolean OR to create a
single bitmap and then finally performs a bitmap AND operation on this resultant bitmap and the bitmap obtained from the key
sub_id
= 12.
Consider the following query:
select * from students where sub_id =
12 and marks = 2;
As we can see from the above, the server process performs a Boolean AND operation between two bitmaps and uses the
resultant bitmap to generate a list ROWIDs. The AND operation is similar to the following:
|
MARKS_IDX |
1 |
0 |
1 |
0 |
1 |
1 |
0 |
|
SUB_IDX |
1 |
1 |
0 |
1 |
0 |
1 |
0 |
|
FINAL BITMAP |
1 |
0 |
0 |
0 |
0 |
1 |
0 |
Similarly, Oracle can also perform OR operation as shown below:
In data warehouse environments, these bitmap indexes are pretty
important to achieve the start transformation.
Index Range Scan Optimizations
used:
Starting from 12c, Oracle predominantly uses Batched Table access (TABLE
ACCESS BY INDEX ROWID BATCHED) and
generic table access (TABLE ACCESS BY INDEX ROWID) while accessing a table contents using an
index. Performance of batched table access depends on whether the async IO
capability is available through filesystemio_options and the file system layout. Remember that you
may not notice any performance benefit than traditional IO (blocking pread64)
if all the data files are stored on a single disk or if the data files are not
striped across multiple physical disks or RAID volumes. By default, when you
create a database filesystemio_options parameter is set to none (default
value is none), which prevents the database processes such as LGWR, DBWR, server
processes, etc., to utilize or reap the benefits of asynchronous IO
capabilities provided by the underlying Operating system. If the filesystemio_options parameter is set to none, all the read calls such as pread64, readv etc.,
initiated by the processes will be synchronous and if the data files are stored
in a volume that is striped/mirrored across multiple disks, using synchronous
IO will always be slower than an asynchronous read operation.
Generic/Normal/Conventional:
Regardless of value of the filesystemio_options parameter, Oracle reads one block at a time
using pread64 (synchronous) OS call due to the following:
-
ORDER BY clause contains a column that is indexed and
the optimizer decides to use that particular index for accessing the table
blocks provided that
o
For single column indexes
§ Predicate
doesn’t contain a = condition for non-unique indexes, for unique indexes it
doesn’t matter.
§ Predicate
contains an IN clause or bound based predicates such as <, >, <=,
>=, between etc., and order by on the same column.
o
For multi column indexes
§
-
When performing index unique scan - be it an index or
table block (buffer warmup is a different).
-
During an index range scan if an index key value
results in <5 ROWIDs and these ROWIDs points to different table blocks or if
all the ROWIDs of a particular index key value (considering the fetch size to
greater than the number of ROWIDs retrieved) results in < 5 table blocks.
(this threshold can change to even 6 or 7 db file sequential reads initially if the session
encounters chained or migrated rows).
Oracle’s ability to use large batches (number of table blocks batched
into one single call) of table block reads reduces due to the following:
-
When
the clustering factor is very less and the row sizes exceed a particular
threshold.
-
When
using range predicates (bounded or unbounded) if the subsequent entries in
index leaf block points to table blocks that are clustered (in this case this
is directly dependent on fetchsizes as well.):
-
Client
ResultSet fetchsize and/or arraysize.
-
Use
of additional non-indexed columns as predicates in the SQL statement.
-
When operations such as aggregate functions, order by
on a different column, grouping etc. that make use of the PGA are used in the
query and the clustering factor for the index is very low provided that the row
sizes are small – and the subsequent ROWIDs in the index leaf block points to
different contiguous blocks or same block, up to a certain point beyond which
oracle starts using db file scattered read (basically a single IO call – large IO – which reads
2 blocks in the same call).
Batched Table Access:
Upon retrieving multiple ROWIDs from index leaf blocks a session
performing a table access using TABLE ACCESS BY INDEX
ROWID BATCHED can batch multiple
table block read requests and waits on db file
parallel read wait event until
these blocks are read into the memory. A session can batch 2 – 127 blocks
depends on various factors such as mentioned above. You are most likely to
notice Oracle batching 127 table blocks when SQL statement results in using PGA
for operations such as sorting, grouping, use of aggregate functions etc., and
when using equality or range predicates on the indexed columns when the
clustering factor is actually high (in this case client fetch size if it is
greater than 127, then Oracle can effectively batch up to 127 blocks).
Performance of a query executed using TABLE
ACCESS BY INDEX ROWID BATCHED relies
heavily on async IO and if the data files are spread evenly across multiple
disks then the performance of the query increases considerably.
Remember that even if an explain plan indicates TABLE ACCESS BY INDEX ROWID BATCHED Oracle may not use db file parallel read (batched IO – batching multiple table block read requests).
Remember that loading the data into the table using the SELECT … FROM DUAL CONNECT BY LEVEL, may not accurate depict the load from
multiple sessions which typically happens on many OLTP sessions, and since we
are loading the data into the table using a single session, the clustering
factor usually tends to be very low when compared with loading the data using
multiple sessions.
Creating an Index:
alter system dump datafile # block #;
dumps both the inmemory version of the block and the on-disk version of
the block as well. The ORA-01450 error is related to
your db_block_size and the maximum key length for your database is
approximately 40% of the database block size minus some overhead.
However, you have a bigger problem, getting the CBO to use an index with a
large key value… skip scan or expansion…
serial_direct_read only for tables and not for indexes… fts algorithm
changes when performing fast full index scans… db file scattered read!!!
Enq: KO – fast object checking… dirty blocks… cr block generation…
Foreign keys
select avg(mark3) from randomload where mark1 in (1,2) and mark2=19
order by mark1 desc
index skip scan descending havoc… table move, clustering factor
increase… create table as select increase clustering factor. Chained rows
always results in db file sequential reads. When as migrated rows will not
result, migrated rows doesn’t share the same fate. Scattered reads mid point insertion
strategy.. db file sequential read (not cached) full table scan chained rows…
index fast full scan – null not null – character string. Index fast full scan
null dependency. Index rebuild – other indexes becomining undesirable.
Select mark6,count(*) from randomload group by mark6 - > index fast
full scans. Batched tablbe access will not work in the following cases:
create table t1 as
select rownum n1
,trunc((rownum-1)/3) n2
,rpad('x',100) v1
from dual
connect by level <= 1e4;
create index t1_ind1 on t1(n2, n1);
select * from t1 where n2 = 3329 order by n1 desc;
select * from table(dbms_xplan.display_cursor);
alter index t1_ind1 invisible;
create index t1_ind2 on t1(n2);
select * from t1 where n2 = 3329 order by n1 desc;
select * from table(dbms_xplan.display_cursor);
No distinct 1-1 mapping between branch and index key values. as new
blocks are brought in , old blocks get flushed mid point insertion strategy
check LRU chains latches…. Not null constraint
- index full scan – index fast full scan. Cursor duration temporary
tables – log file sync , scn increase. Db file parallel read – fetchsize when
multiple rows, pga allocation…
Lower fetch sizes increase IOPS on the server. SSD controller cache
write performance. In-line view unnest…
Why insert statement slow (tracing of no use).
Scattered read – table and index while scanning the table using index
range or skip scan.
Post a Comment
0 Comments