When Even Histograms Can't Help

Here is an interesting optimizer case where updated statistics and histograms cannot solve the performance problem. This might be an uncommon case, but it happened for one of my clients and this post is the result of some research on this.

The problem

The problem I faced was a query involving a few tables and the query never finished. I generated the execution plan and realized that the main issue was the table Oracle chose to start with. It was quite a simple table, about 1.5M rows and a single LIKE predicate on a string column. The optimizer thought it would get 1 row from this table, and built the entire plan based on that. In reality it got about 250 rows and that messed everything up.
When I added a CARDINALITY hint to let the optimizer know it’s going to be 200 rows, the query finished and everything looked OK. So obviously I turned to statistics and histograms, but everything was up to date and looked good. So I reproduced it.

Table structure and data

I checked how the table looks like and took a look at the data and this is what I’ve found:

  • The table had 1.5M rows
  • The column that was used in the query was a varchar2 and had a unique constraint
  • The range of values in this column was huge (strings starting with any character, numbers, uppercase and lowercase letters and even dots)
  • The LIKE predicate asked for the 3 first letters, while these 3-letter data distribution was far from being even

Reproduction

In order to reproduce it I created a simple table with 620K rows. The script is quite simple, I just insert a bunch of rows while I’m choosing the first letter and the rest is random. Then I update a number of rows to have the same first 3 letters (in my case it was 108 rows):

create table card_issue(id number, name varchar2(100));
alter table card_issue add constraint card_uk unique (name);
declare
   my_id number;
   my_c varchar2(1);
begin
  my_id:=0;
  for i in 0..9
  loop
    my_c:=chr(ascii('0')+i);
    for j in 1..10000
    loop
      insert into card_issue
        values(my_id,my_c||dbms_random.STRING('p',6));
    end loop;
    commit;
  end loop;
  for i in 0..25
  loop
    my_c:=chr(ascii('a')+i);
    for j in 1..10000
    loop
      insert into card_issue
        values(my_id,my_c||dbms_random.STRING('p',6));
    end loop;
    commit;
  end loop;
  for i in 0..25
  loop
    my_c:=chr(ascii('A')+i);
    for j in 1..10000
    loop
      insert into card_issue
        values(my_id,my_c||dbms_random.STRING('p',6));
    end loop;
    commit;
  end loop;
end;
/
update card_issue set name='ENG'||dbms_random.STRING('p',3)
where name like 'EN%';

Now let’s query (after gathering statistics):

SQL> select * from card_issue where name like 'ENG%'
108 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 1973207663
------------------------------------------------------------------------------------
| Id  | Operation                           | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |            |     1 |    10 |     4   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| CARD_ISSUE |     1 |    10 |     4   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | CARD_UK    |     1 |       |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("NAME" LIKE 'ENG%')
       filter("NAME" LIKE 'ENG%')
Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        118  consistent gets
          0  physical reads
          0  redo size
       3518  bytes sent via SQL*Net to client
        684  bytes received via SQL*Net from client
          9  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
        108  rows processed

As you can see the optimizer thinks it will get 1 row, but we have 108 rows starting with ENG.

Why is it happening

Let’s check the statistics:

SQL> select TABLE_NAME, COLUMN_NAME, NUM_DISTINCT, DENSITY, NUM_NULLS, NUM_BUCKETS
   2 from user_tab_columns
   3 where table_name='CARD_ISSUE' and column_name='NAME';
TABLE_NAME           COLUMN_NAME          NUM_DISTINCT                       DENSITY  NUM_NULLS NUM_BUCKETS
-------------------- -------------------- ------------ ----------------------------- ---------- -----------
CARD_ISSUE           NAME                 620000                 .000001612903225806          0         254
SQL> select TABLE_NAME,COLUMN_NAME,ENDPOINT_NUMBER,ENDPOINT_ACTUAL_VALUE,ENDPOINT_REPEAT_COUNT
   2 from user_histograms
   3 where table_name='CARD_ISSUE'
   4 and column_name='NAME'
   5 order by endpoint_number;
TABLE_NAME           COLUMN_NAME          ENDPOINT_NUMBER ENDPOINT_ACTUAL_VALU ENDPOINT_REPEAT_COUNT
-------------------- -------------------- --------------- -------------------- ---------------------
CARD_ISSUE           NAME                               0 0 5.Ck                                   0
CARD_ISSUE           NAME                               1 06<$NP/                                  0
CARD_ISSUE           NAME                               2 0Mk.]l`                                  0
CARD_ISSUE           NAME                               3 0drFG,&                                  0
CARD_ISSUE           NAME                               4 0|WH(YY                                  0
CARD_ISSUE           NAME                               5 154C2=e                                  0
CARD_ISSUE           NAME                               6 1KB03wL                                  0
CARD_ISSUE           NAME                               7 1b[e/Nx                                  0
CARD_ISSUE           NAME                               8 1z,On<H                                  0
...
CARD_ISSUE           NAME                              58 E.|fOEb                                  0
CARD_ISSUE           NAME                              59 EF EH=6                                  0
CARD_ISSUE           NAME                              60 E]=h<p.                                  0
...

What do we see here?

  • First, num of distinct value is 620,000. This makes sense as it is a unique column with no null values
  • This is also the reason for such a low density as we have a large range with unique strings
  • In USER_HISTOGRAMS we see the buckets (there are 254 in total). As we can see, the ENG% strings will be in bucket 59 (ends with “EF EH=6”), but Oracle doesn’t have more information about the content of the bucket.

At this point, when Oracle tries to figure out how many strings starting with ‘ENG’ there are, it simply doesn’t have accurate enough information to get to the correct number. The calculation in this case leads Oracle to believe that there will be only 1 row.
This happens because the ENG% strings are different than the others. Most other strings don’t have 3 common first letters, while the ENG ones do. In cases where there are many strings with identical beginning, the statistics and histogram should reflect that, allowing Oracle to get a better estimation.

Solutions

I couldn’t find a solution that will help Oracle realize this scenario on its own. I guess this is when the dynamic features are handy (cardinality feedback in 11.2 and adaptive plans in 12.1). In order to fix it from the application side I came up with 2 solutions:

  1. It’s funny, but when I mess up with Oracle it seems to work. If I change the “name like ‘ENG%'” to “substr(name,1,3)=’ENG'” I get a good plan. The reason is that Oracle doesn’t really know how many rows will come back from this predicate, so it doesn’t assume it’s one and generate a plan accordingly. Note that if you had an index on name and you want to use it, this trick will avoid the index. I didn’t try to create extended statistics (and/or function based index) on the substr predicate and see what happens, maybe in a followup post
  2. Add a cardinality hint to manually tell oracle that the cardinality is 200 (close enough number is good). This can be done with the hint itself or if you cannot change the code, using dbms_sqldiag_internal (as described here).

1 thought on “When Even Histograms Can't Help”

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Post