GotoDBA Database Tuning,Database Development,Troubleshooting High CPU With Queries Without Bind Variables

High CPU With Queries Without Bind Variables

Performance problems can be tricky, and everyone has a different method to approach them depending on the exact issue. In this case I’d like to cover a specific case of 100% CPU at the OS level.

Describing the problem

As I said, in this specific scenario I had a server with 100% CPU for some time. This environment usually doesn’t get to 100% CPU so we started investigate.

In some cases, 100% CPU is caused by a change of execution plan for specific queries. If a plan changed, this is how we can find the queries:

  • If this is a long running queriט we can identify the processes that consume the CPU using the linux “top” tool and then drill down to the database and find these sessions and what they do
  • If this is a short running query, we won’t be able to see a specific process using “top” (they will change all the time), but we can identify the query in AWR as it shows the total performance per query. In this case we might see a specific query that was executed many times and consumed a lot of CPU in total

However, in this case we couldn’t see specific processes in “top”, and AWR didn’t help as well. Out of 50K seconds of CPU, the top query was about 1K seconds in total. So it’s clearly not a specific query. What can it be?

When not using bind variables

This is obviously a specific change with the database behavior, so we still assumed that the root cause is execution plans change, but we couldn’t find the query.

The thing with this application is that in many cases it doesn’t use bind variables (by design), so the problem with AWR here is that it can’t aggregate similar queries. The aggregation in AWR is by SQL_ID, but queries with different literals have different SQL_IDs.

How can we find the queries that are very similar, but not identical?

Finding the query

With queries that use literals instead of bind variables the SQL_ID is different, but you know what is the same? The execution plan.

In order to find the queries, I needed to aggregate by execution plan and see what’s going on. I did that on ASH and on the active sessions as well.

Here is the query for the active sessions:

select 
   sq.plan_hash_value,
   count(s.sid)
from 
   v$session s,
   v$sql sq
where
   s.sql_id=sq.sql_id
   and s.SQL_CHILD_NUMBER=sq.CHILD_NUMBER
   and s.username is not null
group by sq.plan_hash_value
order by count(*) desc;

In the result I saw 4 plans with at least 5 concurrent sessions each (and in the ASH there were much more obviously). I checked and all 4 were basically the same query as well.

Once we found this, we could check what was happening and apparently a user went nuts and generated a bunch of reports on a huge time frame (slice by slice) and this caused the entire system to reach 100% CPU. Once we killed this process at the application level, the system went back to its normal ~40% CPU.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Post