Re: Run-Away query stops system -- MIDRANGE-L

Hi Don,

Assuming that you're on release 7.2 or later, you have a couple of options
for clamping down on runaway queries.

There are job-based controls that limit the amount of CPU or temporary
storage that a job can use. It sounds like CPU probably wasn't the
problem. It's possible, though, that the query engine had to build
temporary structures in memory to as it was processing the query, and this
may have been keeping the I/O subsystem busy. This would show up as disk
activity and as growing temporary storage usage. IBM recently enhanced the
support for job-based temporary storage limits to include storage for SQL
queries (CPU limits have always applied.) I wrote an article about it last
spring:
https://ibm.co/2BbmoQL

Another option is to use the QRYSTGLMT and QRYTIMLMT parameters on the
CHGQRYA command. This is somewhat less useful, since the limit kicks in
based on the optimizer's anticipated estimate of resource usage (as
opposed to the actual measured usage). Depending on how much the optimizer
can figure out about your data, these estimates may be very good or they
may not.

Thank you,

Tim Clark
DB2 for IBM i

From: Don Brown <DBrown@xxxxxxxxxx>
To: "Midrange- L" <midrange-l@xxxxxxxxxxxx>,
Date: 11/12/2018 05:30 PM
Subject: Run-Away query stops system
Sent by: "MIDRANGE-L" <midrange-l-bounces@xxxxxxxxxxxx>

Yesterday the IBMi became totally unresponsive. It took 20 or 30

attempts

to get a console connection to the system

This is a single system using lan console (No HMC) with V7R3 and very
current on PTF's

The cause of the problem was later identified that a user running a very
bad query. This was discovered when performance started to go south

again

and one of the IT people walked around all the departments and found a

user

re-running the query that did not finish when the system failed!!

The application has a user tool for creating adhoc queries.
When the user log on there is a batch job created in in a user

subsystem.

When the query runs a QZDASOINIT jobs provides the database connection

with

user profile QUSER in QUSRWRK

I checked everything I could to see what was chewing up resources and
nothing identified this job.

I believe the joining of tables in this query resulted in many new

indexes

being required over files with millions of records.

I would like to identify what could have been done to identify the
responsible job - any suggestions ?

The system CPU usage from wrkactjob was < 10%
Disk % busy was high between 30 - 50 %
System storage was fine at 62%
No relevant messages were on the system operator queue or history log
Displaying the history log resulted in CPD2537 Not all messages logged -
Also pointing to the high disk activity
Wrksysact did not show any abnormal jobs or any high I/O
The full version of system performance tools is not available on this
system.

How could I have identified this job using the available tools ?

While I am still doing a lot of reading I believe there is a way to

govern

a query that if it uses more than a preset resource or exceeds a run
duration the system can cancel the query - Has anyone implemented this

have any relevant links to how this is implemented ?

Appreciate any suggestion or advice.

Don Brown