On 2/21/11 11:24 PM, Åke Olsson wrote:
On a few occasions we have seen this happening in a batch subsystem
(one of many sbs on machine):
A job starts using way to much resources and needs to be put out of
its misery. We try killing the job *immed.
Why not CHGJOB RUNPTY(99) or HLDJOB instead?
It still stays on - forever - using up constantly around 30% of the
CPU on A 570 box. Trying to use ENDJOBABN does not help.
Ending the job abnormally is IMNSHO best avoided until after problem
debug has been performed, especially if the desire is to prevent
recurrence of the failure scenario. ENDJOBABN is much like an attempt
to dismiss\ignore what appears to be a severe problem.
The subsystem cannot be stopped.
A subsystem monitor job can not end until all of its jobs are ended.
If having the *SBSD available to process jobs is desirable, then best
not to ENDSBS until the effects of the ENDJOB and\or ENDJOBABN are
complete; i.e. until the job is terminated.
The "undead" job cannot be held or changed in any other way.
After or before ENDJOB? After ENDJOB [and esp. after ENDJOBABN] very
little support exists for external activity against the job, so little
surprise that hold or change might not function then. But even before
ending a job, the hold or change may timeout such that the effects are
either pending or ignored; a timeout because the job can not process the
event, for example if the process is running in the LIC versus above the
LIC.
One peculiar thing about these "undead" jobs is that they show no
program stack despite the fact that they use both i/o (loads of open
files) and CPU.
No visible stack before attempting to end them, or after? Presumably
after, which is a good reason to adjust the run priority and then hold
versus end. If before, the jobs are presumably running a LIC operation
for which the "stack" is visible only by a "task dump" from STRSST [or
if the process is available to process events, then DMPJOBINT; the job
must either already be serviced, or accept the service-job event, and
then PRTINTDTA or STRSST to spool the dump].
The only solution seems to be to IPL the system, which is highly
unpopular with the users.
Was the spooled joblog from any of the jobs reviewed? The other
spool files produced by the job; e.g. dumps and DSPJOB output which may
possibly have been produced as logging? The history from the job;
DSPLOG JOB(named)?
The box is on V6R1. As for PTF level I do not know. I simply assume
that the sysadmins do their job with regards to installing critical
and cum PTF:s.
An existing preventive PTF for such a situation would be in the HIPer
[High Impact\Pervasive] list as recorded in the PSP [Preventive Service
Planning] on the IBM support pages.
Any ideas? Does this ring a bell?
Instead of all the efforts to end, why not instead redirect efforts
to investigate the problem and for an origin? Are these jobs known to
be running a specific application such that a trace or service can be
initiated in advance of them reaching the point of "misery"? If no
stack is available and there was no ability to prepare with
service\debug, and even if a trace was enabled, get the stack from the
service tools [at the point where previous attempts were made to end].
Regards, Chuck
As an Amazon Associate we earn from qualifying purchases.