Re: "Undead" Jobs -- MIDRANGE-L

On 2/21/11 11:24 PM, Åke Olsson wrote:

On a few occasions we have seen this happening in a batch subsystem
(one of many sbs on machine):

A job starts using way to much resources and needs to be put out of
its misery. We try killing the job *immed.

Why not CHGJOB RUNPTY(99) or HLDJOB instead?

It still stays on - forever - using up constantly around 30% of the
CPU on A 570 box. Trying to use ENDJOBABN does not help.

Ending the job abnormally is IMNSHO best avoided until after problem debug has been performed, especially if the desire is to prevent recurrence of the failure scenario. ENDJOBABN is much like an attempt to dismiss\ignore what appears to be a severe problem.

The subsystem cannot be stopped.

A subsystem monitor job can not end until all of its jobs are ended. If having the *SBSD available to process jobs is desirable, then best not to ENDSBS until the effects of the ENDJOB and\or ENDJOBABN are complete; i.e. until the job is terminated.

The "undead" job cannot be held or changed in any other way.

After or before ENDJOB? After ENDJOB [and esp. after ENDJOBABN] very little support exists for external activity against the job, so little surprise that hold or change might not function then. But even before ending a job, the hold or change may timeout such that the effects are either pending or ignored; a timeout because the job can not process the event, for example if the process is running in the LIC versus above the LIC.

One peculiar thing about these "undead" jobs is that they show no
program stack despite the fact that they use both i/o (loads of open
files) and CPU.

No visible stack before attempting to end them, or after? Presumably after, which is a good reason to adjust the run priority and then hold versus end. If before, the jobs are presumably running a LIC operation for which the "stack" is visible only by a "task dump" from STRSST [or if the process is available to process events, then DMPJOBINT; the job must either already be serviced, or accept the service-job event, and then PRTINTDTA or STRSST to spool the dump].

The only solution seems to be to IPL the system, which is highly
unpopular with the users.

Was the spooled joblog from any of the jobs reviewed? The other spool files produced by the job; e.g. dumps and DSPJOB output which may possibly have been produced as logging? The history from the job; DSPLOG JOB(named)?

The box is on V6R1. As for PTF level I do not know. I simply assume
that the sysadmins do their job with regards to installing critical
and cum PTF:s.

An existing preventive PTF for such a situation would be in the HIPer [High Impact\Pervasive] list as recorded in the PSP [Preventive Service Planning] on the IBM support pages.

Any ideas? Does this ring a bell?

Instead of all the efforts to end, why not instead redirect efforts to investigate the problem and for an origin? Are these jobs known to be running a specific application such that a trace or service can be initiated in advance of them reaching the point of "misery"? If no stack is available and there was no ability to prepare with service\debug, and even if a trace was enabled, get the stack from the service tools [at the point where previous attempts were made to end].

Regards, Chuck