Re: ALCOBJ Fails, WRKOBJLCK Shows No Locks -- MIDRANGE-L

On 11/30/2011 9:44 PM, CRPence wrote:

On 30-Nov-2011 13:54 , Sam_L wrote:

A batch job tried:

ALCOBJ OBJ((xxxxxPF *FILE *SHRNUP *N)) WAIT(0)

It failed with CPF1002.

Is there a spooled joblog taken with LOG(4 0 *SECLVL); presumably the
actual request "RQS message" was logged too? Any QPDSPJOB produced?

Yes, I have the job log. CPF1002 doesn't really say much, though.

I ran WRKOBJLCK OBJ(xxxxxPF) OBJTYPE(*FILE) MBR(*ALL)

Any spooled output of that request to confirm? A spooled WRKJOB taken
for the job doing the WRKOBJLCK?

It showed only *SHRRD locks.

Any spooled WRKJOB for job(s) listed as holding any of those *SHRRD? Any
recollection if only one job, or more than one job was holding the
member locks and if both MBR and DATA? Were the number of each matching?

These were all interactive jobs. As I recall, there were two which had a lock only on the file. All the rest has both member and file *SHRRD locks.

FWiW neither request above, had library qualified the file; or so one
might infer for lack of 'xxLIB/' preceding the xxxxxPF

All the jobs were running with the default library list.

I tried the same ALCOBJ command from the green screen. It also
failed with CPF1002. Tried it several times, same result.

Is there a spooled joblog taken with LOG(4 0 *SECLVL)? DSPJOB output?

I doubt if I have that. But I cut and pasted the exact failing instruction from the failing job.

Having used WAIT(big_value) instead, should have allowed a review for
the lock "status" of WAIT, via a WRKJOB OPTION(*JOBLCK) against the job
attempting that ALCOBJ issued within that [wait] time; i.e. assuming the
effect had not cleared up, as alluded transpired an hour later.

Hmmm... If the CPF1002 does not have the library name [seems not so in
v5r3], then the joblogs are of less value than I might have assumed in
above comments. The job information, at least the library list, is still
pertinent if the name was not library qualified.

Over an hour later, while attempting to demonstrate the failure to a
coworker, my ALCOBJ worked! None of the other *SHRRD locks went away
(as far as I can recollect.)

User resubmitted the batch job and it worked as expected.

So it appears that something had a lock on the file that WRKOBJLCK
could not show. Any ideas?

The WRKOBJLCK MBR(*ALL) does not show the *FILE level locks when
performed interactively OUTPUT(*). If all of the locks seen were held by
one job, then a *EXCL on the *FILE might have been held by that other
job but not visible as I recall; i.e. a conflicting but unseen lock as
the legitimate origin for the error. The named file being created either
under isolation or as part of a restore would have such a lock; only the
former would also have any data locks... and per exclusive, any other
locks only in the one job. A database file pending database recovery
[for a failed operation not under isolation\commitment-control], the
*FILE will have a *EXCL lock by an active job which has not yet
reacted-to by recovering from a failure; database message(s) [since a
failure leaving the file pending recovery] in the history likely would
record a "recovery completed" or similar, if that were an issue.

Granted this is old code, but I don't think it does any *EXCL type locks.

FWiW I believe there is an oddity for database file member objects
whereby the "allocate object" request must first obtain a space location
lock before being able to attempt to get the actual *FILE [*SHRRD on the
file], *MEM [*SHRRD on the member\MBR], and *QDDS [*SHRNUP on the DATA]
locks. I do not recall if WRKOBJLCK shows any SLL [Space Location
Locks}, but IIRC somewhere iNav has a means to include them.? A file for
which a clear [CLRPFM] or reorganize [RGZPFM] request is pending
completion, I believe would hold such a lock to prevent any other lock
requester from appearing to "hang" [in RUN status] on a long-held seize.
Such a scenario should show the dataspace lock as *EXCL however.

FWiW: For a recreate scenario [as with the several failed attempts
before the functional attempts an hour later], a repeatable\persistent
failure for the ALCOBJ could be traced and then a breakpoint added to
catch an error condition [MCH5802] that was signaled but not suppressed
[as could be inferred, given MCH5802 is visible in the text of trace
output; I can not recall, nor trace to confirm]. For example, stopped at
the statement '/0001' of the "receive message" program which presumably
runs with the option to remove the message, F10=Command line could be
used to access the error and press F1=Help [and F6=Print] to review the
extra details recorded in that error message. If the message does not
appear in a trace, that means the message is suppressed and debug would
not help to find the MCH5802 as the underlying condition which was
relayed to the user as CPF1002.

During the hour when we could not get the lock there were several attempts made. The program is coded so that if the ALCOBJ fails you can retry of cancel. I did several retries before I canceled it. The user, naturally, submitted the job a couple of more times before I convinced her were trying to figure it out. And I made several tries myself from my interactive session.

We're at V7R1, but just for the last 2 weeks. I'll have to keep an eye out and see if it happens again.