Re: Tight loop - stopping the error -- MIDRANGE-L

CRPence wrote:

but why take the chance?

Chuck:

To see what happens!

Unfortunately, it merely generated a single T/AF and ended the job. Ah, well...

Using just CHKOBJ against an object, even QSYSOPR *MSGQ, should cause the authority failure entry to be logged; presumably with less processing by the OS code, to effect that request.

For example:

pgm /* replace ## to reflect the 'not aut' msg */
again:
chkobj qsys/qsysopr *msgq
monmsg cpf98## exec(rcvmsg ... rmv(*yes))
goto again
endpgm

This is closer to what is being done, but there is more to the overall "requirements".

The real problem is that we don't know what the "requirements" are. By reviewing symptoms, it _looks_ as if some internal limit is being hit that i5/OS reacts to by throwing a very hard error. Initial guesses (from IBM support) are that something like an "activation mark" limit is being reached after some 100s of millions of audit entries are processed.

I'm working on replicating the problem for a customer.

I'm using a model 570 through IBM's VLP. I currently have 20 jobs looping around a few kinds of commands for different audit entry types plus a couple single jobs that do some specific entry generations.

The _basic_ job structure for the 20 jobs doing entry generation is something like:

1. CPYLIB to get a unique work library of 1000+ varied objects.
2. List WrkLib objects to *USRSPC
2. Start DOFOR loop
* Initialize a few vars such as pointer into *usrspc
3. CHGAUD OBJAUD(*NONE) generic for all objects in WrkLib.
4. CHGAUD OBJAUD(*ALL) generic for all objects in WrkLib.
5. Get next object name from space
6. RTVOBJD TEXT() to generate ZR.
7. CHGOBJD TEXT() to generate ZC.
8. Goto step 5 until *USRSPC list exhausted.
9. ENDDO
10. DLTLIB WrkLib.

(It's not particularly logical because it undergoes changes in order to test techniques.)

So, each time one of those jobs runs, it creates a bunch of CO entries at the start and DO entries at the end via CPYLIB/DLTLIB for that job's unique WrkLib. Unique WrkLibs avoid a lot of lock issues, etc. There's a mix of objects in the priming library -- *PGMs, *FILEs, *DTAARAs, etc. The data files are all cleared since I'm only accessing the objects themselves in my code.

I get a bunch of AD entries when I change audit values generically for the entire WrkLib in single commands -- it's not clear that I can code anything that'll be faster than a generic command. I get the ZR/ZC combo easily enough via RTVOBJD/CHGOBJD.

With 20 concurrent jobs, there's a fair mix of entries going into the audit journal. (All QAUDLVL values are set, so some entries are created just by running stuff.) The code that sets up the DOFOR loop currently pulls 200000 as the loop limit from a *DTAARA.

The odd secondary jobs ensure that AFs and a few other entries get sprinkled in. I create these jobs as ideas come to mind. For example, one of them creates, manipulates and deletes an IFS directory/sub-directory structure with some streamfiles just in case it's related to calling some IFS APIs when processing the resulting audit entries.

In short, I am indeed getting lots of audit journal entries of different types. It's an LPAR system where I'm the only user on this partition and it seems "fast enough" for most purposes... just not this purpose. The customer who's seeing the problem generates entries faster.

I've set QAUDFRCLVL to 100 to reduce that impact as much as possible. I'm holding approx 18 of the top 20 tasks (sometimes 17, sometimes 19, depending on QDBSRVXR* at any given moment.) My activity levels seem appropriate. I'm not particularly memory nor DASD constrained, based on faulting, paging, etc.

Adding more jobs to my 20 has passed the point of useful returns. I'll need more CPUs allocated to my LPAR if I want that to make a difference.

I used a combination of generic commands and individual commands because I wasn't sure which would give better results. The generics certainly make for simpler coding and create lots of entries each time they're executed, but that's not the biggest concern. If single-object references would give better overall results, I'm happy to go that route.

Because there's no indication of what causes the error when it eventually strikes, I'm keeping the entries somewhat mixed. So far, it's always after the half-billion entries processed mark; but the best count has indicated more like 750,000,000 processed entries. The error takes maybe a couple weeks of run-time to appear on the customer's system.

The coding in my test generators is ILE CL V5R4 in order to use V5R4 support.

I had a system where I could try John Earl's suggestion, and it had me curious. Unfortunately, i5/OS and its LIC was too smart for it. Woulda been interesting if it had worked...

So, that's maybe a better description of what problem is being attacked. I'd have a better problem description except that when the problem hits, job structures get corrupted. There's no useful visibility when you can't even execute DSPJOB against it to see what the call stack looks like; DSPJOB crashes. Error handling goes nuts in the job; forget trying to dump anything; no visibility to locks...

After years of running for a number of customers, the problem hasn't been seen before with this code. For _this_ customer, it's seeming like it's guaranteed to show up (after a couple weeks of run-time.) The problem essentially exactly matches APAR SE24828, but the customer is properly PTFd for that. (No libl manipulation goes on in the job, but that doesn't seem relevant anyway.)

Obviously, it _could_ be something like parm mismatches and other stuff that really screws up everything -- those possibilities have been gone over and over. But for only one customer? The code isn't customized. And it sure isn't untried in many different environments.

The major difference _seems_ to be that _this_ customer has massive amounts of audit entries, even compared to other large customers, and _this_ customer doesn't have system quiescence very often. That means that the audit entry processing calls its various procs a whole bunch of times before the job is ended and restarted.

Therefore... the current effort is to generate "many" various entries quickly to force the error when entries are processed. But how?

Any conceptual improvements?

Tom Liotta