CRPence wrote:
but why take the chance?
Chuck:
To see what happens!
Unfortunately, it merely generated a single T/AF and ended the job.
Ah, well...
Using just CHKOBJ against an object, even
QSYSOPR *MSGQ, should cause the authority failure entry to be logged;
presumably with less processing by the OS code, to effect that request.
For example:
pgm /* replace ## to reflect the 'not aut' msg */
again:
chkobj qsys/qsysopr *msgq
monmsg cpf98## exec(rcvmsg ... rmv(*yes))
goto again
endpgm
This is closer to what is being done, but there is more to the
overall "requirements".
The real problem is that we don't know what the "requirements" are.
By reviewing symptoms, it _looks_ as if some internal limit is being
hit that i5/OS reacts to by throwing a very hard error. Initial
guesses (from IBM support) are that something like an "activation
mark" limit is being reached after some 100s of millions of audit
entries are processed.
I'm working on replicating the problem for a customer.
I'm using a model 570 through IBM's VLP. I currently have 20 jobs
looping around a few kinds of commands for different audit entry
types plus a couple single jobs that do some specific entry generations.
The _basic_ job structure for the 20 jobs doing entry generation is
something like:
1. CPYLIB to get a unique work library of 1000+ varied objects.
2. List WrkLib objects to *USRSPC
2. Start DOFOR loop
* Initialize a few vars such as pointer into *usrspc
3. CHGAUD OBJAUD(*NONE) generic for all objects in WrkLib.
4. CHGAUD OBJAUD(*ALL) generic for all objects in WrkLib.
5. Get next object name from space
6. RTVOBJD TEXT() to generate ZR.
7. CHGOBJD TEXT() to generate ZC.
8. Goto step 5 until *USRSPC list exhausted.
9. ENDDO
10. DLTLIB WrkLib.
(It's not particularly logical because it undergoes changes in order
to test techniques.)
So, each time one of those jobs runs, it creates a bunch of CO
entries at the start and DO entries at the end via CPYLIB/DLTLIB for
that job's unique WrkLib. Unique WrkLibs avoid a lot of lock issues,
etc. There's a mix of objects in the priming library -- *PGMs,
*FILEs, *DTAARAs, etc. The data files are all cleared since I'm only
accessing the objects themselves in my code.
I get a bunch of AD entries when I change audit values generically
for the entire WrkLib in single commands -- it's not clear that I
can code anything that'll be faster than a generic command. I get
the ZR/ZC combo easily enough via RTVOBJD/CHGOBJD.
With 20 concurrent jobs, there's a fair mix of entries going into
the audit journal. (All QAUDLVL values are set, so some entries are
created just by running stuff.) The code that sets up the DOFOR loop
currently pulls 200000 as the loop limit from a *DTAARA.
The odd secondary jobs ensure that AFs and a few other entries get
sprinkled in. I create these jobs as ideas come to mind. For
example, one of them creates, manipulates and deletes an IFS
directory/sub-directory structure with some streamfiles just in case
it's related to calling some IFS APIs when processing the resulting
audit entries.
In short, I am indeed getting lots of audit journal entries of
different types. It's an LPAR system where I'm the only user on this
partition and it seems "fast enough" for most purposes... just not
this purpose. The customer who's seeing the problem generates
entries faster.
I've set QAUDFRCLVL to 100 to reduce that impact as much as
possible. I'm holding approx 18 of the top 20 tasks (sometimes 17,
sometimes 19, depending on QDBSRVXR* at any given moment.) My
activity levels seem appropriate. I'm not particularly memory nor
DASD constrained, based on faulting, paging, etc.
Adding more jobs to my 20 has passed the point of useful returns.
I'll need more CPUs allocated to my LPAR if I want that to make a
difference.
I used a combination of generic commands and individual commands
because I wasn't sure which would give better results. The generics
certainly make for simpler coding and create lots of entries each
time they're executed, but that's not the biggest concern. If
single-object references would give better overall results, I'm
happy to go that route.
Because there's no indication of what causes the error when it
eventually strikes, I'm keeping the entries somewhat mixed. So far,
it's always after the half-billion entries processed mark; but the
best count has indicated more like 750,000,000 processed entries.
The error takes maybe a couple weeks of run-time to appear on the
customer's system.
The coding in my test generators is ILE CL V5R4 in order to use V5R4
support.
I had a system where I could try John Earl's suggestion, and it had
me curious. Unfortunately, i5/OS and its LIC was too smart for it.
Woulda been interesting if it had worked...
So, that's maybe a better description of what problem is being
attacked. I'd have a better problem description except that when the
problem hits, job structures get corrupted. There's no useful
visibility when you can't even execute DSPJOB against it to see what
the call stack looks like; DSPJOB crashes. Error handling goes nuts
in the job; forget trying to dump anything; no visibility to locks...
After years of running for a number of customers, the problem hasn't
been seen before with this code. For _this_ customer, it's seeming
like it's guaranteed to show up (after a couple weeks of run-time.)
The problem essentially exactly matches APAR SE24828, but the
customer is properly PTFd for that. (No libl manipulation goes on in
the job, but that doesn't seem relevant anyway.)
Obviously, it _could_ be something like parm mismatches and other
stuff that really screws up everything -- those possibilities have
been gone over and over. But for only one customer? The code isn't
customized. And it sure isn't untried in many different environments.
The major difference _seems_ to be that _this_ customer has massive
amounts of audit entries, even compared to other large customers,
and _this_ customer doesn't have system quiescence very often. That
means that the audit entry processing calls its various procs a
whole bunch of times before the job is ended and restarted.
Therefore... the current effort is to generate "many" various
entries quickly to force the error when entries are processed. But how?
Any conceptual improvements?
Tom Liotta
As an Amazon Associate we earn from qualifying purchases.