Re: 520 (Power5) at V6R1 Goes to Sleep -- MIDRANGE-L

On 12/8/10 11:21 AM, Kirk Goins wrote:

I have a client that initially reported his 520 at V6R1 was dropping
off the network for 20-40 minutes and them coming back to life like
nothing happened. After some questioning we know that the system
stops responding to pings on the Ethernet ports ( both on the MB) AND
even the Twinax Console stops responding. So I don't see this as a
network issue. During the problem times, the console has a sign on
screen but entering a ID and password just results in a 'X' until
everything starts working again. The system has auto tune on, lots of
memory ( I think a full 32GB, at least 16GB ) 65 diskarms. ALL
logging appears to stop, nothing in QHST or joblogs we checked.
random days and times. Nothing in WRKPRB, (SST - PAL or Service
Action Logs ). I just left a message to check the VLOGs so I don't
know about them yet.

IBM is having the client install iDoctor, but other than that they
haven't found anything yet. I 'feel' that either the box is
thrashing or some very very low level task gets in a tight loop.
Anyone else seen this before? Any Thoughts?

A poorly established *MACHINE pool is a possible origin for the described effect. There have been a number of auto-tuning fixes over the years; no idea if\what from v6r1, nor do I recall search keywords for such fixes, but I would guess any existing fixes would be on the HIPer group.

Even without iDoctor, given the "coming back to life" means the interactive sessions did not terminate, a WRKACTJOB RESET(*YES) left active in an interactive job could be reviewed after the "drop" to see the F5=Refresh(ed) CPU%, AuxIO, and other statistics averaged over that time. I believe WRKSYSACT output to an OutFile supports auto-refresh, so it could similarly log\catch a CPU issue; including if it were LIC task(s) instead of a particular job\thread.

Since one impact is described being for interactive signon [and for lack of noting which joblogs were reviewed], review of the subsystem monitor joblogs and the QSYSARB & QCMNARB joblogs seems appropriate; albeit those probably should be notifying the history. Note also that review of the history logged *after* the incident seems since-resolved is possibly worthwhile; e.g. CPF3100 messages issued some time after the apparent recovery could be logged to QHST as an indication of work that might have transpired over several prior minutes. The history might instead merely be delayed as side effect of a problem with the SCPF job which would process the history; copying data from QHST *MSGQ into the QHST##### files.

Regards, Chuck

This mailing list archive is Copyright 1997-2026 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact [javascript protected email address].

Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.