Why is it whenever we have a question about something, the topic
seems to magically pop up on these boards. It seems now that it also
works in reverse! I can tell you what happens when a system fills
up. (:
The messages "Your system is filling up!" started at 3:30 AM - after
the night operator had gone home.
The messages arrived at 3:30, 4:30, 5:30, and 6:30.
At about 7:00 the day help desk lady arrived and had trouble signing
on. She called me, and about 7:20 I was attempting to sign on. She
said her session just dropped, as did that of one of our early-bird
programmers.
She got into the computer room, and the console message indicated
that the system was trying to dump main storage, and had no place to
put it. The system was dead.
My pager did receive the 'disk full' messages, but it wasn't until I
left my home in a 'dead zone.' Not even Verizon could have helped.
At 8:30 I arrived at the office, and my backup had IBM on the phone.
The alternatives were: A) follow their detailed instructions to save
main storage so IBM could find out what died. This would take a
while. Then IPL. Or B) forget the main storage save and IPL.
We chose to IPL, since I had a good idea which job had done the dirty deed.
IBM told us that the IPL could take up to 3 times normal (mileage may vary).
At 8:40 the IPL was started.
After several steps, SRC code C600-4A57 - Database Analyze Pass 1 -
was posted. This step ran forever (ie about 3 hours)!
Somewhere in my mind I remember that the system keeps a table of open
or 'dirty' objects, and only checks these during an abnormal IPL.
I'd hate to see how long it'd take if this table was corrupted.
After the system got done with C600-4A57 the IPL proceeded at normal
pace to the signon screen. Then it was just check for damages and
re-start everything (we do NOT have a startup program that wakes
everything up-- after an abnormal end we like to check things first!).
Total time from the start of IPL to Signon Screen was about 4-5
hours. Our operator informed me that our normal IPLs are running at
least 45 minutes these days, so the time wasn't too far off from
IBM's estimate-- although I thought our IPLs were only 20-30 minutes,
so we got a bit anxious when it took longer than 2 hours!
We didn't have any damaged objects. We were lucky that at 7 AM users
were just starting to hit the system, so we didn't have to re-do the
entire day. One of our plant applications is designed to switch
between our 2 systems, so they were able to run off of the backup box.
Once we were able to sign on, we deleted the file created by the
application that did the deed. 433 Gigs of DLTF later, we had -lots-
of room. We went from 97% to 84%. I suspect that we were a lot
closer to 100% when the machine died-- the IPL cleared out all of
the QTEMP libraries. [We were retrieving selected journal
transactions-- and a LOT of stuff had gone on!]
IBM did mention that if the big object had been in a QTEMP library,
sometimes these objects are preserved during an abnormal IPL, so we
would have had to track it (them) down. I'm not sure of the details,
because that wasn't what had done us in. IIRC, if the system has
trouble waking up, you are presented with a screen that tells you to
delete something so the system can run.
We will be adding additional monitoring and messaging to make sure
the messages get through to those of us who live in
'pager-challenged' neighborhoods!
--Paul E Musselman
PaulMmn@xxxxxxxxxxxxxxxxxxxx
As an Amazon Associate we earn from qualifying purchases.
This mailing list archive is Copyright 1997-2025 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact
[javascript protected email address].
Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.