RE: dasd utilization at 99.56% -- MIDRANGE-L

Why is it whenever we have a question about something, the topic seems to magically pop up on these boards. It seems now that it also works in reverse! I can tell you what happens when a system fills up. (:

The messages "Your system is filling up!" started at 3:30 AM - after the night operator had gone home.
The messages arrived at 3:30, 4:30, 5:30, and 6:30.

At about 7:00 the day help desk lady arrived and had trouble signing on. She called me, and about 7:20 I was attempting to sign on. She said her session just dropped, as did that of one of our early-bird programmers.

She got into the computer room, and the console message indicated that the system was trying to dump main storage, and had no place to put it. The system was dead.

My pager did receive the 'disk full' messages, but it wasn't until I left my home in a 'dead zone.' Not even Verizon could have helped.

At 8:30 I arrived at the office, and my backup had IBM on the phone. The alternatives were: A) follow their detailed instructions to save main storage so IBM could find out what died. This would take a while. Then IPL. Or B) forget the main storage save and IPL.

We chose to IPL, since I had a good idea which job had done the dirty deed.

IBM told us that the IPL could take up to 3 times normal (mileage may vary).

At 8:40 the IPL was started.

After several steps, SRC code C600-4A57 - Database Analyze Pass 1 - was posted. This step ran forever (ie about 3 hours)!

Somewhere in my mind I remember that the system keeps a table of open or 'dirty' objects, and only checks these during an abnormal IPL. I'd hate to see how long it'd take if this table was corrupted.

After the system got done with C600-4A57 the IPL proceeded at normal pace to the signon screen. Then it was just check for damages and re-start everything (we do NOT have a startup program that wakes everything up-- after an abnormal end we like to check things first!).

Total time from the start of IPL to Signon Screen was about 4-5 hours. Our operator informed me that our normal IPLs are running at least 45 minutes these days, so the time wasn't too far off from IBM's estimate-- although I thought our IPLs were only 20-30 minutes, so we got a bit anxious when it took longer than 2 hours!

We didn't have any damaged objects. We were lucky that at 7 AM users were just starting to hit the system, so we didn't have to re-do the entire day. One of our plant applications is designed to switch between our 2 systems, so they were able to run off of the backup box.

Once we were able to sign on, we deleted the file created by the application that did the deed. 433 Gigs of DLTF later, we had -lots- of room. We went from 97% to 84%. I suspect that we were a lot closer to 100% when the machine died-- the IPL cleared out all of the QTEMP libraries. [We were retrieving selected journal transactions-- and a LOT of stuff had gone on!]

IBM did mention that if the big object had been in a QTEMP library, sometimes these objects are preserved during an abnormal IPL, so we would have had to track it (them) down. I'm not sure of the details, because that wasn't what had done us in. IIRC, if the system has trouble waking up, you are presented with a screen that tells you to delete something so the system can run.

We will be adding additional monitoring and messaging to make sure the messages get through to those of us who live in 'pager-challenged' neighborhoods!

--Paul E Musselman
PaulMmn@xxxxxxxxxxxxxxxxxxxx