RE: System slow-down - disk usage? -- MIDRANGE-L

A couple answers at once:

The previous file doesn't always start at 0 records. It contains months worth of data. A previously processed record may be from the same input file, or may not.

The previous file only contains the key fields to determine a duplicate.

True, the majority of the time a duplicate is an exception to the rule. However we do have clients that send us 40% duplicates (yes, I know, sigh). If I were to do the Write and catch the Duplicate error I'd have to also remove the job log entry (maybe not "have" to, but 100,000 job log messages about a duplicate write would be annoying). This is where I wish Alison Butterill would recognize the need to suppress a job log message when monitoring for an error. Sure, I can do it with an API, I know (and have).

The Previous file, in this case, is gb's in size. We don't currently have enough Main storage to accommodate, much less to allow multiple jobs (using different files) to process at once. We are looking at getting more memory.

Select Distinct may be a good idea for eliminating duplicates within a single input file. I'll definitely consider that.

We are blocking the read on the input file. However, the size of the Previous file is really the issue here. Empty the Previous file, and the program flies. Fill up the Previous file, and the program grinds.

The Input & Previous file will never have deleted records in them.

Thanks for all of the responses,
Kurt

-----Original Message-----
From: midrange-l-bounces@xxxxxxxxxxxx [mailto:midrange-l-bounces@xxxxxxxxxxxx] On Behalf Of Charles Wilt
Sent: Tuesday, June 15, 2010 1:07 PM
To: Midrange Systems Technical Discussion
Subject: Re: System slow-down - disk usage?

Kurt,

It isn't that SETLL "moves through" so much data...it's that the same
block is repeatedly moved into and out of memory.

If I understand you correctly, each time the program runs, the
previous file starts out empty with records added for the specific
run.

Given the existing process, possible improvements
1) More memory, (enough for SETOBJACC to place the whole previous
table in memory?)
2) specify a SIZE(xxxx) ALLOCATE(*YES) on the previous table

How "duplicate" are the records? All fields exactly, or just a given
key in common?

Does the input file change while the process is running?

IMHO, to improve the process, you need to only even look at the first
record to process, otherwise you're wasting time reading the record
and checking the previous table. So how to get rid of the need for
the previous table??

SELECT DISTINCT perhaps?

Or using RLA, create a logical over the input file with the appropriate key.
Before starting the process, RGZPFM KEYFILE(MYLF)
read file
while not EOF
process record
setgt key myfile
end while

HTH,
Charles

On Tue, Jun 15, 2010 at 1:01 PM, Kurt Anderson
<kurt.anderson@xxxxxxxxxxxxxx> wrote:

Thought I'd bounce a couple things out there to see what people think.

We've identified the slow-down cause, and it has to do with us checking every record we process against a file of records that we've previously processed. If that "previous" file starts off empty, the program starts off screaming along. However, 20 million records later, when the previous file is at 20 million records, the program has slowed down considerably. (Just using 20 million as an example, it's really a gradual decline in performance.)

We check this Previous file by doing a SETLL on it. If we get a hit, we kick out the record we're processing, otherwise we write the record to the Previous file and continue along.

I also noticed that when the program is running with a large Previous file, that the page/fault ratio closes in on 3:1. Is this because the SetLL is moving through so much data? (No other batch jobs were running, and we have very little interactive use.)

We are looking at getting more main memory (as well as aux storage), but I'm trying to think outside the box... is there a way for me to perform this previous check without flooding the memory?

Thanks,

-Kurt

-----Original Message-----
From: midrange-l-bounces@xxxxxxxxxxxx [mailto:midrange-l-bounces@xxxxxxxxxxxx] On Behalf Of Kurt Anderson
Sent: Monday, June 14, 2010 8:41 AM
To: 'Midrange Systems Technical Discussion'
Subject: RE: System slow-down - disk usage?

Actually, that wasn't it. Although was worth discovering anyway.

-----Original Message-----
From: midrange-l-bounces@xxxxxxxxxxxx [mailto:midrange-l-bounces@xxxxxxxxxxxx] On Behalf Of Kurt Anderson
Sent: Friday, June 11, 2010 9:44 PM
To: 'Midrange Systems Technical Discussion'
Subject: RE: System slow-down - disk usage?

Looks like I found the issue. The output file which receives a high volume of records in our "problem" job had a field defined as VARLEN(16), but 90% of the calls had a value that exceeded 16 bytes. This was a new field for this client and the size was determined based on all other clients (before we saw the new client's data). Once I discovered that, I upped the VARLEN to 25 (based on new client's data) and suddenly the job screams in speed. I knew a misuse of VARLEN would cause issues, but didn't realize it would be so bad.

Now, I had mentioned that a completely unrelated job took 24 hours to run (4-5x normal). That still boggles me because that job was not exposed to any of the changes made - which is what made me think there was a system issue.

We did check the Cache Battery, which at least exposed us to the fact that we are 100 days until a warning, so I really appreciate that tidbit. I've also been exposed to some system monitoring tools.

Thank you everyone for responding with ideas and things to check into.

-Kurt

-----Original Message-----
From: midrange-l-bounces@xxxxxxxxxxxx [mailto:midrange-l-bounces@xxxxxxxxxxxx] On Behalf Of Kurt Anderson
Sent: Thursday, June 10, 2010 2:57 PM
To: 'Midrange Systems Technical Discussion'
Subject: RE: System slow-down - disk usage?

I knew I forgot something.

Main Storage: 3885.01MB

QPFRADJ = 2

In WrkDskSts I see Active for all units.

-----Original Message-----
From: midrange-l-bounces@xxxxxxxxxxxx [mailto:midrange-l-bounces@xxxxxxxxxxxx] On Behalf Of DrFranken
Sent: Thursday, June 10, 2010 12:09 PM
To: midrange-l@xxxxxxxxxxxx
Subject: Re: System slow-down - disk usage?

OK Cleaning up never hurts and often helps so that much is good.

Your disk configuration is a bit odd in that the first two units are
mirrored to each other while the rest are RAID protected. Not
unsupported or anything like that but you do lose about 35GB of
available storage this way and the RAID is on only 4 drives rather than
across 8 drives (all drives ARE still protected) In any case while 'odd'
it shouldn't be the source of your problem.

I forgot to mention that when looking at %busy on the drives or at the
paging/faulting ratios you need to wait until the elapsed time at the
top is about 5 minutes. Much shorter than that and you get 'spikey' data
that isn't real valuable. Much longer than that and all your data
averages into useless mush. It's not bad to know that your average
%busy for the entire morning was say 12% but it doesn't help you
troubleshoot much.

On the WRKDSKSTS screen after you pressed F11 did you see 'Degraded' or
'Unprotected'? That would indicate battery failure/drive failure.
'Active' is what you want to see.

One disk is one arm so you're correct there. How much memory is in the
machine? (Easy find is at top of WRKSHRPOOL screen.)

Definitely watch the paging/faulting and %BUSY numbers while the long
slow job is running.

Also what is your system value QPFRADJ set at?

- DrFranken

After I sent out my earlier email, we buckled down and cleaned up a lot of excess on the system, essentially gaining back 10% of the disk, which put us back to where we started. We had another job running, although this time with less to process, but it did take significantly longer than expected.

I found an article on the Cache Battery and have passed it along to my boss.
http://www.itjungle.com/fhg/fhg050907-story03.html

I checked our paging to fault ratio, and it seems decent. Our "hog" jobs aren't running right now, but I looked at wrksyssts enough while they were running to recall that the ratio was around 1 fault per 50 pages.

Using WRKDSKSTS, we have 7 units. Is each unit considered an arm? (I guess, 1 arm per disk? I'm a software guy doing his best to understand the system side here.) 1& 2 are Protection Type MBR, 3-7 are DPY. All are Active. I'll have to start up some tests to take a look at the Busy %. At the moment the Busy % is in the low teens or lower.

In regard to our system, it's a 520, 9405, P10.

Thanks for the help,
Kurt

-----Original Message-----
From: midrange-l-bounces@xxxxxxxxxxxx [mailto:midrange-l-bounces@xxxxxxxxxxxx] On Behalf Of DrFranken
Sent: Wednesday, June 09, 2010 11:45 PM
To: midrange-l@xxxxxxxxxxxx
Subject: Re: System slow-down - disk usage?

Absolutely check Richard's suggestion on the Cache Battery. If any of
them are dead this sort of performance WILL result during significant I/O.

Going from 60 to 70% DASD should not cause this dramatic slow down. It
may cause some small fraction but nothing like 4 times plus.

What is the %BUSY in WRKDSKSTS when the long running job is running? If
the disks are 40% or more busy then you likely need more arms, faster
arms, or bigger disk cache but even then that's only a 'probable'. Also
how many disk arms do you currently have? Are they DPY or MRR protected
(From WRKDSKSTS F11)

You also need to check faulting as you mention. The big thing about
paging and faulting is the ratios. If the pool running the jobs is
paging at say 2500 but faulting at 25 then you're doing exceptionally
well as only 1 in 100 pages results in a fault. If you've got 500 faults
out of 500 pages then you likely have a memory pool that is far too
small or has too many jobs running in it. The reason you don't find
specific numbers is because 'It Depends'. If you have a 32 way 595 you
can have faulting numbers that would make a 520 user cry and not bat an
eye.

What is your system CPW (or processor feature code) and how much memory
is installed.

- - DrFranken

On 6/9/2010 6:35 PM, Kurt Anderson wrote:

I'm on v5r4, and we've recently gotten a very large customer and have had some speed issues. At first we thought they were specific to some certain new programs, but today we discovered the issue was impacting another job that was completely an absolutely isolated as far as programs go. So, we were looking at things from a system point of view to see what changed to cause this other job to slow down so much. Our guess - that our % system ASP used went from ~60% to ~70%. Is it possible that that would cause us an issue? (We had a job that would normally run 5 hours take almost 24 hours.)

We IPL'd over the weekend as well. Anyway, I realize this email is probably lacking a lot of specific information, but I'm not really a systems guy, and we're kind of grasping at straws, so I thought I'd see if such a change to disk % used should have such a big impact?

I am looking into other performance improving methods, but at this time we'd really like to pin down the cause of our performance crawl before attempting to put in enhancements.

While I'm at it, I'm curious how to quantify "excessive paging." I've seen reference to that phrase online, yet can't seem to find a number.

Thanks,

Kurt Anderson
Sr. Programmer/Analyst
CustomCall Data Systems

--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing list
To post a message email: MIDRANGE-L@xxxxxxxxxxxx
To subscribe, unsubscribe, or change list options,
visit: http://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxx
Before posting, please take a moment to review the archives
at http://archive.midrange.com/midrange-l.

--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing list
To post a message email: MIDRANGE-L@xxxxxxxxxxxx
To subscribe, unsubscribe, or change list options,
visit: http://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxx
Before posting, please take a moment to review the archives
at http://archive.midrange.com/midrange-l.

--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing list
To post a message email: MIDRANGE-L@xxxxxxxxxxxx
To subscribe, unsubscribe, or change list options,
visit: http://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxx
Before posting, please take a moment to review the archives
at http://archive.midrange.com/midrange-l.

--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing list
To post a message email: MIDRANGE-L@xxxxxxxxxxxx
To subscribe, unsubscribe, or change list options,
visit: http://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxx
Before posting, please take a moment to review the archives
at http://archive.midrange.com/midrange-l.

--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing list
To post a message email: MIDRANGE-L@xxxxxxxxxxxx
To subscribe, unsubscribe, or change list options,
visit: http://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxx
Before posting, please take a moment to review the archives
at http://archive.midrange.com/midrange-l.

--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing list
To post a message email: MIDRANGE-L@xxxxxxxxxxxx
To subscribe, unsubscribe, or change list options,
visit: http://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxx
Before posting, please take a moment to review the archives
at http://archive.midrange.com/midrange-l.