On 22-Jan-2012 08:50 , Vinay Gavankar wrote:
Thanks everyone for the input. Here are my constraints (which I
should have put down before).
The data (about 75 GB) is on a production box, on which the only
thing I can do is to pull it to another development box (no access
to command line or SQL or IFS on the production box).
If there is access to the flat file by a user, there is very probably
access to that file via the SQL by that same user. While there may be
no ability to access the SQL [ST1] utility via STRSQL on the production
system, there are a variety of means to access the SQL, a language for
which many statements can be performed dynamically such that no compile
is necessary. Any utility that accepts dynamic SQL is a potential path
to the authorized data; e.g. STRSQL and STRQM on the development box may
be able to connect to the production box. The data could even be
ordered and copied within the production box rather than copied to another.
So the limitation as alluded may not be so restrictive. The SQL
could probably easily be run on the production system and the ordered
result "pulled" to the development system using the database instead of
some other means [which leaves the data unordered].
On the development box, I have access to SQL, but not too sure about
IFS. The data is in EBCDIC, so unsure what it will translate to in
ASCII (by the way, the file also has another control character x'1E'
in it. Did not mention it before, as I though it was not relevant to
my problem).
Is DRDA configured to enable CONNECT TO TheProduction from the
TheDevelopment system? If there is very fast sequential\tape media
available to both systems? Offloading the [ordered] data can be a
better choice than communications; e.g. write the ordered result set to
the tape at the production system, and then copy the data from that tape
to the development system.
Once I have the sorted file, I have to chop it up into 500,000
records each, so that those files can be further processed in
parallel. These files will have to be copied back to production
machine by another team (I can pull the data off, not put it back on
production).
More work beyond ordering? Can that other work be done in the same
"pass" as the sort? As noted, the sorted data at the production system
could be sent to tape and [possibly more easily] copied to the separate
files at the development box. If the other work can be done at the same
time, then having copied the transformed\sorted data to media allows
directly replacing the original while also limiting storage
requirements. Since every row must be processed to effect collation,
there may be little reason to delay other work to a separate and more
parallel processing; not knowing what work, that is difficult to guess.
I tried Chuck's solution, on a test file and it works. I will have to
check into the IFS access.
Note: The expression could be enhanced to avoid problems for records
missing Last Name data, and for any records where the Last Name data is
located first or last. The records will not be sorted properly using
the given SQL, if either the data does not follow the strict definition
that was given [wherein the Last Name data resides for any given row] or
if any records are missing the Last Name data "field" designator.
The machine is already at over 90% disk usage and sudden jump of 2-3
percent, may be frowned upon (but there is no way around that).
Two systems were noted as being involved, but which is described in
the above quoted text is omitted.
Depending on how the query is implemented, and what additional
database features are available [e.g. SMP installed], the disk usage
could easily jump dramatically; possibly way beyond just double plus
[just as indeterminate amount of] storage for all key values. If SMP is
available, imposing some limits may be desirable; e.g. CHGQRYA
I am sure the sorting of 18 million records will take some time, and
can you say which one would take less system resources? (There are
scores and scores of other developers on the system who would scream,
if I hog the system for a few hours).
Again, which system? Is the plan to copy the unordered file from
production to development and then do all of the work on the development
system? The SQL could be performed on the production system and the
result set then copied to development, or even possibly ordered and
modified to make the copy entirely on the production system. The SQL
could be run with specific work management attributes that limit the
impact to the rest of the system; e.g. its own storage pool and runpty(65).
As John mentions, I could write a simple program to create an
intermediate file with a key, reorganize it to get it sorted, but I
think the RGZPFM will use as much system resources as the SQL would
(not sure about the Unix sort).
I believe that was Booth. The RGZPFM will use effectively twice the
amount of both the key and the physical data sizes, plus the equivalent
of storing a few hundred records separately; but already by then, there
is a second copy of the data. A known and fixed key length can limit
the required storage. The given SQL might generate a much longer key
than necessary to hold the largest Last Name data, but I really doubt
that the amount of work or storage would be significantly different than
the RGZPFM; each effect about all the same processing. With some
features available to the SQL [like parallel I\O], the SQL could
complete much faster, but presumably with more obvious consumption of
resources.
However physically sorting the data at that stage might no make
sense; i.e. physical ordering could be delayed until later, because an
index of the physical data could be maintained while the data was being
inserted into the file. The physical ordering then could be effected
when copying back to the original flat file using the index ordering,
after all other work is completed; i.e. if the additional work can be
effected distinctly across multiple subsets of the data, then that work
can be done without any known order.
Regards, Chuck
As an Amazon Associate we earn from qualifying purchases.