Re: Sorting file based on variable string -- MIDRANGE-L

On 22-Jan-2012 08:50 , Vinay Gavankar wrote:

Thanks everyone for the input. Here are my constraints (which I
should have put down before).

The data (about 75 GB) is on a production box, on which the only
thing I can do is to pull it to another development box (no access
to command line or SQL or IFS on the production box).

If there is access to the flat file by a user, there is very probably access to that file via the SQL by that same user. While there may be no ability to access the SQL [ST1] utility via STRSQL on the production system, there are a variety of means to access the SQL, a language for which many statements can be performed dynamically such that no compile is necessary. Any utility that accepts dynamic SQL is a potential path to the authorized data; e.g. STRSQL and STRQM on the development box may be able to connect to the production box. The data could even be ordered and copied within the production box rather than copied to another.

So the limitation as alluded may not be so restrictive. The SQL could probably easily be run on the production system and the ordered result "pulled" to the development system using the database instead of some other means [which leaves the data unordered].

On the development box, I have access to SQL, but not too sure about
IFS. The data is in EBCDIC, so unsure what it will translate to in
ASCII (by the way, the file also has another control character x'1E'
in it. Did not mention it before, as I though it was not relevant to
my problem).

Is DRDA configured to enable CONNECT TO TheProduction from the TheDevelopment system? If there is very fast sequential\tape media available to both systems? Offloading the [ordered] data can be a better choice than communications; e.g. write the ordered result set to the tape at the production system, and then copy the data from that tape to the development system.

Once I have the sorted file, I have to chop it up into 500,000
records each, so that those files can be further processed in
parallel. These files will have to be copied back to production
machine by another team (I can pull the data off, not put it back on
production).

More work beyond ordering? Can that other work be done in the same "pass" as the sort? As noted, the sorted data at the production system could be sent to tape and [possibly more easily] copied to the separate files at the development box. If the other work can be done at the same time, then having copied the transformed\sorted data to media allows directly replacing the original while also limiting storage requirements. Since every row must be processed to effect collation, there may be little reason to delay other work to a separate and more parallel processing; not knowing what work, that is difficult to guess.

I tried Chuck's solution, on a test file and it works. I will have to
check into the IFS access.

Note: The expression could be enhanced to avoid problems for records missing Last Name data, and for any records where the Last Name data is located first or last. The records will not be sorted properly using the given SQL, if either the data does not follow the strict definition that was given [wherein the Last Name data resides for any given row] or if any records are missing the Last Name data "field" designator.

The machine is already at over 90% disk usage and sudden jump of 2-3
percent, may be frowned upon (but there is no way around that).

Two systems were noted as being involved, but which is described in the above quoted text is omitted.

Depending on how the query is implemented, and what additional database features are available [e.g. SMP installed], the disk usage could easily jump dramatically; possibly way beyond just double plus [just as indeterminate amount of] storage for all key values. If SMP is available, imposing some limits may be desirable; e.g. CHGQRYA

I am sure the sorting of 18 million records will take some time, and
can you say which one would take less system resources? (There are
scores and scores of other developers on the system who would scream,
if I hog the system for a few hours).

Again, which system? Is the plan to copy the unordered file from production to development and then do all of the work on the development system? The SQL could be performed on the production system and the result set then copied to development, or even possibly ordered and modified to make the copy entirely on the production system. The SQL could be run with specific work management attributes that limit the impact to the rest of the system; e.g. its own storage pool and runpty(65).

As John mentions, I could write a simple program to create an
intermediate file with a key, reorganize it to get it sorted, but I
think the RGZPFM will use as much system resources as the SQL would
(not sure about the Unix sort).

I believe that was Booth. The RGZPFM will use effectively twice the amount of both the key and the physical data sizes, plus the equivalent of storing a few hundred records separately; but already by then, there is a second copy of the data. A known and fixed key length can limit the required storage. The given SQL might generate a much longer key than necessary to hold the largest Last Name data, but I really doubt that the amount of work or storage would be significantly different than the RGZPFM; each effect about all the same processing. With some features available to the SQL [like parallel I\O], the SQL could complete much faster, but presumably with more obvious consumption of resources.

However physically sorting the data at that stage might no make sense; i.e. physical ordering could be delayed until later, because an index of the physical data could be maintained while the data was being inserted into the file. The physical ordering then could be effected when copying back to the original flat file using the index ordering, after all other work is completed; i.e. if the additional work can be effected distinctly across multiple subsets of the data, then that work can be done without any known order.

Regards, Chuck