|
Leif, The S/34 had a inquiry utility that did something on the same principle, and I use it for multi-million record files for keyword in content searching. You construct an indexed file with the word and record number and process the index to point you to likely suspects in the text file. You only have to build the index once and add a trigger to the file if someone should edit or add a line. We build the file with a limit of 8 characters per word and just truncate the index. This does mean that a index build for words abcdefgh and abcdefghi create the same index, but you limit the size of the index file and must add additional logic in interrogating the suspects. Leif Svalgaard wrote: > > Thanks everybody for the variety of suggestions. > Just in case you wonder where the file came from, > the real problem was this: > > Imagine you have a large text (say all the programs > of a large application) of about 1,000,000 lines. > > You want to make a KWIC index. In case you have never > heard of a KWIC index, it is constructed in this way: > > For each line in the text: > shift the line left or right until the first word start in (say) > position 40, output the line, shift again until the second word > is in position 40, output, etc until all words on the line > have been treated. > > The output file will contain about 10 times as many records > as the input, assuming about 10 words per line. > > Sort the output on position 40. Now you have a KWIC > index. > > There are other ways of doing this. The above is very > simple and easy to get to work if you have a fast sort. > > An improvement is to look each word up in a little dictionary > of common words like "the", "and", etc and mot output them, > but that is not the issue. > > I had some discussion with somebody on the running time > of this on various platforms. It is clear that the sort dominates. > > +--- > | This is the Midrange System Mailing List! > | To submit a new message, send your mail to MIDRANGE-L@midrange.com. > | To subscribe to this list send email to MIDRANGE-L-SUB@midrange.com. > | To unsubscribe from this list send email to MIDRANGE-L-UNSUB@midrange.com. > | Questions should be directed to the list owner/operator: david@midrange.com > +--- +--- | This is the Midrange System Mailing List! | To submit a new message, send your mail to MIDRANGE-L@midrange.com. | To subscribe to this list send email to MIDRANGE-L-SUB@midrange.com. | To unsubscribe from this list send email to MIDRANGE-L-UNSUB@midrange.com. | Questions should be directed to the list owner/operator: david@midrange.com +---
As an Amazon Associate we earn from qualifying purchases.
This mailing list archive is Copyright 1997-2024 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact [javascript protected email address].
Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.