On 18-Dec-2014 16:06 -0600, Thomas Garvey wrote:
We have a program that writes records into an externally defined
database file that has received the following error...
 <<SNIPped joblog: rewritten for compactness, symptoms,
   and adding some explanation>>
MCH3203  DbReuseFind 000FB4   QDBPUT  06B1
    12/18/14  08:52:28.640853
Message . . . . : Function error X'1720' in machine instruction.
 Internal dump identifier (ID) 0300167E.
  msgMCH3203 f/DbReuseFind x/000FB4   T/QDBPUT  x06B1
     rcx1720 rc1720 vl02001720
  The LIC DB code that attempted to locate a location for a [set of] 
deleted record(s) failed; a VLIC Log (VLog) was generated [major code 
0200 for "function check" and minor code 1720 [for an effective "failed 
assumption] that is logged to record the failure.  The failed attempt 
was a "put", aka a write or insert; intuitively, the dataspace appeared 
to support Reuse Deleted Records (REUSEDLT), and thus the LIC DB was 
looking for a place to insert the record(s) into a slot held by a 
since-deleted row.  Something about this search went awry, the error 
which likely is an assertion that something should have been, but was 
found not to have been [i.e. failed assumption], so rather than 
progressing in spite of the bad information [which could have led to 
GIGO], the operation was canceled to protect the integrity of the data.
CPF9999  QMHUNMSG    *N       QDBPUT  06B1
Message . . . . : Function check. MCH3203 unmonitored by QDBPUT
 at  statement *N, instruction X'06B1'.
  msgCPF9999 F/QMHUNMSG t/QDBPUT x/06B1
  The LIC DB module that was invoked to implement the /insert-row/ 
method is, at that fix-level of the OS DB program QDBPUT, occurs at the 
failing instruction x/06B1 [for whatever specific type of write 
activity, according to the code path].
  The generic "function check" condition, aka the *FC, indicates that 
the OS Database "Put" program did not monitor for the failure... which 
is correct, there is really nothing the DB2 Insert feature can do when 
the underlying database support fails, so the message be manifest as an 
"unmonitored failure" which also enables some internal diagnostic 
messaging support to better record the failure.
CPF3698  QSCPUTR     005E     QMHAPD  0500
Message . . . . : Dump output directed to spooled file 1,
 job 875639/JOBNAME/E206151053 created on system ITEAM on...
  msgCPF3698 F/QSCPUTR x/005E T/QMHAPD x/0500
  As part of Software Logging [SC: ¿Service Component?] for software 
errors, the prior /unmonitored/ condition is further logged as a 
"software problem".  Really not germane, just a record that the OS was 
told to log, so that was done, and the effect was logged in the joblog.
CPF5257  QDBSIGEX    0539     E20615105R  WORKLIB  STMT/2123
Message . . . . : Failure for device or member E206151053 file
 E206151053 in  library WORKLIB.
Cause . . : An error occurred during a read or write operation.
  msgCPF5257 F/QDBSIGEX x/0539 t/usrpgm
  The user program issuing the failing statement, a WRITE, INSERT, 
whatever, is: E20615105R in WORKLIB  STMT/2123
  The failed I/O must be informed of, to the requesting program.  The 
I/O error is signaled by the generic "Signal Exception" routine of the 
OS Database component, using the Common Data Management [DM] message 
range of messages, because the I/O was via the Open Data Path (ODP) 
created for the Database Open request; the member, file, and library, 
are offered up as part of the information, and that the operation was a 
"write" vs a read can be inferred from the OS DB program name.
I examined the file and could not determine if it was damaged. Then
I noticed the 'DbReuseFind' reference (in the original MCH0203
message) which made me check the maximum file size, and number of
deleted records, as I know the file is configured to 'reuse deleted
records'. The file had some 2,500,000 records, and slightly more than
1,000,000 were deleted records, and the file has *NOMAX for Member
Size. So, I reorganized the file, just to see what would happen. It
reorganized just fine, eliminated the deleted records, and I
restarted the job.
  The Reorganize Database File Member (RGZPFM) request will purge the 
deleted records, and thus purge the reuse-table; i.e. there will be no 
table in which deleted rows are tracked, thus the database insert\put 
will have no reason to look for the desire or ability to replace a 
deleted row with an active row.
  Thus, the problem was "circumvented"; the origin of the problem is 
not [yet] investigated, and the details of the condition only exist in 
the logging, because the object [the dataspace] against which the 
failure transpired, no longer exhibits the problem.
Now the job runs fine.
*What do you think happened?*
  The VLogs, the VL02001720 and possibly some nearby [esp. just before] 
in time, plus possibly some longer ago in the past for the object of the 
same address, may contain some information relevant to answering that 
question.  What is clear however, is that the LIC DB was not happy about 
what transpired at that moment with that dataspace with respect to 
processing for the REUSEDLT(*YES), and the operation was failed\canceled 
with purpose.
BTW, another job, which only deletes records from this file, was
failing with the same Machine Check error, with only a slightly
different Error Code (hex 1716 instead of 1720, as above)
 <<again; rewritten for compactness, symptoms, explanation>>
    12/18/14  08:52:28.640853
MCH3203  #dbdelim 00343C   QDBUDR  0C1D
  12/18/14  08:54:09.273373
Message . . . . :   Function error X'1716' in machine instruction.
Internal dump identifier (ID) .
  msgMCH3203 F/#dbdelim x/00343C T/QDBUDR x/0C1D
     rcx1716 rc1716 vl02001716
  Not sure why the DMPID did not appear in the message data; that may 
itself [and seems to] be a defect.
  This time the request was from the OS DB Delete [, Update, Release; 
i.e. the "UDR" of the program naming] processor, a request to the DB LIC 
method to Delete a Dataspace Entry [DB Delete Image is term for 
¿DBdelim?].  Presumably this processing some ~100seconds later 
encountered what was effectively /the same/ issue with the ReUse 
feature, but from a different perspective; delete instead of insert. 
Quite possibly, like alluded for the other failure, the condition would 
have persisted.
Notice the only difference is the reference to #dbdelim on the
MCH3203 and the hex error code, now '1716' instead of '1720'.
  I am now unsure if I recall the assert\assume minor code for the 
Function Check [major code 0200] Vlog; perhaps that is 1716 vs 1720, or 
perhaps both may be used in that manner.  No matter, this is another 
failure for which the database I/O processing was terminated.
Both jobs are running fine now, after the reorg.
*Any thoughts?*
  Again, the issue that had [presumably persisted at\from the time of 
the first logged incident] transpired was circumvented by that action. 
The origin for the issue and any underlying issues that might lead to 
the same failures may persist; e.g. code defects, for which the 
preventive [and\or corrective] PTFs await being installed, or for which 
the problem remains unreported for which such PTF(s) can be generated or 
the problem report [APAR] answered [perhaps as unreproducible or as ??].
  I seem to recall some number of releases ago, when files could reach 
a maximum number of deleted rows.  I do not recall if there was a 
corrective fix that did not require RGZPFM, CLRPFM, or re-create of the 
dataspace, but a design enhancement enabled multiple re-use spaces to 
allow for a huge number increase in the mapping\tracking of deleted 
records.  Possibly the dataspace for the file [the data portion, 
associated with] the particular member against which the error 
transpired was created on an old release and some corrective action 
remained awaited until the member could support the larger number of 
deleted rows.  That release and the release of the OS being utilized I 
do not think were mentioned in the OP.  If the Analyze Problem (ANZPRB) 
as alluded in the messaging were followed and that led to an issue\PMR 
being opened with the Service Provider [and IBM Support looked at the 
collected Vlogs], they might be able to determine that the problem was 
due to the limit imposed on a down-level object, and that limit has 
since been increased, and that what was used as the circumvention might 
even have been corrective... or they might determine the issue is a 
defect [that has or awaits a PTF].
As an Amazon Associate we earn from qualifying purchases.