Re: expat XML parsing problem, CPF9897: Parse error at line 1: unknown encoding -- MIDRANGE-L

Hello Cevdet,

Expat isn't "Scott Klement's tool". While it's true that I wrote some articles about how to use it from RPG, I had no part in it's development.

Out of the box, Expat supports 4 different input encodings. They are:

* US-ASCII (deprecated)
* ISO-8859-1
* UTF-8
* UTF-16

Note two things:
1) ISO-8859-9 is NOT one of them.
2) It does not understand _any_ form of EBCDIC at all.

Expat also lets you register your own encoding handler via the XML_SetUnknownEncodingHandler() API. This is Expat's official method of adding support for other encodings besides the 4 listed above.

However, even if you use XML_SetUnknownEncodingHandler, it cannot handle EBCDIC encodings, because Expat still expects the basic characters needed to parse the <?xml?> declaration to match what they are in ASCII/Unicode. So it never supports EBCDIC, ever.

My solution to this was very simple: If I'm not expecting the data to have one of the four Expat-friendly encodings, I let IBM i translate my input file into UTF-8. Since Expat supports UTF-8, this solves my problem completely.

But you're not doing that! Indeed, your code is telling IBM i to translate the data to EBCDIC, something that can't ever work with Expat. Let's take a look at your code:

/free
fd = open('/tmp/tcmb_kur.xml': O_RDONLY+O_TEXTDATA);
if (fd< 0);
EscErrno(errno);
endif;

Here you are specifying O_TEXTDATA without specifying which code page or CCSID to translate the data to. The result? It's going to translate the data, using code pages (not ccsids) to the job's default ccsid. And the job will, undoubtedly, be EBCDIC.

Instead, please consider doing this:

fd = open( '/tmp/tcmb_kur.xml'
: O_RDONLY + O_TEXTDATA + O_CCSID
: 0
: 1208 );
if (fd< 0);
EscErrno(errno);
endif;

O_CCSID means you want to use CCSIDs (instead of the default, which is code pages). This is important, since Unicode is not supported with code pages.

I've also told the open() API that my program's data is in CCSID 1208. 1208 is UTF-8, so the IFS APIs will pass data to my program in UTF-8 (not EBCDIC). This means it'll be hard to view the data in the debugger, which is unfortunate, but it'll be understood by Expat. (Provided that you tell Expat to expect UTF-8)

Strangely, though, you are telling Expat to expect the data in ISO-8859-1. Since there are characters in ISO-8859-9 that don't exist in ISO-8859-1, that's not a good choice. (And it's why I'd use UTF-8, it supports nearly every character known to man.)

Here's what you have:

p = XML_ParserCreate(XML_ENC_ISO8859_1);
if (p = *NULL);
callp close(fd);
die('Couldn''t allocate memory for parser');
endif;

Here's what I recommend:

p = XML_ParserCreate(XML_ENC_UTF8);
if (p = *NULL);
callp close(fd);
die('Couldn''t allocate memory for parser');
endif;

So now I'm telling Expat to expect the data in UTF-8 format, and since IBM i has been asked to translate the data to 1208, everyone should be happy.

(Unless the data in your file doesn't match the CCSID it has been marked with -- but, we'll cross that bridge when/if we come to it.)