I will attempt to explain only what I find laughable [not to respond to any questions posed], using a simplistic example that may clarify my earlier comment, when viewed within the context of that example:

Given an ASCII text file, which has the following as its line of text, accurately indicating its encoding [is the imaginary ASCII ISO-8X]:
<?xml encoding="ISO-8X"?>

Now transmit only that line of data with ASCII encoding from that text file, to a text file on an EBCDIC system. FTP provides just such a text transport. Because the target system has a different encoding for the data in its text files, that means a text transport would properly effect automatic translation of the ASCII characters to EBCDIC characters. Having used that _appropriate transport_ of just the text data to the target system, the claim still being made by the [character string] data on the target system, that its encoding is "ISO-8X", is now _false_. That is, the data is now EBCDIC, irrespective of the claim by that [now as an EBCDIC] character string, that its xml encoding is the ASCII "ISO-8X".

Really.... _that_ is what makes me laugh. The data by itself clearly was no longer self-describing in that scenario.

Maybe I laugh only because from that scenario, I am reminded of a quote by Steve Martin as Navin Johnson in the beginning of the movie /The Jerk/. In stating his origins, Navin was similarly unable to properly describe himself.

Regards, Chuck

Scott Klement wrote:
Chuck,

Remember, XML is a format designed for data interchange. When you
copy a file via a tool like Windows Networking (or /QNTC, etc) or
when you copy a file via a tool like FTP, SCP, SFTP, RCP, etc, etc...
how can the encoding possibly be known if it's not part of the
document?

If I do business with 100000 different suppliers, and each one needs
to send me an XML document, surely it's not reasonable to expect that
every one of them will use a proprietary i5/OS FTP extension to set
the CCSID properly when they transfer the XML to me? Especially since
each of those 100000 vendors probably has several thousand other
customers all running different systems. Is it reasonable for them to
have to do something different for everyone's system to set the
proper encoding/CCSID/codepage or whatever?

No. The data needs to be self-describing. Folks need to be able to send the XML data and not worry about doing some extra steps to
notify each system which encoding the data is in.

I can understand you being curious as to how it works, but I don't understand why you'd find such a practical feature to be "laughable".


By the standards, all XML documents must begin with the < character. There's no other character a well-formed XML document can begin with.
Given that, it's really pretty easy to determine if the document is big-endian or little-endian. Does it start with x'003c'? Then it's
big endian. Does it start with x'3c00'? Then it's little-endian. In
all ASCII and Unicode encodings, < is always x'3c', x'3c00' or
x'003c'. In EBCDIC it'd be x'4C' (or x'004C'), though I don't think
EBCDIC is technically supported by the XML standard.

Since all of the characters in the <?xml encoding="whatever"?> are invariant, simply reading the first two bytes of the document should
be enough to let you read the "encoding" information (or to determine
that it's not there, in which case you take the default).

Not sure why this is laughable.


CRPence wrote:

Without regard to any concerns for how something does or does not work, well or appropriately, I offer in response to the quoted snippet...

I have always thought it laughable, that character data could ever be considered self-describing of its encoding. With all the various [and possibility any new] encoding schemes, including endian issues, how could any stream of bits in any particular encoding, actually define the encoding of the complete stream of bits? Just as with transporting those bits, what the encoding is, must be negotiated _outside_ of the data itself.


As an Amazon Associate we earn from qualifying purchases.

This thread ...

Replies:

Follow On AppleNews
Return to Archive home page | Return to MIDRANGE.COM home page

This mailing list archive is Copyright 1997-2025 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact [javascript protected email address].

Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.