This was intended to be an inquiry to see if anyone had a little more
expertise in regex with sed than me, rather than devolve into a
discussion on 101 for xml initiates.

As I indicated to Scott, our xml is a bit like legacy data that has
always been delivered to us in it's present and sometimes inadequately
formed state, which has not been an issue or even known until now
because the data was never formally parsed with the industry standard
parsers which we have recently introduced to our presentation layer.
Our present issue is essentially a rescue exercise to prevent customers
getting occasional outages caused by such data issues that have only
arisen now that it is clear that some xml documents are not acceptable.
So in a sense, the horse has already bolted.

Thanks for the responses,
Peter


-----Original Message-----
From: rpg400-l-bounces@xxxxxxxxxxxx
[mailto:rpg400-l-bounces@xxxxxxxxxxxx] On Behalf Of Larry Ducie
Sent: Thursday, 13 August 2009 11:42 p.m.
To: RPG400
Subject: RE: sed to translate


Hi Peter,

<snip>
Thanks all,
I think I agree.
Concatenating 2 regexes should suffice.
But it would be nice to tuck away a more arcane single expression for
future reference.
</snip>

With the greatest of respect - I think you are missing the point made by
Scott.

I am sure you know, but I'll say it anyway (for the sake of the archive)
that XML has some reserved characters because they form part of the
language. These are: <, >, ", and ' That is, less than, greater than,
quote, and apostrophe. To ensure a parser knows that one of these
characters is part of the data and not part of the structure of the xml
you have to escape these entities.

The escaped entities are written as follows:

less than: &lt;
greater than: &gt;
quote: &quot;
apostrophe: &apos;

These are pre-defined entity references and are part of the language,
but all escaped references are wrapped in a leading & and trailing ;

This leads inevitably to another issue - the ampersand is now reserved
and part of the language because it denotes the start of an escaped
entity reference. This means we have to escape the ampersand!

ampersand: &amp;

So there are 5 pre-defined entity references in XML.

The point Scott was clearly making is - well formed xml can easily have
all of these escaped entity references. In fact, well formed xml should
NEVER have any of the five standard entity references which is not
escaped in the usual way. Further, xml doesn't restrict itself to the 5
pre-defined entity references. You can create your own custom unicode
numeric character reference of the form &#nnnn; (nnnn code point in
decimal form) or &#xhhhh (hhhh code point in hexadecimal form). This
allows you to place any valid unicode character in your text-based xml
document and not worry about it breaking the parser.

This is all standard, and as an example I believe this is all catered
for when using %XML-SAX. The pre-defined entity references cause a
*XML_PREDEF_REF event to fire and any other character reference will
cause a *XML_UNKNOWN_REF event to fire.

We haven't even got into DTD Entity declaratrions (internal and
external). Construction of Entity Replacement Text alone is extremely
complicated because what you do can be governed by a DTD and its
hierarchical handling of declarations! Believe me - you really don't
want to go down the road of custom parsing. The XML 1.0 spec is pretty
broad and you find yourself adding one fix after another. Get the XML
fixed and use a standard w3c compliant parser. :-)

Please see: http://www.w3.org/TR/REC-xml/ section 4.1 to see the details
of the XML 1.0 specification and how it relates to character references.

Cheers
Larry Ducie



_________________________________________________________________
What goes online, stays online Check the daily blob for the latest on
what's happening around the web
http://windowslive.ninemsn.com.au/blog.aspx

As an Amazon Associate we earn from qualifying purchases.

This thread ...

Replies:

Follow On AppleNews
Return to Archive home page | Return to MIDRANGE.COM home page

This mailing list archive is Copyright 1997-2025 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact [javascript protected email address].

Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.