Re: CCSIDs, Code Page, Character Sets... -- MIDRANGE-L

Now, is someone adding this to the FAQ page at http://faq.midrange.com?

Rob Berendt
-- 
Group Dekko Services, LLC
Dept 01.073
PO Box 2000
Dock 108
6928N 400E
Kendallville, IN 46755
http://www.dekko.com





Bruce Vining <bvining@xxxxxxxxxx> 
Sent by: midrange-l-bounces@xxxxxxxxxxxx
12/28/2004 03:19 PM
Please respond to
Midrange Systems Technical Discussion <midrange-l@xxxxxxxxxxxx>


To
Midrange Systems Technical Discussion <midrange-l@xxxxxxxxxxxx>
cc

Subject
CCSIDs, Code Page, Character Sets...











I don't know that I'm any expert either, but I'll give it a shot for a few
definitions and examples (which I hope come through):

Character set: A finite set of different graphic or control characters 
that
is complete for a given purpose.  Each character is represented with a
character identifer such as Latin capital letter A being LA020000.  One
example of a character set would be Character Set 640 - the Syntactic
Character Set

  (Embedded image moved to file: pic01072.gif)An illustration of character
                                 set 00640


Another character set would be Character Set 697 - the Country Extended
Character Set

 (Embedded image moved to file: pic13559.gif)Country extended character 
set
                                   00697


Graphic character: A visual representation of a character that is normally
produced by writing, printing, or displaying.  For instance the number 
sign
(SM010000) is #.

Control character:  A character whose occurrence specifies a control
function.  For instance in EBCDIC DBCS environments the Shift Out and 
Shift
In control characters to begin and end a DBCS sequence.

Code point:  A unique bit pattern that can serve as an element of a code
page to which a character can be assigned.  The element is associated with
a binary value.  The assignment of a character to an element of a code 
page
determines the binary value that will be used to represent each occurrence
of the character in a character string.  Code points are one or more bytes
long.  For instance in EBCDIC environments x'C1' is the Latin capital
letter A.

Code page:  A set of assignments, each of which assigns a code point to a
character.  Each code page has a unique name or identifier.  Within a 
given
code page, a code point is assigned to one character.  More than one
character set can be assigned code points from the same code page.  For
example in Code Page 37 (EBCDIC USA/Canada)




   (Embedded image moved to file: pic15607.gif)   REQTEXT


As an example of more than one character set being assigned to a given 
code
page note that Character Set 640 and Character Set 697 are mapped onto 
Code
Page 37 (640 is a proper subset of 697).

CCSID: A number identifying a specific set of identifiers for encoding
scheme, character set, code page, and additional coding-related required
information (ACRI).  For example CCSID 37 represents EBCDIC encoding, Code
Page 37, Character Set 697, and no special ACRI.  CCSID 5035 represents
EBCDIC Mixed byte encoding, Code Page 1027, Character Set 1172, Code Page
300, and Character Set 370.

One problem that is often encountered is that for differenct CCSIDs the
same graphic character is represented by different code point values.  If
for example we look at CCSID 500 (EBCDIC, code page 500, character set 
697)

 (Embedded image moved to file: pic10705.gif)   REQTEXT


And CCSID 273 (EBCDIC, Code Page 273, Character Set 697)



   (Embedded image moved to file: pic32130.gif)   REQTEXT

We notice that the commercial at sign @ (SM050000) is x'7C' in CCSID 500
and x'B5' in CCSID 273.  If the system is to accurately represent/process
SM050000 then it is rather important for the system to know what CCSID a
given piece of character data is in.   The ideal is that if we were to
convert a string containing SM050000 to CCSID 277

    (Embedded image moved to file: pic19113.gif)   REQTEXT

from either CCSID 500 or CCSID 273 that we would end up with x'80' as that
is the correct code point value for SM050000 (@) in CCSID 277.

The iSeries will correctly perform these conversions so long as the data 
is
correctly identified by the CCSID currently in use to represent the data.
All too often though when a problem is encounted the data being processed
is not in the CCSID you might think it is.  This can be due to variables
such as the terminal you are entering data from (they have their own
Character Set and Code Page values), the CCSID the job is running in, and
the CCSID conversion options in effect for your job.  For this reason it 
is
often best/easiest to display the hex values of your data rather than
trusting the graphic character you see on your terminal as the terminal
will simply show you a graphic character based on the assumption that the
data is in the terminal configured code page.



 
             qsrvbas@netscape. 
             net (Tom Liotta) 
             Sent by:                                                   To 

             midrange-l-bounce         midrange-l@xxxxxxxxxxxx 
             s@xxxxxxxxxxxx                                             cc 

 
                                                                   Subject 

             12/23/2004 07:26          RE: Another QtmmSendMail problem? 
             PM 
 
 
             Please respond to 
             Midrange Systems 
                 Technical 
                Discussion 
 
 




>> Due to all the possible knobs it may be easiest to simply
>> use debug and
>> look at the hex value of the address on their system
>> prior to the
>> conversion to confirm their @ is indeed x'7C'

This piece of Bruce's advice seems to be the one that gets missed the 
most.
Not necessarily for this problem; I wouldn't expect Brad to skip over it
lightly.

But confusions often seem to arise simply because we look at the
_characters_ that get displayed or printed and we forget (or never 
realize)
that the _important_ aspect is the actual bit pattern. The shape of the
displayed or printed character can change simply because we use a 
different
terminal or printer or emulator. The same data can look like a '@' on one
terminal and like a '§' on another just because the devices interpret the
hex values into different character sets.

Many of us learned years ago to use *CAT or *BCAT instead of the special
characters in CL programs for this same reason. The 'characters' would
change depending on the device settings regardless of the hex values of 
the
data.

I keep hoping we can work out a decent FAQ coverage of what this all 
means,
but nobody seems comfortable enough with all the concepts to get it right.
I know how I interpret things, but I'm totally unclear whether I'm very
close to understanding. So...

I'm going to ramble on about how I understand it all just to get something
going for the archives. Maybe some real experts will respond with
corrections, enhancements, clarifications or whatever; and a real FAQ will
start to emerge.

Character set -- The name for the things we see.

To me, a 'character set' represents a list of shapes of characters. Each
'character' in the set is what a device draws when it sees a particular 
bit
pattern from a given code page.

Code page -- The name for collating sequence of characters.

Different languages have different characters. German uses a lot of 
umlauts
for example. You can have the letter 'o' and an umlaut-'o' in German. If
you want to sort words that have those letters, how do you choose which 
has
the highest collating sequence? When converting to a different code page,
what happens to the collating sequence?

I've kind of pictured the problem of reconciling collating sequences
between different languages as partly being a computer design problem.
Somewhere in the hardware, there's a way to compare one character against
another and get a result that says "greater", "lesser" or "equal". But the
frequency of characters in words in a language also seems to need to be
addressed. This seems partly addressed by converting code pages from
computer to computer.

Because character comparisons happen so often, it should be addressed at a
low level. I'd imagine that code pages are designed to help minimize what
it takes to do comparisons of words and I would expect that conversions
between two code pages would take that into account. If I convert stuff
from Cyrillic to English, maybe I shouldn't expect exact bit-for-bit
matches. Maybe the difference would show up simply because computers that
normally operate on English data can be more efficient with a different
code page than computers that regularly operate on Cyrillic data.

So, the combination of code pages and character sets account for the
differences of collating sequences and visual representations. During code
page conversions, efficiencies are maybe maintained to some degree.

CCSID -- The name for a particular combination of code page with a
character set. For a given code page, a change in character set means it
needs to be a different CCSID.

Rather than keeping track of code pages and character sets separately, we
just use one symbol that stands for the two of them together, the Coded
Character Set IDentifier, CCSID.

Those are basically the vague, nebulous meanings I've come to use for 
those
three concepts. I'd love it if someone could say they're right, or correct
them where they're wrong. (Ideally, someone should also clear up how
Unicode fits in with CCSID since it almost seems as if Unicode sometimes
encompasses all of "CCSID" within almost a single CCSID. Or something like
that.)

Tom Liotta


midrange-l-request@xxxxxxxxxxxx wrote:

>   7. Re: Another QtmmSendMail problem? (Brad Stone)
>
>On Tue, 21 Dec 2004 12:49:56 -0600
>
>Things are still as they were...  the funny thing is,
>SNDDST  seems to send emails just fine when they use the @
>sign.
>
>When they use § and the email api, it works fine, shows up
>as @ in MIME file header, but § in a PF record that is used
>to log emails that are sent out.
>
>So to me, something is surely converting wrong in this
>particular case.
>
>They did some searching and found that the § and @
>characters are reversed in the different CCSIDs, though.
>
>I'm not sure where else to look.
>
> Bruce Vining <bvining@xxxxxxxxxx> wrote:
>>
>> A job CCSID of 870 and the resulting conversions to 500
>> would be the same
>> on both systems.  What might be different though are the
>> code points being
>> generated by your keyboards (assuming the address is
>> being provided
>> interactively) and subsequently being processed in the
>> job so that the
>> actual inputs to the conversion are not the same.
>>
>> Due to all the possible knobs it may be easiest to simply
>> use debug and
>> look at the hex value of the address on their system
>> prior to the
>> conversion to confirm their @ is indeed x'7C' (and that
>> yours in test is
>> also).  If debug isn't possible then we would need to
>> know the
>> configuration (CHRID and KBDTYPE) of their workstation
>> (and hope that this
>> does represent what the workstation really is sending,
>> which with all the
>> emulators available is always a big IF these days...);
>> the CCSID, DFTCCSID,
>> and CHRIDCTL of their job; and the CHRID value for their
>> *DSPF (or
>> *PNLGRP).

--
Tom Liotta
The PowerTech Group, Inc.
19426 68th Avenue South
Kent, WA 98032
Phone  253-872-7788 x313
Fax    253-872-7904
http://www.powertech.com



__________________________________________________________________
Switch to Netscape Internet Service.
As low as $9.95 a month -- Sign up today at
http://isp.netscape.com/register

Netscape. Just the Net You Need.

New! Netscape Toolbar for Internet Explorer
Search from anywhere on the Web and block those annoying pop-ups.
Download now at http://channels.netscape.com/ns/search/install.jsp
--
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing 
list
To post a message email: MIDRANGE-L@xxxxxxxxxxxx
To subscribe, unsubscribe, or change list options,
visit: http://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxx
Before posting, please take a moment to review the archives
at http://archive.midrange.com/midrange-l.

-- 
This is the Midrange Systems Technical Discussion (MIDRANGE-L) mailing 
list
To post a message email: MIDRANGE-L@xxxxxxxxxxxx
To subscribe, unsubscribe, or change list options,
visit: http://lists.midrange.com/mailman/listinfo/midrange-l
or email: MIDRANGE-L-request@xxxxxxxxxxxx
Before posting, please take a moment to review the archives
at http://archive.midrange.com/midrange-l.