Sorry for the slight detour, but this thread has got me wondering: how does
the database implement variable-byte encodings like UTF-8?
To detour even more in various DBMS, they is exquisitely choice by convention (some implies bytes, some "char" when defining fields).
But, at the end of the day, you need to store the thing somewhere, print somewhere etc. (despite the "infinite string" illusion of some languages).
IMHO the most direct and simple way one DBMS can implement it is just specify it by size, like DB2 does, because it has less surprises ... but also ideally one should be allowed to specify a "max codepoints" in the db layer.
10 bytes (the CHAR(10) ) can represent a variable number of symbols/glyphs.
And take in account that in unicode it's easy to make assumptions regarding glyph - position equivalences, but it is a complex standard.... It needs to allow for japanese while preserving cuneiform scripts....
For example the rendering of "à" can have its own isolate codepoint *OR* can be represented by a combination in the stream.... "a" + " ' " (A PLUS A COMBINING CHARACTER) (the NORMALIZE option in DB2 should take care of it during storing).
ḁ̴̧̈́̆n̷̢͍̝̞̊̅d̷̢̖̻̍̎͗̆̔̐ ̴̡̟͖̦͎̓͋ͅẹ̸͆̎͐̈́v̴̤̾̋̐̆e̸̛͚n̷̙̯̼̐̋ ̷͙̣̞͕͊͂m̶̞̦̏͂̆̕͝o̶͇̮̜͍͇̓͝r̶̨̹̗̘̱̦̃̕è̴̬͋ ̴̨̡̛͐̊̄̄̚ṇ̵́͂̌̾̌u̷̥̙̅̎a̸̡̧͈̗̟̣̋́̿͝ņ̸̥̜̂͛̑c̸̣̻̏͛e̴̩͊s̴̳̯̱͔̅....
As an Amazon Associate we earn from qualifying purchases.
This mailing list archive is Copyright 1997-2025 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact
[javascript protected email address].
Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.