A Short List of Questions to "Unicode-aware" Libraries/Software
=================================================================

(This list is not yet complete! ;-))


 0. Why?
---------

Because there are sooo many libraries/programs that claim to be
"Unicode-aware" but most of them (if not all?) are not or only
partially. This sucks a lot if you rely on or need a _specific_
Unicode property or feature wich is just missing in this
library/program.

So if you want (or have) to "use Unicode" in your program you
seariously ask yourself which parts or features of Unicode do
you _really_ need.


 1. General questions
----------------------

Do the authors show that they know and understand the difference
between a "Unicode character" and a "code point"?

How are codepoints stored and handled? As a 16 bit entity?
If yes: how do they handle codepoints that are outside of the BMP?
(See: http://en.wikipedia.org/wiki/Basic_Multilingual_Plane)

Is the difference between UTF-16 and UCS-2 mentioned?

What about codepoints denoted as "non-character" by Unicode?
(U+FDD0...U+FDEF, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, ...
U+10FFFE, U+10FFFF) Are they mentioned, detected, signalled?

If there are functions to handle UTF-8-encoded strings:
Are illegal UTF-8 sequences detected (if yes: which ones?) and
signalled?
(see: http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf)

How is a single "unicode character" stored? As a single code point?
What about characters that needs more than one code point (letters
with several stacked accents, complicated east-asian signs etc.)?
(N.B.: That Unicode "feature" breaks any attempt to implement an O(1)
access to a "single character" in a Unicode string! [1])


 2. Handling of different normalization forms
----------------------------------------------

Is the topic of different string normalization forms (NFC, NFD, NFKC,
NFKD) mentioned in the docs?
(see http://en.wikipedia.org/wiki/Unicode_equivalence)

Rough test: Is an

     U+00E4 (LATIN SMALL LETTER A WITH DIAERESIS)

handled equivalently to the combination of

     U+0061 (LATIN SMALL LETTER A) and
     U+0308 (COMBINING DIAERESIS)

If yes: Are there functions to convert a string from one normalization
form to another? What about non-reversible conversions? Are they
mentioned, signalled, handled?

What about accented letters with no precomposed equivalent, e.g.
<U+0062 U+0308>, or with several "stacked" accents? Are these handled
as _one_ letter/character or not?

Are there functions to "strip" accents? (<U+00E4> -> <U+0061>)
If yes: How do they handle ligatures (e.g. U+00E6)? Are these also
stripped (U+00E6 becomes <U+0061>), de-composed (<U+00E6> becomes 'ae'
<U+0061 U+0065>) or left untouched? Do they strip _all_ accents or only
some/the last one?

Does a substring search find spelling variants? (e.g. 'fi' is encoded
as ligature <U+FB01>)


 2.1. Case conversion
- - - - - - - - - - - -

Are functions like toUpper() and toLower() locale-dependent? If not:
How do they know whether an 'i' (U+0069) must be uppercased to 'I'
(U+0049) or to U+0130 ("LATIN CAPITAL LETTER I WITH DOT ABOVE") as it
would be necessary in Turkish texts?
(see: http://en.wikipedia.org/wiki/Turkish_dotted_and_dotless_I)

What about
   U+017F (LATIN SMALL LETTER LONG S) and
   U+00DF (LATIN SMALL LETTER SHARP S) ?

The latter has (since Unicode 5.0) a capital counterpart (U+1E9E) but
it is not part of the official German spelling rules. Officially a
"sharp s" is transformed to 'SS' in uppercase.


 3. Scripts with different spelling conventions
------------------------------------------------

Does the software have a generalization for "letters", "white spaces"
and "word breaks" which is needed for example in many string matching
algorithms? (Btw, a concept of "word breaks" is totally useless for
East-Asian languages.)

Are Right-to-left scripts (such as Arabic, Hebrew) handled properly in
rendering and text processing (a 'left bracket' is a closing bracket in
these scripts!)

Is text with mixed R-to-L and L-to-R scripts handled and rendered
correctly?


<TO BE CONTINUED>

------------
[1]: ...except they use an array of pointers, so each pointer refers to
    one Unicode character (that might consist of more than one codepoint!)
    But I don't know any Unicode string library that does so...