A Short List of Questions to "Unicode-aware" Libraries/Software ================================================================= (This list is not yet complete! ;-)) 0. Why? --------- Because there are sooo many libraries/programs that claim to be "Unicode-aware" but most of them (if not all?) are not or only partially. This sucks a lot if you rely on or need a _specific_ Unicode property or feature wich is just missing in this library/program. So if you want (or have) to "use Unicode" in your program you seariously ask yourself which parts or features of Unicode do you _really_ need. 1. General questions ---------------------- Do the authors show that they know and understand the difference between a "Unicode character" and a "code point"? How are codepoints stored and handled? As a 16 bit entity? If yes: how do they handle codepoints that are outside of the BMP? (See: http://en.wikipedia.org/wiki/Basic_Multilingual_Plane) Is the difference between UTF-16 and UCS-2 mentioned? What about codepoints denoted as "non-character" by Unicode? (U+FDD0...U+FDEF, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, ... U+10FFFE, U+10FFFF) Are they mentioned, detected, signalled? If there are functions to handle UTF-8-encoded strings: Are illegal UTF-8 sequences detected (if yes: which ones?) and signalled? (see: http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf) How is a single "unicode character" stored? As a single code point? What about characters that needs more than one code point (letters with several stacked accents, complicated east-asian signs etc.)? (N.B.: That Unicode "feature" breaks any attempt to implement an O(1) access to a "single character" in a Unicode string! [1]) 2. Handling of different normalization forms ---------------------------------------------- Is the topic of different string normalization forms (NFC, NFD, NFKC, NFKD) mentioned in the docs? (see http://en.wikipedia.org/wiki/Unicode_equivalence) Rough test: Is an U+00E4 (LATIN SMALL LETTER A WITH DIAERESIS) handled equivalently to the combination of U+0061 (LATIN SMALL LETTER A) and U+0308 (COMBINING DIAERESIS) If yes: Are there functions to convert a string from one normalization form to another? What about non-reversible conversions? Are they mentioned, signalled, handled? What about accented letters with no precomposed equivalent, e.g. , or with several "stacked" accents? Are these handled as _one_ letter/character or not? Are there functions to "strip" accents? ( -> ) If yes: How do they handle ligatures (e.g. U+00E6)? Are these also stripped (U+00E6 becomes ), de-composed ( becomes 'ae' ) or left untouched? Do they strip _all_ accents or only some/the last one? Does a substring search find spelling variants? (e.g. 'fi' is encoded as ligature ) 2.1. Case conversion - - - - - - - - - - - - Are functions like toUpper() and toLower() locale-dependent? If not: How do they know whether an 'i' (U+0069) must be uppercased to 'I' (U+0049) or to U+0130 ("LATIN CAPITAL LETTER I WITH DOT ABOVE") as it would be necessary in Turkish texts? (see: http://en.wikipedia.org/wiki/Turkish_dotted_and_dotless_I) What about U+017F (LATIN SMALL LETTER LONG S) and U+00DF (LATIN SMALL LETTER SHARP S) ? The latter has (since Unicode 5.0) a capital counterpart (U+1E9E) but it is not part of the official German spelling rules. Officially a "sharp s" is transformed to 'SS' in uppercase. 3. Scripts with different spelling conventions ------------------------------------------------ Does the software have a generalization for "letters", "white spaces" and "word breaks" which is needed for example in many string matching algorithms? (Btw, a concept of "word breaks" is totally useless for East-Asian languages.) Are Right-to-left scripts (such as Arabic, Hebrew) handled properly in rendering and text processing (a 'left bracket' is a closing bracket in these scripts!) Is text with mixed R-to-L and L-to-R scripts handled and rendered correctly? ------------ [1]: ...except they use an array of pointers, so each pointer refers to one Unicode character (that might consist of more than one codepoint!) But I don't know any Unicode string library that does so...