ICU Demos

ICU Converter Explorer Help

One of the advantages to Unicode is its consistent interpretation on many computers systems (aka platforms). Unfortunately, the interpretation of many legacy codepages from various platforms is not consistent. For various reasons, many organizations and computer manufacturers have made small incompatible changes to some codepage interpretations. This causes portability problems when the codepage legacy data is transferred between platforms.

This Converter Explorer will allow you to "explore" the aliases and properties of each ICU converter. More details about ICU converters can be found on our Charset Repository page, the ICU API reference, and the Conversion section of the ICU User's Guide. All data from this explorer comes directly from ICU.

If you are wondering why some alias names or byte sequences are mapped certain ways, you can always view the ICU alias table directly. The alias table is not meant to be easily read by newcomers to ICU, which is the main reason why the Converter Explorer exists, but it does contain comments that some people may find helpful. The alias table from CVS may contain information that is more current than your copy of ICU or what is currently available in Converter Explorer. The bottom of each Converter Explorer page always describes the version of ICU it is using.

Viewing the aliases and standards

IANA is the main source of converter aliases on the Internet. Since IANA does not specify the Unicode mappings for every codepage and alias, and every platform supports other aliases besides the IANA aliases, ICU provides a way to target the codepage conversion based upon the standard or platform. This allows you to use the right converter name and implementation based upon which standard you are targeting.

You can change the view of aliases for each standard by selecting the appropriate standard at the top of the page. This will allow you to the see the subset of aliases that a standard or platform can recognize. For example, if you select IANA and ALL and select the "View Results" button, you will see all aliases recognized by IANA and ICU. You will notice that the IANA set of aliases is a subset of all ICU aliases.

The column marked as "Internal Converter Name" is also known as a canonical name. The canonical name is a unique ICU converter name, and it is usually based upon the UTR #22 naming scheme. The canonical name is always guaranteed to be the correct converter that you need in a particular ICU release, but sometimes the mapping tables will get updated between ICU releases and this converter may change at that time. API functions like ucnv_getCanonicalName() and ucnv_getName() will return this value. The ucnv_getStandardName() function requires this name as an argument.

The "All Aliases" column is not a real standard. It is just a special way to see all of the aliases for a specific converter regardless of which standards support the converter's alias names.

The "Untagged Aliases" column is also not a real standard. It is a special way to see all of the aliases that are not associated with any particular standard. An alias in this column can mean that it is a name of an alternate mapping table with the same name under a different standard, or this is a rarely used alias and its use is discouraged.

Viewing the converter details

Once you have selected a converter to view, you can get all of the details about that converter. Here is a list of things that you will find on that page:

Type of converter: The internal converter implementation used. The ucnv_getType() API will return this value. See API reference for details.
Minimum number of bytes: The minimum number of bytes required by this encoding.
Maximum number of bytes: The maximum number of bytes required by this encoding.
Substitution character: This is the byte sequence used when a converter encounters an unmappable Unicode character.
Is ASCII [\x20-\x7E] compatible?: Is the byte range \x20 to \x7E compatible with ASCII? For example, some codepages will the ASCII backslash \x5C to the Yen Symbol \u00A5. Sometimes special shift bytes are required to display the ASCII range, and ASCII does not know about codepage shifting or escape sequences. Some codepages are EBCDIC based.

Only the range [\x20-\x7E] is used for this comparison because some ISO controls are rotated, and most people are interested in the graphical interpretation of ASCII. More details on this subject can be found from this paper.
Is ASCII [\\u0020-\\u007E] ambiguous?: This is the value that ucnv_isAmbiguous() returns. When this value is TRUE, it usually implies that this is a non-ASCII compatible codepage and an ASCII compatible codepage is available.
Contains ambiguous aliases?: This means that at least one of the aliases for this converter is also on a different converter.
Converters with conflicting aliases: If there are any converters with conflicting aliases, this will have the list of converters with their conflicting alias and standard. Care should be taken when using any of the aliases on this list when a standard is not specified on the ICU conversion API. This information can usually be queried from ucnv_getCanonicalName() or ucnv_getStandardName().
Always generates Unicode NFC?: When this value is TRUE, then a conversion from this codepage to Unicode will always generate Unicode in Normalization Form Composed (NFC). When this value is UNKNOWN, then there may be a possibility that this converter will generate Unicode text that is not in NFC depending on the input, and applying an NFC transformation may change the original text. This value is derived by creating a Unicode Set with the value "[[:NFC_Quick_Check=yes:]&[:ccc=0:]]", and confirming that it is a full superset of the codepage's Unicode Set. More details about Unicode Normalization can be found in Unicode Standard Annex #15.
Contains BiDi characters?: When this value is TRUE, then a conversion to or from this codepage can contain bidirectional characters. These are right to left characters, like Hebrew and Arabic characters. When displaying data from this codepage, you may need to apply the BiDi algorithm described in Unicode Standard Annex #9.
List of Languages Representable By This Codepage: This is a list of languages that are representable by this codepage. This data comes from ucnv_getUnicodeSet() and ulocdata_getExemplarSet(), and making sure that the returned UnicodeSet for the language is a complete subset of the given codepage. The list of languages comes from uloc_getAvailable().
Set of Unicode Characters Representable By This Codepage: This is a list of Unicode characters that are representable by this codepage. For example, if some text data is written as the bytes of this codepage and converted to Unicode, this would be the possible set of possible Unicode characters to which the text could be converted to. This set can contain multi-codepoint characters. This data comes from ucnv_getUnicodeSet().