ICU Demos
ICU Converter Explorer Help
One of the advantages to Unicode is its consistent interpretation on many
computers systems (aka platforms). Unfortunately, the interpretation of many
legacy codepages from various platforms is not consistent. For various reasons,
many organizations and computer manufacturers have made small incompatible
changes to some codepage interpretations. This causes portability problems when
the codepage legacy data is transferred between platforms.
This Converter Explorer will allow you to "explore" the aliases and
properties of each ICU converter. More details about ICU converters can be
found on our Charset
Repository page, the ICU API reference, and the Conversion section
of the ICU User's Guide. All data from this explorer comes directly from
ICU.
If you are wondering why some alias names or byte sequences are mapped
certain ways, you can always
view the ICU alias table directly. The alias table is not meant to be
easily read by newcomers to ICU, which is the main reason why the Converter
Explorer exists, but it does contain comments that some people may find
helpful. The alias table from CVS may contain information that is more current
than your copy of ICU or what is currently available in Converter Explorer. The
bottom of each Converter Explorer page always describes the version of ICU it
is using.
Viewing the aliases and standards
IANA is the
main source of converter aliases on the Internet. Since IANA does not specify
the Unicode mappings for every codepage and alias, and every platform supports
other aliases besides the IANA aliases, ICU provides a way to target the
codepage conversion based upon the standard or platform. This allows you to use
the right converter name and implementation based upon which standard you are
targeting.
You can change the view of aliases for each standard by selecting the
appropriate standard at the top of the page. This will allow you to the see the
subset of aliases that a standard or platform can recognize. For example, if
you select IANA and ALL and select the "View Results" button, you will see all
aliases recognized by IANA and ICU. You will notice that the IANA set of
aliases is a subset of all ICU aliases.
The column marked as "Internal Converter Name" is also known as a
canonical name. The canonical name is a unique ICU converter name, and it is
usually based upon the UTR
#22 naming scheme. The canonical name is always guaranteed to be the
correct converter that you need in a particular ICU release, but sometimes the
mapping tables will get updated between ICU releases and this converter may
change at that time. API functions like ucnv_getCanonicalName()
and ucnv_getName() will return this value. The
ucnv_getStandardName() function requires this name as an
argument.
The "All Aliases" column is not a real standard. It is just a special
way to see all of the aliases for a specific converter regardless of which
standards support the converter's alias names.
The "Untagged Aliases" column is also not a real standard. It is a
special way to see all of the aliases that are not associated with any
particular standard. An alias in this column can mean that it is a name of an
alternate mapping table with the same name under a different standard, or this
is a rarely used alias and its use is discouraged.
Viewing the converter details
Once you have selected a converter to view, you can get all of the details
about that converter. Here is a list of things that you will find on that
page:
- Type of converter
- The internal converter implementation used. The
ucnv_getType()
API will return this value. See API reference for details.
- Minimum number of bytes
- The minimum number of bytes required by this encoding.
- Maximum number of bytes
- The maximum number of bytes required by this encoding.
- Substitution character
- This is the byte sequence used when a converter encounters an unmappable
Unicode character.
- Is ASCII [\x20-\x7E] compatible?
- Is the byte range \x20 to \x7E compatible with ASCII? For example, some
codepages will the ASCII backslash \x5C to the Yen Symbol \u00A5. Sometimes
special shift bytes are required to display the ASCII range, and ASCII does not
know about codepage shifting or escape sequences. Some codepages are EBCDIC
based.
Only the range [\x20-\x7E] is used for this comparison because some ISO
controls are rotated, and most people are interested in the graphical
interpretation of ASCII. More details on this subject can be found from this
paper.
- Is ASCII [\\u0020-\\u007E] ambiguous?
- This is the value that
ucnv_isAmbiguous() returns. When this
value is TRUE, it usually implies that this is a non-ASCII compatible codepage
and an ASCII compatible codepage is available.
- Contains ambiguous aliases?
- This means that at least one of the aliases for this converter is also on a
different converter.
- Converters with conflicting aliases
- If there are any converters with conflicting aliases, this will have the
list of converters with their conflicting alias and standard. Care should be
taken when using any of the aliases on this list when a standard is not
specified on the ICU conversion API. This information can usually be queried
from
ucnv_getCanonicalName() or
ucnv_getStandardName().
- Always generates Unicode NFC?
- When this value is
TRUE, then a conversion from this codepage
to Unicode will always generate Unicode in Normalization Form Composed (NFC).
When this value is UNKNOWN, then there may be a possibility that
this converter will generate Unicode text that is not in NFC depending on the
input, and applying an NFC transformation may change the original text. This
value is derived by creating a Unicode Set with the value
"[[:NFC_Quick_Check=yes:]&[:ccc=0:]]", and confirming that it is a full
superset of the codepage's Unicode Set. More details about Unicode
Normalization can be found in Unicode Standard Annex #15.
- Contains BiDi characters?
- When this value is
TRUE, then a conversion to or from this
codepage can contain bidirectional characters. These are right to left
characters, like Hebrew and Arabic characters. When displaying data from this
codepage, you may need to apply the BiDi algorithm described in
Unicode Standard Annex
#9.
- List of Languages Representable By This Codepage
- This is a list of languages that are representable by this codepage. This
data comes from
ucnv_getUnicodeSet() and
ulocdata_getExemplarSet(), and making sure that the returned
UnicodeSet for the language is a complete subset of the given codepage. The
list of languages comes from uloc_getAvailable().
- Set of Unicode Characters Representable By This Codepage
- This is a list of Unicode characters that are representable by this
codepage. For example, if some text data is written as the bytes of this
codepage and converted to Unicode, this would be the possible set of possible
Unicode characters to which the text could be converted to. This set can
contain multi-codepoint characters. This data comes from
ucnv_getUnicodeSet().
|