ICU4J Technical FAQ
This page contains frequently asked questions about the content provided
with the International Components for Unicode for Java as well as basics on
internationalization. It is organized into the following sections:
Common Java Questions
How do I read UTF-8 data in Java?
What version of the JDK is required for
ICU4J?
Comparison between ICU and JDK: What's the
difference?
NumberFormat
Why do I need these classes?
What's the difference between ICU4J NumberFormat
and the standard NumberFormat?
Are there any known limitations of ICU4J
NumberFormat?
RuleBasedNumberFormat
Why don't you have rules for my native
language?
Why don't negative numbers (or fractional
numbers) show up correctly in my language?
You spelled the word for 20,000 wrong in
my language! Don't you know any better?
Why don't the text areas scroll or allow
me to edit them when I'm displaying Japanese?
The Hebrew words are all spelled
backwards! What's wrong with you people?
How come I see rows of boxes (or question
marks) instead of letters?
Why doesn't this program work right on my
browser?
International Calendars
Why do I need these classes?
Which Japanese calendar do you
support?
Do you really support the true lunar
Islamic calendar?
Why don't you have resource data for my
language?
Why don't you support the Chinese
calendar?
Why don't Hebrew and Arabic text look right in
the demo?
StringSearch
Do I have to know anything about Collators to
use StringSearch?
What algorithm are you using to perform the
search?
RuleBasedBreakIterator
Why did you bother to rewrite BreakIterator? Wasn't the
old version working?
What do you mean, performance improvements? It seems WAY
slower to me!
This still has all the same bugs that the old
BreakIterator did! Why would I want to use this one instead?
Why is there no demo?
What's this DictionaryBasedBreakIterator
thing?
Why do you have a Thai dictionary, but no resource data
that actually lets me use it?
What's this BreakIteratorRules_en_US_TEST
thing?
How can I create my own dictionary file?
Common Java Questions
- 1.How do I read UTF-8 data in
Java?
- Java supports UTF-8 the same way it does any other encoding. To read UTF-8
data, use one of the following APIs. In the case of UTF-8, use "UTF-8"
as the encoding name. Unfortunately, beyond a small fixed set (which includes
UTF-8) JVMs from different vendors are free to support or not support whatever
encodings they want. Furthermore, in JDK1.3.1 and prior, there is no public API
to query the set of provided encodings.
// Given a String, return a byte array containing the text of the String
// in the named encoding
public byte[] getBytes(String enc) throws UnsupportedEncodingException
// Construct a String given a byte array containing text in the named encoding
public String(byte[] bytes, String enc) throws UnsupportedEncodingException
// Construct an InputStreamReader that reads text from bytes in the named encoding
InputStreamReader(InputStream in, String enc) throws UnsupportedEncodingException
// Construct an OutputStreamWriter that writes text as bytes in the named encoding
OutputStreamWriter(OutputStream out, String enc) throws UnsupportedEncodingException
[Goto Top]
- 2.What version of the JDK is
required for ICU4J?
-
Parts of ICU4J depend on functionality that is only available in JDK 1.4 or
later, although some components work under earlier JVMs. All components should
be compiled using a Java2 compiler, as even components that run under earlier
JVMs can require language features that are only present in Java2. Currently
1.1.x, 1.2.x and 1.3.x JVMs are unsupported and untested, and you use the
components on these JVMs at your own risk.
[Goto Top]
- 3.Comparison between ICU
and JDK: What's the difference?
- This is one of our most popular question. Please refer to
our comparison chart.
[Goto Top]
NumberFormat
- 1.Why do I need these
classes?
- The NumberFormat classes in the standard JDK do a good job in most cases,
but they have some limitations. If you require support for BigInteger or
BigDecimal numbers, if you require exact values of parsed strings, or if you
need any of the new features (scientific notation, padding, rounding), then you
need ICU4J NumberFormat.
[Goto Top]
- 2.What's the difference
between ICU4J NumberFormat and the standard NumberFormat?
- The ICU4J NumberFormat package contains modified versions of the standard
JDK classes NumberFormat, DecimalFormat, DecimalFormatSymbols, and internal
classes. The modified classes do everything the standard classes do, and then
some.
- ICU4J's NumberFormat formats and parses BigInteger and BigDecimal, in
addition to double and long values.
- If a long overflows in JDK 1.2, it becomes a double. For instance, if you
format something close to Long.MAX_VALUE with a percent format, it will be cast
to a double, losing some precision. With ICU4J NumberFormat, it will
overflow to a BigInteger, with no loss of information.
- Higher performance formatting of longs.
- During parsing, BigDecimal objects are returned instead of Double objects
to represent non-integral values. Double objects are only returned to represent
values which cannot be stored in a BigDecimal. These are NaN, Infinity,
-Infinity, and -0. All other values are returned as Long, BigInteger, or
BigDecimal values. Clients who previously checked for specific types (using
"instanceof Double" and/or by casting to Double) should change their code and
use the Number.doubleValue() method.
- Support for scientific notation.
- Support for rounding.
- Support for padding to a specified width.
- Various bug fixes incorporated into JDK 1.2, but not present in JDK 1.1.x
yet.
[Goto Top]
- 3. Are there any known
limitations of ICU4J NumberFormat?
-
- The rounding increment is limited to 18 digits (the number of digits fully
representable in a long).
- All padding configurations cannot be represented in a pattern. In other
words, it's possible to make API calls to set up a pattern with padding. Then,
if a pattern is created for that format, it won't fully expressing the padding
parameters. This is actually generally true of many attributes of many of the
formatting classes.
- All rounding configurations cannot be represented in a pattern.
[Goto Top]
RuleBasedNumberFormat
-
1.Why don't you have rules for my native language?
- Because this code isn't part of an official IBM product yet, it hasn't been
through the whole IBM localization process. The rules you see here were
intended to be a representative sampling we could use in designing the
algorithm, and are not 100% complete or 100% accurate. We intend to have
complete and accurate data for all locales supported by the JDK sometime in
early to mid 1999, if not sooner. If you can provide us information we're
missing, we'd be very grateful.
[Goto Top]
-
2.Why don't negative numbers (or fractional numbers) show up correctly in my
language?
- See answer to question 1 above.
Our initial research failed to turn up information on formatting negative
numbers or numbers with fractional parts for any languages other than English.
If you can supply us with the missing information, we'd certainly appreciate
it.
[Goto Top]
-
3.You spelled the word for 20,000 wrong in my language! Don't you know any
better?
- No, we don't. See the answer to question 1. If you have corrections for us,
please let us know.
[Goto Top]
-
4.Why don't the text areas scroll or allow me to edit them when I'm
displaying Japanese?
- The text-display facilities in JDK 1.1 are not totally complete, and won't
show non-Latin text properly on a U.S. English version of Windows. We hacked
around this to permit display of non-Latin text on all systems, but we
implemented only text display and not scrolling or editing. JDK 1.2 should
overcome these limitations, and we'll issue a new version of this demo that
takes advantage of that sometime after a final or near-final of JDK 1.2 ships.
[Goto Top]
-
5.The Hebrew words are all spelled backwards! What's wrong with you
people?
- See the answer to question 5 above. The workaround for the display problems
isn't sophisticated enough to do bi-directional reordering, and we don't
currently bother to check which systems have native support for this. If you're
running on a Hebrew-localized version of Windows, you can change the demo
program to use the native text-editing capabilities fairly easily. Again, this
problem should be fixed in JDK 1.2.
[Goto Top]
-
6.How come I see rows of boxes (or question marks) instead of
letters?
- In order to see non-Latin text, you need to have an appropriate font
installed, and your font.properties file has to be modified to recognize it.
The simplest way to do this is to download and install the Bitstream Cyberbit
font. You can follow this
link to download a copy of the font and find instructions for modifying
your font.properties file. Stock versions of JDK 1.2 (after the final version
ships) should include the capability to display most non-Latin scripts right
out of the box.
[Goto Top]
-
7.Why doesn't this program work right on my browser?
- We have yet to find a browser that will run this demo correctly. The demo
program was tested on JDK 1.1.4 and later, on both theSymantec andSun JVMs.
Neither Netscape nor MSIE had been updated to the most recent version of the
JDK as this is being written. You can obtain a current version of the Java
Development Kit directly from Sun. The actual RuleBasedNumberFormat framework
should work with any JDK 1.1 version.
[Goto Top]
International Calendars
- 1.Why do I need
these classes?
- If your application displays or manipulates dates and times, and if you
want your application to run in countries outside of North America and western
Europe, you need to support the traditional calendar systems that are still in
use in some parts of the world. These classes provide that support while
conforming to the standard Java Calendar API, allowing you to code your
application once and have it work with any international calendar.
[Goto Top]
- 2.Which Japanese
calendar do you support?
- Currently, our JapaneseCalendar is almost identical to the Gregorian
calendar, except that it follows the traditional conventions for year and era
names. In modern times, each emperor's reign is treated as an era, and years
are numbered from the start of that era. Historically each emperor's reign
would be divided up into several eras, or gengou. Currently, our era
data extends back to Haika, which began in 645 AD. In all other respects
(month and date, all of the time fields, etc.) the JapaneseCalendar class will
give results that are identical to GregorianCalendar.
Lunar calendars similar to the Chinese calendar have also been used in Japan
during various periods in history, but according to our sources they are not in
common use today. If you see a real need for a Japanese lunar calendar, and
especially if you know of any good references on how it differs from the
Chinese calendar, please let us know by posting a note on the mailing list.
[Goto Top]
- 3.Do you
really support the true lunar Islamic calendar?
- The Islamic calendar is strictly lunar, and a month begins at the moment
when the crescent of the new moon is visible above the horizon at sunset. It is
impossible to calculate this calendar in advance with 100% accuracy, since moon
sightings are dependent on the location of the observer, the weather, the
observer's eyesight, and so on. However, there are fairly commonly-accepted
criteria (the angle between the sun and the moon, the moon's angle above the
horizon, the position of the moon's bright limb, etc.) that let you predict the
start of any given month with a very high degree of accuracy, except of course
for the weather factor. We currently use a fairly crude approximation that is
still relatively accurate, corresponding with the official Saudi calendar for
all but one month in the last 40-odd years. This will be improved in future
versions of the class.
What all this boils down to is that the IslamicCalendar class does a fairly
good job of predicting the Islamic calendar, and it is good enough for most
computational purposes. However, for religious purposes you should, of course,
consult the appropriate mosque or other authority.
[Goto Top]
- 4.Why don't you
have resource data for my language?
- We've included all of the string resource data that we've managed to
collect so far. The languages we currently provide are Arabic, Dutch, English,
Finnish, French, Hebrew, Hungarian, Japanese, and Thai, though not all
calendars are supported in each language. If you would like to contribute
string translations for other languages, please post a note on the mailing list
[Goto Top]
- 5.Why don't you
support the Chinese calendar?
- The short answer is, "Because it's extremely complicated." The Chinese
lunar obeys complex rules based upon calculations of solstices, new moons, and
solar longitude, and is much more difficult to implement than the Islamic and
Hebrew calendars. However, we're starting to work on a ChineseCalendar class
and will add it to the International Calendars package when it is ready.
[Goto Top]
- 6.Why don't Hebrew
and Arabic text look right in the demo?
- The short answer is that you're probably running the demo under JDK 1.1
rather than JDK 1.2. Most browsers currently (as of early 1999) do not yet
support JDK 1.2. To see the demo in its full glory, install JDK 1.2 on your
machine if you have not yet done so, download this package, and run the demo
under 1.2. If you want the whole story, read on....
Arabic and Hebrew are both right to left languages. Each line of text
starts at the right side of the page and flows to the left. When right to left
text is mixed with a left to right language such as English, (e.g. an Arabic
month name followed by the decimal year number) you ge bi-directional
(or bidi) text, which has complicated rules for reordering characters
based on their context. JDK 1.2 and later have good support for bi-directional
text, thanks in part to work done by IBM. However, JDK 1.1.x does not support
bidi text itself and simply passes text strings to the underlying OS. If you
are running on an OS that supports bidi, such as the Arabic version of Windows,
the text may look OK, but on other platforms it will probably be displayed
backwards.
In addition, Arabic text makes heavy use of ligatures. Most letters can be
displayed in several different forms depending on their context, and most
letters are joined to the adjoining letters with ligatures. JDK 1.2 supports
this directly, while support in earlier versions is spotty and depends on the
underlying platform OS.
[Goto Top]
StringSearch
- 1.Do I have to know
anything about Collators to use StringSearch?
- Since StringSearch uses a RuleBasedCollator to handle the
language-sensitive aspects of searching, understanding how collation works
certainly helps. But the only parts of the Collator API that you really need to
know about are the collation strength values, PRIMARY, SECONDARY, and TERTIARY,
that determine whether case and accents are ignored during a search.
[Goto Top]
- 2.What algorithm are
you using to perform the search?
- StringSearch uses a version of the Boyer-Moore search algorithm that has
been modified for use with Unicode. Rather than using raw Unicode character
values in its comparisons and shift tables, the algorithm uses collation
elements that have been "hashed" down to a smaller range to make the tables a
reasonable size. An article explaining this algorithm in a fair amount of
detail is schedule for publication in the February, 1999 issue of Java Report.
[Goto Top]
RuleBasedBreakIterator
- 1. Why did you bother to rewrite
BreakIterator? Wasn't the old version working?
- It was working, but we were too constrained by the design. The break-data
tables were hard-coded, and there was only one set of them. This meant you
couldn't customize BreakIterator's behavior, nor could we accommodate languages
with mutually-exclusive breaking rules (Japanese and Chinese, for example, have
different word-breaking rules.) The hard-coded tables were also very
complicated, difficult to maintain, and easy to mess up, leading to mysterious
bugs. And in the original version, there was no way to subclass BreakIterator
and get any implementation at all-- if you wanted different behavior, you had
to rewrite the whole thing from scratch. We undertook this project to fix all
these problems and give us a better platform for future development. In
addition, we managed to get some significant performance improvements out of
the new version.
[Goto Top]
- 2. What do you mean, performance
improvements? It seems WAY slower to me!
- The one thing that's significantly slower is construction. This is because
it actually builds the tables at runtime by parsing a textual description. In
the old version, the tables were hard-coded, so no initialization was
necessary. If this is causing you trouble, it's likely that you're creating and
destroying BreakIterators too frequently. For example, if you're writing code
to word-wrap a document in a text editor, and you create and destroy a new
BreakIterator for every line you process, performance will be unbelievably
slow. If you move the creation out of the inner loop and create a new
BreakIterator only once per word-wrapping operation, or once per document,
you'll find that your performance improves dramatically. If you still have
problems after doing this, let us know-- there may be bugs we need to fix.
[Goto Top]
- 3. This still has all the same bugs
that the old BreakIterator did! Why would I want to use this one
instead?
- Because now you can fix it. The resource data in this package was designed
to mimic as closely as possible the behavior of the original BreakIterator
class (as of JDK 1.2). We did this deliberately to minimize our variables when
making sure the new iterator still passed all the old tests. We haven't updated
it since to avoid the bookkeeping hassles of keeping track of which version
includes which fixes. We're hoping to get this added to a future version of the
JDK, at which time we'll fix all the outstanding bugs relating to breaking in
the wrong places. In the meantime, you can customize the resource data to
modify things to work the way you want them to.
[Goto Top]
- 4. Why is there no demo?
- We haven't had time to write a good demo for this new functionality yet.
We'll add one later.
[Goto Top]
- 5. What's this
DictionaryBasedBreakIterator thing?
- This is a new feature that isn't in the JDK. DictionaryBasedBreakIterator
is intended for use with languages that don't put spaces between words (such as
Thai), or for languages that do put spaces between words, but often combine
lots of words into long compound words (such as German). Instead of looking
through the text for sequences of characters that signal the end of a word, it
compares the text against a list of known words, using this to determine where
the boundaries should go. The algorithm we use for this is fast, accurate, and
error-tolerant.
[Goto Top]
- 6. Why do you have a Thai dictionary,
but no resource data that actually lets me use it?
- We're not quite done doing the necessary research. We don't currently have
good test cases we can use to verify it's working correctly with Thai, nor are
we completely confident in our dictionary. If you can help us with this, we'd
like to hear from you!
[Goto Top]
- 7. What's this
BreakIteratorRules_en_US_TEST thing?
- This is a resource file that, in conjunction with the "english.dict"
dictionary, we used to test the dictionary-based break iterator. It allows you
to locate word boundaries in English text that has had the spaces taken out.
(The SimpleBITest program demonstrates this.) The dictionary isn't
industrial-strength, however: we included enough words to make for a reasonable
test, but it's by no means complete or anywhere near it.
[Goto Top]
- 8. How can I create my own dictionary
file?
- Right now, you can't. We didn't include the tool we used to create
dictionary files because it's very rough and extremely slow. There's also a
strong likelihood that the format of the dictionary files will change in the
future. If you really want to create your own dictionary file, contact us, and
we'll see what we can do.
[Goto Top]
|