Skip to main content

International Components for Unicode

Collation: Sun JDK 1.6 vs ICU4J 3.8

The performance test takes a locale and creates a RuleBasedCollator with default options. A large list of names is used as data in each test, where the names vary according to language. Each Collation operation over the whole list is repeated 1000 times. The percentage values in the final column are the most useful. They measure differences, where positive is better for ICU4J, and negative is better for the compared implementation.

Key

Operation Units Description
strcoll nanosecs Timing for string collation, an incremental compare of strings.
keygen nanosecs Timing for generation of sort keys, used to 'precompile' information so that subsequent operations can use binary comparison.
keylen bytes/char The average length of the generated sort keys, in bytes per character (Unicode/ISO 10646 code point). Generally this is the important field for sort key performance, since it directly impacts the time necessary for binary comparison, and the overhead of memory usage and retrieval time for sort keys.

Data

                                      -------- ICU --------   ------ JAVA -------      (JAVA - ICU)/ICU
Locale     Data file                  strcoll keygen  keylen  strcoll keygen  keylen    coll  keygen  keylen
------------------------------------------------------------------------------------------------------------
en_US      TestNames_Latin.txt      |     176   3940    1.59  |  2602   8003    5.92 | 1378%   103%    272%
da_DK      TestNames_Latin.txt      |     676   5629    1.59  |  7700  19744    5.91 | 1039%   251%    272%
de_DE      TestNames_Latin.txt      |     178   3951    1.59  |  2610   8013    5.92 | 1366%   103%    272%
de__PHONEB TestNames_Latin.txt      |     189   5016    1.59  |  2635   8012    5.92 | 1294%    60%    272%
fr_FR      TestNames_Latin.txt      |     177   4111    1.59  |  2616   8826    5.92 | 1378%   115%    272%
ja_JP      TestNames_Latin.txt      |     176   3981    1.59  |  2918   8274    5.92 | 1558%   108%    272%
ja_JP      TestNames_Japanese_h.txt |     486   2744    3.32  |  1253   4782    6.68 |  158%    74%    101%
ja_JP      TestNames_Japanese_k.txt |     489   2765    3.32  |  5145  15923    6.68 |  952%   476%    101%
ja_JP      TestNames_Asian.txt      |     557   2356    3.49  |  2133   7767    6.81 |  283%   230%     95%
zh_CN      TestNames_Latin.txt      |     207   6555    1.59  |  2602   8003    5.92 | 1157%    22%    272%
zh_CN      TestNames_Chinese.txt    |     581   1697    4.50  |  1233   2969    9.43 |  112%    75%    110%
zh_TW      TestNames_Latin.txt      |     207   6555    1.59  |  2618   8022    5.92 | 1165%    22%    272%
zh_TW      TestNames_Chinese.txt    |     502   1393    4.28  |  1590   2747    7.34 |  217%    97%     71%
zh__PINYIN TestNames_Latin.txt      |     204   6545    1.59  |  2604   8033    5.92 | 1176%    23%    272%
zh__PINYIN TestNames_Chinese.txt    |     580   1717    4.50  |  1236   2959    9.43 |  113%    72%    110%
ko_KR      TestNames_Latin.txt      |     175   3951    1.59  |  2756   8284    5.92 | 1475%   110%    272%
ko_KR      TestNames_Korean.txt     |     837   2132    4.66  |  1535   2785    7.33 |   83%    31%     57%
ru_RU      TestNames_Latin.txt      |     177   3941    1.59  |  2614   7992    5.92 | 1377%   103%    272%
ru_RU      TestNames_Russian.txt    |     590   7324    1.66  |  3170  23588    5.85 |  437%   222%    252%
th         TestNames_Latin.txt      |     176   4655    1.67  |  2638   8072    5.90 | 1399%    73%    253%
th         TestNames_Thai.txt       |    1537   6630    1.67  |  2743   8136    5.76 |   78%    23%    245%

Notes

  1. The tests are performed on a Pentium M machine, 1.6GHz with 1GB RAM, running Windows XP.
  2. We were responsible for the original implementation of collation in Java, so we've learned quite a bit since then.
  3. As with all performance measurements, the results will vary according to the hardware and compiler. The strcoll operation is particularly sensitive; we have found that even slight changes in code alignment can produce 10% differences.
  4. For more information on incremental vs. sort key comparison, the importance of multi-level sorting, and other features of collation, see Unicode Collation (UCA).
  5. For general information on ICU collation see User Guide.
  6. For information on APIs, see C, C++, or Java.