Languages Around The World

Regular Expressions

Overview

ICU's Regular Expressions package provides applications with the ability to apply regular expression matching to Unicode string data. The regular expression patterns and behavior are based on Perl's regular expressions. The C++ programming API for using ICU regular expressions is loosely based on the JDK 1.4 package java.util.regex, with some extensions to adapt it for use in a C++ environment. A plain C API is also provided.

The ICU Regular expression API supports operations including testing for a pattern match, searching for a pattern match, and replacing matched text. Capture groups allow subranges within an overall match to be identified, and to appear within replacement text.

A Perl-inspired split() function that breaks a string into fields based on a delimiter pattern is also included.

A detailed description of regular expression patterns and pattern matching behavior is not included in this user guide. The best reference for this topic is the book "Mastering Regular Expressions, Second Edition" by Jeffrey E. F. Friedl, O'Reilly & Associates; 2nd edition (July 15, 2002). Matching behavior can sometimes be surprising, and this book is highly recommended for anyone doing significant work with regular expressions.

Note

Using ICU Regular Expressions

The ICU C++ Regular Expression API includes two classes, RegexPattern and RegexMatcher, that parallel the classes from the Java JDK package java.util.regex.    A RegexPattern represents a compiled regular expression while RegexMatcher associates a RegexPattern and an input string to be matched, and provides API for the various find, match and replace operations. In most cases, however, only the class RegexMatcher is needed, and the existence of class RegexPattern can safely be ignored.

The first step in using a regular expression is typically the creation of a RegexMatcher object from the source (string) form of the regular expression.

RegexMatcher holds a pre-processed (compiled) pattern and a reference to an input string to be matched, and provides API for the various find, match and replace operations. RegexMatchers can be reset and reused with new input, thus avoiding object creation overhead when performing the same matching operation repeatedly on different strings.

The following code will create a RegexMatcher from a string containing a regular expression, and then perform a simple find() operation.

#include <unicode/regex.h>

UErrorCode        status    = U_ZERO_ERROR;

  ...

RegexMatcher *matcher = new RegexMatcher("abc+", 0, status);
if (U_FAILURE(status)) {
    // Handle any syntax errors in the regular expression here
    ...
}


UnicodeString    stringToTest = “Find the abc in this string”;
matcher->reset(stringToTest);

if (matcher->find(status)) {
   // We found a match.
   int startOfMatch = matcher->start();   // string index of start of match.
   ...
}

Several types of matching tests are available

FunctionDescription
matches()True if the pattern matches the entire string. from the start through to the last character.
lookingAt()True if the pattern matches at the start of the string. The match need not include the entire string.
find()True if the pattern matches somewhere within the string. Successive calls to find() will find additional matches, until the string is exhausted.

If additional text is to be checked for a match with the same pattern, there is no need to create a new matcher object; just reuse the existing one.

myMatcher->reset(anotherString);
if (myMatcher->matches(status)) {
   // We have a with the new string.
}

Note that matching happens directly in the string supplied by the application. This reduces the overhead when resetting a matcher to an absolute minimum – the matcher need only store a reference to the new string – but it does mean that the application must be careful not to modify or delete the string while the matcher is holding a reference to the string.

After finding a match, additional information is available about the range of the input matched, and the contents of any capture groups. Note that, for simplicity, any error parameters have been omitted. See the API reference for complete a complete description of the API.

FunctionDescription
start() Return the index of the start of the matched region in the input string .
end() Return the index of the first character following the match.
group() Return a UnicodeString containing the text that was matched.
start(n) Return the index of the start of the text matched by the nth capture group.
end(n) Return the index of the first character following the text matched by the nth capture group.
group(n) Return a UnicodeString containing the text that was matched by the nth capture group..

Regular Expression Metacharacters

CharacterDescription
\aMatch a BELL, \u0007
\AMatch at the beginning of the input. Differs from ^ in that \A will not match after a new line within the input.
\b, outside of a [Set]Match if the current position is a word boundary. Boundaries occur at the transitions between word (\w) and non-word (\W) characters, with combining marks ignored. For better word boundaries, see ICU Boundary Analysis .
\b, within a [Set]Match a BACKSPACE, \u0008.
\BMatch if the current position is not a word boundary.
\cXMatch a control-X character.
\dMatch any character with the Unicode General Category of Nd (Number, Decimal Digit.)
\DMatch any character that is not a decimal digit.
\eMatch an ESCAPE, \u001B.
\ETerminates a \Q ... \E quoted sequence.
\fMatch a FORM FEED, \u000C.
\GMatch if the current position is at the end of the previous match.
\nMatch a LINE FEED, \u000A.
\N{UNICODE CHARACTER NAME}Match the named character.
\p{UNICODE PROPERTY NAME}Match any character with the specified Unicode Property.
\P{UNICODE PROPERTY NAME}Match any character not having the specified Unicode Property.
\QQuotes all following characters until \E.
\rMatch a CARRIAGE RETURN, \u000D.
\sMatch a white space character. White space is defined as [\t\n\f\r\p{Z}].
\SMatch a non-white space character.
\tMatch a HORIZONTAL TABULATION, \u0009.
\uhhhhMatch the character with the hex value hhhh.
\UhhhhhhhhMatch the character with the hex value hhhhhhhh. Exactly eight hex digits must be provided, even though the largest Unicode code point is \U0010ffff.
\wMatch a word character. Word characters are [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}].
\WMatch a non-word character.
\x{hhhh}Match the character with hex value hhhh. From one to six hex digits may be supplied.
\xhhMatch the character with two digit hex value hh
\XMatch a Grapheme Cluster .
\ZMatch if the current position is at the end of input, but before the final line terminator, if one exists.
\zMatch if the current position is at the end of input.
\nBack Reference. Match whatever the nth capturing group matched. n must be a number > 1 and < total number of capture groups in the pattern. Note: Octal escapes, such as \012, are not supported in ICU regular expressions
[pattern]Match any one character from the set. See UnicodeSet for a full description of what may appear in the pattern
.Match any character.
^Match at the beginning of a line.
$Match at the end of a line.
\Quotes the following character. Characters that must be quoted to be treated as literals are * ? + [ ( ) { } ^ $ | \ . /

Regular Expression Operators

OperatorDescription
|Alternation. A|B matches either A or B.
*Match 0 or more times. Match as many times as possible.
+Match 1 or more times. Match as many times as possible.
?Match zero or one times. Prefer one.
{n}Match exactly n times
{n,}Match at least n times. Match as many times as possible.
{n,m}Match between n and m times. Match as many times as possible, but not more than m.
*?Match 0 or more times. Match as few times as possible.
+?Match 1 or more times. Match as few times as possible.
??Match zero or one times. Prefer zero.
{n}?Match exactly n times
{n,}?Match at least n times, but no more than required for an overall pattern match
{n,m}?Match between n and m times. Match as few times as possible, but not less than n.
*+Match 0 or more times. Match as many times as possible when first encountered, do not retry with fewer even if overall match fails (Possessive Match)
++Match 1 or more times. Possessive match.
?+Match zero or one times. Possessive match.
{n}+Match exactly n times
{n,}+Match at least n times. Possessive Match.
{n,m}+Match between n and m times. Possessive Match.
( ... )Capturing parentheses. Range of input that matched the parenthesized subexpression is available after the match.
(?: ... )Non-capturing parentheses. Groups the included pattern, but does not provide capturing of matching text. Somewhat more efficient than capturing parentheses.
(?> ... )Atomic-match parentheses. First match of the parenthesized subexpression is the only one tried; if it does not lead to an overall pattern match, back up the search for a match to a position before the "(?>"
(?# ... )Free-format comment (?# comment ).
(?= ... )Look-ahead assertion. True if the parenthesized pattern matches at the current input position, but does not advance the input position.
(?! ... )Negative look-ahead assertion. True if the parenthesized pattern does not match at the current input position. Does not advance the input position.
(?<= ... )Look-behind assertion. True if the parenthesized pattern matches text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.)
(?<! ... )Negative Look-behind assertion. True if the parenthesized pattern does not match text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.)
(?ismwx-ismwx: ... )Flag settings. Evaluate the parenthesized expression with the specified flags enabled or -disabled.
(?ismwx-ismwx)Flag settings. Change the flag settings. Changes apply to the portion of the pattern following the setting. For example, (?i) changes to a case insensitive match.

Replacement Text

The replacement text for find-and-replace operations may contain references to capture-group text from the find. References are of the form $n, where n is the number of the capture group.

CharacterDescriptions
$nThe text of capture group n will be substituted for $n. n must be >= 0 and not greater than the number of capture groups. A $ not followed by a digit has no special meaning, and will appear in the substitution text as itself, a $.
\Treat the following character as a literal, suppressing any special meaning. Backslash escaping in substitution text is only required for '$' and '\', but may be used on any other character without bad effects.

Flag Options

The following flags control various aspects of regular expression matching. The flag values may be specified at the time that an expression is compiled into a RegexPattern object, or they may be specified within the pattern itself using the (?ismx-ismx) pattern options.

NoteThe UREGEX_CANON_EQ option is not yet available.
Flag (pattern)Flag (API Constant)Description
UREGEX_CANON_EQIf set, matching will take the canonical equivalence of characters into account. NOTE: this flag is not yet implemented.
iUREGEX_CASE_INSENSITIVEIf set, matching will take place in a case-insensitive manner.
xUREGEX_COMMENTSIf set, allow use of white space and #comments within patterns
sUREGEX_DOTALLIf set, a "." in a pattern will match a line terminator in the input text. By default, it will not. Note that a carriage-return / line-feed pair in text behave as a single line terminator, and will match a single "." in a RE pattern
mUREGEX_MULTILINEControl the behavior of "^" and "$" in a pattern. By default these will only match at the start and end, respectively, of the input text. If this flag is set, "^" and "$" will also match at the start and end of each line within the input text.
wUREGEX_UWORDControls the behavior of \b in a pattern. If set, word boundaries are found according to the definitions of word found in Unicode UAX 29, Text Boundaries. By default, word boundaries are identified by means of a simple classification of characters as either “word” or “non-word”, which approximates traditional regular expression behavior. The results obtained with the two options can be quite different in runs of spaces and other non-word characters.

Using split()

ICU's split() function is similar in concept to Perl's – it will split a string into fields, with a regular expression match defining the field delimiters and the text between the delimiters being the field content itself.

Suppose you have a string of words separated by spaces

    UnicodeString s = “dog cat   giraffe”;

This code will extract the individual words from the string.

    UErrorCode status = U_ZERO_ERROR;
    RegexMatcher m(“\\s+”, 0, status); 
    const int maxWords = 10;
    UnicodeString words[maxWords];    
    int numWords = m.split(s, words, maxWords, status);

After the split(),

Variablevalue
numWords3
words[0]“dog”
words[1]“cat”
words[2]“giraffe”
words[3 to 9] “”

The field delimiters, the spaces from the original string, do not appear in the output strings.

Note that, in this example, “words” is a local, or stack array of actual UnicodeString objects. No heap allocation is involved in initializing this array of empty strings (C++ is not Java!). Local UnicodeString arrays like this are a very good fit for use with split(); after extracting the fields, any values that need to be kept in some more permanent way can be copied to their ultimate destination.

If the number if fields in a string being split exceeds the capacity of the destination array, the last destination string will contain all of the input string data that could not be split, including any embedded field delimiters. This is similar to split() in Perl.

If the pattern expression contains capturing parentheses, the captured data ($1, $2, etc.) will also be saved in the destination array, interspersed with the fields themselves.

If, in the “dog cat giraffe” example, the pattern had been “(\s+)” instead of “\s+”, split() would have produced five output strings instead of three. Words[1] and words[3] would have been the spaces.

Find and Replace

Description of AppendReplacement() and AppendTail(). To be added.



Copyright (c) 2000 - 2008 IBM and Others - PDF Version - Feedback: http://icu-project.org/contacts.html

User Guide for ICU v4.0 Generated 2008-12-17.