
Text Element Boundary Analysis
Overview of Text Boundary Analysis
Text boundary analysis is the process of locating linguistic boundaries while formatting and handling text. Examples of this process include:
Locating appropriate points to word-wrap text to fit within specific margins while displaying or printing.
Locating the beginning of a word that the user has selected.
Counting characters, words, sentences, or paragraphs.
Determining how far to move the text cursor when the user hits an arrow key (Some characters require more than one position in the text store and some characters in the text store do not display at all).
Making a list of the unique words in a document.
Figuring out if a given range of text contains only whole words.
Capitalizing the first letter of each word.
Locating a particular unit of the text (For example, finding the third word in the document).
The BreakIterator classes were designed to support these kinds of tasks. The BreakIterator objects maintain a location between two characters in the text. This location will always be a text boundary. Clients can move the location forward to the next boundary or backward to the previous boundary. Clients can also check if a particular location within a source text is on a boundary or find the boundary which is before or after a particular location.
Four Types of BreakIterator
ICU BreakIterators can be used to locate the following kinds of text boundaries:
Character Boundary
Word Boundary
Line-break Boundary
Sentence Boundary
Each type of boundary is found in accordance with the rules specified by Unicode Standard Annex #29, Text Boundaries (http://www.unicode.org/reports/tr29 )
Character Boundary
The character-boundary iterator locates the boundaries between "characters", where "character" is what the end user of an application would usually expect. For example, the Ä letter can be represented in Unicode either with a single code-point value or with two code-point values (one representing the A and another representing the umlaut). The character-boundary iterator will treat the Ä as a single character regardless of whether or not it is stored using one code point or two. In short, the character-boundary iterator is used to identify sequences that should be treated as single characters from a user's perspective.
End-user characters, as described above, are also called grapheme clusters, in an attempt to limit the confusion caused by multiple meanings for the word "character".
Word Boundary
The word-boundary iterator locates the boundaries of words, for purposes such as double click selection or "Find whole words" operations in an editor.
Here's an example of a sentence, showing the boundary locations that will be identified by a word break iterator:
Word boundary locations are found according to these general principles:
Words themselves are kept together
Numbers are kept together, including any commas, points or currency symbols.
Apostrophes or hyphens within a word are kept with the word. They are not broken out separately like most other punctuation
Punctuation, spaces and other characters that are not part of a word or number, are broken out separately, with a boundary before and after each character.
The rules used for locating word breaks take into account the alphabets and conventions used for different languages.
Locating word breaks for Thai text presents a special challenge, because there are no spaces or other identifiable characters separating the words. To solve the problem of word-breaking Thai text, ICU provides a special dictionary-based break iterator.
Line-break Boundary
The line-break iterator locates positions within the text that would be appropriate points for a text editor to break lines when displaying the text. Line breaks differ from word breaks in that adjoining punctuation and trailing white space are kept with the words instead of being treated as separate "words" on their own (for example, do not wrap a line before a space).
This example shows the differences in the break locations produced by word and line break iterators
Sentence Boundary
A sentence-break iterator locates sentence boundaries.
The exact rules used for locating each type of boundary are described in a pair of documents from the Unicode Consortium. Unicode Standard Annex 14 ( http://www.unicode.org/unicode/reports/tr14/ ) gives the rules for locating line boundaries, while technical report 29 "http://www.unicode.org/unicode/reports/tr29/" ) describe character, word and sentence boundaries.
Usage
To locate boundaries in a document, create a BreakIterator using the BreakIterator::create***Instance family of methods in C++, or the ubrk_open() function (C). "***" is Character, Word, Line or Sentence, depending on the type of iterator wanted. These factory methods also take a parameter that specifies the locale for the language of the text to be processed.
When creating a BreakIterator, a locale is also specified, and the behavior of the BreakIterator obtained may be specialized in some way for that locale. For ICU 2.6, Break Iterators for the Thai locale will make use of a Thai dictionary for finding word and line boundaries; all other locales will use the default boundary rules.
Applications also may register customized BreakIterators for use in specific locales. Once such a break iterator has been registered, any requests for break iterators for the locale will return copies of the registered break iterator
In the general-usage-model, applications will use the following basic steps to analyze a piece of text for boundaries:
Create a BreakIterator with the desired behavior
Use the setText() or adoptText() methods to set the iterator to analyze a particular piece of text. Since BreakIterator uses a CharacterIterator to access the text, it can be stored in any form as long as you provide an appropriate CharacterIterator. There is a convenience method for analyzing a UnicodeString, but the user also can analyze part of a UnicodeString by creating a StringCharacterIterator directly.
Locate the desired boundaries using the appropriate combination of first(), last(), next(), previous(), preceding(), and following() methods.
The setText() or the adoptText() method can be called more than once, allowing a single BreakIterator to be reused to analyze different pieces of text. Because the creation of a BreakIterator can be relatively time-consuming, it makes good sense to cache and reuse BreakIterators within an application.
Set the text to be searched using the following:
adoptText(CharacterIterator) sets the BreakIterator to analyze a new piece of text. The new piece of text is specified with a CharacterIterator, which allows BreakIterator to analyze the text for boundaries no matter how it happens to be stored [it always accesses the text through the CharacterIterator]. The BreakIterator takes ownership of the CharacterIterator and will delete it when the process is completed.
setText(UnicodeString) is a shortcut for the adoptText() method. If the text is a UnicodeString, the user can call setText and pass it the string, rather than creating a StringCharacterIterator and passing it to the adoptText() method. This method will create the StringCharacterIterator. To analyze only part of a UnicodeString, on the other hand, create the StringCharacterIterator manually, specify the substring, and then pass it to the adoptText() method.
getText() method returns a const reference to the CharacterIterator that the BreakIterator is using to access the text.
createText() method returns a clone of the CharacterIterator that the BreakIterator is using to access the text. Ownership of the clone is transferred to the caller. (The caller can seek the returned CharacterIterator without affecting the BreakIterator, but if the actual text underlying the iterator is changed, the adoptText() method must be called again to make sure the BreakIterator does not malfunction.)
The iterator always points to a boundary position between two characters. The numerical value of the position, as returned by current() is the zero-based index of the character following the boundary. Thus a position of zero represents a boundary preceding the first character of the text, and a position of one represents a boundary between the first and second characters.
The first() and last() methods reset the iterator's current position to the beginning or end of the text (the beginning and the end are always considered boundaries). The next() and previous() methods advance the iterator one boundary forward or backward from the current position. If the next() or previous() methods run off the beginning or end of the text, it returns DONE. The current() method returns the current position.
The following() and preceding() methods are used for random access or to reposition the iterator to some arbitrary spot in the middle of the text. Since a BreakIterator always points to a boundary position, the following() and preceding() methods will never set the iterator to point to the position specified by the caller (even if it is, in fact, a boundary position). BreakIterator will, however, set the iterator to the nearest boundary position before or after the specified position. The isBoundary() method returns true or false, based on whether or not the specified position is a boundary position. It does this by calling the preceding() and next() methods, so it also repositions the iterator either at the specified position or the first boundary position after it. If any of these functions is passed an out-or-range offset, it returns DONE and repositions the iterator to the beginning or end of the text.
Reuse
It is slow and wasteful to repeatedly create and destroy a BreakIterator when it is not necessary. For example, do not create a separate BreakIterator for each line in a document that is being word-wrapped. Keep around a single instance of a line BreakIterator and use it whenever a line break iterator is needed.
Accuracy
ICU's break iterators implement the default boundary rules described in the Unicode Consortium Technical Reports 14 and 29 . These are relatively simple boundary rules that can be implemented efficiently, and are sufficient for many purposes and languages. However, some languages and applications will require a more sophisticated linguistic analysis of the text in order to find boundaries with good accuracy. Such an analysis is not directly available from ICU at this time.
Break Iterators based on custom, user-supplied boundary rules can be created and used by applications with requirements that are not met by the standard default boundary rules.
BreakIterator Boundary Analysis Examples
Print out all the word-boundary positions in a UnicodeString:
In C++,
void listWordBoundaries(const UnicodeString& s) { UErrorCode status = U_ZERO_ERROR; BreakIterator* bi = BreakIterator::createWordInstance(Locale::getUS(), status); bi->setText(s); int32_t p = bi->first(); while (p != BreakIterator::DONE) { printf("Boundary at position %d\n", p); p = bi->next(); } delete bi; } |
In C:
void listWordBoundaries(const UChar* s, int32_t len) { UBreakIterator* bi; int32_t p; UErrorCode err = U_ZERO_ERROR; bi = ubrk_open(UBRK_WORD, 0, s, len, &err); if (U_FAILURE(err)) return; p = ubrk_first(bi); while (p != UBRK_DONE) { printf("Boundary at position %d\n", p); p = ubrk_next(bi); } ubrk_close(bi); } |
Get the boundaries of the word that contains a double-click position:
In C++:
void wordContaining(BreakIterator& wordBrk, int32_t idx, const UnicodeString& s, int32_t& start, int32_t& end) { // this function is written to assume that we have an // appropriate BreakIterator stored in an object or a // global variable somewhere-- When possible, programmers // should avoid having the create() and delete calls in // a function of this nature. if (s.isEmpty()) return; wordBrk.setText(s); start = wordBrk.preceding(idx + 1); end = wordBrk.next(); // NOTE: for this and similar operations, use preceding() and next() // as shown here, not following() and previous(). preceding() is // faster than following() and next() is faster than previous() // NOTE: By using preceding(idx + 1) above, we're adopting the convention // that if the double-click comes right on top of a word boundary, it // selects the word that _begins_ on that boundary (preceding(idx) would // instead select the word that _ends_ on that boundary). } |
In C:
void wordContaining(UBreakIterator* wordBrk, int32_t idx, const UChar* s, int32_t sLen, int32_t* start, int32_t* end, UErrorCode* err) { if (wordBrk == NULL || s == NULL || start == NULL || end == NULL) { *err = U_ILLEGAL_ARGUMENT_ERROR; return; } ubrk_setText(wordBrk, s, sLen, err); if (U_SUCCESS(*err)) { *start = ubrk_preceding(wordBrk, idx + 1); *end = ubrk_next(wordBrk); } } |
Check for Whole Words
Use the following to check if a range of text is a "whole word":
In C++:
UBool isWholeWord(BreakIterator& wordBrk, const UnicodeString& s, int32_t start, int32_t end) { if (s.isEmpty()) return FALSE; wordBrk.setText(s); if (!wordBrk.isBoundary(start)) return FALSE; return wordBrk.isBoundary(end);} |
In C:
UBool isWholeWord(UBreakIterator* wordBrk, const UChar* s, int32_t sLen, int32_t start, int32_t end, UErrorCode* err) { UBool result = FALSE; if (wordBrk == NULL || s == NULL) { *err = U_ILLEGAL_ARGUMENT_ERROR; return FALSE; } ubrk_setText(wordBrk, s, sLen, err); if (U_SUCCESS(*err)) { result = ubrk_isBoundary(wordBrk, start) >> ubrk_isBoundary(wordBrk, end); } return result; } |
Although users can check for "whole words" using these methods, it is possible to get better performance (in most cases) with the following algorithm:
bool isWholeWord(BreakIterator *wordBrk, const UnicodeString& s, int32_t start, int32_t end) { wordBrk->setText(s); if (!wordBrk->isBoundary(start)) return false; UTextOffset p = wordBrk->current(); while (p < end) p = wordBrk->next(); return p == end; } |
This algorithm is faster because the next() method is the fastest boundary-detection method in BreakIterator. The following() and isBoundary() method [while it calls following()] is the slowest. Two calls to the isBoundary() method is faster only when the selection range is long and comprises more than roughly four words.
Count the words in a document (C++ only):
int32_t containsLetters(RuleBasedBreakIterator& bi, const UnicodeString& s, int32_t start) { bi.setText(s); int32_t count = 0; while (start != BreakIterator::DONE) { int breakType = bi.getRuleStatus(); if (breakType != UBRK_WORD_NONE) { // Exclude spaces, punctuation, and the like. ++count; } start = bi.next(); } return count; } |
The function getRuleStatus() returns an enum giving additional information on the text preceding the last break position found. Using this value, it is possible to distinguish between numbers, words, words containing kana characters, words containing ideographic characters, and non-word characters, such as spaces or punctuation. The sample uses the break status value to filter out, and not count, boundaries associated with non-word characters.
Word-wrap a document (C++ only):
The sample function below wraps a paragraph so that each line is less than or equal to 72 characters. The function fills in an array passed in by the caller with the starting offsets of each line in the document. Also, it fills in a second array to track how many trailing white space characters there are in the line. For simplicity, it is assumed that an outside process has already broken the document into paragraphs. For example, it is assumed that every string the function is passed has a single newline at the end only.
int32_t wrapParagraph(const UnicodeString& s, const Locale& locale, int32_t lineStarts[], int32_t trailingwhitespace[], int32_t maxLines, UErrorCode &status) { int32_t numLines = 0; int32_t p, q; const int32_t MAX_CHARS_PER_LINE = 72; UChar c; BreakIterator *bi = BreakIterator::createLineInstance(locale, status); if (U_FAILURE(status)) { delete bi; return 0; } bi->setText(s); p = 0; while (p < s.length()) { // jump ahead in the paragraph by the maximum number of // characters that will fit q = p + MAX_CHARS_PER_LINE; // if this puts us on a white space character, a control character // (which includes newlines), or a non-spacing mark, seek forward // and stop on the next character that is not any of these things // since none of these characters will be visible at the end of a // line, we can ignore them for the purposes of figuring out how // many characters will fit on the line) if (q < s.length()) { c = s[q]; while (q < s.length() && (u_isspace(c) || u_charType(c) == U_CONTROL_CHAR || u_charType(c) == U_NON_SPACING_MARK )) { ++q; c = s[q]; } } // then locate the last legal line-break decision at or before // the current position ("at or before" is what causes the "+ 1") q = bi->preceding(q + 1); // if this causes us to wind back to where we started, then the // line has no legal line-break positions. Break the line at // the maximum number of characters if (q == p) { p += MAX_CHARS_PER_LINE; lineStarts[numLines] = p; trailingwhitespace[numLines] = 0; ++numLines; } // otherwise, we got a good line-break position. Record the start of this // line (p) and then seek back from the end of this line (q) until you find // a non-white space character (same criteria as above) and // record the number of white space characters at the end of the // line in the other results array else { lineStarts[numLines] = p; int32_t nextLineStart = q; for (q--; q > p; q--) { c = s[q]; if (!(u_isspace(c) || u_charType(c) == U_CONTROL_CHAR || u_charType(c) == U_NON_SPACING_MARK)) { break; } } trailingwhitespace[numLines] = nextLineStart - q -1; p = nextLineStart; ++numLines; } if (numLines >= maxLines) { break; } } delete bi; return numLines; } |
Most text editors would not break lines based on the number of characters on a line. Even with a monospaced font, there are still many Unicode characters that are not displayed and therefore should be filtered out of the calculation. With a proportional font, character widths are added up until a maximum line width is exceeded or an end of the paragraph marker is reached.
Trailing white space does not need to be counted in the line-width measurement because it does not need to be displayed at the end of a line. The sample code above returns an array of trailing white space values because an external rendering process needs to be able to measure the length of the line (without the trailing white space) to justify the lines. For example, if the text is right-justified, the invisible white space would be drawn outside the margin. The line would actually end with the last visible character.
In either case, the basic principle is to jump ahead in the text to the location where the line would break (without taking word breaks into account). Then, move backwards using the preceding() method to find the last legal breaking position before that location. Iterating straight through the text with next() method will generally be slower.
ICU BreakIterator Data Files
The source code for the ICU break rules for the standard boundary types is located in the directory icu/source/data/brkitr. These files will be built, and the corresponding binary state tables incorporated into ICU's data, by the standard ICU4C build process.
Beginning with ICU 3.0, the same break rule source files and compiled state tables are used for both ICU4C and ICU4J. The state tables are built using ICU4C, and the resulting binary tables are incorporated into ICU4J.
RBBI Rules
ICU locates boundary positions within text by means of rules, which are a variant form of regular expressions. A boundary rule is an expression that matches a section of text - a word or sentence or whatever - that should remain together, with boundaries occurring between the ranges of matched text. A set of rules consists of a series of regular expressions separated by semicolons; the rules, taken together, define regions of text that are kept together between boundaries. Boundaries occur at the end of text ranges matched by the rules.
Forward, Reverse, Safe Point rules
For each type of boundary, four sets of rules are required, as described in the following table.
Forward | Advance (match text) starting from a boundary position and continuing to the next following boundary. |
Reverse | Starting from a boundary, match backwards, until the preceding boundary position. |
Safe Forward | Starting from any arbitrary position in the text, match forward to a safe position, which is a position from which the normal Reverse rule will work correctly. |
Safe Reverse | Starting from any arbitrary position in the text, move backwards to a safe position, which is a position from which the normal Forward rule will work correctly. |
All four rules need to be supplied.
Normal next() or previous() operations use the Forward or Reverse rules, respectively, to move directly from one boundary position to another.
The preceding() and following() functions first apply a safe rule, then apply a normal Forward or Reverse rule. (preceding() and following() can start from any arbitrary location in the input text)
![]() | Note: Earlier versions of ICU (prior to 3.0) worked with only a Forward rule and a safe Reverse rule. While the rule builder will still recognize rules written in this form, their use is deprecated and strongly discouraged. |
A rule input file is divided into sections, one for each type of rule:
# This shows the general layout of a break rule file # # The order of the four sections doesn't matter, so long as they all appear. # # Variable definitions can appear anywhere, so long as they are defined before # their first use in a rule. Variables carry forward across section boundaries. # !!forward # forward rules go here. !!reverse # Reverse rules go here. !!safe_forward # Safe Forward rules go here. !!safe_reverse # Safe Reverse rules go here. |
Variables
A set of break rules may define and use variables, which are convenient when subexpressions reappear more than once, or to simplify complex expressions by allowing parts to be separately defined and named. Use of variables within a set of rules has no effect on the efficiency of the resulting break iterator.
!!chain
ICU boundary rules can be written in two ways: chained or non-chained.
With non-chained rules, each rule (regular expression) stands by itself, matching a segment of text between two boundary positions. When moving to the next boundary, the single rule with the longest match defines the boundary position.
This is very much like traditional regular expression behavior.
Non-chained rule matching behavior is the default for ICU break rules.
Chaining allows boundary positions to be determined by an arbitrary number of the boundary rules, applied in an arbitrary sequence. Any character in the text that completes a match for one rule can function as a chaining point, and simultaneously be the beginning character of a match for any other rule. Matching continues in this way until the longest possible match is obtained.
Chaining from one rule to the next can occur at any point that the first rule of the pair matches. The longest match of each individual rule is not required, and if chaining from a shorter match of an intermediate rule results in a longer overall match, that is what will happen.
Chained rules are closer in flavor to the rules definitions in the Unicode Consortium text boundary specifications. Line Break boundaries, in particular, were not really possible to implement accurately with traditional, non-chained regular expression.
!!chain in a rule file enables rule chaining. !!chain applies to all rule sections, and must appear before the first section.
The !!LBCMNoChain option modifies chaining behavior by preventing chaining from one rule to another from occurring on any character whose Line Break property is Combining Mark. This option is subject to change or removal, and should not be used in general. Within ICU, it is used only with the line break rules. We hope to replace it with something more general.
Rule Status Values
Break rules can be tagged with a number, which is called the rule status. After a boundary has been located, the status number of the specific rule that determined the boundary position is available to the application through the function getRuleStatus().
For the predefined word boundary rules, status values are available to distinguish between boundaries associated with words and those around spaces or punctuation. Similarly for line break boundaries, status values distinguish between mandatory line endings (new line characters) and break opportunities that are appropriate points for line wrapping.
In the source form of the break rules, status numbers appear at end of a rule, and are enclosed in {braces}.
Rule Syntax
Here is the syntax for the boundary rules.
Rule Name | Rule Values | Notes |
---|---|---|
rules | statement+ | |
statement | assignment | rule | control | |
control | (“!!forward” | “!!reverse” | “!!safe_forward” | “!!safe_reverse” | “!!chain” | “!!LBCMNoChain”) ';' | |
assignment | variable '=' expr ';' | 5 |
rule | '!'? expr ('{'number'}')? ';' | 8, 9 |
number | [0-9]+ | 1 |
break-point | '/' | |
expr | expr-q | expr '|' expr | expr expr | 3 |
expr-q | term | term '*' | term '?' | term '+' | |
term | rule-char | unicode-set | variable | quoted-sequence | '(' expr ')' | break-point | |
rule-special | any printing ascii character except letters or numbers | white-space | |
rule-char | any non-escaped character that is not rule-special | '.' | any escaped character except '\p' or '\P' | |
variable | '$' name-start-char name-char* | 7 |
name-start-char | '_' | \p{L} | |
name-char | name-start-char | \p{N} | |
quoted-sequence | ''' (any char except single quote or line terminator or two adjacent single quotes)+ ''' | |
escaped-char | See “Character Quoting and Escaping ” | |
Unicode set | See UnicodeSet | 4 |
comment | unescaped '#' [any char except new-line]* new-line | 2 |
s | unescaped \p{Z}, tab, LF, FF, CR, NEL | 6 |
new-line | LF, CR, NEL | 2 |
Notes:
The number associated with a rule that actually determined a break position is available to the application after the break has been returned. These numbers are not Perl regular expression repeat counts.
Comments are recognized and removed separately from otherwise parsing the rules. They may appear wherever a space would be allowed (and ignored.)
The implicit concatenation of adjacent terms has higher precedence than the '|' operation. "ab|cd" is interpreted as "(ab)|(cd)", not as "a(b|c)d" or "(((ab)|c)d)"
The syntax for UnicodeSet is defined (and parsed) by the UnicodeSet class. It is not repeated here.
For $variables that will be referenced from inside of a UnicodeSet, the definition must consist only of a Unicode Set. For example, when variable $a is used in a rule like [$a$b$c], then this definition of $a is ok “$a=[:Lu:];” while this one “$a=abcd;” would cause an error when $a was used.
Spaces are allowed nearly anywhere, and are not significant unless escaped. Exceptions to this are noted.
No spaces are allowed within a variable name. The variable name $dictionary is special. If defined, it must be a Unicode Set, the characters of which will trigger the use of word dictionary based boundaries.
A leading '!' on a rule is a deprecated syntax for specifying a reverse rule. Putting reverse rules in the !!reverse section is now preferred.
{nnn} appearing at the end of a rule is a Rule Status number, not a repeat count as it would be with conventional regular expression syntax.
EBNF Syntax used for the RBBI rules syntax description
a? | zero or one instance of a |
---|---|
a+ | one or more instances of a |
a* | zero or more instances of a |
a | b | either a or b, but not both |
'a' "a" | the literal string between the quotes |
Additional Sample Code
C/C++: See icu/source/samples/break/ in the ICU source distribution for code samples showing the use of ICU boundary analysis.
Copyright (c) 2000 - 2007 IBM and Others - PDF Version - Feedback: http://icu-project.org/contacts.html
User Guide for ICU v3.8 Generated 2007-09-14.