The regular expressions (or regex for short) used in searches and segmentation rules are those supported by Java. Should you need more specific information, consult the Java Regex documentation. See additional references and examples below.
This chapter is intended for advanced users, who need to define their own variants of segmentation rules or devise more complex and powerful key search items.
Table 16.1. Regex - Flags
| The construct | ... matches the following | 
|---|---|
| (?i) | Enables case-insensitive matching (by default, the pattern is case-sensitive). | 
Table 16.2. Regex - Character
| The construct | ... matches the following | 
|---|---|
| x | The character x, except the following... | 
| \uhhhh | The character with hexadecimal value 0xhhhh | 
| \t | The tab character ('\u0009') | 
| \n | The newline (line feed) character ('\u000A') | 
| \r | The carriage-return character ('\u000D') | 
| \f | The form-feed character ('\u000C') | 
| \a | The alert (bell) character ('\u0007') | 
| \e | The escape character ('\u001B') | 
| \cx | The control character corresponding to x | 
| \0n | The character with octal value 0n (0 <= n <= 7) | 
| \0nn | The character with octal value 0nn (0 <= n <= 7) | 
| \0mnn | The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7) | 
| \xhh | The character with hexadecimal value 0xhh | 
Table 16.3. Regex - Quotation
| The construct | ...matches the following | 
|---|---|
| \ | Nothing, but quotes the following character. This is required if you would like to enter any of the meta characters !$()*+.<>?[\]^{|} to match as themselves. | 
| \\ | For example, this is the backslash character | 
| \Q | Nothing, but quotes all characters until \E | 
| \E | Nothing, but ends quoting started by \Q | 
Table 16.4. Regex - Classes for Unicode blocks and categories
| The construct | ...matches the following | 
|---|---|
| \p{InGreek} | A character in the Greek block (simple block) | 
| \p{Lu} | An uppercase letter (simple category) | 
| \p{Sc} | A currency symbol | 
| \P{InGreek} | Any character except one in the Greek block (negation) | 
| [\p{L}&&[^\p{Lu}]] | Any letter except an uppercase letter (subtraction) | 
Table 16.5. Regex - Character classes
| The construct | ...matches the following | 
|---|---|
| [abc] | a, b, or c (simple class) | 
| [^abc] | Any character except a, b, or c (negation) | 
| [a-zA-Z] | a through z or A through Z, inclusive (range) | 
Table 16.6. Regex - Predefined character classes
| The construct | ...matches the following | 
|---|---|
| . | Any character (except for line terminators) | 
| \d | A digit: [0-9] | 
| \D | A non-digit: [^0-9] | 
| \s | A whitespace character: [ \t\n\x0B\f\r] | 
| \S | A non-whitespace character: [^\s] | 
| \w | A word character: [a-zA-Z_0-9] | 
| \W | A non-word character: [^\w] | 
Table 16.7. Regex - Boundary matchers
| The construct | ...matches the following | 
|---|---|
| ^ | The beginning of a line | 
| $ | The end of a line | 
| \b | A word boundary | 
| \B | A non-word boundary | 
Table 16.8. Regex - Greedy quantifiers
| The construct | ...matches the following | 
|---|---|
| X? | X, once or not at all | 
| X* | X, zero or more times | 
| X+ | X, one or more times | 
greedy quantifiers will match as much as they can. For example, a+? will match the aaa in aaabbb
Table 16.9. Regex - Reluctant (non-greedy) quantifiers
| The construct | ...matches the following | 
|---|---|
| X?? | X, once or not at all | 
| X*? | X, zero or more times | 
| X+? | X, one or more times | 
non-greedy quantifiers will match as little as they can. For example, a+? will match the first a in aaabbb
Table 16.10. Regex - Logical operators
| The construct | ...matches the following | 
|---|---|
| XY | X followed by Y | 
| X|Y | Either X or Y | 
| (XY) | XY as a single group | 
A number of interactive tools are available to develop and test regular expressions. They generally follow much the same pattern (for an example from the Regular Expression Tester see below): the regular expression (top entry) analyzes the search text (Text box in the middle) , yielding the hits, shown in the result Text box.
See The Regex Coach for Windows,Linux, FreeBSD versions of a stand-alone tool. This is much the same as the above example.
A nice collection of useful regex cases can be found in OmegaT itself (see Options > Segmentation). The following list includes expressions you may find useful when searching through the translation memory:
Table 16.11. Regex - Examples of regular expressions in translations
| Regular expression | Finds the following: | 
|---|---|
| (\b\w+\b)\s\1\b | double words | 
| [\.,]\s*[\.,]+ | comma or a period, followed by spaces and yet another comma or period | 
| \. \s+$ | extra spaces after the period at the end of the line | 
| \s+a\s+[aeiou] | English: words, beginning with vowels, should generally be preceded by "an", not "a" | 
| \s+an\s+[^aeiou] | English: the same check as above, but concerning consonants ("a", not "an") | 
| \s\s+ | more than one space | 
| \.[A-Z] | Period, followed by an upper-case letter - possibly a space is missing between the period and the start of a new sentence? |