Regular Expressions

Single Character Classes

[abc] a, b, or c (simple class)
[bcr]at matches cat
[^abc] Any character except a, b, or c (negation). [^ab] reads not a or b. it matches c and doesn't match a or b or ab
[^bcr]at matches hat
[a-zA-Z] a through z, or A through Z, inclusive (range)
[a-c] matches c
foo[1-5] matches foo4
foo[^1-5] matches foo6
[a-d[m-p]] a through d, or m through p: [a-dm-p] (union)
[0-4[6-8]] matches 6 or 8 but not 9 or 5
[a-z&&[def]] d, e, or f (intersection)
[0-9&&[345]] matches 5
[a-z&&[^bc]] a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]] a through z, and not m through p: [a-lq-z] (subtraction)
[0-9&&[^345]] matches 6

An ordinary character in a regular expression matches itself. A metacharacter is a special character that affects the way a pattern is matched. The letter A is an ordinary character. The punctuation mark . is a metacharacter that matches any single character.

Predefine Character Classes

. any character. Except \n newline
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]

pipe (OR operator) abc|def matches abc or def

\d matches 1
\w matches a
\w does not match !

Quantifiers

Quantifiers allow you to specify the number of occurrences to match against.

Greedy Reluctant Possessive Meaning
X? X?? X?+ X, once or not at all a? matches empty string and also a. -?[0-9]+ matches -23 and 123
X* X*? X*+ X, zero or more times a* matches empty string or a or aa
X+ X+? X++ X, one or more times a+ matches a, aa, aaa, etc. [0-1]+ matches any binary
X{n} X{n}? X{n}+ X, exactly n times [0-9]{3}-[0-9]{4} finds any number of the form 123-4567
X{n,} X{n,}? X{n,}+ X, at least n times
X{n,m} X{n,m}? X{n,m}+ X, at least n but not more than m times 'ba{2,3}b' will find 'baab' and 'baaab' but NOT 'bab' or 'baaaab'

.* matches any sequence
\. matches the dot as \ is used to escape characters
\..* matches anything after the dot

The regular expression a? is not specifically looking for the letter "b"; it's merely looking for the presence (or lack thereof) of the letter "a". If the quantifier allows for a match of "a" zero times, anything in the input string that's not an "a" will show up as a zero-length match.

Number of Occurrences

To match a pattern exactly n number of times, simply specify the number inside a set of {braces}

a{3} matches aaa but not aa

To require a pattern to appear at least n times, add a comma after the number
a{3,} matches aaaaa

to specify an upper limit on the number of occurrences, add a second number inside the braces

a{3,6} matches aaaa

Capturing Groups

Quantifiers can only attach to one character at a time, so the regular expression "abc+" would mean "a, followed by b, followed by c one or more times". It would not mean "abc" one or more times. However, quantifiers can also attach to Character Classes such as [abc]+ (a or b or c, one or more times) or Capturing Groups such as (abc)+ (the group "abc", one or more times).

(dog){3} matches dogdogdogdogdogdog

Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of (parentheses). For example, the regular expression (dog) creates a single group containing the letters "d" "o" and "g". The portion of the input string that matches the capturing group will be saved in memory for later recall via backreferences.

The groups are counted from left to right. There are 4 groups in ((A)(B(D))).

Group 0 which represents the entire expression is not included in the total reported by groupCount.

Backreferences

The section of the input string matching the capturing group(s) is saved in memory for later recall via backreference. A backreference is specified in the regular expression as a backslash (\) followed by a digit indicating the number of the group to be recalled. For example, the expression (\d\d) defines one capturing group matching two digits in a row, which can be recalled later in the expression via the backreference \1.

To match any 2 digits, followed by the exact same two digits, you would use (\d\d)\1 as the regular expression:

Enter your regex: (\d\d)\1
Enter input string to search: 1212
I found the text "1212" starting at index 0 and ending at index 4.
If you change the last two digits the match will fail:

Enter your regex: (\d\d)\1
Enter input string to search: 1234
No match found.

For nested capturing groups, backreferencing works in exactly the same way: Specify a backslash followed by the number of the group to be recalled.

Boundary Matchers

Boundary matches help identify WHERE in the string the match was taking place.

Boundary Construct Description
^ The beginning of a line. This is also used for negation. See above.
$ The end of a line
\b A word boundary
\B A non-word boundary
\A The beginning of the input
\G The end of the previous match
\Z The end of the input but for the final terminator, if any
\z The end of the input

Greedy, Reluctant, and Possessive Quantifiers

There are subtle differences among greedy, reluctant, and possessive quantifiers.

Greedy quantifiers are considered "greedy" because they force the matcher to read in, or eat, the entire input string prior to attempting the first match. If the first match attempt (the entire input string) fails, the matcher backs off the input string by one character and tries again, repeating the process until a match is found or there are no more characters left to back off from. Depending on the quantifier used in the expression, the last thing it will try matching against is 1 or 0 characters.

The reluctant quantifiers, however, take the opposite approach: They start at the beginning of the input string, then reluctantly eat one character at a time looking for a match. The last thing they try is the entire input string.

Finally, the possessive quantifiers always eat the entire input string, trying once (and only once) for a match. Unlike the greedy quantifiers, possessive quantifiers never back off, even if doing so would allow the overall match to succeed.

Examples

Definition Regex Desc
A string containing one word but not another ^(?!.*reza).*ali.*$ The first bit, (?!.*reza) is a negative look-ahead: before matching the string it checks the string does not contain "reza" (with any number of characters before it) then if not matches the string against "ali".
(?!regex) (negative lookahead) this matches "negative lookahead" but not regex anything except a particular pattern, (?7777)\d{4} matches 4 digits other than 7777.
a group of sequence of numbers ([0-9]*) matches any sequence of numbers as a single group

Glob

Glob is a syntax for pattern matching used for file and directories. It is simpler than regex. See more info here: http://docs.oracle.com/javase/tutorial/essential/io/fileOps.html#glob

In Python

In python notice the difference between (…) and (?…) for example "[0-9]+(?:-[0-9]+)*" matches 12 and 12-13, 12-13-14 but "[0-9]+(-[0-9]+)*" does not.

In Java

        Pattern p = Pattern.compile("com\\.x\\..*\\.dto\\.*"); // compile a pattern
        Matcher m1 = p.matcher("com.x.services.products.data.online.dto.PeriodTypeEnum.class");   //check against the pattern
        System.out.println(m1.find()); // returns true
        System.out.println(m1.group(x)); // the group that matched
 
//Pattern has useful methods and flags such as: 
Pattern.compile("..." , Pattern.CASE_INSENSITIVE | Pattern.UNIX_LINES);
 
//A quick way to check a string against a regex
Pattern.matches(String regex, CharSequence input)

Reference

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License