Regular Expressions
A regular expression is the specification of the syntax of a simple language
Used with regexp.exec
, regexp.test
, string.match
, string.replace
, string.search
and string.split
to interact with string
Quite convoluted and difficult to read as do not allow comments or whitespace so a JavaScript regular expression must be on a single line
An Example
/ˆ(?:([A-Za-z]+):)?(\/{0,3})([0-9.\-A-Za-z]+)(?::(\d+))?(?:\/([ˆ?#]*))?(?:\?([ˆ#]*))?(?:#(.*))?$/
Breaking it down one portion factor at a time:
- Note that the string starts and ends with a slash
/
ˆ
indicates the beginning of a string(?:([A-Za-z]+):)?
(?:...)
indicates a noncapturing group), where the '...' is replaced by the group that you wish to match, but not save to anywhere- Suffix
?
indicates the group is optional, so it could or could not exist in the string - it could even exist more than once ()
around the ([A-Za-z]+) indicates a capturing group which is therefore captured and placed in theresult
array- They groups are placed in the array in order, so the first will appear in
result[1]
- Noncapturing groups are preferred to capturing groups because capturing groups have a performance penalty (on account of saving to the result array)
- You can also have capturing groups within noncapturing groups such as
(?:Bob says: (\w+))
- They groups are placed in the array in order, so the first will appear in
[...]
indicates a character classA-Za-z
is a character class containing all 26 letters of the alphabet in both upper and lower case- Suffix
+
means character class will be matched one or more times - Suffix
:
is matched literally (so the letters will be followed by a colon in this case)
(\/{0,3})
\/
The backslash\
escapes the forward slash/
(which traditionally symbolises the end of the regular expression literal) and together they indicate that the forward slash/
should be matched- Suffix
{0,3}
means the slash/
will be matched between 0 and 3 times
([0-9.\-A-Za-z]+)
- String made up of one or more (note the
+
at the end denoting possible multiple ocurrences) digits, letters (upper or lower case), full stops (.) or hyphens (-)- Note that the hyphen was escaped with a backslash
\-
as hyphens usually denote a range but in this case is a hyphen within the expression
- Note that the hyphen was escaped with a backslash
- String made up of one or more (note the
(?::(\d+))?
\d
represents a digit character so this will be a sequence of one or more digit characters (as per the+
)- The digit characters will be immediately preceded by a colon
:
(\d+) will be the fourth capturing group in this expression, it is also _optional_ (
?) and inside a non-capturing group (
(?:...)`
(?:\/([ˆ?#]*))?
- Another optional grou (
?
), beginning with a literal slash/
(escaped by the backslash) - The
ˆ
at the beginning of character class[ˆ?#]
means it includes all characters except ? and #- This acutally leave the regexp open to attack because too many characters are included in the character class
- The
*
indicates the character class will appear zero or more times
- Another optional grou (
(?:\?([ˆ#]*))?
- We've seen everything here before: An optional capturing group starting with a literal
?
(escaped by the backslash) with zero or more characters that are not #
- We've seen everything here before: An optional capturing group starting with a literal
(?:#(.*))?
- Final optional group beginning with a
#
.
matches any character except a line ending character
- Final optional group beginning with a
$
represents the end of a stringNote:
ˆ
and$
are important because they anchor the regexp and checks whether the string matched against it contains only what is in the regexp- If
ˆ
and$
weren't present, it would check that the string contained the regexp but wouldn't necessarily be only made up of this - Using only
ˆ
checks the string starts with the regexp - Using only
$
checks the string ends with the regexp
- If
Another example
/ˆ-?\d+(?:\.\d*)?(?:e[+\-]?\d+)?$/i;
Most of this we have seen before but here are the new bits:
- The
i
at the end means ignore case when matching letters -?
means the minus sign is optional(?:\.\d*)
matches a decimal point followed by zero or more digits (123.6834.4442284 does not match)- Note this expression only uses noncapturing groups
Construction
3 flags exist in regular expressions: i
means insensitive - ignore the character case, 'gmeans global - to match multiple items and
m` means multiline - where ˆ and $ can match line-ending characters
Two ways to build a regular expression:
- Regular Expression literals as per the examples above start and end with a slash
/
- Here the flags are appended after the final slash, for example
/i
- Be careful:
RegExp
objects made by regular expression literals share a single instance
- Here the flags are appended after the final slash, for example
- Use
RegExp
constructor- The first parameter is the string to be made into a
RegExp
object, the second is the flag - Useful when all information for creating the regular expression is not available at time of programming
- Backslashes mean something in the constructor, so these must be doubled and quotes must be escaped
- The first parameter is the string to be made into a
//example creating a regular expression object that matches a JavaScript string
var my_regexp = new RegExp("'(?:\\\\.|[ˆ\\\\\\'])*'", 'g');
Elements
Regexp Choice
|
provides a match if any of the sequences provided match.
In "into".match(/in|int/);
, the in will be a match so it doesn't even look at the int.
Regexp Sequence
A regexp sequence is made up of one or more regexp factors. If there are no quantifiers after the factor (like ?
, *
or +
), the factor will be matched one time.
Regexp Factor
A regexp factor can be a character, a parenthesized group, a character class, or an escape sequence.
It's essentially a portion of the full RegExp
, like what we broke down the regexp above into.
- The following special characters must all be escaped with a backslash
\
to be taken literally, or they will take on an alternative meaning: \ / [ ] ( ) { } ? + * | . ˆ$ - The
\
prefix does not make letters or digits literal - When unescaped:
.
matches any character except line-endingˆ
matches the beginning of the text whenlastIndex
property is zero, or matches line-ending character when them
flag is present- Having
ˆ
inside a character class means NOT, so [ˆ0-9] means does not match a digit $
matches the beginning of the text or a line-ending character when them
flag is present
Regexp Escape
As well as escaping special characters in regexp factors, the backslash has additional uses:
- As in strings,
\f
is the formfeed character,\n
is new line,\r
is carriage return,\t
is tab and\u
specifies Unicode as a 16-bit hex. But\b
is not a backspace character \d
=== [0-9] and\D
is the opposite, NOT (ˆ) a digit, [ˆ0-9]\s
matches is a partial set of Unicode whitespace characters and\S
is the opposite\w
=== [0-9A-Za-z] and\W
=== [ˆ0-9A-Za-z] but useless for any real world language (because of accents on letters, etc)\1
refers to the text captured in group 1 so it is matched again later on in the regexp\2
refers to group 2,\3
to group 3 and so on
*\b
is a bad part. It was supposed to be a word-boundary anchor but is useless for multilingual applications
Regexp Group
Four kinds of groups:
Capturing: (...)
where each group is captured into the result
array - the first capturing group in the regexp goes into result[1]
, the second into result[2]
and so on
Noncapturing (?:...)
where the text is matched, but not captured and saved anywhere, making is slightly faster than a capturing group (has no bearing on numbering of capturing groups)
- Positive lookahead, a bad part:
(?=...)
acts like a noncapturing group except after the match is made, it goes back to where text started - Negative lookahead, a bad part:
(?!...)
is like a positive lookahead but only matches if there is no match with what is in it
Regexp Class
- Conveniently and easily specifies one of a set of characters using square brackets
[]
, for example vowels:[aeiou]
- Can shorten specification of all 32 ASCII special characters to [!-\/:-@[-'{-˜] (note that the ' in this piece of code should be a back-tick which I can't use as part of these notes)
- Also allows
ˆ
as the first character after the opening[
to mean NOT the characters in the character set
Regexp Class Escape
There are specific characters that must be escaped in a character class: - / [ \ ] ˆ
Regexp Quantifier
A quantifier at the en of a factor indicates how many times the factor should be matched
- A number in curly braces means the factor should match that many times, so
/o{3}
matches ooo - Two comma-seperated numbers in curly braces provide the range of times a factor should match, so
{3,5}
indicates it will match 3, 4 or 5 times - Zero or one times (same thing as saying something is optional) can be
?
or{0,1}
- Zero or more times can be
*
or{0,}
- One or more times can be
+
or{1,}
Prefer to use 'zero or more' or 'one or more' matching over the 'zero or one' matching - i.e. prefer greedy matching over lazy matching