Regular Expressions

A regular expression is the specification of the syntax of a simple language

Used with regexp.exec, regexp.test, string.match, string.replace, string.search and string.split to interact with string

Quite convoluted and difficult to read as do not allow comments or whitespace so a JavaScript regular expression must be on a single line

An Example

/ˆ(?:([A-Za-z]+):)?(\/{0,3})([0-9.\-A-Za-z]+)(?::(\d+))?(?:\/([ˆ?#]*))?(?:\?([ˆ#]*))?(?:#(.*))?$/

Breaking it down one portion factor at a time:

  • Note that the string starts and ends with a slash /
  • ˆ indicates the beginning of a string
  • (?:([A-Za-z]+):)?
    • (?:...) indicates a noncapturing group), where the '...' is replaced by the group that you wish to match, but not save to anywhere
    • Suffix ? indicates the group is optional, so it could or could not exist in the string - it could even exist more than once
    • () around the ([A-Za-z]+) indicates a capturing group which is therefore captured and placed in the result array
      • They groups are placed in the array in order, so the first will appear in result[1]
      • Noncapturing groups are preferred to capturing groups because capturing groups have a performance penalty (on account of saving to the result array)
      • You can also have capturing groups within noncapturing groups such as (?:Bob says: (\w+))
    • [...] indicates a character class
    • A-Za-z is a character class containing all 26 letters of the alphabet in both upper and lower case
    • Suffix + means character class will be matched one or more times
    • Suffix : is matched literally (so the letters will be followed by a colon in this case)
  • (\/{0,3})
    • \/ The backslash \ escapes the forward slash / (which traditionally symbolises the end of the regular expression literal) and together they indicate that the forward slash / should be matched
    • Suffix {0,3} means the slash / will be matched between 0 and 3 times
  • ([0-9.\-A-Za-z]+)
    • String made up of one or more (note the + at the end denoting possible multiple ocurrences) digits, letters (upper or lower case), full stops (.) or hyphens (-)
      • Note that the hyphen was escaped with a backslash \- as hyphens usually denote a range but in this case is a hyphen within the expression
  • (?::(\d+))?
    • \d represents a digit character so this will be a sequence of one or more digit characters (as per the +)
    • The digit characters will be immediately preceded by a colon :
    • (\d+) will be the fourth capturing group in this expression, it is also _optional_ (?) and inside a non-capturing group ((?:...)`
  • (?:\/([ˆ?#]*))?
    • Another optional grou (?), beginning with a literal slash / (escaped by the backslash)
    • The ˆ at the beginning of character class [ˆ?#] means it includes all characters except ? and #
      • This acutally leave the regexp open to attack because too many characters are included in the character class
    • The * indicates the character class will appear zero or more times
  • (?:\?([ˆ#]*))?
    • We've seen everything here before: An optional capturing group starting with a literal ? (escaped by the backslash) with zero or more characters that are not #
  • (?:#(.*))?
    • Final optional group beginning with a #
    • . matches any character except a line ending character
  • $ represents the end of a string

  • Note: ˆ and $ are important because they anchor the regexp and checks whether the string matched against it contains only what is in the regexp

    • If ˆ and $ weren't present, it would check that the string contained the regexp but wouldn't necessarily be only made up of this
    • Using only ˆ checks the string starts with the regexp
    • Using only $ checks the string ends with the regexp

Another example /ˆ-?\d+(?:\.\d*)?(?:e[+\-]?\d+)?$/i;

Most of this we have seen before but here are the new bits:

  • The i at the end means ignore case when matching letters
  • -? means the minus sign is optional
  • (?:\.\d*) matches a decimal point followed by zero or more digits (123.6834.4442284 does not match)
  • Note this expression only uses noncapturing groups

Construction

3 flags exist in regular expressions: i means insensitive - ignore the character case, 'gmeans global - to match multiple items andm` means multiline - where ˆ and $ can match line-ending characters

Two ways to build a regular expression:

  1. Regular Expression literals as per the examples above start and end with a slash /
    • Here the flags are appended after the final slash, for example /i
    • Be careful: RegExp objects made by regular expression literals share a single instance
  2. Use RegExp constructor
    • The first parameter is the string to be made into a RegExp object, the second is the flag
    • Useful when all information for creating the regular expression is not available at time of programming
    • Backslashes mean something in the constructor, so these must be doubled and quotes must be escaped
//example creating a regular expression object that matches a JavaScript string

var my_regexp = new RegExp("'(?:\\\\.|[ˆ\\\\\\'])*'", 'g');

Elements

Regexp Choice

| provides a match if any of the sequences provided match.

In "into".match(/in|int/);, the in will be a match so it doesn't even look at the int.

Regexp Sequence

A regexp sequence is made up of one or more regexp factors. If there are no quantifiers after the factor (like ?, * or +), the factor will be matched one time.

Regexp Factor

A regexp factor can be a character, a parenthesized group, a character class, or an escape sequence.

It's essentially a portion of the full RegExp, like what we broke down the regexp above into.

  • The following special characters must all be escaped with a backslash \ to be taken literally, or they will take on an alternative meaning: \ / [ ] ( ) { } ? + * | . ˆ$
  • The \ prefix does not make letters or digits literal
  • When unescaped:
    • . matches any character except line-ending
    • ˆ matches the beginning of the text when lastIndex property is zero, or matches line-ending character when the m flag is present
    • Having ˆ inside a character class means NOT, so [ˆ0-9] means does not match a digit
    • $ matches the beginning of the text or a line-ending character when the m flag is present

Regexp Escape

As well as escaping special characters in regexp factors, the backslash has additional uses:

  • As in strings, \f is the formfeed character, \n is new line, \r is carriage return, \t is tab and \u specifies Unicode as a 16-bit hex. But \b is not a backspace character
  • \d === [0-9] and \D is the opposite, NOT (ˆ) a digit, [ˆ0-9]
  • \s matches is a partial set of Unicode whitespace characters and \S is the opposite
  • \w === [0-9A-Za-z] and \W === [ˆ0-9A-Za-z] but useless for any real world language (because of accents on letters, etc)
  • \1 refers to the text captured in group 1 so it is matched again later on in the regexp
    • \2 refers to group 2, \3 to group 3 and so on

*\b is a bad part. It was supposed to be a word-boundary anchor but is useless for multilingual applications

Regexp Group

Four kinds of groups: Capturing: (...) where each group is captured into the result array - the first capturing group in the regexp goes into result[1], the second into result[2] and so on Noncapturing (?:...) where the text is matched, but not captured and saved anywhere, making is slightly faster than a capturing group (has no bearing on numbering of capturing groups)

  • Positive lookahead, a bad part: (?=...) acts like a noncapturing group except after the match is made, it goes back to where text started
  • Negative lookahead, a bad part: (?!...) is like a positive lookahead but only matches if there is no match with what is in it

Regexp Class

  • Conveniently and easily specifies one of a set of characters using square brackets [], for example vowels: [aeiou]
  • Can shorten specification of all 32 ASCII special characters to [!-\/:-@[-'{-˜] (note that the ' in this piece of code should be a back-tick which I can't use as part of these notes)
  • Also allows ˆ as the first character after the opening [ to mean NOT the characters in the character set

Regexp Class Escape

There are specific characters that must be escaped in a character class: - / [ \ ] ˆ

Regexp Quantifier

A quantifier at the en of a factor indicates how many times the factor should be matched

  • A number in curly braces means the factor should match that many times, so /o{3} matches ooo
  • Two comma-seperated numbers in curly braces provide the range of times a factor should match, so {3,5} indicates it will match 3, 4 or 5 times
  • Zero or one times (same thing as saying something is optional) can be ? or {0,1}
  • Zero or more times can be * or {0,}
  • One or more times can be + or {1,}

Prefer to use 'zero or more' or 'one or more' matching over the 'zero or one' matching - i.e. prefer greedy matching over lazy matching

results matching ""

    No results matching ""