Regular Expressions

From miki
Jump to navigation Jump to search

References

Engines

Powerful engines:

Open source regex engine implemented into PHP for instance

Less powerful engines:

Use Extended regular expressions (switch -r) so that meta-characters (){} have their special meaning when unquoted.

Character Classes

Class Meaning Comment
[ae] Matches a or e
[a-z] Matches any char in range a...z
[^a-z] Matches any char not in range a...z
\d Digit - Equivalent to [0-9]
\w Word character - Equivalent to [A-Za-z0-9_]
\s Whitespace character - Equivalent to [ \t\r\n] same as [[:blank:]]
\D Negated \d, i.e. [^\d]
\W Negated \w, i.e. [^\w]
\S Negated \s, i.e. [^\s]

About negated class, note that [\D\S] is not the same as [^\d\s]. The latter will not match a character that is either a digit or a whitespace. The former will match any character that is either not a digit, or not a whitespace, i.e. it will match any character...

Bug — in sed 4.2.2, the malformed regex [a-Z] matches any lowercase or uppercase character. [A-z], although correct, is rejected:

echo 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ@?>=<;:0123456789/;-,+*()[]^_{|}~' | sed 's/[a-Z]/%/g'
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%@?>=<;:0123456789/;-,+*()[]^_{|}~

$ echo 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ@?>=<;:0123456789/;-,+*()[]^_{|}~' | sed 's/[A-z]/%/g'
# sed: -e expression #1, char 11: Invalid range end

Zero-length matches

The regex here are zero-length, meaning they match a zero-length string, either because they match particular positions in the string (such as end-of-line or beginning-of-line anchors), or because the matched string is dropped after evaluation (like assertions, which only yield a boolean value, match or not matched).

Anchors:

Anchor Meaning Comment
^ Beginning-of-line anchor
$ End-of-line anchor
\b Word boundary anchor
\B Negated word boundary anchor
\< Start-of-a-word anchor GNU extensions
\> End-of-a-word anchor GNU extensions
\K Start match here PERL extensions

Assertions:

Assertion Meaning Comment
(?=regex) Lookahead positive assertion e.g. \b(?=\w{0,3}cat)\w{6}\b, matches locate
(?!regex) Lookahead negative assertion e.g. \b(?!\w{0,3}cat)\w{6}\b, matches relica but not locate
(?<=regex) Lookbehind positive assertion
(?<!regex) Lookbehind negative assertion

Negative assertion can be used to invert a regex match: ^(?!.*<REGEX_HERE>) will match everything not matching <REGEX_HERE>

Examples

Sed - The list below is actually for Extended regular expression (switch -r).

Regexp Description
. Match any character
gray|grey Match gray or grey
gr(a|e)y Match gray or grey
gr[ae]y Match gray or grey
file[^0-2] Match file3 or file4, but not file0, file1, file2.
colou?r (zero or one) - Match Color or Colour.
ab*c (zero or more) - Match ac, abc, abbc, ....
ab+c (one or more) - Match abc, abbc, abbbc, ....
a{3,5} (at least m and not more than n times) - Match aaa, aaaa, aaaaa.
^on single line$ (start and end of line) - Match on single line on a single line.

Tips

The best regex tricks: match a but not "a"

Reference: http://www.rexegg.com/regex-best-trick.html

To match Tarzan but not "Tarzan", use regex

"Tarzan"|(Tarzan)

The same way, to add more variations we want to ignore, like Tarzania and --Tarzan--To match Tarzan but not "Tarzan", use regex

"Tarzan"|(Tarzan)

In pseudo-regex, this gives:

NotThis|NotThat|GoAway|(WeWantThis)

Alternatively, to delete the wrong matches, we change it to

(KeepThis|KeepThat|KeepTheOther)|DeleteThis

Search whole words

Use \b [1]:

echo "bar embarassment" | sed "s/\bbar\b/no bar/g"
# no bar embarassment

Match braced expressions

sed -r 's/f\([^()]*\)/OK/' file.txt                           # Match f() and f(...)
sed -r 's/f\(([^()]|\([^()]*\))*\)/OK/' file.txt              # 1-level imbrication - also match f(...(...)...)
sed -r 's/f\(([^()]|\(([^()]|\([^()]*\))*\))*\)/OK/' file.txt # 2-level imbrication - also matches f(...(...(...)...)...)
# To add more level,
# replace:          =======================
# with:    ====================================
# 
# ... or add '\(([^()]|' as prefix and ')*\)' as suffix.

Empty regular expression

In Sed, using the empty regular expression // or s//.../ allows to match the previous regexp, but without repeating it [2].

It also match for groups, which is remembered:

sed -ri '0,/^#include ([<"])/{s//#include  \1/;}'  # Here \1 will match either < or "

Document regular expression

  • In PCRE, can use (?x) to ignore whitespace. This allows to write:
    (?x)
    ^\w+            # mandatory leading letters
    ( [-+.'] \w+ )* # optional suffix
    @
    \w+             # domain
    ( [-.] \w+ )*   # domain suffix
    ( \.\w+ ( [-.] \w+ )* )* #tld
    $
  • Alternative one can cut into sub-expressions. For instance in Python:
mandatory_leading_letters = "^\w+"
optional_suffix = "([-+.']\w+)*"
domain = "\w+"
domain_suffix = "([-.]\w+)*"
tld = "\.\w+([-.]\w+)*$"
regex = "{mandatory_leading_letters}{optional_suffix}@{domain}{domain_suffix}{tld}"
One could then use regex in a f-string with re module.

Match an expression in the middle

Say we want to match the time in string tests passed in 1.78s.

tests passed in 1.78s
                ====

We can use this regular expression:

echo "tests passed in 1.78s" | grep -oP 'in \K([0-9.]+)(?=s)'

This uses the match-starts-here anchor (\K), and an assertion.

Regex Golf

My solutions so far (see here for other scores [3]):

Plain strings (207)   foo
Anchors (208)         k$
Ranges (202)          ^[a-f]+$
Backrefs (201)        (...).*\1
Abba (190)            ^(?!.*(.)(.)\2\1).*$
Abba (193)            ^(?!.*(.)(.)\2\1)
A man, a plan (176)   ^(.)(.).*\2\1$
A man, a plan (177)   ^(.)[^p].*\1$
Prime (232)           ^(xx|xxx|x{5}|x{7}|x{11}|x{13}|x{17}|x{19}|x{23}|x{29}|x{31})$|x{33}
 From Reddit:
   Prime (286)        ^(?!(xx+)\1+$)                  (ie. cannot be twice or more times a number >=2)
Four (198)            (.).\1.\1.\1
Four (199)            (.)(.\1){3}
Order (156)           ^a?b?c?c?d?e?e?f?g?h?i?l?l?m?n?o?o?p?r?s?s?t?t?y?w?z?$
  From Reddit:
    Order (196)       ^[a-f]*[g-z]*$
    Order (199)       ^.{5}[^e]?$                     (obvious cheat actually)
Triples (570)         (6|[56]0|31|12|24|[48]7|58|0[0249]|7[258]|003|015|303|9005)$
Glob (362)            ^(([^*]+) .+ \2|([^*]+)\* .+ \3.*)$|err|fal|log|tud|aio|sy
  From Reddit:
    Glob (389)        ([rlwpc][dplroy]|[lpg]e[an]).*t (pure cheat, not obvious)
Balance (283)         ^(<(<(<(<(<(<(<>)*>)*>)*>)*>)*>)*>)*$
Balance (286)         ^<<>><|^(<(<(<(<(<>)*>)*>)*>)*>)*$
Powers (60)           ^(((((((((xx?)\9?)\8?)\7?)\6?)\5?)\4?)\3?)\2?)\1?$
Powers (72)           ^(?!((xx)+x|(x{24})+|x{28}|x{160})$)x*
Powers (76)           ^(?!((xx)+x|x{28}|x{48}|(x{5})+)$)
  From Reddit:
    Powers (93)       ^(?!(x(xx)+)\1*$)               (ie. 2^n cannot be a multiple of an odd number >=3)
Long count (216)      ^0+ 0+1 0010 0011 0100 0101 0110 01+ 10+ 1001 1010 101
  From Reddit:
    Long count (253)  ((.+)0 \2+1 ?){8}               (ie. something|'0' same thing|'1', this 8 times)
Long count v2 (216)   ^0+ 0+1 0010 0011 0100 0101 0110 01+ 10+ 1001 1010 101
  From Reddit:
    Long count (253)  ((.+)0 \2+1 ?){8}               (ie. something|'0' same thing|'1', this 8 times)
Alphabetical (289)    [rs]er$|^([er]|ass).*s$|a t|e e|n r|rt r|ne t|ar ta

Total 4005