Regular Expressions: Difference between revisions

Latest revision as of 13:01, 27 June 2024

References

Regular-Expressions.info, The Premier website about Regular Expressions
Regular expression on Wikipedia

Engines

Powerful engines:

Perl
PCRE

Open source regex engine implemented into PHP for instance

Less powerful engines:

sed

Use Extended regular expressions (switch -r) so that meta-characters (){} have their special meaning when unquoted.

grep

Character Classes

Class	Meaning	Comment
`[ae]`	Matches a or e
`[a-z]`	Matches any char in range a...z
`[^a-z]`	Matches any char not in range a...z
`\d`	Digit - Equivalent to `[0-9]`
`\w`	Word character - Equivalent to `[A-Za-z0-9_]`
`\s`	Whitespace character - Equivalent to `[ \t\r\n]`	same as `[[:blank:]]`
`\D`	Negated \d, i.e. `[^\d]`
`\W`	Negated \w, i.e. `[^\w]`
`\S`	Negated \s, i.e. `[^\s]`

About negated class, note that [\D\S] is not the same as [^\d\s]. The latter will not match a character that is either a digit or a whitespace. The former will match any character that is either not a digit, or not a whitespace, i.e. it will match any character...

Bug — in sed 4.2.2, the malformed regex [a-Z] matches any lowercase or uppercase character. [A-z], although correct, is rejected:

echo 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ@?>=<;:0123456789/;-,+*()[]^_{|}~' | sed 's/[a-Z]/%/g'
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%@?>=<;:0123456789/;-,+*()[]^_{|}~

$ echo 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ@?>=<;:0123456789/;-,+*()[]^_{|}~' | sed 's/[A-z]/%/g'
# sed: -e expression #1, char 11: Invalid range end

Zero-length matches

The regex here are zero-length, meaning they match a zero-length string, either because they match particular positions in the string (such as end-of-line or beginning-of-line anchors), or because the matched string is dropped after evaluation (like assertions, which only yield a boolean value, match or not matched).

Anchors:

Anchor	Meaning	Comment
`^`	Beginning-of-line anchor
`$`	End-of-line anchor
`\b`	Word boundary anchor
`\B`	Negated word boundary anchor
`\<`	Start-of-a-word anchor	GNU extensions
`\>`	End-of-a-word anchor	GNU extensions
`\K`	Start match here	PERL extensions

Assertions:

Assertion	Meaning	Comment
`(?=regex)`	Lookahead positive assertion	e.g. `\b(?=\w{0,3}cat)\w{6}\b`, matches locate
`(?!regex)`	Lookahead negative assertion	e.g. `\b(?!\w{0,3}cat)\w{6}\b`, matches relica but not locate
`(?<=regex)`	Lookbehind positive assertion
`(?<!regex)`	Lookbehind negative assertion

Negative assertion can be used to invert a regex match: ^(?!.*<REGEX_HERE>) will match everything not matching <REGEX_HERE>

Examples

Sed - The list below is actually for Extended regular expression (switch -r).

Regexp	Description
`.`	Match any character
`gray\|grey`	Match gray or grey
`gr(a\|e)y`	Match gray or grey
`gr[ae]y`	Match gray or grey
`file[^0-2]`	Match file3 or file4, but not file0, file1, file2.
`colou?r`	(zero or one) - Match Color or Colour.
`ab*c`	(zero or more) - Match ac, abc, abbc, ....
`ab+c`	(one or more) - Match abc, abbc, abbbc, ....
`a{3,5}`	(at least m and not more than n times) - Match aaa, aaaa, aaaaa.
`^on single line$`	(start and end of line) - Match on single line on a single line.

Tips

The best regex tricks: match a but not "a"

Reference: http://www.rexegg.com/regex-best-trick.html

To match Tarzan but not "Tarzan", use regex

"Tarzan"|(Tarzan)

The same way, to add more variations we want to ignore, like Tarzania and --Tarzan--To match Tarzan but not "Tarzan", use regex

"Tarzan"|(Tarzan)

In pseudo-regex, this gives:

NotThis|NotThat|GoAway|(WeWantThis)

Alternatively, to delete the wrong matches, we change it to

(KeepThis|KeepThat|KeepTheOther)|DeleteThis

Search whole words

Use \b [1]:

echo "bar embarassment" | sed "s/\bbar\b/no bar/g"
# no bar embarassment

Match braced expressions

sed -r 's/f\([^()]*\)/OK/' file.txt                           # Match f() and f(...)
sed -r 's/f\(([^()]|\([^()]*\))*\)/OK/' file.txt              # 1-level imbrication - also match f(...(...)...)
sed -r 's/f\(([^()]|\(([^()]|\([^()]*\))*\))*\)/OK/' file.txt # 2-level imbrication - also matches f(...(...(...)...)...)
# To add more level,
# replace:          =======================
# with:    ====================================
# 
# ... or add '\(([^()]|' as prefix and ')*\)' as suffix.

Empty regular expression

In Sed, using the empty regular expression // or s//.../ allows to match the previous regexp, but without repeating it [2].

It also match for groups, which is remembered:

sed -ri '0,/^#include ([<"])/{s//#include  \1/;}'  # Here \1 will match either < or "

Document regular expression

In PCRE, can use (?x) to ignore whitespace. This allows to write:

    (?x)
    ^\w+            # mandatory leading letters
    ( [-+.'] \w+ )* # optional suffix
    @
    \w+             # domain
    ( [-.] \w+ )*   # domain suffix
    ( \.\w+ ( [-.] \w+ )* )* #tld
    $

Alternative one can cut into sub-expressions. For instance in Python:

mandatory_leading_letters = "^\w+"
optional_suffix = "([-+.']\w+)*"
domain = "\w+"
domain_suffix = "([-.]\w+)*"
tld = "\.\w+([-.]\w+)*$"
regex = "{mandatory_leading_letters}{optional_suffix}@{domain}{domain_suffix}{tld}"

One could then use regex in a f-string with re module.

Match an expression in the middle

Say we want to match the time in string tests passed in 1.78s.

tests passed in 1.78s
                ====

We can use this regular expression:

echo "tests passed in 1.78s" | grep -oP 'in \K([0-9.]+)(?=s)'

This uses the match-starts-here anchor (\K), and an assertion.

Regex Golf

http://regex.alf.nu/

My solutions so far (see here for other scores [3]):

Plain strings (207)   foo
Anchors (208)         k$
Ranges (202)          ^[a-f]+$
Backrefs (201)        (...).*\1
Abba (190)            ^(?!.*(.)(.)\2\1).*$
Abba (193)            ^(?!.*(.)(.)\2\1)
A man, a plan (176)   ^(.)(.).*\2\1$
A man, a plan (177)   ^(.)[^p].*\1$
Prime (232)           ^(xx|xxx|x{5}|x{7}|x{11}|x{13}|x{17}|x{19}|x{23}|x{29}|x{31})$|x{33}
 From Reddit:
   Prime (286)        ^(?!(xx+)\1+$)                  (ie. cannot be twice or more times a number >=2)
Four (198)            (.).\1.\1.\1
Four (199)            (.)(.\1){3}
Order (156)           ^a?b?c?c?d?e?e?f?g?h?i?l?l?m?n?o?o?p?r?s?s?t?t?y?w?z?$
  From Reddit:
    Order (196)       ^[a-f]*[g-z]*$
    Order (199)       ^.{5}[^e]?$                     (obvious cheat actually)
Triples (570)         (6|[56]0|31|12|24|[48]7|58|0[0249]|7[258]|003|015|303|9005)$
Glob (362)            ^(([^*]+) .+ \2|([^*]+)\* .+ \3.*)$|err|fal|log|tud|aio|sy
  From Reddit:
    Glob (389)        ([rlwpc][dplroy]|[lpg]e[an]).*t (pure cheat, not obvious)
Balance (283)         ^(<(<(<(<(<(<(<>)*>)*>)*>)*>)*>)*>)*$
Balance (286)         ^<<>><|^(<(<(<(<(<>)*>)*>)*>)*>)*$
Powers (60)           ^(((((((((xx?)\9?)\8?)\7?)\6?)\5?)\4?)\3?)\2?)\1?$
Powers (72)           ^(?!((xx)+x|(x{24})+|x{28}|x{160})$)x*
Powers (76)           ^(?!((xx)+x|x{28}|x{48}|(x{5})+)$)
  From Reddit:
    Powers (93)       ^(?!(x(xx)+)\1*$)               (ie. 2^n cannot be a multiple of an odd number >=3)
Long count (216)      ^0+ 0+1 0010 0011 0100 0101 0110 01+ 10+ 1001 1010 101
  From Reddit:
    Long count (253)  ((.+)0 \2+1 ?){8}               (ie. something|'0' same thing|'1', this 8 times)
Long count v2 (216)   ^0+ 0+1 0010 0011 0100 0101 0110 01+ 10+ 1001 1010 101
  From Reddit:
    Long count (253)  ((.+)0 \2+1 ?){8}               (ie. something|'0' same thing|'1', this 8 times)
Alphabetical (289)    [rs]er$|^([er]|ass).*s$|a t|e e|n r|rt r|ne t|ar ta

Total 4005

@@ Line 29: / Line 29: @@
 |'''<code>\w</code>'''||'''Word character''' - Equivalent to '''<code>[A-Za-z0-9_]</code>'''||
 |-
-|'''<code>\s</code>'''||'''Whitespace character''' - Equivalent to '''<code>[ \t\r\n]</code>'''||
+|'''<code>\s</code>'''||'''Whitespace character''' - Equivalent to '''<code>[ \t\r\n]</code>'''||same as '''<code><nowiki>[[:blank:]]</nowiki></code>'''
 |-
 |'''<code>\D</code>'''||'''Negated \d''', i.e. '''<code>[^\d]</code>'''||
@@ Line 38: / Line 38: @@
 |-
 |}
 About '''negated class''', note that '''<code>[\D\S]</code>''' is not the same as '''<code>[^\d\s]</code>'''. The latter will not match a character that is either a digit or a whitespace. The former will match any character that is either not a digit, or not a whitespace, i.e. it will match any character...
+'''Bug''' &mdash; in ''sed 4.2.2'', the malformed regex <code>[a-Z]</code> matches any lowercase or uppercase character. <code>[A-z]</code>, although correct, is rejected:
+<source lang=bash>
+echo 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ@?>=<;:0123456789/;-,+*()[]^_{|}~' | sed 's/[a-Z]/%/g'
+# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%@?>=<;:0123456789/;-,+*()[]^_{|}~
+$ echo 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ@?>=<;:0123456789/;-,+*()[]^_{|}~' | sed 's/[A-z]/%/g'
+# sed: -e expression #1, char 11: Invalid range end
+</source>
 == Zero-length matches ==
@@ Line 59: / Line 69: @@
 |-
 |'''<code>\&gt;</code>'''||'''End-of-a-word''' anchor||GNU extensions
+|-
+|'''<code>\K</code>'''||'''Start match here'''||PERL extensions
 |}
@@ Line 103: / Line 115: @@
 |<tt>^on single line$</tt>||''(start and end of line)'' - Match ''on single line'' on a single line.
 |}
+== Tips ==
+=== The best regex tricks: match a but not "a" ===
+Reference: http://www.rexegg.com/regex-best-trick.html
+To match <code>Tarzan</code> but not <code>"Tarzan"</code>, use regex
+<pre>
+"Tarzan"|(Tarzan)
+</pre>
+The same way, to add more variations we want to ignore, like <code>Tarzania</code> and <code>--Tarzan--</code>To match <code>Tarzan</code> but not <code>"Tarzan"</code>, use regex
+<pre>
+"Tarzan"|(Tarzan)
+</pre>
+In pseudo-regex, this gives:
+<pre>
+NotThis|NotThat|GoAway|(WeWantThis)
+</pre>
+Alternatively, to delete the wrong matches, we change it to
+<pre>
+(KeepThis|KeepThat|KeepTheOther)|DeleteThis
+</pre>
+=== Search whole words ===
+Use <code>\b</code> [https://stackoverflow.com/questions/1032023/sed-whole-word-search-and-replace]:
+<source lang="bash">
+echo "bar embarassment" | sed "s/\bbar\b/no bar/g"
+# no bar embarassment
+</source>
+=== Match braced expressions ===
+<source lang=bash>
+sed -r 's/f\([^()]*\)/OK/' file.txt                           # Match f() and f(...)
+sed -r 's/f\(([^()]|\([^()]*\))*\)/OK/' file.txt              # 1-level imbrication - also match f(...(...)...)
+sed -r 's/f\(([^()]|\(([^()]|\([^()]*\))*\))*\)/OK/' file.txt # 2-level imbrication - also matches f(...(...(...)...)...)
+# To add more level,
+# replace:          =======================
+# with:    ====================================
+#
+# ... or add '\(([^()]|' as prefix and ')*\)' as suffix.
+</source>
+=== Empty regular expression ===
+In Sed, using the empty regular expression <code>//</code> or <code>s//.../</code> allows to match the previous regexp, but without repeating it [https://www.gnu.org/software/sed/manual/html_node/Addresses.html].
+It also match for groups, which is remembered:
+<source lang="bash">
+sed -ri '0,/^#include ([<"])/{s//#include  \1/;}'  # Here \1 will match either < or "
+</source>
+=== Document regular expression ===
+* In PCRE, can use <code>(?x)</code> to ignore whitespace. This allows to write:
+<source lang="text">
+    (?x)
+    ^\w+            # mandatory leading letters
+    ( [-+.'] \w+ )* # optional suffix
+    @
+    \w+             # domain
+    ( [-.] \w+ )*   # domain suffix
+    ( \.\w+ ( [-.] \w+ )* )* #tld
+    $
+</source>
+* Alternative one can cut into sub-expressions. For instance in Python:
+<source lang="python">
+mandatory_leading_letters = "^\w+"
+optional_suffix = "([-+.']\w+)*"
+domain = "\w+"
+domain_suffix = "([-.]\w+)*"
+tld = "\.\w+([-.]\w+)*$"
+regex = "{mandatory_leading_letters}{optional_suffix}@{domain}{domain_suffix}{tld}"
+</source>
+:One could then use <code>regex</code> in a f-string with <code>re</code> module.
+=== Match an expression in the middle ===
+Say we want to match the time in string <code>tests passed in 1.78s</code>.
+<source lang="text">
+tests passed in 1.78s
+                ====
+</source>
+We can use this regular expression:
+<source lang="bash">
+echo "tests passed in 1.78s" | grep -oP 'in \K([0-9.]+)(?=s)'
+</source>
+This uses the '''match-starts-here''' anchor (<code>\K</code>), and an assertion.
 == Regex Golf ==
@@ Line 127: / Line 227: @@
  Triples (570)         (6|[56]0|31|12|24|[48]7|58|0[0249]|7[258]|003|015|303|9005)$
  Glob (362)            ^(([^*]+) .+ \2|([^*]+)\* .+ \3.*)$|err|fal|log|tud|aio|sy
+   From Reddit:
+     Glob (389)        ([rlwpc][dplroy]|[lpg]e[an]).*t (pure cheat, not obvious)
  Balance (283)         ^(<(<(<(<(<(<(<>)*>)*>)*>)*>)*>)*>)*$
  Balance (286)         ^<<>><|^(<(<(<(<(<>)*>)*>)*>)*>)*$
@@ Line 142: / Line 244: @@
  Alphabetical (289)    [rs]er$|^([er]|ass).*s$|a t|e e|n r|rt r|ne t|ar ta
- Total 3978
+ Total 4005