Regular Expressions: Difference between revisions

From miki
Jump to navigation Jump to search
(Created page with '== References == * [http://www.regular-expressions.info/ Regular-Expressions.info], The Premier website about Regular Expressions * [http://en.wikipedia.org/wiki/Regular_expressi…')
 
 
(29 intermediate revisions by the same user not shown)
Line 29: Line 29:
|'''<code>\w</code>'''||'''Word character''' - Equivalent to '''<code>[A-Za-z0-9_]</code>'''||
|'''<code>\w</code>'''||'''Word character''' - Equivalent to '''<code>[A-Za-z0-9_]</code>'''||
|-
|-
|'''<code>\s</code>'''||'''Whitespace character''' - Equivalent to '''<code>[ \t\r\n]</code>'''||
|'''<code>\s</code>'''||'''Whitespace character''' - Equivalent to '''<code>[ \t\r\n]</code>'''||same as '''<code><nowiki>[[:blank:]]</nowiki></code>'''
|-
|-
|'''<code>\D</code>'''||'''Negated \d''', i.e. '''<code>[^\d]</code>'''||
|'''<code>\D</code>'''||'''Negated \d''', i.e. '''<code>[^\d]</code>'''||
Line 38: Line 38:
|-
|-
|}
|}
About '''negated class''', note that '''<code>[\D\S]</code>''' is not the same as '''<code>[^\d\s]</code>'''. The latter will not match a character that is either a digit or a whitespace. The former will match any character that is either not a digit, or not a whitespace, i.e. it will match any character...
About '''negated class''', note that '''<code>[\D\S]</code>''' is not the same as '''<code>[^\d\s]</code>'''. The latter will not match a character that is either a digit or a whitespace. The former will match any character that is either not a digit, or not a whitespace, i.e. it will match any character...

'''Bug''' &mdash; in ''sed 4.2.2'', the malformed regex <code>[a-Z]</code> matches any lowercase or uppercase character. <code>[A-z]</code>, although correct, is rejected:

<source lang=bash>
echo 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ@?>=<;:0123456789/;-,+*()[]^_{|}~' | sed 's/[a-Z]/%/g'
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%@?>=<;:0123456789/;-,+*()[]^_{|}~

$ echo 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ@?>=<;:0123456789/;-,+*()[]^_{|}~' | sed 's/[A-z]/%/g'
# sed: -e expression #1, char 11: Invalid range end
</source>


== Zero-length matches ==
== Zero-length matches ==
Line 59: Line 69:
|-
|-
|'''<code>\&gt;</code>'''||'''End-of-a-word''' anchor||GNU extensions
|'''<code>\&gt;</code>'''||'''End-of-a-word''' anchor||GNU extensions
|-
|'''<code>\K</code>'''||'''Start match here'''||PERL extensions
|}
|}


Line 73: Line 85:
|'''<code>(?<!''regex'')</code>'''||'''Lookbehind negative''' assertion||
|'''<code>(?<!''regex'')</code>'''||'''Lookbehind negative''' assertion||
|}
|}

Negative assertion can be used to '''invert a regex match''': <code>^(?!.*<REGEX_HERE>)</code> will match everything not matching <code><REGEX_HERE></code>


== Examples ==
== Examples ==
Line 101: Line 115:
|<tt>^on single line$</tt>||''(start and end of line)'' - Match ''on single line'' on a single line.
|<tt>^on single line$</tt>||''(start and end of line)'' - Match ''on single line'' on a single line.
|}
|}

== Tips ==
=== The best regex tricks: match a but not "a" ===
Reference: http://www.rexegg.com/regex-best-trick.html

To match <code>Tarzan</code> but not <code>"Tarzan"</code>, use regex
<pre>
"Tarzan"|(Tarzan)
</pre>
The same way, to add more variations we want to ignore, like <code>Tarzania</code> and <code>--Tarzan--</code>To match <code>Tarzan</code> but not <code>"Tarzan"</code>, use regex
<pre>
"Tarzan"|(Tarzan)
</pre>

In pseudo-regex, this gives:
<pre>
NotThis|NotThat|GoAway|(WeWantThis)
</pre>

Alternatively, to delete the wrong matches, we change it to
<pre>
(KeepThis|KeepThat|KeepTheOther)|DeleteThis
</pre>

=== Search whole words ===
Use <code>\b</code> [https://stackoverflow.com/questions/1032023/sed-whole-word-search-and-replace]:
<source lang="bash">
echo "bar embarassment" | sed "s/\bbar\b/no bar/g"
# no bar embarassment
</source>

=== Match braced expressions ===
<source lang=bash>
sed -r 's/f\([^()]*\)/OK/' file.txt # Match f() and f(...)
sed -r 's/f\(([^()]|\([^()]*\))*\)/OK/' file.txt # 1-level imbrication - also match f(...(...)...)
sed -r 's/f\(([^()]|\(([^()]|\([^()]*\))*\))*\)/OK/' file.txt # 2-level imbrication - also matches f(...(...(...)...)...)
# To add more level,
# replace: =======================
# with: ====================================
#
# ... or add '\(([^()]|' as prefix and ')*\)' as suffix.
</source>

=== Empty regular expression ===
In Sed, using the empty regular expression <code>//</code> or <code>s//.../</code> allows to match the previous regexp, but without repeating it [https://www.gnu.org/software/sed/manual/html_node/Addresses.html].

It also match for groups, which is remembered:
<source lang="bash">
sed -ri '0,/^#include ([<"])/{s//#include \1/;}' # Here \1 will match either < or "
</source>

=== Document regular expression ===
* In PCRE, can use <code>(?x)</code> to ignore whitespace. This allows to write:
<source lang="text">
(?x)
^\w+ # mandatory leading letters
( [-+.'] \w+ )* # optional suffix
@
\w+ # domain
( [-.] \w+ )* # domain suffix
( \.\w+ ( [-.] \w+ )* )* #tld
$
</source>
* Alternative one can cut into sub-expressions. For instance in Python:
<source lang="python">
mandatory_leading_letters = "^\w+"
optional_suffix = "([-+.']\w+)*"
domain = "\w+"
domain_suffix = "([-.]\w+)*"
tld = "\.\w+([-.]\w+)*$"
regex = "{mandatory_leading_letters}{optional_suffix}@{domain}{domain_suffix}{tld}"
</source>
:One could then use <code>regex</code> in a f-string with <code>re</code> module.

=== Match an expression in the middle ===
Say we want to match the time in string <code>tests passed in 1.78s</code>.

<source lang="text">
tests passed in 1.78s
====
</source>

We can use this regular expression:
<source lang="bash">
echo "tests passed in 1.78s" | grep -oP 'in \K([0-9.]+)(?=s)'
</source>

This uses the '''match-starts-here''' anchor (<code>\K</code>), and an assertion.

== Regex Golf ==
* http://regex.alf.nu/

My solutions so far (see here for other scores [http://www.reddit.com/r/programming/comments/1tb0go/regex_golf/]):
Plain strings (207) foo
Anchors (208) k$
Ranges (202) ^[a-f]+$
Backrefs (201) (...).*\1
Abba (190) ^(?!.*(.)(.)\2\1).*$
Abba (193) ^(?!.*(.)(.)\2\1)
A man, a plan (176) ^(.)(.).*\2\1$
A man, a plan (177) ^(.)[^p].*\1$
Prime (232) ^(xx|xxx|x{5}|x{7}|x{11}|x{13}|x{17}|x{19}|x{23}|x{29}|x{31})$|x{33}
From Reddit:
Prime (286) ^(?!(xx+)\1+$) (ie. cannot be twice or more times a number >=2)
Four (198) (.).\1.\1.\1
Four (199) (.)(.\1){3}
Order (156) ^a?b?c?c?d?e?e?f?g?h?i?l?l?m?n?o?o?p?r?s?s?t?t?y?w?z?$
From Reddit:
Order (196) ^[a-f]*[g-z]*$
Order (199) ^.{5}[^e]?$ (obvious cheat actually)
Triples (570) (6|[56]0|31|12|24|[48]7|58|0[0249]|7[258]|003|015|303|9005)$
Glob (362) ^(([^*]+) .+ \2|([^*]+)\* .+ \3.*)$|err|fal|log|tud|aio|sy
From Reddit:
Glob (389) ([rlwpc][dplroy]|[lpg]e[an]).*t (pure cheat, not obvious)
Balance (283) ^(<(<(<(<(<(<(<>)*>)*>)*>)*>)*>)*>)*$
Balance (286) ^<<>><|^(<(<(<(<(<>)*>)*>)*>)*>)*$
Powers (60) ^(((((((((xx?)\9?)\8?)\7?)\6?)\5?)\4?)\3?)\2?)\1?$
Powers (72) ^(?!((xx)+x|(x{24})+|x{28}|x{160})$)x*
Powers (76) ^(?!((xx)+x|x{28}|x{48}|(x{5})+)$)
From Reddit:
Powers (93) ^(?!(x(xx)+)\1*$) (ie. 2^n cannot be a multiple of an odd number >=3)
Long count (216) ^0+ 0+1 0010 0011 0100 0101 0110 01+ 10+ 1001 1010 101
From Reddit:
Long count (253) ((.+)0 \2+1 ?){8} (ie. something|'0' same thing|'1', this 8 times)
Long count v2 (216) ^0+ 0+1 0010 0011 0100 0101 0110 01+ 10+ 1001 1010 101
From Reddit:
Long count (253) ((.+)0 \2+1 ?){8} (ie. something|'0' same thing|'1', this 8 times)
Alphabetical (289) [rs]er$|^([er]|ass).*s$|a t|e e|n r|rt r|ne t|ar ta
Total 4005

Latest revision as of 13:01, 27 June 2024

References

Engines

Powerful engines:

Open source regex engine implemented into PHP for instance

Less powerful engines:

Use Extended regular expressions (switch -r) so that meta-characters (){} have their special meaning when unquoted.

Character Classes

Class Meaning Comment
[ae] Matches a or e
[a-z] Matches any char in range a...z
[^a-z] Matches any char not in range a...z
\d Digit - Equivalent to [0-9]
\w Word character - Equivalent to [A-Za-z0-9_]
\s Whitespace character - Equivalent to [ \t\r\n] same as [[:blank:]]
\D Negated \d, i.e. [^\d]
\W Negated \w, i.e. [^\w]
\S Negated \s, i.e. [^\s]

About negated class, note that [\D\S] is not the same as [^\d\s]. The latter will not match a character that is either a digit or a whitespace. The former will match any character that is either not a digit, or not a whitespace, i.e. it will match any character...

Bug — in sed 4.2.2, the malformed regex [a-Z] matches any lowercase or uppercase character. [A-z], although correct, is rejected:

echo 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ@?>=<;:0123456789/;-,+*()[]^_{|}~' | sed 's/[a-Z]/%/g'
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%@?>=<;:0123456789/;-,+*()[]^_{|}~

$ echo 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ@?>=<;:0123456789/;-,+*()[]^_{|}~' | sed 's/[A-z]/%/g'
# sed: -e expression #1, char 11: Invalid range end

Zero-length matches

The regex here are zero-length, meaning they match a zero-length string, either because they match particular positions in the string (such as end-of-line or beginning-of-line anchors), or because the matched string is dropped after evaluation (like assertions, which only yield a boolean value, match or not matched).

Anchors:

Anchor Meaning Comment
^ Beginning-of-line anchor
$ End-of-line anchor
\b Word boundary anchor
\B Negated word boundary anchor
\< Start-of-a-word anchor GNU extensions
\> End-of-a-word anchor GNU extensions
\K Start match here PERL extensions

Assertions:

Assertion Meaning Comment
(?=regex) Lookahead positive assertion e.g. \b(?=\w{0,3}cat)\w{6}\b, matches locate
(?!regex) Lookahead negative assertion e.g. \b(?!\w{0,3}cat)\w{6}\b, matches relica but not locate
(?<=regex) Lookbehind positive assertion
(?<!regex) Lookbehind negative assertion

Negative assertion can be used to invert a regex match: ^(?!.*<REGEX_HERE>) will match everything not matching <REGEX_HERE>

Examples

Sed - The list below is actually for Extended regular expression (switch -r).

Regexp Description
. Match any character
gray|grey Match gray or grey
gr(a|e)y Match gray or grey
gr[ae]y Match gray or grey
file[^0-2] Match file3 or file4, but not file0, file1, file2.
colou?r (zero or one) - Match Color or Colour.
ab*c (zero or more) - Match ac, abc, abbc, ....
ab+c (one or more) - Match abc, abbc, abbbc, ....
a{3,5} (at least m and not more than n times) - Match aaa, aaaa, aaaaa.
^on single line$ (start and end of line) - Match on single line on a single line.

Tips

The best regex tricks: match a but not "a"

Reference: http://www.rexegg.com/regex-best-trick.html

To match Tarzan but not "Tarzan", use regex

"Tarzan"|(Tarzan)

The same way, to add more variations we want to ignore, like Tarzania and --Tarzan--To match Tarzan but not "Tarzan", use regex

"Tarzan"|(Tarzan)

In pseudo-regex, this gives:

NotThis|NotThat|GoAway|(WeWantThis)

Alternatively, to delete the wrong matches, we change it to

(KeepThis|KeepThat|KeepTheOther)|DeleteThis

Search whole words

Use \b [1]:

echo "bar embarassment" | sed "s/\bbar\b/no bar/g"
# no bar embarassment

Match braced expressions

sed -r 's/f\([^()]*\)/OK/' file.txt                           # Match f() and f(...)
sed -r 's/f\(([^()]|\([^()]*\))*\)/OK/' file.txt              # 1-level imbrication - also match f(...(...)...)
sed -r 's/f\(([^()]|\(([^()]|\([^()]*\))*\))*\)/OK/' file.txt # 2-level imbrication - also matches f(...(...(...)...)...)
# To add more level,
# replace:          =======================
# with:    ====================================
# 
# ... or add '\(([^()]|' as prefix and ')*\)' as suffix.

Empty regular expression

In Sed, using the empty regular expression // or s//.../ allows to match the previous regexp, but without repeating it [2].

It also match for groups, which is remembered:

sed -ri '0,/^#include ([<"])/{s//#include  \1/;}'  # Here \1 will match either < or "

Document regular expression

  • In PCRE, can use (?x) to ignore whitespace. This allows to write:
    (?x)
    ^\w+            # mandatory leading letters
    ( [-+.'] \w+ )* # optional suffix
    @
    \w+             # domain
    ( [-.] \w+ )*   # domain suffix
    ( \.\w+ ( [-.] \w+ )* )* #tld
    $
  • Alternative one can cut into sub-expressions. For instance in Python:
mandatory_leading_letters = "^\w+"
optional_suffix = "([-+.']\w+)*"
domain = "\w+"
domain_suffix = "([-.]\w+)*"
tld = "\.\w+([-.]\w+)*$"
regex = "{mandatory_leading_letters}{optional_suffix}@{domain}{domain_suffix}{tld}"
One could then use regex in a f-string with re module.

Match an expression in the middle

Say we want to match the time in string tests passed in 1.78s.

tests passed in 1.78s
                ====

We can use this regular expression:

echo "tests passed in 1.78s" | grep -oP 'in \K([0-9.]+)(?=s)'

This uses the match-starts-here anchor (\K), and an assertion.

Regex Golf

My solutions so far (see here for other scores [3]):

Plain strings (207)   foo
Anchors (208)         k$
Ranges (202)          ^[a-f]+$
Backrefs (201)        (...).*\1
Abba (190)            ^(?!.*(.)(.)\2\1).*$
Abba (193)            ^(?!.*(.)(.)\2\1)
A man, a plan (176)   ^(.)(.).*\2\1$
A man, a plan (177)   ^(.)[^p].*\1$
Prime (232)           ^(xx|xxx|x{5}|x{7}|x{11}|x{13}|x{17}|x{19}|x{23}|x{29}|x{31})$|x{33}
 From Reddit:
   Prime (286)        ^(?!(xx+)\1+$)                  (ie. cannot be twice or more times a number >=2)
Four (198)            (.).\1.\1.\1
Four (199)            (.)(.\1){3}
Order (156)           ^a?b?c?c?d?e?e?f?g?h?i?l?l?m?n?o?o?p?r?s?s?t?t?y?w?z?$
  From Reddit:
    Order (196)       ^[a-f]*[g-z]*$
    Order (199)       ^.{5}[^e]?$                     (obvious cheat actually)
Triples (570)         (6|[56]0|31|12|24|[48]7|58|0[0249]|7[258]|003|015|303|9005)$
Glob (362)            ^(([^*]+) .+ \2|([^*]+)\* .+ \3.*)$|err|fal|log|tud|aio|sy
  From Reddit:
    Glob (389)        ([rlwpc][dplroy]|[lpg]e[an]).*t (pure cheat, not obvious)
Balance (283)         ^(<(<(<(<(<(<(<>)*>)*>)*>)*>)*>)*>)*$
Balance (286)         ^<<>><|^(<(<(<(<(<>)*>)*>)*>)*>)*$
Powers (60)           ^(((((((((xx?)\9?)\8?)\7?)\6?)\5?)\4?)\3?)\2?)\1?$
Powers (72)           ^(?!((xx)+x|(x{24})+|x{28}|x{160})$)x*
Powers (76)           ^(?!((xx)+x|x{28}|x{48}|(x{5})+)$)
  From Reddit:
    Powers (93)       ^(?!(x(xx)+)\1*$)               (ie. 2^n cannot be a multiple of an odd number >=3)
Long count (216)      ^0+ 0+1 0010 0011 0100 0101 0110 01+ 10+ 1001 1010 101
  From Reddit:
    Long count (253)  ((.+)0 \2+1 ?){8}               (ie. something|'0' same thing|'1', this 8 times)
Long count v2 (216)   ^0+ 0+1 0010 0011 0100 0101 0110 01+ 10+ 1001 1010 101
  From Reddit:
    Long count (253)  ((.+)0 \2+1 ?){8}               (ie. something|'0' same thing|'1', this 8 times)
Alphabetical (289)    [rs]er$|^([er]|ass).*s$|a t|e e|n r|rt r|ne t|ar ta

Total 4005