Regular Expressions: Difference between revisions
(12 intermediate revisions by the same user not shown) | |||
Line 29: | Line 29: | ||
|'''<code>\w</code>'''||'''Word character''' - Equivalent to '''<code>[A-Za-z0-9_]</code>'''|| |
|'''<code>\w</code>'''||'''Word character''' - Equivalent to '''<code>[A-Za-z0-9_]</code>'''|| |
||
|- |
|- |
||
|'''<code>\s</code>'''||'''Whitespace character''' - Equivalent to '''<code>[ \t\r\n]</code>'''|| |
|'''<code>\s</code>'''||'''Whitespace character''' - Equivalent to '''<code>[ \t\r\n]</code>'''||same as '''<code><nowiki>[[:blank:]]</nowiki></code>''' |
||
|- |
|- |
||
|'''<code>\D</code>'''||'''Negated \d''', i.e. '''<code>[^\d]</code>'''|| |
|'''<code>\D</code>'''||'''Negated \d''', i.e. '''<code>[^\d]</code>'''|| |
||
Line 38: | Line 38: | ||
|- |
|- |
||
|} |
|} |
||
About '''negated class''', note that '''<code>[\D\S]</code>''' is not the same as '''<code>[^\d\s]</code>'''. The latter will not match a character that is either a digit or a whitespace. The former will match any character that is either not a digit, or not a whitespace, i.e. it will match any character... |
About '''negated class''', note that '''<code>[\D\S]</code>''' is not the same as '''<code>[^\d\s]</code>'''. The latter will not match a character that is either a digit or a whitespace. The former will match any character that is either not a digit, or not a whitespace, i.e. it will match any character... |
||
'''Bug''' — in ''sed 4.2.2'', the malformed regex <code>[a-Z]</code> matches any lowercase or uppercase character. <code>[A-z]</code>, although correct, is rejected: |
|||
<source lang=bash> |
|||
echo 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ@?>=<;:0123456789/;-,+*()[]^_{|}~' | sed 's/[a-Z]/%/g' |
|||
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%@?>=<;:0123456789/;-,+*()[]^_{|}~ |
|||
$ echo 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ@?>=<;:0123456789/;-,+*()[]^_{|}~' | sed 's/[A-z]/%/g' |
|||
# sed: -e expression #1, char 11: Invalid range end |
|||
</source> |
|||
== Zero-length matches == |
== Zero-length matches == |
||
Line 59: | Line 69: | ||
|- |
|- |
||
|'''<code>\></code>'''||'''End-of-a-word''' anchor||GNU extensions |
|'''<code>\></code>'''||'''End-of-a-word''' anchor||GNU extensions |
||
|- |
|||
|'''<code>\K</code>'''||'''Start match here'''||PERL extensions |
|||
|} |
|} |
||
Line 103: | Line 115: | ||
|<tt>^on single line$</tt>||''(start and end of line)'' - Match ''on single line'' on a single line. |
|<tt>^on single line$</tt>||''(start and end of line)'' - Match ''on single line'' on a single line. |
||
|} |
|} |
||
== Tips == |
|||
=== The best regex tricks: match a but not "a" === |
|||
Reference: http://www.rexegg.com/regex-best-trick.html |
|||
To match <code>Tarzan</code> but not <code>"Tarzan"</code>, use regex |
|||
<pre> |
|||
"Tarzan"|(Tarzan) |
|||
</pre> |
|||
The same way, to add more variations we want to ignore, like <code>Tarzania</code> and <code>--Tarzan--</code>To match <code>Tarzan</code> but not <code>"Tarzan"</code>, use regex |
|||
<pre> |
|||
"Tarzan"|(Tarzan) |
|||
</pre> |
|||
In pseudo-regex, this gives: |
|||
<pre> |
|||
NotThis|NotThat|GoAway|(WeWantThis) |
|||
</pre> |
|||
Alternatively, to delete the wrong matches, we change it to |
|||
<pre> |
|||
(KeepThis|KeepThat|KeepTheOther)|DeleteThis |
|||
</pre> |
|||
=== Search whole words === |
|||
Use <code>\b</code> [https://stackoverflow.com/questions/1032023/sed-whole-word-search-and-replace]: |
|||
<source lang="bash"> |
|||
echo "bar embarassment" | sed "s/\bbar\b/no bar/g" |
|||
# no bar embarassment |
|||
</source> |
|||
=== Match braced expressions === |
|||
<source lang=bash> |
|||
sed -r 's/f\([^()]*\)/OK/' file.txt # Match f() and f(...) |
|||
sed -r 's/f\(([^()]|\([^()]*\))*\)/OK/' file.txt # 1-level imbrication - also match f(...(...)...) |
|||
sed -r 's/f\(([^()]|\(([^()]|\([^()]*\))*\))*\)/OK/' file.txt # 2-level imbrication - also matches f(...(...(...)...)...) |
|||
# To add more level, |
|||
# replace: ======================= |
|||
# with: ==================================== |
|||
# |
|||
# ... or add '\(([^()]|' as prefix and ')*\)' as suffix. |
|||
</source> |
|||
=== Empty regular expression === |
|||
In Sed, using the empty regular expression <code>//</code> or <code>s//.../</code> allows to match the previous regexp, but without repeating it [https://www.gnu.org/software/sed/manual/html_node/Addresses.html]. |
|||
It also match for groups, which is remembered: |
|||
<source lang="bash"> |
|||
sed -ri '0,/^#include ([<"])/{s//#include \1/;}' # Here \1 will match either < or " |
|||
</source> |
|||
=== Document regular expression === |
|||
* In PCRE, can use <code>(?x)</code> to ignore whitespace. This allows to write: |
|||
<source lang="text"> |
|||
(?x) |
|||
^\w+ # mandatory leading letters |
|||
( [-+.'] \w+ )* # optional suffix |
|||
@ |
|||
\w+ # domain |
|||
( [-.] \w+ )* # domain suffix |
|||
( \.\w+ ( [-.] \w+ )* )* #tld |
|||
$ |
|||
</source> |
|||
* Alternative one can cut into sub-expressions. For instance in Python: |
|||
<source lang="python"> |
|||
mandatory_leading_letters = "^\w+" |
|||
optional_suffix = "([-+.']\w+)*" |
|||
domain = "\w+" |
|||
domain_suffix = "([-.]\w+)*" |
|||
tld = "\.\w+([-.]\w+)*$" |
|||
regex = "{mandatory_leading_letters}{optional_suffix}@{domain}{domain_suffix}{tld}" |
|||
</source> |
|||
:One could then use <code>regex</code> in a f-string with <code>re</code> module. |
|||
=== Match an expression in the middle === |
|||
Say we want to match the time in string <code>tests passed in 1.78s</code>. |
|||
<source lang="text"> |
|||
tests passed in 1.78s |
|||
==== |
|||
</source> |
|||
We can use this regular expression: |
|||
<source lang="bash"> |
|||
echo "tests passed in 1.78s" | grep -oP 'in \K([0-9.]+)(?=s)' |
|||
</source> |
|||
This uses the '''match-starts-here''' anchor (<code>\K</code>), and an assertion. |
|||
== Regex Golf == |
== Regex Golf == |
||
Line 127: | Line 227: | ||
Triples (570) (6|[56]0|31|12|24|[48]7|58|0[0249]|7[258]|003|015|303|9005)$ |
Triples (570) (6|[56]0|31|12|24|[48]7|58|0[0249]|7[258]|003|015|303|9005)$ |
||
Glob (362) ^(([^*]+) .+ \2|([^*]+)\* .+ \3.*)$|err|fal|log|tud|aio|sy |
Glob (362) ^(([^*]+) .+ \2|([^*]+)\* .+ \3.*)$|err|fal|log|tud|aio|sy |
||
From Reddit: |
|||
Glob (389) ([rlwpc][dplroy]|[lpg]e[an]).*t (pure cheat, not obvious) |
|||
Balance (283) ^(<(<(<(<(<(<(<>)*>)*>)*>)*>)*>)*>)*$ |
Balance (283) ^(<(<(<(<(<(<(<>)*>)*>)*>)*>)*>)*>)*$ |
||
Balance (286) ^<<>><|^(<(<(<(<(<>)*>)*>)*>)*>)*$ |
Balance (286) ^<<>><|^(<(<(<(<(<>)*>)*>)*>)*>)*$ |
||
Line 142: | Line 244: | ||
Alphabetical (289) [rs]er$|^([er]|ass).*s$|a t|e e|n r|rt r|ne t|ar ta |
Alphabetical (289) [rs]er$|^([er]|ass).*s$|a t|e e|n r|rt r|ne t|ar ta |
||
Total |
Total 4005 |
Latest revision as of 13:01, 27 June 2024
References
- Regular-Expressions.info, The Premier website about Regular Expressions
- Regular expression on Wikipedia
Engines
Powerful engines:
- Perl
- PCRE
- Open source regex engine implemented into PHP for instance
Less powerful engines:
- Use Extended regular expressions (switch
-r
) so that meta-characters(){}
have their special meaning when unquoted.
Character Classes
Class | Meaning | Comment |
---|---|---|
[ae] |
Matches a or e | |
[a-z] |
Matches any char in range a...z | |
[^a-z] |
Matches any char not in range a...z | |
\d |
Digit - Equivalent to [0-9] |
|
\w |
Word character - Equivalent to [A-Za-z0-9_] |
|
\s |
Whitespace character - Equivalent to [ \t\r\n] |
same as [[:blank:]]
|
\D |
Negated \d, i.e. [^\d] |
|
\W |
Negated \w, i.e. [^\w] |
|
\S |
Negated \s, i.e. [^\s] |
About negated class, note that [\D\S]
is not the same as [^\d\s]
. The latter will not match a character that is either a digit or a whitespace. The former will match any character that is either not a digit, or not a whitespace, i.e. it will match any character...
Bug — in sed 4.2.2, the malformed regex [a-Z]
matches any lowercase or uppercase character. [A-z]
, although correct, is rejected:
echo 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ@?>=<;:0123456789/;-,+*()[]^_{|}~' | sed 's/[a-Z]/%/g'
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%@?>=<;:0123456789/;-,+*()[]^_{|}~
$ echo 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ@?>=<;:0123456789/;-,+*()[]^_{|}~' | sed 's/[A-z]/%/g'
# sed: -e expression #1, char 11: Invalid range end
Zero-length matches
The regex here are zero-length, meaning they match a zero-length string, either because they match particular positions in the string (such as end-of-line or beginning-of-line anchors), or because the matched string is dropped after evaluation (like assertions, which only yield a boolean value, match or not matched).
Anchors:
Anchor | Meaning | Comment |
---|---|---|
^ |
Beginning-of-line anchor | |
$ |
End-of-line anchor | |
\b |
Word boundary anchor | |
\B |
Negated word boundary anchor | |
\< |
Start-of-a-word anchor | GNU extensions |
\> |
End-of-a-word anchor | GNU extensions |
\K |
Start match here | PERL extensions |
Assertions:
Assertion | Meaning | Comment |
---|---|---|
(?=regex) |
Lookahead positive assertion | e.g. \b(?=\w{0,3}cat)\w{6}\b , matches locate
|
(?!regex) |
Lookahead negative assertion | e.g. \b(?!\w{0,3}cat)\w{6}\b , matches relica but not locate
|
(?<=regex) |
Lookbehind positive assertion | |
(?<!regex) |
Lookbehind negative assertion |
Negative assertion can be used to invert a regex match: ^(?!.*<REGEX_HERE>)
will match everything not matching <REGEX_HERE>
Examples
Sed - The list below is actually for Extended regular expression (switch -r
).
Regexp | Description |
---|---|
. | Match any character |
gray|grey | Match gray or grey |
gr(a|e)y | Match gray or grey |
gr[ae]y | Match gray or grey |
file[^0-2] | Match file3 or file4, but not file0, file1, file2. |
colou?r | (zero or one) - Match Color or Colour. |
ab*c | (zero or more) - Match ac, abc, abbc, .... |
ab+c | (one or more) - Match abc, abbc, abbbc, .... |
a{3,5} | (at least m and not more than n times) - Match aaa, aaaa, aaaaa. |
^on single line$ | (start and end of line) - Match on single line on a single line. |
Tips
The best regex tricks: match a but not "a"
Reference: http://www.rexegg.com/regex-best-trick.html
To match Tarzan
but not "Tarzan"
, use regex
"Tarzan"|(Tarzan)
The same way, to add more variations we want to ignore, like Tarzania
and --Tarzan--
To match Tarzan
but not "Tarzan"
, use regex
"Tarzan"|(Tarzan)
In pseudo-regex, this gives:
NotThis|NotThat|GoAway|(WeWantThis)
Alternatively, to delete the wrong matches, we change it to
(KeepThis|KeepThat|KeepTheOther)|DeleteThis
Search whole words
Use \b
[1]:
echo "bar embarassment" | sed "s/\bbar\b/no bar/g"
# no bar embarassment
Match braced expressions
sed -r 's/f\([^()]*\)/OK/' file.txt # Match f() and f(...)
sed -r 's/f\(([^()]|\([^()]*\))*\)/OK/' file.txt # 1-level imbrication - also match f(...(...)...)
sed -r 's/f\(([^()]|\(([^()]|\([^()]*\))*\))*\)/OK/' file.txt # 2-level imbrication - also matches f(...(...(...)...)...)
# To add more level,
# replace: =======================
# with: ====================================
#
# ... or add '\(([^()]|' as prefix and ')*\)' as suffix.
Empty regular expression
In Sed, using the empty regular expression //
or s//.../
allows to match the previous regexp, but without repeating it [2].
It also match for groups, which is remembered:
sed -ri '0,/^#include ([<"])/{s//#include \1/;}' # Here \1 will match either < or "
Document regular expression
- In PCRE, can use
(?x)
to ignore whitespace. This allows to write:
(?x)
^\w+ # mandatory leading letters
( [-+.'] \w+ )* # optional suffix
@
\w+ # domain
( [-.] \w+ )* # domain suffix
( \.\w+ ( [-.] \w+ )* )* #tld
$
- Alternative one can cut into sub-expressions. For instance in Python:
mandatory_leading_letters = "^\w+"
optional_suffix = "([-+.']\w+)*"
domain = "\w+"
domain_suffix = "([-.]\w+)*"
tld = "\.\w+([-.]\w+)*$"
regex = "{mandatory_leading_letters}{optional_suffix}@{domain}{domain_suffix}{tld}"
- One could then use
regex
in a f-string withre
module.
Match an expression in the middle
Say we want to match the time in string tests passed in 1.78s
.
tests passed in 1.78s
====
We can use this regular expression:
echo "tests passed in 1.78s" | grep -oP 'in \K([0-9.]+)(?=s)'
This uses the match-starts-here anchor (\K
), and an assertion.
Regex Golf
My solutions so far (see here for other scores [3]):
Plain strings (207) foo Anchors (208) k$ Ranges (202) ^[a-f]+$ Backrefs (201) (...).*\1 Abba (190) ^(?!.*(.)(.)\2\1).*$ Abba (193) ^(?!.*(.)(.)\2\1) A man, a plan (176) ^(.)(.).*\2\1$ A man, a plan (177) ^(.)[^p].*\1$ Prime (232) ^(xx|xxx|x{5}|x{7}|x{11}|x{13}|x{17}|x{19}|x{23}|x{29}|x{31})$|x{33} From Reddit: Prime (286) ^(?!(xx+)\1+$) (ie. cannot be twice or more times a number >=2) Four (198) (.).\1.\1.\1 Four (199) (.)(.\1){3} Order (156) ^a?b?c?c?d?e?e?f?g?h?i?l?l?m?n?o?o?p?r?s?s?t?t?y?w?z?$ From Reddit: Order (196) ^[a-f]*[g-z]*$ Order (199) ^.{5}[^e]?$ (obvious cheat actually) Triples (570) (6|[56]0|31|12|24|[48]7|58|0[0249]|7[258]|003|015|303|9005)$ Glob (362) ^(([^*]+) .+ \2|([^*]+)\* .+ \3.*)$|err|fal|log|tud|aio|sy From Reddit: Glob (389) ([rlwpc][dplroy]|[lpg]e[an]).*t (pure cheat, not obvious) Balance (283) ^(<(<(<(<(<(<(<>)*>)*>)*>)*>)*>)*>)*$ Balance (286) ^<<>><|^(<(<(<(<(<>)*>)*>)*>)*>)*$ Powers (60) ^(((((((((xx?)\9?)\8?)\7?)\6?)\5?)\4?)\3?)\2?)\1?$ Powers (72) ^(?!((xx)+x|(x{24})+|x{28}|x{160})$)x* Powers (76) ^(?!((xx)+x|x{28}|x{48}|(x{5})+)$) From Reddit: Powers (93) ^(?!(x(xx)+)\1*$) (ie. 2^n cannot be a multiple of an odd number >=3) Long count (216) ^0+ 0+1 0010 0011 0100 0101 0110 01+ 10+ 1001 1010 101 From Reddit: Long count (253) ((.+)0 \2+1 ?){8} (ie. something|'0' same thing|'1', this 8 times) Long count v2 (216) ^0+ 0+1 0010 0011 0100 0101 0110 01+ 10+ 1001 1010 101 From Reddit: Long count (253) ((.+)0 \2+1 ?){8} (ie. something|'0' same thing|'1', this 8 times) Alphabetical (289) [rs]er$|^([er]|ass).*s$|a t|e e|n r|rt r|ne t|ar ta Total 4005