Sed: Difference between revisions
(New page: == References == * [http://sed.sourceforge.net/ The SED Homepage on SourceForge] * [http://sed.sourceforge.net/sedfaq.html The SED FAQ] * [http://www.opengroup.org/onlinepubs/007908799/xcu...) |
|||
(22 intermediate revisions by the same user not shown) | |||
Line 5: | Line 5: | ||
* [http://www.gnu.org/software/sed/manual/sed.html Sed, a stream editor] |
* [http://www.gnu.org/software/sed/manual/sed.html Sed, a stream editor] |
||
* [http://sed.sourceforge.net/sed1line.txt The sed one-liners] |
* [http://sed.sourceforge.net/sed1line.txt The sed one-liners] |
||
== Installation == |
|||
It is recommended to add the following alias in your <tt>~/.bashrc</tt>: |
|||
<source lang="bash" enclose="prevalid"> |
|||
alias sed="sed -r" |
|||
</source> |
|||
Of course, this alias has no effect on shell script. There you'll have to specify the option explicitly at each invokation. |
|||
== Usage == |
== Usage == |
||
Line 15: | Line 22: | ||
</source> |
</source> |
||
=== Portable scripts / deal with locale === |
|||
It is recommended to set environment variables <code>LC_COLLATE</code> and <code>LC_CTYPE</code> to <code>C</code> [https://www.gnu.org/software/sed/manual/sed.html#Limitations] to avoid bugs in shell scripts: |
|||
<source lang="bash"> |
|||
export LC_COLLATE=C LC_CTYPE=C |
|||
# Now the following line works as expected |
|||
echo $'Copyright \xa9 1999' | sed -r 's/./x/g' |
|||
</source> |
|||
Another solution is also to set environment variable <code>LANG</code> to 8-bit character set like <code>iso-8859-1</code>. |
|||
=== Commands <code>a</code>, <code>i</code> and <code>c</code> === |
|||
Use of address commands <tt>a\text</tt>, <tt>i\text</tt>, <tt>c\text</tt>. The command is terminated by a '''*newline*'''. To insert a newline character, use <tt>\n</tt>: |
Use of address commands <tt>a\text</tt>, <tt>i\text</tt>, <tt>c\text</tt>. The command is terminated by a '''*newline*'''. To insert a newline character, use <tt>\n</tt>: |
||
<source lang="bash"> |
<source lang="bash"> |
||
cat mytext |
|||
First line |
# First line |
||
Second line |
# Second line |
||
cat mysedscript |
|||
1 {i\inserted text |
# 1 {i\inserted text |
||
s/$/ (not anymore)/g} |
# s/$/ (not anymore)/g} |
||
sed -f mysedscript mytext |
|||
inserted text |
# inserted text |
||
First line (not anymore) |
# First line (not anymore) |
||
Second line |
# Second line |
||
</source> |
|||
All on one line: use <code>echo -e</code> to generate the newline that terminates the command <code>i</code>: |
|||
<source lang="bash"> |
|||
$ echo -e "1 {i\\inserted text\ns/$/ (not anymore)/g}"| sed -f - mytext |
|||
echo -e "1 {i\\inserted text\ns/$/ (not anymore)/g}"| sed -f - mytext |
|||
inserted text |
|||
# inserted text |
|||
First line (not anymore) |
|||
# First line (not anymore) |
|||
Second line |
|||
# Second line |
|||
</source> |
|||
Same result without command <code>i</code>: |
|||
<source lang="bash"> |
|||
$ sed "1 {s/^/inserted text\n/; s/$/ (not anymore)/}" mytext |
|||
sed "1 {s/^/inserted text\n/; s/$/ (not anymore)/}" mytext |
|||
</source> |
</source> |
||
=== Empty regular expression === |
|||
Using <code>//</code> allows to match the previous regex, without repeating it (see [https://www.gnu.org/software/sed/manual/html_node/Addresses.html]). |
|||
== Regular expressions== |
== Regular expressions== |
||
See [[Regular Expressions]]. |
|||
The information below is only illustrative. See e.g. [[wikipedia:Regular_expression|Wikipedia page ]] for reference information. The list below is actually for ''extended regular expressions'', which can be obtained in ''sed'' using option '''-r''' (<tt>sed -r</tt>). |
|||
{| class="wikitable" |
|||
== Script Examples == |
|||
!Regexp!!Description |
|||
=== Remove <script>...</script> HTML tag === |
|||
|- |
|||
<source lang="text"> |
|||
|<tt>.</tt>||Match any character |
|||
s!<script[>\x20\t].*</script>!!g |
|||
|- |
|||
/<script[>\x20\t]/{ |
|||
||<tt>gray|grey</tt>||Match ''gray'' or ''grey'' |
|||
s!<script[>\x20\t].*!!g |
|||
|- |
|||
:NEXTCYCLE |
|||
||<tt>gr(a|e)y</tt>||Match ''gray'' or ''grey'' |
|||
n |
|||
|- |
|||
/<\/script>/!{ |
|||
||<tt>gr[ae]y</tt>||Match ''gray'' or ''grey'' |
|||
s!.*!!g |
|||
|- |
|||
b NEXTCYCLE |
|||
||<tt>file[^0-2]</tt>||Match ''file3'' or ''file4'', but not ''file0'', ''file1'', ''file2''. |
|||
} |
|||
|- |
|||
s!.*</script>!!g |
|||
|<tt>colou?r</tt>||''(zero or one)'' - Match ''Color'' or ''Colour''. |
|||
} |
|||
|- |
|||
</source> |
|||
|<tt>ab*c</tt>||''(zero or more)'' - Match ''ac'', ''abc'', ''abbc'', .... |
|||
|- |
|||
=== Remove newlines === |
|||
|<tt>ab+c</tt>||''(one or more)'' - Match ''abc'', ''abbc'', ''abbbc'', .... |
|||
'''Newline''' characters are added to the pattern space when using the append command '''N'''. The script below removes all newlines from standard input: |
|||
|- |
|||
<source lang="text"> |
|||
|<tt>a{3,5}</tt>||''(at least m and not more than n times)'' - Match ''aaa'', ''aaaa'', ''aaaaa''. |
|||
:a N |
|||
s/\n/ /g |
|||
|<tt>^on single line$</tt>||''(start and end of line)'' - Match ''on single line'' on a single line. |
|||
b a |
|||
|} |
|||
</source> |
|||
<small>When using standard (non-extended) regular expression, some special meta-characters (like the parenthesis '''( )''', or braces '''{ }''') must be quoted with '''backslash \'''.</small> |
|||
One liner in bash: |
|||
<source lang="bash"> |
|||
sed -r ':a N; s/\n//; b a' FILE |
|||
</source> |
|||
=== Remove trailing whitespaces === |
|||
<source lang=bash> |
|||
find -name '*.[c|h|s]' -print0 | xargs -r0 sed -e 's/[[:blank:]]\+$//' -i |
|||
ack-grep --text --type-set=pdf=.pdf --nopdf -f --print0 | xargs -r0 sed -r -i 's/\s+$//'; |
|||
</source> |
|||
=== Recursive patterns === |
|||
For instance, to transform a path like <tt>/usr/local/share/bin/../../../bin/foo<tt> into <tt>/usr/bin/foo</tt>: |
|||
<source lang=text> |
|||
s!^([^./])!\./\1! # Prefix with './' unless starts with '.' or '/' |
|||
s!/./!/!g # Remove any './' in middle |
|||
:a s!/[^/]*[^/.]/\.\.!!g # Remove /foo/.. (1st letter must not be '/', last letter must not be '.') |
|||
t a # ... and repeat until no more substitutions |
|||
</source> |
|||
<source lang=bash> |
|||
echo "/usr/local/share/bin/../../../bin/foo" | sed -r 's!^([^./])!\./\1!; s!/a./!/!g; :a s!/[^/]*[^/.]/\.\.!!g; t a' |
|||
</source> |
|||
Test paths: |
|||
<source lang=bash> |
|||
/usr/local/share/../../../bin/foo # /bin/foo |
|||
/usr/local/./share/../../../bin/foo # /bin/foo |
|||
./usr/../bin/foo # ./bin/foo |
|||
usr/../bin/foo # ./bin/foo |
|||
usr/../bin # ./bin |
|||
usr/../bin/.. # . |
|||
usr/../bin/../.. # ./.. |
|||
</source> |
|||
=== hex conversion in .reg file === |
|||
<source lang=bash> |
|||
eval "$(sed -r ':a N; s/\\\n *//g; b a' mapi-utf8.reg | sed -r "s/(.*)/echo \'\1\'/; /hex:/s/echo/echo -e/" | sed -r '/hex:/{s/,00//g; s/([:,])([0-9a-f][0-9a-f])/\1\\x\2/g}; s/,//g')" |
|||
</source> |
|||
=== Find whole word matches only === |
|||
Use <code>\b</code>, as in |
|||
<source lang=bash> |
|||
sed -rn '/\bWORD\b/p' myfile.txt |
|||
</source> |
|||
=== Concatenate C commands spanning on multiple lines === |
|||
Say we have some C file where some commands are spanning on multiple lines, and we want them back on a single line (for instance, to process them further). Use the following script: |
|||
<source lang=bash> |
|||
find -name "*.[ch]" -type f -print0|xargs -0 sed -r '/#define/b a; /my_function/{:b /;/b a;N;s/\n//; b b};:a'|grep my_function # To review result |
|||
find -name "*.[ch]" -type f -print0|xargs -0 sed -ri '/#define/b a; /my_function/{:b /;/b a;N;s/\n//; b b};:a' # To apply result in-place |
|||
</source> |
|||
=== Match non-ascii characters / invalid collation character === |
|||
By default sed only works with 7-bit ascii character [https://unix.stackexchange.com/questions/256806/replace-non-ascii-characters-with-space-in-a-file], [https://stackoverflow.com/questions/9670916/will-sed-and-others-corrupt-non-ascii-files]. |
|||
Here, in <code>LANG=en_US.UTF-8</code>, we see that non-ascii character is ignored: |
|||
<source lang="bash"> |
|||
echo $'Copyright \xa9 1999' | sed -r 's/./x/g' |
|||
# xxxxxxxxxx�xxxxx |
|||
</source> |
|||
Trying to give non-ascii range gives error <code>Invalid collation character</code>: |
|||
<source lang="bash"> |
|||
echo $'Copyright \xa9 1999' | sed -r 's/[\d128-\d255]/x/g' |
|||
# sed: -e expression #1, char 19: Invalid collation character |
|||
</source> |
|||
We can bypass this issue by using a 8-bit character set, for instance <code>iso-8859-1</code>: |
|||
<source lang="bash"> |
|||
echo $'Copyright \xa9 1999' | LANG=iso-8859-1 sed -r 's/./x/g' |
|||
# xxxxxxxxxxxxxxxx |
|||
echo $'Copyright \xa9 1999' | LANG=iso-8859-1 sed -r 's/[\d128-\d255]/x/g' |
|||
# Copyright x 1999 |
|||
</source> |
|||
Another solution is to set LC_COLLATE=C LC_CTYPE=C, which always avoid bugs in shell scripts [https://www.gnu.org/software/sed/manual/sed.html#Limitations]: |
|||
<source lang="bash"> |
|||
$ echo $'Copyright \xa9 1999' | LC_COLLATE=C LC_CTYPE=C sed -r 's/./x/g' |
|||
xxxxxxxxxxxxxxxx |
|||
</source> |
|||
=== Delete the first matching line === |
|||
From [https://stackoverflow.com/questions/23696871/how-to-remove-only-the-first-occurrence-of-a-line-in-a-file-using-sed SO]: |
|||
<source lang="bash"> |
|||
# Delete first line matching 'foo' |
|||
sed '0,/foo/{//d}' inputfile # Use 0,ADDR2, so that ADDR2 can match the 1st line |
|||
</source> |
|||
Note the special construction <code>//d</code> using '''empty regular expression''' [https://www.gnu.org/software/sed/manual/html_node/Addresses.html], that matches the last given regular expression. |
Latest revision as of 16:28, 3 July 2021
References
- The SED Homepage on SourceForge
- The SED FAQ
- The SED man page
- Sed, a stream editor
- The sed one-liners
Installation
It is recommended to add the following alias in your ~/.bashrc:
alias sed="sed -r"
Of course, this alias has no effect on shell script. There you'll have to specify the option explicitly at each invokation.
Usage
Some basic usage:
sed [OPTION]... {script-only-if-no-other-script} [input-file]...
sed -n # Silent - suppress automatic printing of pattern space
sed -r # Use extended regular expression
sed -i "s/foo/bar/" *.txt # In-place file modification
Portable scripts / deal with locale
It is recommended to set environment variables LC_COLLATE
and LC_CTYPE
to C
[1] to avoid bugs in shell scripts:
export LC_COLLATE=C LC_CTYPE=C
# Now the following line works as expected
echo $'Copyright \xa9 1999' | sed -r 's/./x/g'
Another solution is also to set environment variable LANG
to 8-bit character set like iso-8859-1
.
Commands a
, i
and c
Use of address commands a\text, i\text, c\text. The command is terminated by a *newline*. To insert a newline character, use \n:
cat mytext
# First line
# Second line
cat mysedscript
# 1 {i\inserted text
# s/$/ (not anymore)/g}
sed -f mysedscript mytext
# inserted text
# First line (not anymore)
# Second line
All on one line: use echo -e
to generate the newline that terminates the command i
:
echo -e "1 {i\\inserted text\ns/$/ (not anymore)/g}"| sed -f - mytext
# inserted text
# First line (not anymore)
# Second line
Same result without command i
:
sed "1 {s/^/inserted text\n/; s/$/ (not anymore)/}" mytext
Empty regular expression
Using //
allows to match the previous regex, without repeating it (see [2]).
Regular expressions
See Regular Expressions.
Script Examples
Remove <script>...</script> HTML tag
s!<script[>\x20\t].*</script>!!g
/<script[>\x20\t]/{
s!<script[>\x20\t].*!!g
:NEXTCYCLE
n
/<\/script>/!{
s!.*!!g
b NEXTCYCLE
}
s!.*</script>!!g
}
Remove newlines
Newline characters are added to the pattern space when using the append command N. The script below removes all newlines from standard input:
:a N
s/\n/ /g
b a
One liner in bash:
sed -r ':a N; s/\n//; b a' FILE
Remove trailing whitespaces
find -name '*.[c|h|s]' -print0 | xargs -r0 sed -e 's/[[:blank:]]\+$//' -i
ack-grep --text --type-set=pdf=.pdf --nopdf -f --print0 | xargs -r0 sed -r -i 's/\s+$//';
Recursive patterns
For instance, to transform a path like /usr/local/share/bin/../../../bin/foo into /usr/bin/foo:
s!^([^./])!\./\1! # Prefix with './' unless starts with '.' or '/'
s!/./!/!g # Remove any './' in middle
:a s!/[^/]*[^/.]/\.\.!!g # Remove /foo/.. (1st letter must not be '/', last letter must not be '.')
t a # ... and repeat until no more substitutions
echo "/usr/local/share/bin/../../../bin/foo" | sed -r 's!^([^./])!\./\1!; s!/a./!/!g; :a s!/[^/]*[^/.]/\.\.!!g; t a'
Test paths:
/usr/local/share/../../../bin/foo # /bin/foo
/usr/local/./share/../../../bin/foo # /bin/foo
./usr/../bin/foo # ./bin/foo
usr/../bin/foo # ./bin/foo
usr/../bin # ./bin
usr/../bin/.. # .
usr/../bin/../.. # ./..
hex conversion in .reg file
eval "$(sed -r ':a N; s/\\\n *//g; b a' mapi-utf8.reg | sed -r "s/(.*)/echo \'\1\'/; /hex:/s/echo/echo -e/" | sed -r '/hex:/{s/,00//g; s/([:,])([0-9a-f][0-9a-f])/\1\\x\2/g}; s/,//g')"
Find whole word matches only
Use \b
, as in
sed -rn '/\bWORD\b/p' myfile.txt
Concatenate C commands spanning on multiple lines
Say we have some C file where some commands are spanning on multiple lines, and we want them back on a single line (for instance, to process them further). Use the following script:
find -name "*.[ch]" -type f -print0|xargs -0 sed -r '/#define/b a; /my_function/{:b /;/b a;N;s/\n//; b b};:a'|grep my_function # To review result
find -name "*.[ch]" -type f -print0|xargs -0 sed -ri '/#define/b a; /my_function/{:b /;/b a;N;s/\n//; b b};:a' # To apply result in-place
Match non-ascii characters / invalid collation character
By default sed only works with 7-bit ascii character [3], [4].
Here, in LANG=en_US.UTF-8
, we see that non-ascii character is ignored:
echo $'Copyright \xa9 1999' | sed -r 's/./x/g'
# xxxxxxxxxx�xxxxx
Trying to give non-ascii range gives error Invalid collation character
:
echo $'Copyright \xa9 1999' | sed -r 's/[\d128-\d255]/x/g'
# sed: -e expression #1, char 19: Invalid collation character
We can bypass this issue by using a 8-bit character set, for instance iso-8859-1
:
echo $'Copyright \xa9 1999' | LANG=iso-8859-1 sed -r 's/./x/g'
# xxxxxxxxxxxxxxxx
echo $'Copyright \xa9 1999' | LANG=iso-8859-1 sed -r 's/[\d128-\d255]/x/g'
# Copyright x 1999
Another solution is to set LC_COLLATE=C LC_CTYPE=C, which always avoid bugs in shell scripts [5]:
$ echo $'Copyright \xa9 1999' | LC_COLLATE=C LC_CTYPE=C sed -r 's/./x/g'
xxxxxxxxxxxxxxxx
Delete the first matching line
From SO:
# Delete first line matching 'foo'
sed '0,/foo/{//d}' inputfile # Use 0,ADDR2, so that ADDR2 can match the 1st line
Note the special construction //d
using empty regular expression [6], that matches the last given regular expression.