Sed: Difference between revisions

From miki
Jump to navigation Jump to search
 
(18 intermediate revisions by the same user not shown)
Line 8: Line 8:
== Installation ==
== Installation ==
It is recommended to add the following alias in your <tt>~/.bashrc</tt>:
It is recommended to add the following alias in your <tt>~/.bashrc</tt>:
{{lp2|<source lang="bash" enclose="prevalid">
<source lang="bash" enclose="prevalid">
alias sed="sed -r"
alias sed="sed -r"
</source>}}
</source>
Of course, this alias has no effect on shell script. There you'll have to specify the option explicitly at each invokation.
Of course, this alias has no effect on shell script. There you'll have to specify the option explicitly at each invokation.


Line 22: Line 22:
</source>
</source>


=== Portable scripts / deal with locale ===
It is recommended to set environment variables <code>LC_COLLATE</code> and <code>LC_CTYPE</code> to <code>C</code> [https://www.gnu.org/software/sed/manual/sed.html#Limitations] to avoid bugs in shell scripts:

<source lang="bash">
export LC_COLLATE=C LC_CTYPE=C

# Now the following line works as expected
echo $'Copyright \xa9 1999' | sed -r 's/./x/g'
</source>

Another solution is also to set environment variable <code>LANG</code> to 8-bit character set like <code>iso-8859-1</code>.

=== Commands <code>a</code>, <code>i</code> and <code>c</code> ===
Use of address commands <tt>a\text</tt>, <tt>i\text</tt>, <tt>c\text</tt>. The command is terminated by a '''*newline*'''. To insert a newline character, use <tt>\n</tt>:
Use of address commands <tt>a\text</tt>, <tt>i\text</tt>, <tt>c\text</tt>. The command is terminated by a '''*newline*'''. To insert a newline character, use <tt>\n</tt>:
<source lang="bash">
<source lang="bash">
$ cat mytext
cat mytext
First line
# First line
Second line
# Second line
$ cat mysedscript
cat mysedscript
1 {i\inserted text
# 1 {i\inserted text
s/$/ (not anymore)/g}
# s/$/ (not anymore)/g}
$ sed -f mysedscript mytext
sed -f mysedscript mytext
inserted text
# inserted text
First line (not anymore)
# First line (not anymore)
Second line
# Second line
</source>


# All on one line: use echo -e to generate the newline that terminates the command i\
All on one line: use <code>echo -e</code> to generate the newline that terminates the command <code>i</code>:
<source lang="bash">
$ echo -e "1 {i\\inserted text\ns/$/ (not anymore)/g}"| sed -f - mytext
echo -e "1 {i\\inserted text\ns/$/ (not anymore)/g}"| sed -f - mytext
inserted text
# inserted text
First line (not anymore)
# First line (not anymore)
Second line
# Second line
</source>


#Same result without command \i:
Same result without command <code>i</code>:
<source lang="bash">
$ sed "1 {s/^/inserted text\n/; s/$/ (not anymore)/}" mytext
sed "1 {s/^/inserted text\n/; s/$/ (not anymore)/}" mytext
</source>
</source>

=== Empty regular expression ===
Using <code>//</code> allows to match the previous regex, without repeating it (see [https://www.gnu.org/software/sed/manual/html_node/Addresses.html]).


== Regular expressions==
== Regular expressions==
Line 71: Line 91:
b a
b a
</source>
</source>

One liner in bash:
<source lang="bash">
sed -r ':a N; s/\n//; b a' FILE
</source>

=== Remove trailing whitespaces ===
<source lang=bash>
find -name '*.[c|h|s]' -print0 | xargs -r0 sed -e 's/[[:blank:]]\+$//' -i
ack-grep --text --type-set=pdf=.pdf --nopdf -f --print0 | xargs -r0 sed -r -i 's/\s+$//';
</source>

=== Recursive patterns ===
For instance, to transform a path like <tt>/usr/local/share/bin/../../../bin/foo<tt> into <tt>/usr/bin/foo</tt>:
<source lang=text>
s!^([^./])!\./\1! # Prefix with './' unless starts with '.' or '/'
s!/./!/!g # Remove any './' in middle
:a s!/[^/]*[^/.]/\.\.!!g # Remove /foo/.. (1st letter must not be '/', last letter must not be '.')
t a # ... and repeat until no more substitutions
</source>

<source lang=bash>
echo "/usr/local/share/bin/../../../bin/foo" | sed -r 's!^([^./])!\./\1!; s!/a./!/!g; :a s!/[^/]*[^/.]/\.\.!!g; t a'
</source>

Test paths:
<source lang=bash>
/usr/local/share/../../../bin/foo # /bin/foo
/usr/local/./share/../../../bin/foo # /bin/foo
./usr/../bin/foo # ./bin/foo
usr/../bin/foo # ./bin/foo
usr/../bin # ./bin
usr/../bin/.. # .
usr/../bin/../.. # ./..
</source>

=== hex conversion in .reg file ===
<source lang=bash>
eval "$(sed -r ':a N; s/\\\n *//g; b a' mapi-utf8.reg | sed -r "s/(.*)/echo \'\1\'/; /hex:/s/echo/echo -e/" | sed -r '/hex:/{s/,00//g; s/([:,])([0-9a-f][0-9a-f])/\1\\x\2/g}; s/,//g')"
</source>

=== Find whole word matches only ===
Use <code>\b</code>, as in
<source lang=bash>
sed -rn '/\bWORD\b/p' myfile.txt
</source>

=== Concatenate C commands spanning on multiple lines ===
Say we have some C file where some commands are spanning on multiple lines, and we want them back on a single line (for instance, to process them further). Use the following script:
<source lang=bash>
find -name "*.[ch]" -type f -print0|xargs -0 sed -r '/#define/b a; /my_function/{:b /;/b a;N;s/\n//; b b};:a'|grep my_function # To review result
find -name "*.[ch]" -type f -print0|xargs -0 sed -ri '/#define/b a; /my_function/{:b /;/b a;N;s/\n//; b b};:a' # To apply result in-place
</source>

=== Match non-ascii characters / invalid collation character ===
By default sed only works with 7-bit ascii character [https://unix.stackexchange.com/questions/256806/replace-non-ascii-characters-with-space-in-a-file], [https://stackoverflow.com/questions/9670916/will-sed-and-others-corrupt-non-ascii-files].

Here, in <code>LANG=en_US.UTF-8</code>, we see that non-ascii character is ignored:
<source lang="bash">
echo $'Copyright \xa9 1999' | sed -r 's/./x/g'
# xxxxxxxxxx�xxxxx
</source>

Trying to give non-ascii range gives error <code>Invalid collation character</code>:
<source lang="bash">
echo $'Copyright \xa9 1999' | sed -r 's/[\d128-\d255]/x/g'
# sed: -e expression #1, char 19: Invalid collation character
</source>

We can bypass this issue by using a 8-bit character set, for instance <code>iso-8859-1</code>:
<source lang="bash">
echo $'Copyright \xa9 1999' | LANG=iso-8859-1 sed -r 's/./x/g'
# xxxxxxxxxxxxxxxx
echo $'Copyright \xa9 1999' | LANG=iso-8859-1 sed -r 's/[\d128-\d255]/x/g'
# Copyright x 1999
</source>

Another solution is to set LC_COLLATE=C LC_CTYPE=C, which always avoid bugs in shell scripts [https://www.gnu.org/software/sed/manual/sed.html#Limitations]:
<source lang="bash">
$ echo $'Copyright \xa9 1999' | LC_COLLATE=C LC_CTYPE=C sed -r 's/./x/g'
xxxxxxxxxxxxxxxx
</source>

=== Delete the first matching line ===
From [https://stackoverflow.com/questions/23696871/how-to-remove-only-the-first-occurrence-of-a-line-in-a-file-using-sed SO]:
<source lang="bash">
# Delete first line matching 'foo'
sed '0,/foo/{//d}' inputfile # Use 0,ADDR2, so that ADDR2 can match the 1st line
</source>

Note the special construction <code>//d</code> using '''empty regular expression''' [https://www.gnu.org/software/sed/manual/html_node/Addresses.html], that matches the last given regular expression.

Latest revision as of 16:28, 3 July 2021

References

Installation

It is recommended to add the following alias in your ~/.bashrc:

alias sed="sed -r"

Of course, this alias has no effect on shell script. There you'll have to specify the option explicitly at each invokation.

Usage

Some basic usage:

sed [OPTION]... {script-only-if-no-other-script} [input-file]...
sed -n                              # Silent - suppress automatic printing of pattern space
sed -r                              # Use extended regular expression
sed -i "s/foo/bar/" *.txt           # In-place file modification

Portable scripts / deal with locale

It is recommended to set environment variables LC_COLLATE and LC_CTYPE to C [1] to avoid bugs in shell scripts:

export LC_COLLATE=C LC_CTYPE=C

# Now the following line works as expected
echo $'Copyright \xa9 1999' | sed -r 's/./x/g'

Another solution is also to set environment variable LANG to 8-bit character set like iso-8859-1.

Commands a, i and c

Use of address commands a\text, i\text, c\text. The command is terminated by a *newline*. To insert a newline character, use \n:

cat mytext
# First line
# Second line
cat mysedscript
# 1 {i\inserted text
# s/$/ (not anymore)/g}
sed -f mysedscript mytext
# inserted text
# First line (not anymore)
# Second line

All on one line: use echo -e to generate the newline that terminates the command i:

echo -e "1 {i\\inserted text\ns/$/ (not anymore)/g}"| sed -f - mytext
# inserted text
# First line (not anymore)
# Second line

Same result without command i:

sed "1 {s/^/inserted text\n/; s/$/ (not anymore)/}" mytext

Empty regular expression

Using // allows to match the previous regex, without repeating it (see [2]).

Regular expressions

See Regular Expressions.

Script Examples

Remove <script>...</script> HTML tag

s!<script[>\x20\t].*</script>!!g
/<script[>\x20\t]/{
    s!<script[>\x20\t].*!!g
    :NEXTCYCLE
    n
    /<\/script>/!{
        s!.*!!g
        b NEXTCYCLE
    }
    s!.*</script>!!g
}

Remove newlines

Newline characters are added to the pattern space when using the append command N. The script below removes all newlines from standard input:

:a N
s/\n/ /g
b a

One liner in bash:

sed -r ':a N; s/\n//; b a' FILE

Remove trailing whitespaces

find -name '*.[c|h|s]' -print0 | xargs -r0 sed -e 's/[[:blank:]]\+$//' -i
ack-grep --text --type-set=pdf=.pdf --nopdf -f --print0 | xargs -r0 sed -r -i 's/\s+$//';

Recursive patterns

For instance, to transform a path like /usr/local/share/bin/../../../bin/foo into /usr/bin/foo:

s!^([^./])!\./\1!                  # Prefix with './' unless starts with '.' or '/'
s!/./!/!g                          # Remove any './' in middle
:a s!/[^/]*[^/.]/\.\.!!g           # Remove /foo/.. (1st letter must not be '/', last letter must not be '.')
t a                                # ... and repeat until no more substitutions
echo "/usr/local/share/bin/../../../bin/foo" | sed -r 's!^([^./])!\./\1!; s!/a./!/!g; :a s!/[^/]*[^/.]/\.\.!!g; t a'

Test paths:

/usr/local/share/../../../bin/foo     # /bin/foo
/usr/local/./share/../../../bin/foo   # /bin/foo
./usr/../bin/foo                      # ./bin/foo 
usr/../bin/foo                        # ./bin/foo
usr/../bin                            # ./bin
usr/../bin/..                         # .
usr/../bin/../..                      # ./..

hex conversion in .reg file

eval "$(sed -r ':a N; s/\\\n *//g; b a' mapi-utf8.reg | sed -r "s/(.*)/echo \'\1\'/; /hex:/s/echo/echo -e/" | sed -r '/hex:/{s/,00//g; s/([:,])([0-9a-f][0-9a-f])/\1\\x\2/g}; s/,//g')"

Find whole word matches only

Use \b, as in

sed -rn '/\bWORD\b/p' myfile.txt

Concatenate C commands spanning on multiple lines

Say we have some C file where some commands are spanning on multiple lines, and we want them back on a single line (for instance, to process them further). Use the following script:

find -name "*.[ch]" -type f -print0|xargs -0 sed -r '/#define/b a; /my_function/{:b /;/b a;N;s/\n//; b b};:a'|grep my_function     # To review result
find -name "*.[ch]" -type f -print0|xargs -0 sed -ri '/#define/b a; /my_function/{:b /;/b a;N;s/\n//; b b};:a'                     # To apply result in-place

Match non-ascii characters / invalid collation character

By default sed only works with 7-bit ascii character [3], [4].

Here, in LANG=en_US.UTF-8, we see that non-ascii character is ignored:

echo $'Copyright \xa9 1999' | sed -r 's/./x/g'
# xxxxxxxxxx�xxxxx

Trying to give non-ascii range gives error Invalid collation character:

echo $'Copyright \xa9 1999' | sed -r 's/[\d128-\d255]/x/g'
# sed: -e expression #1, char 19: Invalid collation character

We can bypass this issue by using a 8-bit character set, for instance iso-8859-1:

echo $'Copyright \xa9 1999' | LANG=iso-8859-1 sed -r 's/./x/g'
# xxxxxxxxxxxxxxxx
echo $'Copyright \xa9 1999' | LANG=iso-8859-1 sed -r 's/[\d128-\d255]/x/g'
# Copyright x 1999

Another solution is to set LC_COLLATE=C LC_CTYPE=C, which always avoid bugs in shell scripts [5]:

$ echo $'Copyright \xa9 1999' | LC_COLLATE=C LC_CTYPE=C sed -r 's/./x/g'
xxxxxxxxxxxxxxxx

Delete the first matching line

From SO:

# Delete first line matching 'foo'
sed '0,/foo/{//d}' inputfile  # Use 0,ADDR2, so that ADDR2 can match the 1st line

Note the special construction //d using empty regular expression [6], that matches the last given regular expression.