Sed: Difference between revisions

From miki
Jump to navigation Jump to search
 
(10 intermediate revisions by the same user not shown)
Line 22: Line 22:
</source>
</source>


=== Portable scripts / deal with locale ===
It is recommended to set environment variables <code>LC_COLLATE</code> and <code>LC_CTYPE</code> to <code>C</code> [https://www.gnu.org/software/sed/manual/sed.html#Limitations] to avoid bugs in shell scripts:

<source lang="bash">
export LC_COLLATE=C LC_CTYPE=C

# Now the following line works as expected
echo $'Copyright \xa9 1999' | sed -r 's/./x/g'
</source>

Another solution is also to set environment variable <code>LANG</code> to 8-bit character set like <code>iso-8859-1</code>.

=== Commands <code>a</code>, <code>i</code> and <code>c</code> ===
Use of address commands <tt>a\text</tt>, <tt>i\text</tt>, <tt>c\text</tt>. The command is terminated by a '''*newline*'''. To insert a newline character, use <tt>\n</tt>:
Use of address commands <tt>a\text</tt>, <tt>i\text</tt>, <tt>c\text</tt>. The command is terminated by a '''*newline*'''. To insert a newline character, use <tt>\n</tt>:
<source lang="bash">
<source lang="bash">
Line 34: Line 47:
# First line (not anymore)
# First line (not anymore)
# Second line
# Second line
</source>


All on one line: use <code>echo -e</code> to generate the newline that terminates the command <code>i</code>:
All on one line: use <code>echo -e</code> to generate the newline that terminates the command <code>i</code>:
Line 47: Line 61:
sed "1 {s/^/inserted text\n/; s/$/ (not anymore)/}" mytext
sed "1 {s/^/inserted text\n/; s/$/ (not anymore)/}" mytext
</source>
</source>

=== Empty regular expression ===
Using <code>//</code> allows to match the previous regex, without repeating it (see [https://www.gnu.org/software/sed/manual/html_node/Addresses.html]).


== Regular expressions==
== Regular expressions==
Line 73: Line 90:
s/\n/ /g
s/\n/ /g
b a
b a
</source>

One liner in bash:
<source lang="bash">
sed -r ':a N; s/\n//; b a' FILE
</source>
</source>


Line 122: Line 144:
find -name "*.[ch]" -type f -print0|xargs -0 sed -ri '/#define/b a; /my_function/{:b /;/b a;N;s/\n//; b b};:a' # To apply result in-place
find -name "*.[ch]" -type f -print0|xargs -0 sed -ri '/#define/b a; /my_function/{:b /;/b a;N;s/\n//; b b};:a' # To apply result in-place
</source>
</source>

=== Match non-ascii characters / invalid collation character ===
By default sed only works with 7-bit ascii character [https://unix.stackexchange.com/questions/256806/replace-non-ascii-characters-with-space-in-a-file], [https://stackoverflow.com/questions/9670916/will-sed-and-others-corrupt-non-ascii-files].

Here, in <code>LANG=en_US.UTF-8</code>, we see that non-ascii character is ignored:
<source lang="bash">
echo $'Copyright \xa9 1999' | sed -r 's/./x/g'
# xxxxxxxxxx�xxxxx
</source>

Trying to give non-ascii range gives error <code>Invalid collation character</code>:
<source lang="bash">
echo $'Copyright \xa9 1999' | sed -r 's/[\d128-\d255]/x/g'
# sed: -e expression #1, char 19: Invalid collation character
</source>

We can bypass this issue by using a 8-bit character set, for instance <code>iso-8859-1</code>:
<source lang="bash">
echo $'Copyright \xa9 1999' | LANG=iso-8859-1 sed -r 's/./x/g'
# xxxxxxxxxxxxxxxx
echo $'Copyright \xa9 1999' | LANG=iso-8859-1 sed -r 's/[\d128-\d255]/x/g'
# Copyright x 1999
</source>

Another solution is to set LC_COLLATE=C LC_CTYPE=C, which always avoid bugs in shell scripts [https://www.gnu.org/software/sed/manual/sed.html#Limitations]:
<source lang="bash">
$ echo $'Copyright \xa9 1999' | LC_COLLATE=C LC_CTYPE=C sed -r 's/./x/g'
xxxxxxxxxxxxxxxx
</source>

=== Delete the first matching line ===
From [https://stackoverflow.com/questions/23696871/how-to-remove-only-the-first-occurrence-of-a-line-in-a-file-using-sed SO]:
<source lang="bash">
# Delete first line matching 'foo'
sed '0,/foo/{//d}' inputfile # Use 0,ADDR2, so that ADDR2 can match the 1st line
</source>

Note the special construction <code>//d</code> using '''empty regular expression''' [https://www.gnu.org/software/sed/manual/html_node/Addresses.html], that matches the last given regular expression.

Latest revision as of 16:28, 3 July 2021

References

Installation

It is recommended to add the following alias in your ~/.bashrc:

alias sed="sed -r"

Of course, this alias has no effect on shell script. There you'll have to specify the option explicitly at each invokation.

Usage

Some basic usage:

sed [OPTION]... {script-only-if-no-other-script} [input-file]...
sed -n                              # Silent - suppress automatic printing of pattern space
sed -r                              # Use extended regular expression
sed -i "s/foo/bar/" *.txt           # In-place file modification

Portable scripts / deal with locale

It is recommended to set environment variables LC_COLLATE and LC_CTYPE to C [1] to avoid bugs in shell scripts:

export LC_COLLATE=C LC_CTYPE=C

# Now the following line works as expected
echo $'Copyright \xa9 1999' | sed -r 's/./x/g'

Another solution is also to set environment variable LANG to 8-bit character set like iso-8859-1.

Commands a, i and c

Use of address commands a\text, i\text, c\text. The command is terminated by a *newline*. To insert a newline character, use \n:

cat mytext
# First line
# Second line
cat mysedscript
# 1 {i\inserted text
# s/$/ (not anymore)/g}
sed -f mysedscript mytext
# inserted text
# First line (not anymore)
# Second line

All on one line: use echo -e to generate the newline that terminates the command i:

echo -e "1 {i\\inserted text\ns/$/ (not anymore)/g}"| sed -f - mytext
# inserted text
# First line (not anymore)
# Second line

Same result without command i:

sed "1 {s/^/inserted text\n/; s/$/ (not anymore)/}" mytext

Empty regular expression

Using // allows to match the previous regex, without repeating it (see [2]).

Regular expressions

See Regular Expressions.

Script Examples

Remove <script>...</script> HTML tag

s!<script[>\x20\t].*</script>!!g
/<script[>\x20\t]/{
    s!<script[>\x20\t].*!!g
    :NEXTCYCLE
    n
    /<\/script>/!{
        s!.*!!g
        b NEXTCYCLE
    }
    s!.*</script>!!g
}

Remove newlines

Newline characters are added to the pattern space when using the append command N. The script below removes all newlines from standard input:

:a N
s/\n/ /g
b a

One liner in bash:

sed -r ':a N; s/\n//; b a' FILE

Remove trailing whitespaces

find -name '*.[c|h|s]' -print0 | xargs -r0 sed -e 's/[[:blank:]]\+$//' -i
ack-grep --text --type-set=pdf=.pdf --nopdf -f --print0 | xargs -r0 sed -r -i 's/\s+$//';

Recursive patterns

For instance, to transform a path like /usr/local/share/bin/../../../bin/foo into /usr/bin/foo:

s!^([^./])!\./\1!                  # Prefix with './' unless starts with '.' or '/'
s!/./!/!g                          # Remove any './' in middle
:a s!/[^/]*[^/.]/\.\.!!g           # Remove /foo/.. (1st letter must not be '/', last letter must not be '.')
t a                                # ... and repeat until no more substitutions
echo "/usr/local/share/bin/../../../bin/foo" | sed -r 's!^([^./])!\./\1!; s!/a./!/!g; :a s!/[^/]*[^/.]/\.\.!!g; t a'

Test paths:

/usr/local/share/../../../bin/foo     # /bin/foo
/usr/local/./share/../../../bin/foo   # /bin/foo
./usr/../bin/foo                      # ./bin/foo 
usr/../bin/foo                        # ./bin/foo
usr/../bin                            # ./bin
usr/../bin/..                         # .
usr/../bin/../..                      # ./..

hex conversion in .reg file

eval "$(sed -r ':a N; s/\\\n *//g; b a' mapi-utf8.reg | sed -r "s/(.*)/echo \'\1\'/; /hex:/s/echo/echo -e/" | sed -r '/hex:/{s/,00//g; s/([:,])([0-9a-f][0-9a-f])/\1\\x\2/g}; s/,//g')"

Find whole word matches only

Use \b, as in

sed -rn '/\bWORD\b/p' myfile.txt

Concatenate C commands spanning on multiple lines

Say we have some C file where some commands are spanning on multiple lines, and we want them back on a single line (for instance, to process them further). Use the following script:

find -name "*.[ch]" -type f -print0|xargs -0 sed -r '/#define/b a; /my_function/{:b /;/b a;N;s/\n//; b b};:a'|grep my_function     # To review result
find -name "*.[ch]" -type f -print0|xargs -0 sed -ri '/#define/b a; /my_function/{:b /;/b a;N;s/\n//; b b};:a'                     # To apply result in-place

Match non-ascii characters / invalid collation character

By default sed only works with 7-bit ascii character [3], [4].

Here, in LANG=en_US.UTF-8, we see that non-ascii character is ignored:

echo $'Copyright \xa9 1999' | sed -r 's/./x/g'
# xxxxxxxxxx�xxxxx

Trying to give non-ascii range gives error Invalid collation character:

echo $'Copyright \xa9 1999' | sed -r 's/[\d128-\d255]/x/g'
# sed: -e expression #1, char 19: Invalid collation character

We can bypass this issue by using a 8-bit character set, for instance iso-8859-1:

echo $'Copyright \xa9 1999' | LANG=iso-8859-1 sed -r 's/./x/g'
# xxxxxxxxxxxxxxxx
echo $'Copyright \xa9 1999' | LANG=iso-8859-1 sed -r 's/[\d128-\d255]/x/g'
# Copyright x 1999

Another solution is to set LC_COLLATE=C LC_CTYPE=C, which always avoid bugs in shell scripts [5]:

$ echo $'Copyright \xa9 1999' | LC_COLLATE=C LC_CTYPE=C sed -r 's/./x/g'
xxxxxxxxxxxxxxxx

Delete the first matching line

From SO:

# Delete first line matching 'foo'
sed '0,/foo/{//d}' inputfile  # Use 0,ADDR2, so that ADDR2 can match the 1st line

Note the special construction //d using empty regular expression [6], that matches the last given regular expression.