Awk: Difference between revisions

From miki
Jump to navigation Jump to search
 
(19 intermediate revisions by the same user not shown)
Line 1: Line 1:
== References ==
== References ==
* [http://vc.airvectors.net/tsawk.html An Awk Primer (good tutorial on Awk)]
* [http://vc.airvectors.net/tsawk.html An Awk Primer (good tutorial on Awk)]
* [https://www.gnu.org/software/gawk/manual/ gawk User guide]
* '''GAWK: Effective AWK Programming''' ({{file|gawk.pdf}} from package {{deb|gawk-doc}})
* '''GAWK: Effective AWK Programming''' ({{file|gawk.pdf}} from package {{deb|gawk-doc}})
* [https://learnbyexample.github.io/learn_gnuawk/ CLI text processing with GNU Awk] (book, many examples)


On Awk:
== Awk Program Examples ==
* [http://www.skeeve.com/awk-sys-prog.html AWK As A Major Systems Programming Language — Revisited]

== Awk Examples ==
<source lang="bash">
<source lang="bash">
ps al | awk '{print $2}' # Print second field of ps output
ps al | awk '{print $2}' # Print second field of ps output
Line 19: Line 24:
perl -lne 'print $1 if /<configuration .* id="([^"]*)" name="some_name"/' FILE
perl -lne 'print $1 if /<configuration .* id="([^"]*)" name="some_name"/' FILE
# some_id.1525790178
# some_id.1525790178
</source>

== Language reference ==
=== Awk program structure ===
<source lang="awk">
@include "script1" # gawk extension
pattern {action}
pattern {action}
# ...
function name (args) { ... }
</source>
A ''rule'' is a ''pattern'' and ''action''. Either pattern or action can be omitted.
=== Patterns ===
<source lang="awk">
/regular expression/ { } # match when input records fits reg. exp.
expression { } # match when expression is nonzero
begpat, endpat { }
BEGIN { } # match program begin. All BEGIN rules are merged.
END { } # match program end. All END rules are merged.
BEGINFILE { } # match begin of each file (merged)
ENDFILE { } # match end of each file (merged)
{ } # empty pattern. Match every input record
</source>

Search patterns using regex can be constrained to a given field:
<source lang="awk">
$1 ~ /^France$/ { } # searches for lines whose first field is the word France
$1 !~ /^Norway$/ { } # searches for lines whose first field is NOT the word Norway
</source>

EXamples of expressions:
<source lang="awk">
NR == 10 { } # Match line number 10
NR == 10, NR == 20 { } # Match line 10 through 20
NF == 0 { } # Match empty lines (ie. with ZERO field)
$1 == "France" { } # Match line whose first word is "France"
</source>

'''Attention''' with numeric comparisons:
<source lang="awk">
(( $1 + 0 ) == $1 ) { } # Match if first field is numeric
(( $1 + 0 ) != $1 ) { } # Match if first field is string
$1 == 100 { } # Numeric compare -- always OK
$1 < 100 { } # DANGEROUS - FAIL IF $1 not numeric
((( $1 + 0 ) == $1 ) && ( $1 > 100 )) { } # BETTER - 1st check if field is numeric
</source>

=== Control statement ===
;Block and sequences
:Instructions are grouped with braces <code>{ ... }</code> and separated by newlines or semi-colons <code>;</code>
<source lang="awk">
{ if (NR) { print NR; print "hello" } }
</source>

;If statement
<source lang="awk">
# multiline
if (x % 2)
print "x is even"
else
print "x is odd"

# single line
if (x % 2) print "x is even"; else print "x is odd"
</source>

;While statement
<source lang="awk">
i = 1; while (i <= 3) { print $i; i++ }
</source>

;For statement
<source lang="awk">
for (i = 1; i <= 3; i++) print $i
</source>

=== Functions ===
<source lang="awk">
t=mktime("2020 12 26 23 43 11") # Convert to time integer
gsub(/[:-]/," ",$1); t=t=mktime($1) # if input in 1st field, formatted as 2020-12-26 23:43:11
</source>

== How-To ==

=== Execute a system command and capture its output ===
To run a system command, we use <code>system("cmd")</code>. However to capture its output, we use <code>cmd | getline value</code> [https://stackoverflow.com/questions/1960895/assigning-system-commands-output-to-variable].
However, we must also '''close the command''', otherwise awk will complain / will not reexecute the command / will produce strange resuts:

Example of program:
<source lang="awk">
/\/\/ test password/ {
cmd = "openssl rand -hex 16";
cmd | getline r;
gsub(/[0-9a-f][0-9a-f]/,"0x&, ",r);
print " { ", r, "}, // test password - DO NOT EDIT THIS COMMENT";
close(cmd);
next;
}
{print}'
</source>
</source>


== Tips ==
== Tips ==
* '''Defining environment variable''' - Using an ''Awk'' script and Bash builtin '''eval'''
=== Defining environment variable ===
Using an ''Awk'' script and Bash builtin '''eval'''
<source lang="bash">
<source lang="bash">
eval $(awk 'BEGIN{printf "MY_VAR=value";}')
eval $(awk 'BEGIN{printf "MY_VAR=value";}')
Line 28: Line 133:
</source>
</source>


* '''Hexadecimal conversion''' - Use <code>strtonum</code> to convert parameter:
=== Hexadecimal conversion ===
Use <code>strtonum</code> to convert parameter:
<source lang="awk">
<source lang="awk">
{
{
Line 36: Line 142:
}
}
</source>
</source>

* '''Using environment variables''' - Use <code>ENvIRON["NAME"]</code>:
Alternatively, use <code>awk --non-decimal-data</code> to have gawk interpret hexadecimal and octal immediately.

=== Using environment variables ===
Use <code>ENVIRON["NAME"]</code>:
<source lang="awk">
<source lang="awk">
{ print strtonum("0x"ENVIRON["STARTADDR"]); }
{ print strtonum("0x"ENVIRON["STARTADDR"]); }
</source>
</source>

* '''Pass command-line parameters''' - Awk variables can be defined directly on the invocation line:
=== Pass command-line parameters ===
Awk variables can be defined directly on the invocation line:
<source lang="bash">
<source lang="bash">
awk -v myvar=123 'BEGIN { printf "myvar is %d\n",myvar }' # Use -v (before program text) for var used in BEGIN section
awk -v myvar=123 'BEGIN { printf "myvar is %d\n",myvar }' # Use -v (before program text) for var used in BEGIN section
echo foo | awk '{ printf "myvar is %d\n",myvar }' myvar=123 # Otherwise specify var after program text
echo foo | awk '{ printf "myvar is %d\n",myvar }' myvar=123 # Otherwise specify var after program text
</source>
</source>

* '''Pass command-line parameters''' - Awk defines the variables <code>ARGC</code> and <code>ARGV</code>:
=== Pass command-line parameters ===
Awk defines the variables <code>ARGC</code> and <code>ARGV</code>:
<source lang="awk">
<source lang="awk">
BEGIN {
BEGIN {
Line 52: Line 166:
}
}
</source>
</source>

* '''<code>$0</code> is the whole line'''
=== <code>$0</code> is the whole line ===
<source lang=awk>
<source lang=awk>
# Concatenate DNS
# Concatenate DNS
Line 59: Line 174:
END {print record}
END {print record}
</source>
</source>

* ''''String concatenation''' &mdash; simply line up the string without operator.
=== String concatenation ===
simply line up the string without operator.
<source lang=awk>
<source lang=awk>
print "The result is " result;
print "The result is " result;
</source>
</source>

=== Next line on pattern match ===
Only match one pattern in a pattern list
<source lang="awk">
/PATTERN1/ {print $1; next}
/PATTERN2/ {print $2; next}
{print $3}
</source>

=== Force int conversion with <code>x+0</code> ===
Say we have a file with numbers collated to non-digit:
( 1 2)
( 1 3)

We can force integer conversion by applying some mathematical operation:
<source lang="bash">
awk '{print $3}' foo
# 2)
# 3)
awk '{print $3+0}' foo
# 2
# 3
</source>

=== Pattern conversion ===

2014-01 2,277.40
2014-02 2,282.20
2014-03 3,047.90
2014-04 4,127.60
2014-05 5,117.60

Use <code>gsub</code> for regex replacement (here remove the commas <code>,</code>):
<source lang="bash">
awk '{gsub(/,/,"",$2);sum+=$2}END{printf("%f",sum)}'
</source>

=== Remove duplicates, keeping line order ===
A simple awk script to remove duplicate lines from a file, keeping original order [https://iridakos.com/how-to/2019/05/16/remove-duplicate-lines-preserving-order-linux.html]:
<source lang=bash>
awk '!visited[$0]++' your_file > deduplicated_file
</source>

=== Remove the first, second... line matching a pattern ===
From [https://stackoverflow.com/questions/23696871/how-to-remove-only-the-first-occurrence-of-a-line-in-a-file-using-sed SO]:
<source lang="bash">
awk '/foo/{ if (++f == 1) next} 1' file # Delete 1st matching line
awk '/foo/{ if (++f == 2) next} 1' file # Delete 2nd matching line
awk '/foo/{ if (++f ~ /^(1|2)$/) next} 1' file # Delete 1st and 2nd matching line
</source>

=== Process CSV files ===
See '''csvquote''' in [[Linux Commands]].

There is also a rewrite of [https://github.com/benhoyt/goawk AWK in Go], with csv support.

Latest revision as of 16:00, 28 August 2023

References

On Awk:

Awk Examples

ps al | awk '{print $2}'                                         # Print second field of ps output
arp -n 10.137.3.129|awk '/ether/{print $3}'                      # Print third field of arp output, if line contains 'ether' somewhere
getent hosts unix.stackexchange.com | awk '{ print $1 ; exit }'  # Print only first line, then exit
find /proc -type l | awk -F"/" '{print $3}'                      # Print second folder name (i.e. process pid)

Example of parsing an XML file (and comparing with perl):

cat FILE
#        <configuration buildProperties="" description="" id="some_id.1525790178" name="some_name" parent="some_parent">
awk -F "[= <>\"]+" '/<configuration / { if ($8 == "some_name") print $6 }' FILE
# some_id.1525790178
perl -lne 'print $1 if /<configuration .* id="([^"]*)" name="some_name"/' FILE
# some_id.1525790178

Language reference

Awk program structure

@include "script1"    # gawk extension
pattern {action}
pattern {action}
# ...
function name (args) { ... }

A rule is a pattern and action. Either pattern or action can be omitted.

Patterns

/regular expression/ {  }   # match when input records fits reg. exp.
expression           {  }   # match when expression is nonzero
begpat, endpat       {  }
BEGIN                {  }   # match program begin. All BEGIN rules are merged.
END                  {  }   # match program end. All END rules are merged.
BEGINFILE            {  }   # match begin of each file (merged)
ENDFILE              {  }   # match end of each file (merged)
                     {  }   # empty pattern. Match every input record

Search patterns using regex can be constrained to a given field:

$1 ~ /^France$/  {  }   # searches for lines whose first field is the word France
$1 !~ /^Norway$/ {  }   # searches for lines whose first field is NOT the word Norway

EXamples of expressions:

NR == 10            {  }   # Match line number 10
NR == 10, NR == 20  {  }   # Match line 10 through 20
NF == 0             {  }   # Match empty lines (ie. with ZERO field)
$1 == "France"      {  }   # Match line whose first word is "France"

Attention with numeric comparisons:

(( $1 + 0 ) == $1 )                   {  }   # Match if first field is numeric
(( $1 + 0 ) != $1 )                   {  }   # Match if first field is string
$1 == 100                             {  }   # Numeric compare -- always OK
$1 < 100                              {  }   # DANGEROUS - FAIL IF $1 not numeric
((( $1 + 0 ) == $1 ) && ( $1 > 100 )) {  }   # BETTER - 1st check if field is numeric

Control statement

Block and sequences
Instructions are grouped with braces { ... } and separated by newlines or semi-colons ;
{ if (NR) { print NR; print "hello" } }
If statement
# multiline
if (x % 2)
    print "x is even"
else
    print "x is odd"

# single line
if (x % 2) print "x is even"; else print "x is odd"
While statement
i = 1; while (i <= 3) { print $i; i++ }
For statement
for (i = 1; i <= 3; i++) print $i

Functions

t=mktime("2020 12 26 23 43 11")        # Convert to time integer
gsub(/[:-]/," ",$1); t=t=mktime($1)    # if input in 1st field, formatted as 2020-12-26 23:43:11

How-To

Execute a system command and capture its output

To run a system command, we use system("cmd"). However to capture its output, we use cmd | getline value [1]. However, we must also close the command, otherwise awk will complain / will not reexecute the command / will produce strange resuts:

Example of program:

/\/\/ test password/ {
    cmd = "openssl rand -hex 16"; 
    cmd | getline r; 
    gsub(/[0-9a-f][0-9a-f]/,"0x&, ",r); 
    print "    { ", r, "}, // test password - DO NOT EDIT THIS COMMENT"; 
    close(cmd); 
    next;
}
{print}'

Tips

Defining environment variable

Using an Awk script and Bash builtin eval

eval $(awk 'BEGIN{printf "MY_VAR=value";}')
echo $MY_VAR

Hexadecimal conversion

Use strtonum to convert parameter:

{
    print strtonum($1);       # decimal, octal or hexa (guessed from prefix)
    print strtonum("0"$2);    # To force octal
    print strtonum("0x"$3);   # To force hexadecimal
}

Alternatively, use awk --non-decimal-data to have gawk interpret hexadecimal and octal immediately.

Using environment variables

Use ENVIRON["NAME"]:

{ print strtonum("0x"ENVIRON["STARTADDR"]); }

Pass command-line parameters

Awk variables can be defined directly on the invocation line:

awk -v myvar=123 'BEGIN { printf "myvar is %d\n",myvar }'     # Use -v (before program text) for var used in BEGIN section
echo foo | awk '{ printf "myvar is %d\n",myvar }' myvar=123   # Otherwise specify var after program text

Pass command-line parameters

Awk defines the variables ARGC and ARGV:

BEGIN {
  for (i = 0; i < ARGC; i++)
  print ARGV[i]
}

$0 is the whole line

# Concatenate DNS
/^A\?/{print record; record=$0} 
/^A /{record=record " " $0;} 
END {print record}

String concatenation

simply line up the string without operator.

print "The result is " result;

Next line on pattern match

Only match one pattern in a pattern list

/PATTERN1/ {print $1; next}
/PATTERN2/ {print $2; next}
{print $3}

Force int conversion with x+0

Say we have a file with numbers collated to non-digit:

( 1 2)
( 1 3)

We can force integer conversion by applying some mathematical operation:

awk '{print $3}' foo
# 2)
# 3)
awk '{print $3+0}' foo
# 2
# 3

Pattern conversion

2014-01     2,277.40
2014-02     2,282.20
2014-03     3,047.90
2014-04     4,127.60
2014-05     5,117.60

Use gsub for regex replacement (here remove the commas ,):

awk '{gsub(/,/,"",$2);sum+=$2}END{printf("%f",sum)}'

Remove duplicates, keeping line order

A simple awk script to remove duplicate lines from a file, keeping original order [2]:

awk '!visited[$0]++' your_file > deduplicated_file

Remove the first, second... line matching a pattern

From SO:

awk '/foo/{ if (++f == 1) next} 1' file         # Delete 1st matching line
awk '/foo/{ if (++f == 2) next} 1' file         # Delete 2nd matching line
awk '/foo/{ if (++f ~ /^(1|2)$/) next} 1' file  # Delete 1st and 2nd matching line

Process CSV files

See csvquote in Linux Commands.

There is also a rewrite of AWK in Go, with csv support.