Difference between pages "Awk by Example, Part 1" and "Awk by Example, Part 2"

From Funtoo
(Difference between pages)
Jump to navigation Jump to search
m (pre -> console)
 
 
Line 1: Line 1:
{{WikiArticle}}
{{WikiArticle}}


== An intro to the great language with the strange name ==
== Records, loops, and arrays ==


=== In defense of awk ===
=== Multi-line records ===
In this series of articles, I'm going to turn you into a proficient awk coder. I'll admit, awk doesn't have a very pretty or particularly "hip" name, and the GNU version of awk, called gawk, sounds downright weird. Those unfamiliar with the language may hear "awk" and think of a mess of code so backwards and antiquated that it's capable of driving even the most knowledgeable UNIX guru to the brink of insanity (causing him to repeatedly yelp "kill -9!" as he runs for coffee machine).
Awk is an excellent tool for reading in and processing structured data, such as the system's /etc/passwd file. /etc/passwd is the UNIX user database, and is a colon-delimited text file, containing a lot of important information, including all existing user accounts and user IDs, among other things. In my previous article, I showed you how awk could easily parse this file. All we had to do was to set the FS (field separator) variable to ":".


Sure, awk doesn't have a great name. But it is a great language. Awk is geared toward text processing and report generation, yet features many well-designed features that allow for serious programming. And, unlike some languages, awk's syntax is familiar, and borrows some of the best parts of languages like C, python, and bash (although, technically, awk was created before both python and bash). Awk is one of those languages that, once learned, will become a key part of your strategic coding arsenal.
By setting the FS variable correctly, awk can be configured to parse almost any kind of structured data, as long as there is one record per line. However, just setting FS won't do us any good if we want to parse a record that exists over multiple lines. In these situations, we also need to modify the RS record separator variable. The RS variable tells awk when the current record ends and a new record begins.


=== The first awk ===
As an example, let's look at how we'd handle the task of processing an address list of Federal Witness Protection Program participants:
Let's go ahead and start playing around with awk to see how it works. At the command line, enter the following command:
<pre>
Jimmy the Weasel
100 Pleasant Drive
San Francisco, CA 12345


<console>$ awk '{ print }' /etc/passwd</console>
Big Tony
200 Incognito Ave.
Suburbia, WA 67890
</pre>
Ideally, we'd like awk to recognize each 3-line address as an individual record, rather than as three separate records. It would make our code a lot simpler if awk would recognize the first line of the address as the first field ($1), the street address as the second field ($2), and the city, state, and zip code as field $3. The following code will do just what we want:
<pre>
BEGIN {
    FS="\n"
    RS=""
}
</pre>
Above, setting FS to "\n" tells awk that each field appears on its own line. By setting RS to "", we also tell awk that each address record is separated by a blank line. Once awk knows how the input is formatted, it can do all the parsing work for us, and the rest of the script is simple. Let's look at a complete script that will parse this address list and print out each address record on a single line, separating each field with a comma.
<pre>
BEGIN {
    FS="\n"
    RS=""
}
{ print $1 ", " $2 ", " $3 }
</pre>
If this script is saved as address.awk, and the address data is stored in a file called address.txt, you can execute this script by typing awk -f address.awk address.txt. This code produces the following output:
<pre>
Jimmy the Weasel, 100 Pleasant Drive, San Francisco, CA 12345
Big Tony, 200 Incognito Ave., Suburbia, WA 67890
</pre>


You should see the contents of your /etc/passwd file appear before your eyes. Now, for an explanation of what awk did. When we called awk, we specified /etc/passwd as our input file. When we executed awk, it evaluated the print command for each line in /etc/passwd, in order. All output is sent to stdout, and we get a result identical to catting /etc/passwd.
=== OFS and ORS ===
In address.awk's print statement, you can see that awk concatenates (joins) strings that are placed next to each other on a line. We used this feature to insert a comma and a space (", ") between the three address fields that appeared on the line. While this method works, it's a bit ugly looking. Rather than inserting literal ", " strings between our fields, we can have awk do it for us by setting a special awk variable called OFS. Take a look at this code snippet.
<pre>
print "Hello", "there", "Jim!"
</pre>


Now, for an explanation of the { print } code block. In awk, curly braces are used to group blocks of code together, similar to C. Inside our block of code, we have a single print command. In awk, when a print command appears by itself, the full contents of the current line are printed.
The commas on this line are not part of the actual literal strings. Instead, they tell awk that "Hello", "there", and "Jim!" are separate fields, and that the OFS variable should be printed between each string. By default, awk produces the following output:
 
<pre>
Here is another awk example that does exactly the same thing:
Hello there Jim!
 
</pre>
<console>$ awk '{ print $0 }' /etc/passwd</console>
This shows us that by default, OFS is set to " ", a single space. However, we can easily redefine OFS so that awk will insert our favorite field separator. Here's a revised version of our original address.awk program that uses OFS to output those intermediate ", " strings:
 
In awk, the $0 variable represents the entire current line, so print and print $0 do exactly the same thing. If you'd like, you can create an awk program that will output data totally unrelated to the input data. Here's an example:
 
<console>$ awk '{ print "" }' /etc/passwd</console>
 
Whenever you pass the "" string to the print command, it prints a blank line. If you test this script, you'll find that awk outputs one blank line for every line in your /etc/passwd file. Again, this is because awk executes your script for every line in the input file. Here's another example:
 
<console>$ awk '{ print "hiya" }' /etc/passwd</console>
 
Running this script will fill your screen with hiya's. :)
 
=== Multiple fields ===
Awk is really good at handling text that has been broken into multiple logical fields, and allows you to effortlessly reference each individual field from inside your awk script. The following script will print out a list of all user accounts on your system:
 
<console>$ awk -F":" '{ print $1 }' /etc/passwd</console>
 
Above, when we called awk, we use the -F option to specify ":" as the field separator. When awk processes the print $1 command, it will print out the first field that appears on each line in the input file. Here's another example:
 
<console>$ awk -F":" '{ print $1 $3 }' /etc/passwd</console>
 
Here's an excerpt of the output from this script:
<pre>
<pre>
halt7
BEGIN {
operator11
    FS="\n"
root0
    RS=""
shutdown6
    OFS=", "
sync5
}
bin1
{ print $1, $2, $3 }
....etc.
</pre>
</pre>
As you can see, awk prints out the first and third fields of the /etc/passwd file, which happen to be the username and uid fields respectively. Now, while the script did work, it's not perfect -- there aren't any spaces between the two output fields! If you're used to programming in bash or python, you may have expected the print $1 $3 command to insert a space between the two fields. However, when two strings appear next to each other in an awk program, awk concatenates them without adding an intermediate space. The following command will insert a space between both fields:
Awk also has a special variable called ORS, called the "output record separator". By setting ORS, which defaults to a newline ("\n"), we can control the character that's automatically printed at the end of a print statement. The default ORS value causes awk to output each new print statement on a new line. If we wanted to make the output double-spaced, we would set ORS to "\n\n". Or, if we wanted records to be separated by a single space (and no newline), we would set ORS to " ".
 
<console>$ awk -F":" '{ print $1 " " $3 }' /etc/passwd</console>
 
When you call print this way, it'll concatenate $1, " ", and $3, creating readable output. Of course, we can also insert some text labels if needed:
 
<console>$ awk -F":" '{ print "username: " $1 "\t\tuid:" $3 }' /etc/passwd</console>


This will cause the output to be:
=== Multi-line to tabbed ===
Let's say that we wrote a script that converted our address list to a single-line per record, tab-delimited format for import into a spreadsheet. After using a slightly modified version of address.awk, it would become clear that our program only works for three-line addresses. If awk encountered the following address, the fourth line would be thrown away and not printed:
<pre>
<pre>
username: halt    uid:7
Cousin Vinnie
username: operator uid:11
Vinnie's Auto Shop
username: root    uid:0
300 City Alley
username: shutdown uid:6
Sosueme, OR 76543
username: sync    uid:5
username: bin      uid:1
....etc.
</pre>
</pre>
 
To handle situations like this, it would be good if our code took the number of records per field into account, printing each one in order. Right now, the code only prints the first three fields of the address. Here's some code that does what we want:
=== External Scripts ===
Passing your scripts to awk as a command line argument can be very handy for small one-liners, but when it comes to complex, multi-line programs, you'll definitely want to compose your script in an external file. Awk can then be told to source this script file by passing it the -f option:
 
<console>$ awk -f myscript.awk myfile.in </console>
 
Putting your scripts in their own text files also allows you to take advantage of additional awk features. For example, this multi-line script does the same thing as one of our earlier one-liners, printing out the first field of each line in /etc/passwd:
<pre>
<pre>
BEGIN {  
BEGIN {  
        FS=":"  
    FS="\n"
    RS=""
    ORS=""
}
    x=1
    while ( x<NF ) {
        print $x "\t"
        x++
    }
    print $NF "\n"  
}  
}  
{ print $1 }
</pre>
</pre>
The difference between these two methods has to do with how we set the field separator. In this script, the field separator is specified within the code itself (by setting the FS variable), while our previous example set FS by passing the -F":" option to awk on the command line. It's generally best to set the field separator inside the script itself, simply because it means you have one less command line argument to remember to type. We'll cover the FS variable in more detail later in this article.
First, we set the field separator FS to "\n" and the record separator RS to "" so that awk parses the multi-line addresses correctly, as before. Then, we set the output record separator ORS to "", which will cause the print statement to not output a newline at the end of each call. This means that if we want any text to start on a new line, we need to explicitly write print "\n".


It is also possible to make the script directly executable, by placing a "#!/usr/bin/awk -f" at the top of the file, as follows:
In the main code block, we create a variable called x that holds the number of current field that we're processing. Initially, it's set to 1. Then, we use a while loop (an awk looping construct identical to that found in the C language) to iterate through all but the last record, printing the record and a tab character. Finally, we print the last record and a literal newline; again, since ORS is set to "", print won't output newlines for us. Program output looks like this, which is exactly what we wanted:
<pre>
<pre>
#!/usr/bin/awk -f
Jimmy the Weasel        100 Pleasant Drive      San Francisco, CA 12345
BEGIN {
Big Tony        200 Incognito Ave.      Suburbia, WA 67890
FS=":"
Cousin Vinnie  Vinnie's Auto Shop      300 City Alley  Sosueme, OR 76543
</pre>
 
=== Looping constructs ===
We've already seen awk's while loop construct, which is identical to its C counterpart. Awk also has a "do...while" loop that evaluates the condition at the end of the code block, rather than at the beginning like a standard while loop. It's similar to "repeat...until" loops that can be found in other languages. Here's an example:
<pre>
{
    count=1
    do {
        print "I get printed at least once no matter what"  
    } while ( count != 1 )
}
}
{ print $1 }
</pre>
</pre>
Next, the script must be made executable by setting the script file's execute bit:
Because the condition is evaluated after the code block, a "do...while" loop, unlike a normal while loop, will always execute at least once. On the other hand, a normal while loop will never execute if its condition is false when the loop is first encountered.


<console>$ chmod +x myscript.awk</console>
=== for loops ===
 
Awk allows you to create for loops, which like while loops are identical to their C counterpart:
Now, you should be able to execute the script as follows:
<pre>
 
for ( initial assignment; comparison; increment ) {
<console>$ ./myscript.awk myfile.in</console>
    code block
 
}
=== The BEGIN and END blocks ===
</pre>
Normally, awk executes each block of your script's code once for each input line. However, there are many programming situations where you may need to execute initialization code before awk begins processing the text from the input file. For such situations, awk allows you to define a BEGIN block. We used a BEGIN block in the previous example. Because the BEGIN block is evaluated before awk starts processing the input file, it's an excellent place to initialize the FS (field separator) variable, print a heading, or initialize other global variables that you'll reference later in the program.
Here's a quick example:
 
Awk also provides another special block, called the END block. Awk executes this block after all lines in the input file have been processed. Typically, the END block is used to perform final calculations or print summaries that should appear at the end of the output stream.
 
=== Regular expressions and blocks ===
Awk allows the use of regular expressions to selectively execute an individual block of code, depending on whether or not the regular expression matches the current line. Here's an example script that outputs only those lines that contain the character sequence foo:
 
<pre>/foo/ { print }</pre>
 
Of course, you can use more complicated regular expressions. Here's a script that will print only lines that contain a floating point number:
 
<pre>/[0-9]+\.[0-9]*/ { print }</pre>
 
=== Expressions and blocks ===
There are many other ways to selectively execute a block of code. We can place any kind of boolean expression before a code block to control when a particular block is executed. Awk will execute a code block only if the preceding boolean expression evaluates to true. The following example script will output the third field of all lines that have a first field equal to fred. If the first field of the current line is not equal to fred, awk will continue processing the file and will not execute the print statement for the current line:
 
<pre>$1 == "fred" { print $3 }</pre>
 
Awk offers a full selection of comparison operators, including the usual "==", "<", ">", "<=", ">=", and "!=". In addition, awk provides the "~" and "!~" operators, which mean "matches" and "does not match". They're used by specifying a variable on the left side of the operator, and a regular expression on the right side. Here's an example that will print only the third field on the line if the fifth field on the same line contains the character sequence root:
 
<pre>$5 ~ /root/ { print $3 }</pre>
 
=== Conditional statements ===
Awk also offers very nice C-like if statements. If you'd like, you could rewrite the previous script using an if statement:
<pre>
<pre>
{
for ( x = 1; x <= 4; x++ ) {
    if ( $5 ~ /root/ ) {  
    print "iteration",x
        print $3
    }
}
}
</pre>
</pre>
Both scripts function identically. In the first example, the boolean expression is placed outside the block, while in the second example, the block is executed for every input line, and we selectively perform the print command by using an if statement. Both methods are available, and you can choose the one that best meshes with the other parts of your script.
This snippet will print:
<pre>
iteration 1
iteration 2
iteration 3
iteration 4
</pre>


Here's a more complicated example of an awk if statement. As you can see, even with complex, nested conditionals, if statements look identical to their C counterparts:
=== Break and continue ===
Again, just like C, awk provides break and continue statements. These statements provide better control over awk's various looping constructs. Here's a code snippet that desperately needs a break statement:
<pre>
<pre>
{
while (1) {
    if ( $1 == "foo" ) {
     print "forever and ever..."
        if ( $2 == "foo" ) {
            print "uno"
        } else {
            print "one"
        }
    } else if ($1 == "bar" ) {
        print "two"
     } else {
        print "three"
    }
}
}
</pre>
</pre>
Using if statements, we can also transform this code:
Because 1 is always true, this while loop runs forever. Here's a loop that only executes ten times:
<pre>
<pre>
! /matchme/ { print $1 $3 $4 }
x=1
while(1) {
    print "iteration",x
    if ( x == 10 ) {
        break
    }
    x++
}
</pre>
</pre>
to this:
Here, the break statement is used to "break out" of the innermost loop. "break" causes the loop to immediately terminate and execution to continue at the line after the loop's code block.
 
The continue statement complements break, and works like this:
<pre>
<pre>
{
x=1
     if ( $0 !~ /matchme/ ) {
while (1) {
         print $1 $3 $4
     if ( x == 4 ) {
         x++
        continue
    }
    print "iteration",x
    if ( x > 20 ) {
        break
     }
     }
    x++
}
}
</pre>
</pre>
Both scripts will output only those lines that don't contain a matchme character sequence. Again, you can choose the method that works best for your code. They both do the same thing.
This code will print "iteration 1" through "iteration 21", except for "iteration 4". If iteration equals 4, x is incremented and the continue statement is called, which immediately causes awk to start to the next loop iteration without executing the rest of the code block. The continue statement works for every kind of awk iterative loop, just as break does. When used in the body of a for loop, continue will cause the loop control variable to be automatically incremented. Here's an equivalent for loop:
 
Awk also allows the use of boolean operators "||" (for "logical or") and "&&"(for "logical and") to allow the creation of more complex boolean expressions:
<pre>
<pre>
( $1 == "foo" ) && ( $2 == "bar" ) { print }  
for ( x=1; x<=21; x++ ) {
    if ( x == 4 ) {
        continue
    }
    print "iteration",x
}
</pre>
</pre>
This example will print only those lines where field one equals foo and field two equals bar.
It wasn't necessary to increment x just before calling continue as it was in our while loop, since the for loop increments x automatically.


=== Numeric variables! ===
=== Arrays ===
So far, we've either printed strings, the entire line, or specific fields. However, awk also allows us to perform both integer and floating point math. Using mathematical expressions, it's very easy to write a script that counts the number of blank lines in a file. Here's one that does just that:
You'll be pleased to know that awk has arrays. However, under awk, it's customary to start array indices at 1, rather than 0:
<pre>
<pre>
BEGIN { x=0 }
myarray[1]="jim"
/^$/  { x=x+1 }
myarray[2]=456
END  { print "I found " x " blank lines. :)" }
</pre>
</pre>
In the BEGIN block, we initialize our integer variable x to zero. Then, each time awk encounters a blank line, awk will execute the x=x+1 statement, incrementing x. After all the lines have been processed, the END block will execute, and awk will print out a final summary, specifying the number of blank lines it found.
When awk encounters the first assignment, myarray is created and the element myarray[1] is set to "jim". After the second assignment is evaluated, the array has two elements.


=== Stringy variables ===
Once defined, awk has a handy mechanism to iterate over the elements of an array, as follows:
One of the neat things about awk variables is that they are "simple and stringy." I consider awk variables "stringy" because all awk variables are stored internally as strings. At the same time, awk variables are "simple" because you can perform mathematical operations on a variable, and as long as it contains a valid numeric string, awk automatically takes care of the string-to-number conversion steps. To see what I mean, check out this example:
<pre>
<pre>
x="1.01"
for ( x in myarray ) {
# We just set x to contain the *string* "1.01"
    print myarray[x]
x=x+1
}
# We just added one to a *string*
print x  
# Incidentally, these are comments :)
</pre>
</pre>
Awk will output:
This code will print out every element in the array myarray. When you use this special "in" form of a for loop, awk will assign every existing index of myarray to x (the loop control variable) in turn, executing the loop's code block once after each assignment. While this is a very handy awk feature, it does have one drawback -- when awk cycles through the array indices, it doesn't follow any particular order. That means that there's no way for us to know whether the output of above code will be:
<pre>
<pre>
2.01
jim
456
</pre>
</pre>
Interesting! Although we assigned the string value 1.01 to the variable x, we were still able to add one to it. We wouldn't be able to do this in bash or python. First of all, bash doesn't support floating point arithmetic. And, while bash has "stringy" variables, they aren't "simple"; to perform any mathematical operations, bash requires that we enclose our math in an ugly $( ) construct. If we were using python, we would have to explicitly convert our 1.01 string to a floating point value before performing any arithmetic on it. While this isn't difficult, it's still an additional step. With awk, it's all automatic, and that makes our code nice and clean. If we wanted to square and add one to the first field in each input line, we would use this script:
or
<pre>
<pre>
{ print ($1^2)+1 }
456
jim
</pre>
</pre>
If you do a little experimenting, you'll find that if a particular variable doesn't contain a valid number, awk will treat that variable as a numerical zero when it evaluates your mathematical expression.
To loosely paraphrase Forrest Gump, iterating over the contents of an array is like a box of chocolates -- you never know what you're going to get. This has something to do with the "stringiness" of awk arrays, which we'll now take a look at.


=== Lots of operators ===
=== Array index stringiness ===
Another nice thing about awk is its full complement of mathematical operators. In addition to standard addition, subtraction, multiplication, and division, awk allows us to use the previously demonstrated exponent operator "^", the modulo (remainder) operator "%", and a bunch of other handy assignment operators borrowed from C.
[[Awk by example, Part1 |In my previous article]], I showed you that awk actually stores numeric values in a string format. While awk performs the necessary conversions to make this work, it does open the door for some odd-looking code:
 
These include pre- and post-increment/decrement ( i++, --foo ), add/sub/mult/div assign operators ( a+=3, b*=2, c/=2.2, d-=6.2 ). But that's not all -- we also get handy modulo/exponent assign ops as well ( a^=2, b%=4 ).
 
=== Field separators ===
Awk has its own complement of special variables. Some of them allow you to fine-tune how awk functions, while others can be read to glean valuable information about the input. We've already touched on one of these special variables, FS. As mentioned earlier, this variable allows you to set the character sequence that awk expects to find between fields. When we were using /etc/passwd as input, FS was set to ":". While this did the trick, FS allows us even more flexibility.
 
The FS value is not limited to a single character; it can also be set to a regular expression, specifying a character pattern of any length. If you're processing fields separated by one or more tabs, you'll want to set FS like so:
<pre>
<pre>
FS="\t+"
a="1"
b="2"
c=a+b+3
</pre>
</pre>
Above, we use the special "+" regular expression character, which means "one or more of the previous character".
After this code executes, c is equal to 6. Since awk is "stringy", adding strings "1" and "2" is functionally no different than adding the numbers 1 and 2. In both cases, awk will successfully perform the math. Awk's "stringy" nature is pretty intriguing -- you may wonder what happens if we use string indexes for arrays. For instance, take the following code:
 
If your fields are separated by whitespace (one or more spaces or tabs), you may be tempted to set FS to the following regular expression:
<pre>
<pre>
FS="[[:space:]]+"
myarr["1"]="Mr. Whipple"
print myarr["1"]
</pre>
</pre>
While this assignment will do the trick, it's not necessary. Why? Because by default, FS is set to a single space character, which awk interprets to mean "one or more spaces or tabs." In this particular example, the default FS setting was exactly what you wanted in the first place!
As you might expect, this code will print "Mr. Whipple". But how about if we drop the quotes around the second "1" index?
 
Complex regular expressions are no problem. Even if your records are separated by the word "foo," followed by three digits, the following regular expression will allow your data to be parsed properly:
<pre>
<pre>
FS="foo[0-9][0-9][0-9]"
myarr["1"]="Mr. Whipple"
print myarr[1]
</pre>
</pre>
Guessing the result of this code snippet is a bit more difficult. Does awk consider myarr["1"] and myarr[1] to be two separate elements of the array, or do they refer to the same element? The answer is that they refer to the same element, and awk will print "Mr. Whipple", just as in the first code snippet. Although it may seem strange, behind the scenes awk has been using string indexes for its arrays all this time!


=== Number of fields ===
After learning this strange fact, some of us may be tempted to execute some wacky code that looks like this:
The next two variables we're going to cover are not normally intended to be written to, but are normally read and used to gain useful information about the input. The first is the NF variable, also called the "number of fields" variable. Awk will automatically set this variable to the number of fields in the current record. You can use the NF variable to display only certain input lines:
<pre>
<pre>
NF == 3 { print "this particular record has three fields: " $0 }
myarr["name"]="Mr. Whipple"
</pre>
print myarr["name"]
Of course, you can also use the NF variable in conditional statements, as follows:
<pre>
{
    if ( NF > 2 ) {
        print $1 " " $2 ":" $3
    }
}
</pre>
</pre>
Not only does this code not raise an error, but it's functionally identical to our previous examples, and will print "Mr. Whipple" just as before! As you can see, awk doesn't limit us to using pure integer indexes; we can use string indexes if we want to, without creating any problems. Whenever we use non-integer array indices like myarr["name"], we're using associative arrays. Technically, awk isn't doing anything different behind the scenes than when we use a string index (since even if you use an "integer" index, awk still treats it as a string). However, you should still call 'em associative arrays -- it sounds cool and will impress your boss. The stringy index thing will be our little secret. ;)
=== Array tools ===
When it comes to arrays, awk gives us a lot of flexibility. We can use string indexes, and we aren't required to have a continuous numeric sequence of indices (for example, we can define myarr[1] and myarr[1000], but leave all other elements undefined). While all this can be very helpful, in some circumstances it can create confusion. Fortunately, awk offers a couple of handy features to help make arrays more manageable.


=== Record number ===
First, we can delete array elements. If you want to delete element 1 of your array fooarray, type:
The record number (NR) is another handy variable. It will always contain the number of the current record (awk counts the first record as record number 1). Up until now, we've been dealing with input files that contain one record per line. For these situations, NR will also tell you the current line number. However, when we start to process multi-line records later in the series, this will no longer be the case, so be careful! NR can be used like the NF variable to print only certain lines of the input:
<pre>
<pre>
(NR < 10 ) || (NR > 100) { print "We are on record number 1-9 or 101+" }
delete fooarray[1]
</pre>
</pre>
And, if you want to see if a particular array element exists, you can use the special "in" boolean operator as follows:
<pre>
<pre>
{
if ( 1 in fooarray ) {
    #skip header
    print "Ayep! It's there."
    if ( NR > 10 ) {
} else {
        print "ok, now for the real information!"
     print "Nope!  Can't find it."
     }
}
}
</pre>
</pre>
Awk provides additional variables that can be used for a variety of purposes. We'll cover more of these variables in later articles.


We've come to the end of our initial exploration of awk. As the series continues, I'll demonstrate more advanced awk functionality, and we'll end the series with a real-world awk application.
=== Next time ===
We've covered a lot of ground in this article. Next time, I'll round out your awk knowledge by showing you how to use awk's math and string functions and how to create your own functions. I'll also walk you through the creation of a checkbook balancing program. Until then, I encourage you to write some of your own awk programs, and to check out the following resources.


== Resources ==
== Resources ==
 
* Read Daniel's other awk articles on Funtoo: Awk By Example, [[Awk by example, Part1|Part 1]] and [[Awk by example, Part3|Part 3]].
* Read Daniel's other awk articles on Funtoo: Awk By Example, [[Awk by example, Part2 |Part 2]] and [[Awk by example, Part3 |Part 3]].
* If you'd like a good old-fashioned book, [http://www.oreilly.com/catalog/sed2/ O'Reilly's sed & awk, 2nd Edition] is a wonderful choice.
* If you'd like a good old-fashioned book, [http://www.oreilly.com/catalog/sed2/ O'Reilly's sed & awk, 2nd Edition] is a wonderful choice.
* Be sure to check out the [http://www.faqs.org/faqs/computer-lang/awk/faq/ comp.lang.awk FAQ]. It also contains lots of additional awk links.
* Be sure to check out the [http://www.faqs.org/faqs/computer-lang/awk/faq/ comp.lang.awk FAQ]. It also contains lots of additional awk links.
Line 262: Line 251:
* [http://www.tasoft.com/tawk.html Thompson's TAWK Compiler] compiles awk scripts into fast binary executables. Versions are available for Windows, OS/2, DOS, and UNIX.
* [http://www.tasoft.com/tawk.html Thompson's TAWK Compiler] compiles awk scripts into fast binary executables. Versions are available for Windows, OS/2, DOS, and UNIX.
* [http://www.gnu.org/software/gawk/manual/gawk.html The GNU Awk User's Guide] is available for online reference.
* [http://www.gnu.org/software/gawk/manual/gawk.html The GNU Awk User's Guide] is available for online reference.
* [http://www.folkstalk.com/2011/12/good-examples-of-awk-command-in-unix.html Awk Command] daily useful examples.
 
[[Category:Linux Core Concepts]]
[[Category:Linux Core Concepts]]
[[Category:Articles]]
[[Category:Articles]]

Revision as of 22:11, September 19, 2012

Template:WikiArticle

Records, loops, and arrays

Multi-line records

Awk is an excellent tool for reading in and processing structured data, such as the system's /etc/passwd file. /etc/passwd is the UNIX user database, and is a colon-delimited text file, containing a lot of important information, including all existing user accounts and user IDs, among other things. In my previous article, I showed you how awk could easily parse this file. All we had to do was to set the FS (field separator) variable to ":".

By setting the FS variable correctly, awk can be configured to parse almost any kind of structured data, as long as there is one record per line. However, just setting FS won't do us any good if we want to parse a record that exists over multiple lines. In these situations, we also need to modify the RS record separator variable. The RS variable tells awk when the current record ends and a new record begins.

As an example, let's look at how we'd handle the task of processing an address list of Federal Witness Protection Program participants:

Jimmy the Weasel
100 Pleasant Drive
San Francisco, CA 12345

Big Tony
200 Incognito Ave.
Suburbia, WA 67890

Ideally, we'd like awk to recognize each 3-line address as an individual record, rather than as three separate records. It would make our code a lot simpler if awk would recognize the first line of the address as the first field ($1), the street address as the second field ($2), and the city, state, and zip code as field $3. The following code will do just what we want:

BEGIN {
    FS="\n"
    RS=""
}

Above, setting FS to "\n" tells awk that each field appears on its own line. By setting RS to "", we also tell awk that each address record is separated by a blank line. Once awk knows how the input is formatted, it can do all the parsing work for us, and the rest of the script is simple. Let's look at a complete script that will parse this address list and print out each address record on a single line, separating each field with a comma.

BEGIN {
    FS="\n"
    RS=""
}
{ print $1 ", " $2 ", " $3 }

If this script is saved as address.awk, and the address data is stored in a file called address.txt, you can execute this script by typing awk -f address.awk address.txt. This code produces the following output:

Jimmy the Weasel, 100 Pleasant Drive, San Francisco, CA 12345
Big Tony, 200 Incognito Ave., Suburbia, WA 67890

OFS and ORS

In address.awk's print statement, you can see that awk concatenates (joins) strings that are placed next to each other on a line. We used this feature to insert a comma and a space (", ") between the three address fields that appeared on the line. While this method works, it's a bit ugly looking. Rather than inserting literal ", " strings between our fields, we can have awk do it for us by setting a special awk variable called OFS. Take a look at this code snippet.

print "Hello", "there", "Jim!"

The commas on this line are not part of the actual literal strings. Instead, they tell awk that "Hello", "there", and "Jim!" are separate fields, and that the OFS variable should be printed between each string. By default, awk produces the following output:

Hello there Jim!

This shows us that by default, OFS is set to " ", a single space. However, we can easily redefine OFS so that awk will insert our favorite field separator. Here's a revised version of our original address.awk program that uses OFS to output those intermediate ", " strings:

BEGIN {
    FS="\n"
    RS=""
    OFS=", "
}
{ print $1, $2, $3 }

Awk also has a special variable called ORS, called the "output record separator". By setting ORS, which defaults to a newline ("\n"), we can control the character that's automatically printed at the end of a print statement. The default ORS value causes awk to output each new print statement on a new line. If we wanted to make the output double-spaced, we would set ORS to "\n\n". Or, if we wanted records to be separated by a single space (and no newline), we would set ORS to " ".

Multi-line to tabbed

Let's say that we wrote a script that converted our address list to a single-line per record, tab-delimited format for import into a spreadsheet. After using a slightly modified version of address.awk, it would become clear that our program only works for three-line addresses. If awk encountered the following address, the fourth line would be thrown away and not printed:

Cousin Vinnie
Vinnie's Auto Shop
300 City Alley
Sosueme, OR 76543

To handle situations like this, it would be good if our code took the number of records per field into account, printing each one in order. Right now, the code only prints the first three fields of the address. Here's some code that does what we want:

BEGIN { 
    FS="\n" 
    RS="" 
    ORS="" 
} 
 
{  
    x=1 
    while ( x<NF ) { 
        print $x "\t" 
        x++ 
    } 
    print $NF "\n" 
} 

First, we set the field separator FS to "\n" and the record separator RS to "" so that awk parses the multi-line addresses correctly, as before. Then, we set the output record separator ORS to "", which will cause the print statement to not output a newline at the end of each call. This means that if we want any text to start on a new line, we need to explicitly write print "\n".

In the main code block, we create a variable called x that holds the number of current field that we're processing. Initially, it's set to 1. Then, we use a while loop (an awk looping construct identical to that found in the C language) to iterate through all but the last record, printing the record and a tab character. Finally, we print the last record and a literal newline; again, since ORS is set to "", print won't output newlines for us. Program output looks like this, which is exactly what we wanted:

Jimmy the Weasel        100 Pleasant Drive      San Francisco, CA 12345 
Big Tony        200 Incognito Ave.      Suburbia, WA 67890
Cousin Vinnie   Vinnie's Auto Shop      300 City Alley  Sosueme, OR 76543

Looping constructs

We've already seen awk's while loop construct, which is identical to its C counterpart. Awk also has a "do...while" loop that evaluates the condition at the end of the code block, rather than at the beginning like a standard while loop. It's similar to "repeat...until" loops that can be found in other languages. Here's an example:

{
    count=1
    do {
        print "I get printed at least once no matter what" 
    } while ( count != 1 )
}

Because the condition is evaluated after the code block, a "do...while" loop, unlike a normal while loop, will always execute at least once. On the other hand, a normal while loop will never execute if its condition is false when the loop is first encountered.

for loops

Awk allows you to create for loops, which like while loops are identical to their C counterpart:

for ( initial assignment; comparison; increment ) {
    code block
}

Here's a quick example:

for ( x = 1; x <= 4; x++ ) {
    print "iteration",x
}

This snippet will print:

iteration 1
iteration 2
iteration 3
iteration 4

Break and continue

Again, just like C, awk provides break and continue statements. These statements provide better control over awk's various looping constructs. Here's a code snippet that desperately needs a break statement:

while (1) {
    print "forever and ever..."
}

Because 1 is always true, this while loop runs forever. Here's a loop that only executes ten times:

x=1
while(1) {
    print "iteration",x
    if ( x == 10 ) {
        break
    }
    x++
}

Here, the break statement is used to "break out" of the innermost loop. "break" causes the loop to immediately terminate and execution to continue at the line after the loop's code block.

The continue statement complements break, and works like this:

x=1
while (1) {
    if ( x == 4 ) {
        x++
        continue
    }
    print "iteration",x
    if ( x > 20 ) {
        break
    }
    x++
}

This code will print "iteration 1" through "iteration 21", except for "iteration 4". If iteration equals 4, x is incremented and the continue statement is called, which immediately causes awk to start to the next loop iteration without executing the rest of the code block. The continue statement works for every kind of awk iterative loop, just as break does. When used in the body of a for loop, continue will cause the loop control variable to be automatically incremented. Here's an equivalent for loop:

for ( x=1; x<=21; x++ ) {
    if ( x == 4 ) {
        continue
    }
    print "iteration",x
}

It wasn't necessary to increment x just before calling continue as it was in our while loop, since the for loop increments x automatically.

Arrays

You'll be pleased to know that awk has arrays. However, under awk, it's customary to start array indices at 1, rather than 0:

myarray[1]="jim"
myarray[2]=456

When awk encounters the first assignment, myarray is created and the element myarray[1] is set to "jim". After the second assignment is evaluated, the array has two elements.

Once defined, awk has a handy mechanism to iterate over the elements of an array, as follows:

for ( x in myarray ) {
    print myarray[x]
}

This code will print out every element in the array myarray. When you use this special "in" form of a for loop, awk will assign every existing index of myarray to x (the loop control variable) in turn, executing the loop's code block once after each assignment. While this is a very handy awk feature, it does have one drawback -- when awk cycles through the array indices, it doesn't follow any particular order. That means that there's no way for us to know whether the output of above code will be:

jim
456

or

456
jim

To loosely paraphrase Forrest Gump, iterating over the contents of an array is like a box of chocolates -- you never know what you're going to get. This has something to do with the "stringiness" of awk arrays, which we'll now take a look at.

Array index stringiness

In my previous article, I showed you that awk actually stores numeric values in a string format. While awk performs the necessary conversions to make this work, it does open the door for some odd-looking code:

a="1"
b="2"
c=a+b+3

After this code executes, c is equal to 6. Since awk is "stringy", adding strings "1" and "2" is functionally no different than adding the numbers 1 and 2. In both cases, awk will successfully perform the math. Awk's "stringy" nature is pretty intriguing -- you may wonder what happens if we use string indexes for arrays. For instance, take the following code:

myarr["1"]="Mr. Whipple"
print myarr["1"]

As you might expect, this code will print "Mr. Whipple". But how about if we drop the quotes around the second "1" index?

myarr["1"]="Mr. Whipple"
print myarr[1]

Guessing the result of this code snippet is a bit more difficult. Does awk consider myarr["1"] and myarr[1] to be two separate elements of the array, or do they refer to the same element? The answer is that they refer to the same element, and awk will print "Mr. Whipple", just as in the first code snippet. Although it may seem strange, behind the scenes awk has been using string indexes for its arrays all this time!

After learning this strange fact, some of us may be tempted to execute some wacky code that looks like this:

myarr["name"]="Mr. Whipple"
print myarr["name"]

Not only does this code not raise an error, but it's functionally identical to our previous examples, and will print "Mr. Whipple" just as before! As you can see, awk doesn't limit us to using pure integer indexes; we can use string indexes if we want to, without creating any problems. Whenever we use non-integer array indices like myarr["name"], we're using associative arrays. Technically, awk isn't doing anything different behind the scenes than when we use a string index (since even if you use an "integer" index, awk still treats it as a string). However, you should still call 'em associative arrays -- it sounds cool and will impress your boss. The stringy index thing will be our little secret. ;)

Array tools

When it comes to arrays, awk gives us a lot of flexibility. We can use string indexes, and we aren't required to have a continuous numeric sequence of indices (for example, we can define myarr[1] and myarr[1000], but leave all other elements undefined). While all this can be very helpful, in some circumstances it can create confusion. Fortunately, awk offers a couple of handy features to help make arrays more manageable.

First, we can delete array elements. If you want to delete element 1 of your array fooarray, type:

delete fooarray[1]

And, if you want to see if a particular array element exists, you can use the special "in" boolean operator as follows:

if ( 1 in fooarray ) {
    print "Ayep!  It's there."
} else {
    print "Nope!  Can't find it."
}

Next time

We've covered a lot of ground in this article. Next time, I'll round out your awk knowledge by showing you how to use awk's math and string functions and how to create your own functions. I'll also walk you through the creation of a checkbook balancing program. Until then, I encourage you to write some of your own awk programs, and to check out the following resources.

Resources