Difference between pages "Awk by Example, Part 3" and "Awk by Example, Part 2"

(Difference between pages)
 
 
Line 1: Line 1:
 
{{Article
 
{{Article
 
|Author=Drobbins
 
|Author=Drobbins
|Previous in Series=Awk by Example, Part 2
+
|Previous in Series=Awk by Example, Part 1
 +
|Next in Series=Awk by Example, Part 3
 
}}
 
}}
== String functions and ... checkbooks? ==
+
== Records, loops, and arrays ==
  
=== Formatting output ===
+
=== Multi-line records ===
While awk's print statement does do the job most of the time, sometimes more is needed. For those times, awk offers two good old friends called printf() and sprintf(). Yes, these functions, like so many other awk parts, are identical to their C counterparts. printf() will print a formatted string to stdout, while sprintf() returns a formatted string that can be assigned to a variable. If you're not familiar with printf() and sprintf(), an introductory C text will quickly get you up to speed on these two essential printing functions. You can view the printf() man page by typing "man 3 printf" on your Linux system.
+
Awk is an excellent tool for reading in and processing structured data, such as the system's /etc/passwd file. /etc/passwd is the UNIX user database, and is a colon-delimited text file, containing a lot of important information, including all existing user accounts and user IDs, among other things. In my previous article, I showed you how awk could easily parse this file. All we had to do was to set the FS (field separator) variable to ":".
  
Here's some sample awk sprintf() and printf() code. As you can see in the following script, everything looks almost identical to C.
+
By setting the FS variable correctly, awk can be configured to parse almost any kind of structured data, as long as there is one record per line. However, just setting FS won't do us any good if we want to parse a record that exists over multiple lines. In these situations, we also need to modify the RS record separator variable. The RS variable tells awk when the current record ends and a new record begins.
 +
 
 +
As an example, let's look at how we'd handle the task of processing an address list of Federal Witness Protection Program participants:
 +
<pre>
 +
Jimmy the Weasel
 +
100 Pleasant Drive
 +
San Francisco, CA 12345
 +
 
 +
Big Tony
 +
200 Incognito Ave.
 +
Suburbia, WA 67890
 +
</pre>
 +
Ideally, we'd like awk to recognize each 3-line address as an individual record, rather than as three separate records. It would make our code a lot simpler if awk would recognize the first line of the address as the first field ($1), the street address as the second field ($2), and the city, state, and zip code as field $3. The following code will do just what we want:
 
<pre>
 
<pre>
#!/usr/bin/awk -f
 
 
BEGIN {
 
BEGIN {
x=1
+
    FS="\n"
b="foo"
+
    RS=""
printf("%s got a %d on the last test\n","Jim",83)
+
myout=sprintf("%s-%d",b,x)
+
print myout
+
 
}
 
}
 
</pre>
 
</pre>
This code will print:
+
Above, setting FS to "\n" tells awk that each field appears on its own line. By setting RS to "", we also tell awk that each address record is separated by a blank line. Once awk knows how the input is formatted, it can do all the parsing work for us, and the rest of the script is simple. Let's look at a complete script that will parse this address list and print out each address record on a single line, separating each field with a comma.
 
<pre>
 
<pre>
Jim got a 83 on the last test
+
BEGIN {
foo-1
+
    FS="\n"
 +
    RS=""
 +
}
 +
{ print $1 ", " $2 ", " $3 }
 
</pre>
 
</pre>
 
+
If this script is saved as address.awk, and the address data is stored in a file called address.txt, you can execute this script by typing awk -f address.awk address.txt. This code produces the following output:
=== String functions ===
+
Awk has a plethora of string functions, and that's a good thing. In awk, you really need string functions, since you can't treat a string as an array of characters as you can in other languages like C, C++, and Python. For example, if you execute the following code:
+
 
<pre>
 
<pre>
mystring="How are you doing today?"
+
Jimmy the Weasel, 100 Pleasant Drive, San Francisco, CA 12345
print mystring[3]
+
Big Tony, 200 Incognito Ave., Suburbia, WA 67890
 
</pre>
 
</pre>
You'll receive an error that looks something like this:
+
 
 +
=== OFS and ORS ===
 +
In address.awk's print statement, you can see that awk concatenates (joins) strings that are placed next to each other on a line. We used this feature to insert a comma and a space (", ") between the three address fields that appeared on the line. While this method works, it's a bit ugly looking. Rather than inserting literal ", " strings between our fields, we can have awk do it for us by setting a special awk variable called OFS. Take a look at this code snippet.
 
<pre>
 
<pre>
awk: string.gawk:59: fatal: attempt to use scalar as array
+
print "Hello", "there", "Jim!"
 
</pre>
 
</pre>
Oh, well. While not as convenient as Python's sequence types, awk's string functions get the job done. Let's take a look at them.
 
  
First, we have the basic length() function, which returns the length of a string. Here's how to use it:
+
The commas on this line are not part of the actual literal strings. Instead, they tell awk that "Hello", "there", and "Jim!" are separate fields, and that the OFS variable should be printed between each string. By default, awk produces the following output:
 
<pre>
 
<pre>
print length(mystring)
+
Hello there Jim!
 
</pre>
 
</pre>
This code will print the value:
+
This shows us that by default, OFS is set to " ", a single space. However, we can easily redefine OFS so that awk will insert our favorite field separator. Here's a revised version of our original address.awk program that uses OFS to output those intermediate ", " strings:
 
<pre>
 
<pre>
24
+
BEGIN {
 +
    FS="\n"
 +
    RS=""
 +
    OFS=", "
 +
}
 +
{ print $1, $2, $3 }
 
</pre>
 
</pre>
OK, let's keep going. The next string function is called index, and will return the position of the occurrence of a substring in another string, or it will return 0 if the string isn't found. Using mystring, we can call it this way:
+
Awk also has a special variable called ORS, called the "output record separator". By setting ORS, which defaults to a newline ("\n"), we can control the character that's automatically printed at the end of a print statement. The default ORS value causes awk to output each new print statement on a new line. If we wanted to make the output double-spaced, we would set ORS to "\n\n". Or, if we wanted records to be separated by a single space (and no newline), we would set ORS to " ".
 +
 
 +
=== Multi-line to tabbed ===
 +
Let's say that we wrote a script that converted our address list to a single-line per record, tab-delimited format for import into a spreadsheet. After using a slightly modified version of address.awk, it would become clear that our program only works for three-line addresses. If awk encountered the following address, the fourth line would be thrown away and not printed:
 
<pre>
 
<pre>
print index(mystring,"you")
+
Cousin Vinnie
 +
Vinnie's Auto Shop
 +
300 City Alley
 +
Sosueme, OR 76543
 
</pre>
 
</pre>
Awk prints:
+
To handle situations like this, it would be good if our code took the number of records per field into account, printing each one in order. Right now, the code only prints the first three fields of the address. Here's some code that does what we want:
 
<pre>
 
<pre>
9
+
BEGIN {
 +
    FS="\n"
 +
    RS=""
 +
    ORS=""
 +
}
 +
 +
 +
    x=1
 +
    while ( x<NF ) {
 +
        print $x "\t"
 +
        x++
 +
    }
 +
    print $NF "\n"
 +
}
 
</pre>
 
</pre>
We move on to two more easy functions, tolower() and toupper(). As you might guess, these functions will return the string with all characters converted to lowercase or uppercase respectively. Notice that tolower() and toupper() return the new string, and don't modify the original. This code:
+
First, we set the field separator FS to "\n" and the record separator RS to "" so that awk parses the multi-line addresses correctly, as before. Then, we set the output record separator ORS to "", which will cause the print statement to not output a newline at the end of each call. This means that if we want any text to start on a new line, we need to explicitly write print "\n".
 +
 
 +
In the main code block, we create a variable called x that holds the number of current field that we're processing. Initially, it's set to 1. Then, we use a while loop (an awk looping construct identical to that found in the C language) to iterate through all but the last record, printing the record and a tab character. Finally, we print the last record and a literal newline; again, since ORS is set to "", print won't output newlines for us. Program output looks like this, which is exactly what we wanted:
 
<pre>
 
<pre>
print tolower(mystring)
+
Jimmy the Weasel        100 Pleasant Drive      San Francisco, CA 12345
print toupper(mystring)
+
Big Tony        200 Incognito Ave.      Suburbia, WA 67890
print mystring
+
Cousin Vinnie  Vinnie's Auto Shop      300 City Alley  Sosueme, OR 76543
 
</pre>
 
</pre>
....will produce this output:
+
 
 +
=== Looping constructs ===
 +
We've already seen awk's while loop construct, which is identical to its C counterpart. Awk also has a "do...while" loop that evaluates the condition at the end of the code block, rather than at the beginning like a standard while loop. It's similar to "repeat...until" loops that can be found in other languages. Here's an example:
 
<pre>
 
<pre>
how are you doing today?
+
{
HOW ARE YOU DOING TODAY?
+
    count=1
How are you doing today?
+
    do {
 +
        print "I get printed at least once no matter what"
 +
    } while ( count != 1 )
 +
}
 
</pre>
 
</pre>
So far so good, but how exactly do we select a substring or even a single character from a string? That's where substr() comes in. Here's how to call substr():
+
Because the condition is evaluated after the code block, a "do...while" loop, unlike a normal while loop, will always execute at least once. On the other hand, a normal while loop will never execute if its condition is false when the loop is first encountered.
 +
 
 +
=== for loops ===
 +
Awk allows you to create for loops, which like while loops are identical to their C counterpart:
 
<pre>
 
<pre>
mysub=substr(mystring,startpos,maxlen)
+
for ( initial assignment; comparison; increment ) {
 +
    code block
 +
}
 
</pre>
 
</pre>
mystring should be either a string variable or a literal string from which you'd like to extract a substring. startpos should be set to the starting character position, and maxlen should contain the maximum length of the string you'd like to extract. Notice that I said maximum length; if length(mystring) is shorter than startpos+maxlen, your result will be truncated. substr() won't modify the original string, but returns the substring instead. Here's an example:
+
Here's a quick example:
 
<pre>
 
<pre>
print substr(mystring,9,3)
+
for ( x = 1; x <= 4; x++ ) {
 +
    print "iteration",x
 +
}
 
</pre>
 
</pre>
Awk will print:
+
This snippet will print:
 
<pre>
 
<pre>
you
+
iteration 1
 +
iteration 2
 +
iteration 3
 +
iteration 4
 
</pre>
 
</pre>
If you regularly program in a language that uses array indices to access parts of a string (and who doesn't), make a mental note that substr() is your awk substitute. You'll need to use it to extract single characters and substrings; because awk is a string-based language, you'll be using it often.
 
  
Now, we move on to some meatier functions, the first of which is called match(). match() is a lot like index(), except instead of searching for a substring like index() does, it searches for a regular expression. The match() function will return the starting position of the match, or zero if no match is found. In addition, match() will set two variables called RSTART and RLENGTH. RSTART contains the return value (the location of the first match), and RLENGTH specifies its span in characters (or -1 if no match was found). Using RSTART, RLENGTH, substr(), and a small loop, you can easily iterate through every match in your string. Here's an example match() call:
+
=== Break and continue ===
 +
Again, just like C, awk provides break and continue statements. These statements provide better control over awk's various looping constructs. Here's a code snippet that desperately needs a break statement:
 
<pre>
 
<pre>
print match(mystring,/you/), RSTART, RLENGTH
+
while (1) {
 +
    print "forever and ever..."
 +
}
 
</pre>
 
</pre>
Awk will print:
+
Because 1 is always true, this while loop runs forever. Here's a loop that only executes ten times:
 
<pre>
 
<pre>
9 9 3
+
x=1
 +
while(1) {
 +
    print "iteration",x
 +
    if ( x == 10 ) {
 +
        break
 +
    }
 +
    x++
 +
}
 
</pre>
 
</pre>
 +
Here, the break statement is used to "break out" of the innermost loop. "break" causes the loop to immediately terminate and execution to continue at the line after the loop's code block.
  
=== String substitution ===
+
The continue statement complements break, and works like this:
Now, we're going to look at a couple of string substitution functions, sub() and gsub(). These guys differ slightly from the functions we've looked at so far in that they actually modify the original string. Here's a template that shows how to call sub():
+
 
<pre>
 
<pre>
sub(regexp,replstring,mystring)
+
x=1
 +
while (1) {
 +
    if ( x == 4 ) {
 +
        x++
 +
        continue
 +
    }
 +
    print "iteration",x
 +
    if ( x > 20 ) {
 +
        break
 +
    }
 +
    x++
 +
}
 
</pre>
 
</pre>
When you call sub(), it'll find the first sequence of characters in mystring that matches regexp, and it'll replace that sequence with replstring. sub() and gsub() have identical arguments; the only way they differ is that sub() will replace the first regexp match (if any), and gsub() will perform a global replace, swapping out all matches in the string. Here's an example sub() and gsub() call:
+
This code will print "iteration 1" through "iteration 21", except for "iteration 4". If iteration equals 4, x is incremented and the continue statement is called, which immediately causes awk to start to the next loop iteration without executing the rest of the code block. The continue statement works for every kind of awk iterative loop, just as break does. When used in the body of a for loop, continue will cause the loop control variable to be automatically incremented. Here's an equivalent for loop:
 
<pre>
 
<pre>
sub(/o/,"O",mystring)
+
for ( x=1; x<=21; x++ ) {
print mystring
+
    if ( x == 4 ) {
mystring="How are you doing today?"
+
        continue
gsub(/o/,"O",mystring)
+
    }
print mystring
+
    print "iteration",x
 +
}
 
</pre>
 
</pre>
We had to reset mystring to its original value because the first sub() call modified mystring directly. When executed, this code will cause awk to output:
+
It wasn't necessary to increment x just before calling continue as it was in our while loop, since the for loop increments x automatically.
<pre>
+
HOw are you doing today?
+
HOw are yOu dOing tOday?
+
</pre>
+
Of course, more complex regular expressions are possible. I'll leave it up to you to test out some complicated regexps.
+
  
We wrap up our string function coverage by introducing you to a function called split(). split()'s job is to "chop up" a string and place the various parts into an integer-indexed array. Here's an example split() call:
+
=== Arrays ===
<pre>
+
You'll be pleased to know that awk has arrays. However, under awk, it's customary to start array indices at 1, rather than 0:
numelements=split("Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec",mymonths,",")
+
</pre>
+
When calling split(), the first argument contains the literal string or string variable to be chopped. In the second argument, you should specify the name of the array that split() will stuff the chopped parts into. In the third element, specify the separator that will be used to chop the strings up. When split() returns, it'll return the number of string elements that were split. split() assigns each one to an array index starting with one, so the following code:
+
<pre>
+
print mymonths[1],mymonths[numelements]
+
</pre>
+
....will print:
+
 
<pre>
 
<pre>
Jan Dec
+
myarray[1]="jim"
 +
myarray[2]=456
 
</pre>
 
</pre>
 +
When awk encounters the first assignment, myarray is created and the element myarray[1] is set to "jim". After the second assignment is evaluated, the array has two elements.
  
=== Special string forms ===
+
Once defined, awk has a handy mechanism to iterate over the elements of an array, as follows:
A quick note -- when calling length(), sub(), or gsub(), you can drop the last argument and awk will apply the function call to $0 (the entire current line). To print the length of each line in a file, use this awk script:
+
 
<pre>
 
<pre>
{
+
for ( x in myarray ) {
     print length()
+
     print myarray[x]
 
}
 
}
 
</pre>
 
</pre>
 
+
This code will print out every element in the array myarray. When you use this special "in" form of a for loop, awk will assign every existing index of myarray to x (the loop control variable) in turn, executing the loop's code block once after each assignment. While this is a very handy awk feature, it does have one drawback -- when awk cycles through the array indices, it doesn't follow any particular order. That means that there's no way for us to know whether the output of above code will be:
=== Financial fun ===
+
A few weeks ago, I decided to write my own checkbook balancing program in awk. I decided that I'd like to have a simple tab-delimited text file into which I can enter my most recent deposits and withdrawals. The idea was to hand this data to an awk script that would automatically add up all the amounts and tell me my balance. Here's how I decided to record all my transactions into my "ASCII checkbook":
+
 
<pre>
 
<pre>
23 Aug 2000    food    -    -    Y    Jimmy's Buffet    30.25
+
jim
 +
456
 
</pre>
 
</pre>
Every field in this file is separated by one or more tabs. After the date (field 1, $1), there are two fields called "expense category" and "income category". When I'm entering an expense like on the above line, I put a four-letter nickname in the exp field, and a "-" (blank entry) in the inc field. This signifies that this particular item is a "food expense" :) Here's what a deposit looks like:
+
or
 
<pre>
 
<pre>
23 Aug 2000    -    inco    -    Y    Boss Man        2001.00
+
456
 +
jim
 
</pre>
 
</pre>
In this case, I put a "-" (blank) in the exp category, and put "inco" in the inc category. "inco" is my nickname for generic (paycheck-style) income. Using category nicknames allows me to generate a breakdown of my income and expenditures by category. As far as the rest of the records, all the other fields are fairly self-explanatory. The cleared? field ("Y" or "N") records whether the transaction has been posted to my account; beyond that, there's a transaction description, and a positive dollar amount.
+
To loosely paraphrase Forrest Gump, iterating over the contents of an array is like a box of chocolates -- you never know what you're going to get. This has something to do with the "stringiness" of awk arrays, which we'll now take a look at.
  
The algorithm used to compute the current balance isn't too hard. Awk simply needs to read in each line, one by one. If an expense category is listed but there is no income category (denoted by "-"), then this item is a debit. If an income category is listed, but no expense category (denoted by "-") is present, then the dollar amount is a credit. And, if there is both an expense and income category listed, then this amount is a "category transfer"; that is, the dollar amount will be subtracted from the expense category and added to the income category. Again, all these categories are virtual, but are very useful for tracking income and expenditures, as well as for budgeting.
+
=== Array index stringiness ===
 
+
[[Awk by example, Part1 |In my previous article]], I showed you that awk actually stores numeric values in a string format. While awk performs the necessary conversions to make this work, it does open the door for some odd-looking code:
=== The code ===
+
Time to look at the code. We'll start off with the first line, the BEGIN block and a function definition:
+
 
<pre>
 
<pre>
#!/usr/bin/awk -f
+
a="1"
BEGIN {
+
b="2"
    FS="\t+"
+
c=a+b+3
    months="Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec"
+
}
+
 
+
function monthdigit(mymonth) {
+
    return (index(months,mymonth)+3)/4
+
}
+
 
</pre>
 
</pre>
Adding the first "#!..." line to any awk script will allow it to be directly executed from the shell, provided that you "chmod +x myscript" first. The remaining lines define our BEGIN block, which gets executed before awk starts processing our checkbook file. We set FS (the field separator) to "\t+", which tells awk that the fields will be separated by one or more tabs. In addition, we define a string called months that's used by our monthdigit() function, which appears next.
+
After this code executes, c is equal to 6. Since awk is "stringy", adding strings "1" and "2" is functionally no different than adding the numbers 1 and 2. In both cases, awk will successfully perform the math. Awk's "stringy" nature is pretty intriguing -- you may wonder what happens if we use string indexes for arrays. For instance, take the following code:
 
+
The last three lines show you how to define your own awk function. The format is simple -- type "function", then the function name, and then the parameters separated by commas, inside parentheses. After this, a "{ }" code block contains the code that you'd like this function to execute. All functions can access global variables (like our months variable). In addition, awk provides a "return" statement that allows the function to return a value, and operates similarly to the "return" found in C, Python, and other languages. This particular function converts a month name in a 3-letter string format into its numeric equivalent. For example, this:
+
 
<pre>
 
<pre>
print monthdigit("Mar")
+
myarr["1"]="Mr. Whipple"
 +
print myarr["1"]
 
</pre>
 
</pre>
....will print this:
+
As you might expect, this code will print "Mr. Whipple". But how about if we drop the quotes around the second "1" index?
 
<pre>
 
<pre>
3
+
myarr["1"]="Mr. Whipple"
 +
print myarr[1]
 
</pre>
 
</pre>
Now, let's move on to some more functions.
+
Guessing the result of this code snippet is a bit more difficult. Does awk consider myarr["1"] and myarr[1] to be two separate elements of the array, or do they refer to the same element? The answer is that they refer to the same element, and awk will print "Mr. Whipple", just as in the first code snippet. Although it may seem strange, behind the scenes awk has been using string indexes for its arrays all this time!
  
=== Financial functions ===
+
After learning this strange fact, some of us may be tempted to execute some wacky code that looks like this:
Here are three more functions that perform the bookkeeping for us. Our main code block, which we'll see soon, will process each line of the checkbook file sequentially, calling one of these functions so that the appropriate transactions are recorded in an awk array. There are three basic kinds of transactions, credit (doincome), debit (doexpense) and transfer (dotransfer). You'll notice that all three functions accept one argument, called mybalance. mybalance is a placeholder for a two-dimensional array, which we'll pass in as an argument. Up until now, we haven't dealt with two-dimensional arrays; however, as you can see below, the syntax is quite simple. Just separate each dimension with a comma, and you're in business.
+
<pre>
 
+
myarr["name"]="Mr. Whipple"
We'll record information into "mybalance" as follows. The first dimension of the array ranges from 0 to 12, and specifies the month, or zero for the entire year. Our second dimension is a four-letter category, like "food" or "inco"; this is the actual category we're dealing with. So, to find the entire year's balance for the food category, you'd look in mybalance[0,"food"]. To find June's income, you'd look in mybalance[6,"inco"].
+
print myarr["name"]
<pre>      
+
function doincome(mybalance) {
+
    mybalance[curmonth,$3] += amount
+
    mybalance[0,$3] += amount       
+
}
+
 
+
function doexpense(mybalance) {
+
    mybalance[curmonth,$2] -= amount
+
    mybalance[0,$2] -= amount       
+
}
+
 
+
function dotransfer(mybalance) {
+
    mybalance[0,$2] -= amount
+
    mybalance[curmonth,$2] -= amount
+
    mybalance[0,$3] += amount
+
    mybalance[curmonth,$3] += amount
+
}
+
 
</pre>
 
</pre>
When doincome() or any of the other functions are called, we record the transaction in two places -- mybalance[0,category] and mybalance[curmonth, category], the entire year's category balance and the current month's category balance, respectively. This allows us to easily generate either an annual or monthly breakdown of income/expenditures later on.
+
Not only does this code not raise an error, but it's functionally identical to our previous examples, and will print "Mr. Whipple" just as before! As you can see, awk doesn't limit us to using pure integer indexes; we can use string indexes if we want to, without creating any problems. Whenever we use non-integer array indices like myarr["name"], we're using associative arrays. Technically, awk isn't doing anything different behind the scenes than when we use a string index (since even if you use an "integer" index, awk still treats it as a string). However, you should still call 'em associative arrays -- it sounds cool and will impress your boss. The stringy index thing will be our little secret. ;)
  
If you look at these functions, you'll notice that the array referenced by mybalance is passed in by reference. In addition, we also refer to several global variables: curmonth, which holds the numeric value of the month of the current record, $2 (the expense category), $3 (the income category), and amount ($7, the dollar amount). When doincome() and friends are called, all these variables have already been set correctly for the current record (line) being processed.
+
=== Array tools ===
 +
When it comes to arrays, awk gives us a lot of flexibility. We can use string indexes, and we aren't required to have a continuous numeric sequence of indices (for example, we can define myarr[1] and myarr[1000], but leave all other elements undefined). While all this can be very helpful, in some circumstances it can create confusion. Fortunately, awk offers a couple of handy features to help make arrays more manageable.
  
=== The main block ===
+
First, we can delete array elements. If you want to delete element 1 of your array fooarray, type:
Here's the main code block that contains the code that parses each line of input data. Remember, because we have set FS correctly, we can refer to the first field as $1, the second field as $2, etc. When doincome() and friends are called, the functions can access the current values of curmonth, $2, $3 and amount from inside the function. Take a look at the code and meet me on the other side for an explanation.
+
 
<pre>
 
<pre>
{
+
delete fooarray[1]
    curmonth=monthdigit(substr($1,4,3))
+
    amount=$7
+
     
+
    #record all the categories encountered
+
    if ( $2 != "-" )
+
        globcat[$2]="yes"
+
    if ( $3 != "-" )
+
        globcat[$3]="yes"
+
 
+
    #tally up the transaction properly
+
    if ( $2 == "-" ) {
+
        if ( $3 == "-" ) {
+
            print "Error: inc and exp fields are both blank!"
+
            exit 1
+
        } else {
+
            #this is income
+
            doincome(balance)
+
            if ( $5 == "Y" )
+
                doincome(balance2)
+
        }
+
    } else if ( $3 == "-" ) {
+
        #this is an expense
+
        doexpense(balance)
+
        if ( $5 == "Y" )
+
            doexpense(balance2)
+
    } else {
+
        #this is a transfer
+
        dotransfer(balance)
+
        if ( $5 == "Y" )
+
            dotransfer(balance2)
+
    }                       
+
}
+
 
</pre>
 
</pre>
In the main block, the first two lines set curmonth to an integer between 1 and 12, and set amount to field 7 (to make the code easier to understand). Then, we have four interesting lines, where we write values into an array called globcat. globcat, or the global categories array, is used to record all those categories encountered in the file -- "inco", "misc", "food", "util", etc. For example, if $2 == "inco", we set globcat["inco"] to "yes". Later on, we can iterate through our list of categories with a simple "for (x in globcat)" loop.
+
And, if you want to see if a particular array element exists, you can use the special "in" boolean operator as follows:
 
+
On the next twenty or so lines, we analyze fields $2 and $3, and record the transaction appropriately. If $2=="-" and $3!="-", we have some income, so we call doincome(). If the situation is reversed, we call doexpense(); and if both $2 and $3 contain categories, we call dotransfer(). Each time, we pass the "balance" array to these functions so that the appropriate data is recorded there.
+
 
+
You'll also notice several lines that say "if ( $5 == "Y" ), record that same transaction in balance2". What exactly are we doing here? You'll recall that $5 contains either a "Y" or a "N", and records whether the transaction has been posted to the account. Because we record the transaction to balance2 only if the transaction has been posted, balance2 will contain the actual account balance, while "balance" will contain all transactions, whether they have been posted or not. You can use balance2 to verify your data entry (since it should match with your current account balance according to your bank), and use "balance" to make sure that you don't overdraw your account (since it will take into account any checks you have written that have not yet been cashed).
+
 
+
=== Generating the report ===
+
After the main block repeatedly processes each input record, we now have a fairly comprehensive record of debits and credits broken down by category and by month. Now, all we need to do is define an END block that will generate a report, in this case a modest one:
+
 
<pre>
 
<pre>
END {
+
if ( 1 in fooarray ) {
    bal=0
+
     print "Ayep!  It's there."
    bal2=0       
+
} else {
    for (x in globcat) {
+
     print "Nope!  Can't find it."
        bal=bal+balance[0,x]
+
        bal2=bal2+balance2[0,x]   
+
     }
+
    printf("Your available funds: %10.2f\n", bal)
+
     printf("Your account balance: %10.2f\n", bal2)       
+
 
}
 
}
 
</pre>
 
</pre>
This report prints out a summary that looks something like this:
 
<pre>
 
Your available funds:    1174.22
 
Your account balance:    2399.33
 
</pre>
 
In our END block, we used the "for (x in globcat)" construct to iterate through every category, tallying up a master balance based on all the transactions recorded. We actually tally up two balances, one for available funds, and another for the account balance. To execute the program and process your own financial goodies that you've entered into a file called '''mycheckbook.txt''', put all the above code into a text file called '''balance''' and do <span style="color:green;">"chmod +x balance"</span>, and then type <span style="color:green;">"./balance mycheckbook.txt"</span>. The balance script will then add up all your transactions and print out a two-line balance summary for you.
 
 
=== Upgrades ===
 
I use a more advanced version of this program to manage my personal and business finances. My version (which I couldn't include here due to space limitations) prints out a monthly breakdown of income and expenses, including annual totals, net income and a bunch of other stuff. Even better, it outputs the data in HTML format, so that I can view it in a Web browser :) If you find this program useful, I encourage you to add these features to this script. You won't need to configure it to record any additional information; all the information you need is already in balance and balance2. Just upgrade the END block, and you're in business!
 
  
I hope you've enjoyed this series. For more information on awk, check out the resources listed below.
+
=== Next time ===
 +
We've covered a lot of ground in this article. Next time, I'll round out your awk knowledge by showing you how to use awk's math and string functions and how to create your own functions. I'll also walk you through the creation of a checkbook balancing program. Until then, I encourage you to write some of your own awk programs, and to check out the following resources.
  
 
== Resources ==
 
== Resources ==
* Read Daniel's other awk articles on Funtoo: Awk By Example, [[Awk by example, Part1|Part 1]] and [[Awk by example, Part2|Part 2]].
+
* Read Daniel's other awk articles on Funtoo: Awk By Example, [[Awk by example, Part1|Part 1]] and [[Awk by example, Part3|Part 3]].
 
* If you'd like a good old-fashioned book, [http://www.oreilly.com/catalog/sed2/ O'Reilly's sed & awk, 2nd Edition] is a wonderful choice.
 
* If you'd like a good old-fashioned book, [http://www.oreilly.com/catalog/sed2/ O'Reilly's sed & awk, 2nd Edition] is a wonderful choice.
 
* Be sure to check out the [http://www.faqs.org/faqs/computer-lang/awk/faq/ comp.lang.awk FAQ]. It also contains lots of additional awk links.
 
* Be sure to check out the [http://www.faqs.org/faqs/computer-lang/awk/faq/ comp.lang.awk FAQ]. It also contains lots of additional awk links.
Line 270: Line 255:
 
* [http://www.gnu.org/software/gawk/manual/gawk.html The GNU Awk User's Guide] is available for online reference.
 
* [http://www.gnu.org/software/gawk/manual/gawk.html The GNU Awk User's Guide] is available for online reference.
  
[[ Category:Linux Core Concepts ]]
+
[[Category:Linux Core Concepts]]
 
[[Category:Articles]]
 
[[Category:Articles]]
 
{{ArticleFooter}}
 
{{ArticleFooter}}

Revision as of 01:57, January 2, 2015

Support Funtoo and help us grow! Donate $15 per month and get a free SSD-based Funtoo Virtual Container.

Records, loops, and arrays

Multi-line records

Awk is an excellent tool for reading in and processing structured data, such as the system's /etc/passwd file. /etc/passwd is the UNIX user database, and is a colon-delimited text file, containing a lot of important information, including all existing user accounts and user IDs, among other things. In my previous article, I showed you how awk could easily parse this file. All we had to do was to set the FS (field separator) variable to ":".

By setting the FS variable correctly, awk can be configured to parse almost any kind of structured data, as long as there is one record per line. However, just setting FS won't do us any good if we want to parse a record that exists over multiple lines. In these situations, we also need to modify the RS record separator variable. The RS variable tells awk when the current record ends and a new record begins.

As an example, let's look at how we'd handle the task of processing an address list of Federal Witness Protection Program participants:

Jimmy the Weasel
100 Pleasant Drive
San Francisco, CA 12345

Big Tony
200 Incognito Ave.
Suburbia, WA 67890

Ideally, we'd like awk to recognize each 3-line address as an individual record, rather than as three separate records. It would make our code a lot simpler if awk would recognize the first line of the address as the first field ($1), the street address as the second field ($2), and the city, state, and zip code as field $3. The following code will do just what we want:

BEGIN {
    FS="\n"
    RS=""
}

Above, setting FS to "\n" tells awk that each field appears on its own line. By setting RS to "", we also tell awk that each address record is separated by a blank line. Once awk knows how the input is formatted, it can do all the parsing work for us, and the rest of the script is simple. Let's look at a complete script that will parse this address list and print out each address record on a single line, separating each field with a comma.

BEGIN {
    FS="\n"
    RS=""
}
{ print $1 ", " $2 ", " $3 }

If this script is saved as address.awk, and the address data is stored in a file called address.txt, you can execute this script by typing awk -f address.awk address.txt. This code produces the following output:

Jimmy the Weasel, 100 Pleasant Drive, San Francisco, CA 12345
Big Tony, 200 Incognito Ave., Suburbia, WA 67890

OFS and ORS

In address.awk's print statement, you can see that awk concatenates (joins) strings that are placed next to each other on a line. We used this feature to insert a comma and a space (", ") between the three address fields that appeared on the line. While this method works, it's a bit ugly looking. Rather than inserting literal ", " strings between our fields, we can have awk do it for us by setting a special awk variable called OFS. Take a look at this code snippet.

print "Hello", "there", "Jim!"

The commas on this line are not part of the actual literal strings. Instead, they tell awk that "Hello", "there", and "Jim!" are separate fields, and that the OFS variable should be printed between each string. By default, awk produces the following output:

Hello there Jim!

This shows us that by default, OFS is set to " ", a single space. However, we can easily redefine OFS so that awk will insert our favorite field separator. Here's a revised version of our original address.awk program that uses OFS to output those intermediate ", " strings:

BEGIN {
    FS="\n"
    RS=""
    OFS=", "
}
{ print $1, $2, $3 }

Awk also has a special variable called ORS, called the "output record separator". By setting ORS, which defaults to a newline ("\n"), we can control the character that's automatically printed at the end of a print statement. The default ORS value causes awk to output each new print statement on a new line. If we wanted to make the output double-spaced, we would set ORS to "\n\n". Or, if we wanted records to be separated by a single space (and no newline), we would set ORS to " ".

Multi-line to tabbed

Let's say that we wrote a script that converted our address list to a single-line per record, tab-delimited format for import into a spreadsheet. After using a slightly modified version of address.awk, it would become clear that our program only works for three-line addresses. If awk encountered the following address, the fourth line would be thrown away and not printed:

Cousin Vinnie
Vinnie's Auto Shop
300 City Alley
Sosueme, OR 76543

To handle situations like this, it would be good if our code took the number of records per field into account, printing each one in order. Right now, the code only prints the first three fields of the address. Here's some code that does what we want:

BEGIN { 
    FS="\n" 
    RS="" 
    ORS="" 
} 
 
{  
    x=1 
    while ( x<NF ) { 
        print $x "\t" 
        x++ 
    } 
    print $NF "\n" 
} 

First, we set the field separator FS to "\n" and the record separator RS to "" so that awk parses the multi-line addresses correctly, as before. Then, we set the output record separator ORS to "", which will cause the print statement to not output a newline at the end of each call. This means that if we want any text to start on a new line, we need to explicitly write print "\n".

In the main code block, we create a variable called x that holds the number of current field that we're processing. Initially, it's set to 1. Then, we use a while loop (an awk looping construct identical to that found in the C language) to iterate through all but the last record, printing the record and a tab character. Finally, we print the last record and a literal newline; again, since ORS is set to "", print won't output newlines for us. Program output looks like this, which is exactly what we wanted:

Jimmy the Weasel        100 Pleasant Drive      San Francisco, CA 12345 
Big Tony        200 Incognito Ave.      Suburbia, WA 67890
Cousin Vinnie   Vinnie's Auto Shop      300 City Alley  Sosueme, OR 76543

Looping constructs

We've already seen awk's while loop construct, which is identical to its C counterpart. Awk also has a "do...while" loop that evaluates the condition at the end of the code block, rather than at the beginning like a standard while loop. It's similar to "repeat...until" loops that can be found in other languages. Here's an example:

{
    count=1
    do {
        print "I get printed at least once no matter what" 
    } while ( count != 1 )
}

Because the condition is evaluated after the code block, a "do...while" loop, unlike a normal while loop, will always execute at least once. On the other hand, a normal while loop will never execute if its condition is false when the loop is first encountered.

for loops

Awk allows you to create for loops, which like while loops are identical to their C counterpart:

for ( initial assignment; comparison; increment ) {
    code block
}

Here's a quick example:

for ( x = 1; x <= 4; x++ ) {
    print "iteration",x
}

This snippet will print:

iteration 1
iteration 2
iteration 3
iteration 4

Break and continue

Again, just like C, awk provides break and continue statements. These statements provide better control over awk's various looping constructs. Here's a code snippet that desperately needs a break statement:

while (1) {
    print "forever and ever..."
}

Because 1 is always true, this while loop runs forever. Here's a loop that only executes ten times:

x=1
while(1) {
    print "iteration",x
    if ( x == 10 ) {
        break
    }
    x++
}

Here, the break statement is used to "break out" of the innermost loop. "break" causes the loop to immediately terminate and execution to continue at the line after the loop's code block.

The continue statement complements break, and works like this:

x=1
while (1) {
    if ( x == 4 ) {
        x++
        continue
    }
    print "iteration",x
    if ( x > 20 ) {
        break
    }
    x++
}

This code will print "iteration 1" through "iteration 21", except for "iteration 4". If iteration equals 4, x is incremented and the continue statement is called, which immediately causes awk to start to the next loop iteration without executing the rest of the code block. The continue statement works for every kind of awk iterative loop, just as break does. When used in the body of a for loop, continue will cause the loop control variable to be automatically incremented. Here's an equivalent for loop:

for ( x=1; x<=21; x++ ) {
    if ( x == 4 ) {
        continue
    }
    print "iteration",x
}

It wasn't necessary to increment x just before calling continue as it was in our while loop, since the for loop increments x automatically.

Arrays

You'll be pleased to know that awk has arrays. However, under awk, it's customary to start array indices at 1, rather than 0:

myarray[1]="jim"
myarray[2]=456

When awk encounters the first assignment, myarray is created and the element myarray[1] is set to "jim". After the second assignment is evaluated, the array has two elements.

Once defined, awk has a handy mechanism to iterate over the elements of an array, as follows:

for ( x in myarray ) {
    print myarray[x]
}

This code will print out every element in the array myarray. When you use this special "in" form of a for loop, awk will assign every existing index of myarray to x (the loop control variable) in turn, executing the loop's code block once after each assignment. While this is a very handy awk feature, it does have one drawback -- when awk cycles through the array indices, it doesn't follow any particular order. That means that there's no way for us to know whether the output of above code will be:

jim
456

or

456
jim

To loosely paraphrase Forrest Gump, iterating over the contents of an array is like a box of chocolates -- you never know what you're going to get. This has something to do with the "stringiness" of awk arrays, which we'll now take a look at.

Array index stringiness

In my previous article, I showed you that awk actually stores numeric values in a string format. While awk performs the necessary conversions to make this work, it does open the door for some odd-looking code:

a="1"
b="2"
c=a+b+3

After this code executes, c is equal to 6. Since awk is "stringy", adding strings "1" and "2" is functionally no different than adding the numbers 1 and 2. In both cases, awk will successfully perform the math. Awk's "stringy" nature is pretty intriguing -- you may wonder what happens if we use string indexes for arrays. For instance, take the following code:

myarr["1"]="Mr. Whipple"
print myarr["1"]

As you might expect, this code will print "Mr. Whipple". But how about if we drop the quotes around the second "1" index?

myarr["1"]="Mr. Whipple"
print myarr[1]

Guessing the result of this code snippet is a bit more difficult. Does awk consider myarr["1"] and myarr[1] to be two separate elements of the array, or do they refer to the same element? The answer is that they refer to the same element, and awk will print "Mr. Whipple", just as in the first code snippet. Although it may seem strange, behind the scenes awk has been using string indexes for its arrays all this time!

After learning this strange fact, some of us may be tempted to execute some wacky code that looks like this:

myarr["name"]="Mr. Whipple"
print myarr["name"]

Not only does this code not raise an error, but it's functionally identical to our previous examples, and will print "Mr. Whipple" just as before! As you can see, awk doesn't limit us to using pure integer indexes; we can use string indexes if we want to, without creating any problems. Whenever we use non-integer array indices like myarr["name"], we're using associative arrays. Technically, awk isn't doing anything different behind the scenes than when we use a string index (since even if you use an "integer" index, awk still treats it as a string). However, you should still call 'em associative arrays -- it sounds cool and will impress your boss. The stringy index thing will be our little secret. ;)

Array tools

When it comes to arrays, awk gives us a lot of flexibility. We can use string indexes, and we aren't required to have a continuous numeric sequence of indices (for example, we can define myarr[1] and myarr[1000], but leave all other elements undefined). While all this can be very helpful, in some circumstances it can create confusion. Fortunately, awk offers a couple of handy features to help make arrays more manageable.

First, we can delete array elements. If you want to delete element 1 of your array fooarray, type:

delete fooarray[1]

And, if you want to see if a particular array element exists, you can use the special "in" boolean operator as follows:

if ( 1 in fooarray ) {
    print "Ayep!  It's there."
} else {
    print "Nope!  Can't find it."
}

Next time

We've covered a lot of ground in this article. Next time, I'll round out your awk knowledge by showing you how to use awk's math and string functions and how to create your own functions. I'll also walk you through the creation of a checkbook balancing program. Until then, I encourage you to write some of your own awk programs, and to check out the following resources.

Resources

Next >>>

Read the next article in this series: Awk by Example, Part 3

Support Funtoo and help us grow! Donate $15 per month and get a free SSD-based Funtoo Virtual Container.

About the Author

Daniel Robbins is best known as the creator of Gentoo Linux and author of many IBM developerWorks articles about Linux. Daniel currently serves as Benevolent Dictator for Life (BDFL) of Funtoo Linux. Funtoo Linux is a Gentoo-based distribution and continuation of Daniel's original Gentoo vision.

Got Funtoo?

Have you installed Funtoo Linux yet? Discover the power of a from-source meta-distribution optimized for your hardware! See our installation instructions and browse our CPU-optimized builds.

Funtoo News

Drobbins

IP Space Migration Continues

All Funtoo user containers in the 8.28 IP space will be moving into our new IP space (172.97) over the next few days. If you have DNS set up -- be sure to watch your container and update to the new IP! container.host.funtoo.org DNS will be updated after the move.
2015-08-27 by Drobbins
Drobbins

Funtoo Hosting IP Move

Funtoo user containers with IPs in the 72.18.x.x range will be gradually migrating to new IP addresses this week. If you have DNS entries for your containers, please be aware that your DNS will need to be updated.
2015-08-11 by Drobbins
Drobbins

New ARM Stages

New ARM Stages, built with a new toolchain, are now hitting mirrors. Existing ARM users should re-install using these stages (dated Aug 3, 2015 or later,) rather than upgrade using emerge.
2015-08-06 by Drobbins
More...

More Articles

Browse all our Linux-related articles, below:

A

B

F

G

K

L

M

O

P

S

T

W

X