Difference between pages "Awk by Example, Part 1" and "Bash by Example, Part 2"

From Funtoo
(Difference between pages)
Jump to navigation Jump to search
 
 
Line 1: Line 1:
{{Article
{{Article
|Author=Drobbins
|Author=Drobbins
|Next in Series=Awk by Example, Part 2
|Previous in Series=Bash by Example, Part 1
|Next in Series=Bash by Example, Part 3
}}
}}
{{WikiArticle}}
== More bash programming fundamentals ==


== An intro to the great language with the strange name ==
=== Accepting arguments ===
Let's start with a brief tip on handling command-line arguments, and then look at bash's basic programming constructs.


=== In defense of awk ===
In the sample program in the [[Bash by example, Part1|introductory article]], we used the environment variable "$1", which referred to the first command-line argument. Similarly, you can use "$2", "$3", etc. to refer to the second and third arguments passed to your script. Here's an example:
In this series of articles, I'm going to turn you into a proficient awk coder. I'll admit, awk doesn't have a very pretty or particularly "hip" name, and the GNU version of awk, called gawk, sounds downright weird. Those unfamiliar with the language may hear "awk" and think of a mess of code so backwards and antiquated that it's capable of driving even the most knowledgeable UNIX guru to the brink of insanity (causing him to repeatedly yelp "kill -9!" as he runs for coffee machine).
<source lang="bash">
#!/usr/bin/env bash


Sure, awk doesn't have a great name. But it is a great language. Awk is geared toward text processing and report generation, yet features many well-designed features that allow for serious programming. And, unlike some languages, awk's syntax is familiar, and borrows some of the best parts of languages like C, python, and bash (although, technically, awk was created before both python and bash). Awk is one of those languages that, once learned, will become a key part of your strategic coding arsenal.
echo name of script is $0
echo first argument is $1
echo second argument is ${2}
echo seventeenth argument is ${17}
echo number of arguments is $#
</source>
The example is self explanatory except for three small details. First, "$0" will expand to the name of the script, as called from the command line, and "$#" will expand to the number of arguments passed to the script. The use of curly braces is optional for single-digit numbers but required for arguments above "$9". Play around with the above script, passing different kinds of command-line arguments to get the hang of how it works.


=== The first awk ===
Sometimes, it's helpful to refer to all command-line arguments at once. For this purpose, bash features the "$@" variable, which expands to all command-line parameters separated by spaces. We'll see an example of its use when we take a look at "for" loops, a bit later in this article.
Let's go ahead and start playing around with awk to see how it works. At the command line, enter the following command:


<console>$ awk '{ print }' /etc/passwd</console>
=== Bash programming constructs ===
If you've programmed in a procedural language like C, Pascal, Python, or Perl, then you're familiar with standard programming constructs like "if" statements, "for" loops, and the like. Bash has its own versions of most of these standard constructs. In the next several sections, I will introduce several bash constructs and demonstrate the differences between these constructs and others you are already familiar with from other programming languages. If you haven't programmed much before, don't worry. I include enough information and examples so that you can follow the text.


You should see the contents of your /etc/passwd file appear before your eyes. Now, for an explanation of what awk did. When we called awk, we specified /etc/passwd as our input file. When we executed awk, it evaluated the print command for each line in /etc/passwd, in order. All output is sent to stdout, and we get a result identical to catting /etc/passwd.
=== Conditional love ===
If you've ever programmed any file-related code in C, you know that it requires a significant amount of effort to see if a particular file is newer than another. That's because C doesn't have any built-in syntax for performing such a comparison; instead, two stat() calls and two stat structures must be used to perform the comparison by hand. In contrast, bash has standard file comparison operators built in, so determining if "'''/tmp/myfile''' is readable" is as easy as checking to see if "<span style="color:green">$myvar</span> is greater than 4".  


Now, for an explanation of the { print } code block. In awk, curly braces are used to group blocks of code together, similar to C. Inside our block of code, we have a single print command. In awk, when a print command appears by itself, the full contents of the current line are printed.
The following table lists the most frequently used bash comparison operators. You'll also find an example of how to use every option correctly. The example is meant to be placed immediately after the "if". For example:
<source lang="bash">
if [ -z "$myvar" ]
then
    echo "myvar is not defined"
fi
</source>
Sometimes, there are several different ways that a particular comparison can be made. For example, the following two snippets of code function identically:
<source lang="bash">
if [ "$myvar" -eq 3 ]
then
    echo "myvar equals 3"
fi


Here is another awk example that does exactly the same thing:
if [ "$myvar" = "3" ]
then
    echo "myvar equals 3"
fi
</source>
In the above two comparisons do exactly the same thing, but the first uses arithmetic comparison operators, while the second uses string comparison operators.


<console>$ awk '{ print $0 }' /etc/passwd</console>
=== String comparison caveats ===
Most of the time, while you can omit the use of double quotes surrounding strings and string variables, it's not a good idea. Why? Because your code will work perfectly, unless an environment variable happens to have a space or a tab in it, in which case bash will get confused. Here's an example of a fouled-up comparison:
<source lang="bash">
if [ $myvar = "foo bar oni" ]
then
    echo "yes"
fi
</source>
In the above example, if myvar equals "foo", the code will work as expected and not print anything. However, if myvar equals "foo bar oni", the code will fail with the following error:
<source lang="bash">
[: too many arguments
</source>
In this case, the spaces in "$myvar" (which equals "foo bar oni") end up confusing bash. After bash expands "$myvar", it ends up with the following comparison:
<source lang="bash">
[ foo bar oni = "foo bar oni" ]
</source>
Because the environment variable wasn't placed inside double quotes, bash thinks that you stuffed too many arguments in-between the square brackets. You can easily eliminate this problem by surrounding the string arguments with double-quotes. Remember, if you get into the habit of surrounding all string arguments and environment variables with double-quotes, you'll eliminate many similar programming errors. Here's how the "foo bar oni" comparison should have been written:
<source lang="bash">
if [ "$myvar" = "foo bar oni" ]
then
    echo "yes"
fi
</source>
The above code will work as expected and will not create any unpleasant surprises.


In awk, the $0 variable represents the entire current line, so print and print $0 do exactly the same thing. If you'd like, you can create an awk program that will output data totally unrelated to the input data. Here's an example:
{{fancynote|If you want your environment variables to be expanded, you must enclose them in double quotes, rather than single quotes. Single quotes disable variable (as well as history) expansion.}}
=== Looping constructs: "for" ===
OK, we've covered conditionals, now it's time to explore bash looping constructs. We'll start with the standard "for" loop. Here's a basic example:  
<source lang="bash">
#!/usr/bin/env bash


<console>$ awk '{ print "" }' /etc/passwd</console>
for x in one two three four
do
    echo number $x
done


Whenever you pass the "" string to the print command, it prints a blank line. If you test this script, you'll find that awk outputs one blank line for every line in your /etc/passwd file. Again, this is because awk executes your script for every line in the input file. Here's another example:
Output:
number one
number two
number three
number four
</source>
What exactly happened? The "for x" part of our "for" loop defined a new environment variable (also called a loop control variable) called "$x", which was successively set to the values "one", "two", "three", and "four". After each assignment, the body of the loop (the code between the "do" ... "done") was executed once. In the body, we referred to the loop control variable "$x" using standard variable expansion syntax, like any other environment variable. Also notice that "for" loops always accept some kind of word list after the "in" statement. In this case we specified four English words, but the word list can also refer to file(s) on disk or even file wildcards. Look at the following example, which demonstrates how to use standard shell wildcards:  
<source lang="bash">
#!/usr/bin/env bash


<console>$ awk '{ print "hiya" }' /etc/passwd</console>
for myfile in /etc/r*
do
    if [ -d "$myfile" ]
    then
      echo "$myfile (dir)"
    else
      echo "$myfile"
    fi
done


Running this script will fill your screen with hiya's. :)
output:


=== Multiple fields ===
/etc/rc.d (dir)
Awk is really good at handling text that has been broken into multiple logical fields, and allows you to effortlessly reference each individual field from inside your awk script. The following script will print out a list of all user accounts on your system:
/etc/resolv.conf
/etc/resolv.conf~
/etc/rpc
</source>
The above code looped over each file in '''/etc''' that began with an "r". To do this, bash first took our wildcard /etc/r* and expanded it, replacing it with the string '''/etc/rc.d /etc/resolv.conf /etc/resolv.conf~ /etc/rpc''' before executing the loop. Once inside the loop, the "-d" conditional operator was used to perform two different actions, depending on whether myfile was a directory or not. If it was, a " (dir)" was appended to the output line.


<console>$ awk -F":" '{ print $1 }' /etc/passwd</console>  
We can also use multiple wildcards and even environment variables in the word list:
<source lang="bash">
for x in /etc/r??? /var/lo* /home/drobbins/mystuff/* /tmp/${MYPATH}/*
do
    cp $x /mnt/mydira
done
</source>
Bash will perform wildcard and variable expansion in all the right places, and potentially create a very long word list.


Above, when we called awk, we use the -F option to specify ":" as the field separator. When awk processes the print $1 command, it will print out the first field that appears on each line in the input file. Here's another example:
While all of our wildcard expansion examples have used absolute paths, you can also use relative paths, as follows:
<source lang="bash">
for x in ../* mystuff/*
do
    echo $x is a silly file
done
</source>
In the above example, bash performs wildcard expansion relative to the current working directory, just like when you use relative paths on the command line. Play around with wildcard expansion a bit. You'll notice that if you use absolute paths in your wildcard, bash will expand the wildcard to a list of absolute paths. Otherwise, bash will use relative paths in the subsequent word list. If you simply refer to files in the current working directory (for example, if you type <span style="color:green">for x in *</span>), the resultant list of files will not be prefixed with any path information. Remember that preceding path information can be stripped using the <span style="color:green">basename</span> executable, as follows:
<source lang="bash">
for x in /var/log/*
do
    echo `basename $x` is a file living in /var/log
done
</source>
Of course, it's often handy to perform loops that operate on a script's command-line arguments. Here's an example of how to use the "$@" variable, introduced at the beginning of this article:  
<source lang="bash">
#!/usr/bin/env bash


<console>$ awk -F":" '{ print $1 $3 }' /etc/passwd</console>
for thing in "$@"
do
    echo you typed ${thing}.
done


Here's an excerpt of the output from this script:
output:
<pre>
halt7
operator11
root0
shutdown6
sync5
bin1
....etc.
</pre>
As you can see, awk prints out the first and third fields of the /etc/passwd file, which happen to be the username and uid fields respectively. Now, while the script did work, it's not perfect -- there aren't any spaces between the two output fields! If you're used to programming in bash or python, you may have expected the print $1 $3 command to insert a space between the two fields. However, when two strings appear next to each other in an awk program, awk concatenates them without adding an intermediate space. The following command will insert a space between both fields:
 
<console>$ awk -F":" '{ print $1 " " $3 }' /etc/passwd</console>
 
When you call print this way, it'll concatenate $1, " ", and $3, creating readable output. Of course, we can also insert some text labels if needed:


<console>$ awk -F":" '{ print "username: " $1 "\t\tuid:" $3 }' /etc/passwd</console>
$ allargs hello there you silly
you typed hello.
you typed there.
you typed you.
you typed silly.
</source>
=== Shell arithmetic ===
Before looking at a second type of looping construct, it's a good idea to become familiar with performing shell arithmetic. Yes, it's true: You can perform simple integer math using shell constructs. Simply enclose the particular arithmetic expression between a "$((" and a "))", and bash will evaluate the expression. Here are some examples:  
<source lang="bash">
$ echo $(( 100 / 3 ))
33
$ myvar="56"
$ echo $(( $myvar + 12 ))
68
$ echo $(( $myvar - $myvar ))
0
$ myvar=$(( $myvar + 1 ))
$ echo $myvar
57
</source>
Now that you're familiar performing mathematical operations, it's time to introduce two other bash looping constructs, "while" and "until".


This will cause the output to be:
=== More looping constructs: "while" and "until" ===
<pre>
A "while" statement will execute as long as a particular condition is true, and has the following format:  
username: halt     uid:7
<source lang="bash">
username: operator uid:11
while [ condition ]
username: root    uid:0  
do
username: shutdown uid:6
     statements
username: sync     uid:5
done
username: bin      uid:1  
</source>
....etc.
"While" statements are typically used to loop a certain number of times, as in the following example, which will loop exactly 10 times:  
</pre>
<source lang="bash">
myvar=0
while [ $myvar -ne 10 ]
do
     echo $myvar
    myvar=$(( $myvar + 1 ))
done
</source>
You can see the use of arithmetic expansion to eventually cause the condition to be false, and the loop to terminate.


=== External Scripts ===
"Until" statements provide the inverse functionality of "while" statements: They repeat as long as a particular condition is false. Here's an "until" loop that functions identically to the previous "while" loop:
Passing your scripts to awk as a command line argument can be very handy for small one-liners, but when it comes to complex, multi-line programs, you'll definitely want to compose your script in an external file. Awk can then be told to source this script file by passing it the -f option:
<source lang="bash">
myvar=0
until [ $myvar -eq 10 ]
do
    echo $myvar
    myvar=$(( $myvar + 1 ))
done
</source>


<console>$ awk -f myscript.awk myfile.in </console>
=== Case statements ===
"Case" statements are another conditional construct that comes in handy. Here's an example snippet:
<source lang="bash">
case "${x##*.}" in
    gz)
          gzunpack ${SROOT}/${x}
          ;;
    bz2)
          bz2unpack ${SROOT}/${x}
          ;;
    *)
          echo "Archive format not recognized."
          exit
          ;;
esac
</source>
Above, bash first expands "${x##*.}". In the code, "$x" is the name of a file, and "${x##*.}" has the effect of stripping all text except that following the last period in the filename. Then, bash compares the resultant string against the values listed to the left of the ")"s. In this case, "${x##*.}" gets compared against "gz", then "bz2" and finally "*". If "${x##*.}" matches any of these strings or patterns, the lines immediately following the ")" are executed, up until the ";;", at which point bash continues executing lines after the terminating "esac". If no patterns or strings are matched, no lines of code are executed; however, in this particular code snippet, at least one block of code will execute, because the "*" pattern will catch everything that didn't match "gz" or "bz2".


Putting your scripts in their own text files also allows you to take advantage of additional awk features. For example, this multi-line script does the same thing as one of our earlier one-liners, printing out the first field of each line in /etc/passwd:
=== Functions and namespaces ===
<pre>
In bash, you can even define functions, similar to those in other procedural languages like Pascal and C. In bash, functions can even accept arguments, using a system very similar to the way scripts accept command-line arguments. Let's take a look at a sample function definition and then proceed from there:  
BEGIN {  
<source lang="bash">
         FS=":"  
tarview() {
}  
    echo -n "Displaying contents of $1 "
{ print $1 }  
    if [ ${1##*.} = tar ]
</pre>
    then
The difference between these two methods has to do with how we set the field separator. In this script, the field separator is specified within the code itself (by setting the FS variable), while our previous example set FS by passing the -F":" option to awk on the command line. It's generally best to set the field separator inside the script itself, simply because it means you have one less command line argument to remember to type. We'll cover the FS variable in more detail later in this article.
         echo "(uncompressed tar)"
 
        tar tvf $1
It is also possible to make the script directly executable, by placing a "#!/usr/bin/awk -f" at the top of the file, as follows:
    elif [ ${1##*.} = gz ]
    then
        echo "(gzip-compressed tar)"
        tar tzvf $1
    elif [ ${1##*.} = bz2 ]
    then
        echo "(bzip2-compressed tar)"
        cat $1 | bzip2 -d | tar tvf -
    fi
}
</source>
{{fancynote|Another case: The above code could have been written using a "case" statement. Can you figure out how?}}
Above, we define a function called "tarview" that accepts one argument, a tarball of some kind. When the function is executed, it identifies what type of tarball the argument is (either uncompressed, gzip-compressed, or bzip2-compressed), prints out a one-line informative message, and then displays the contents of the tarball. This is how the above function should be called (whether from a script or from the command line, after it has been typed in, pasted in, or sourced):  
<pre>
<pre>
#!/usr/bin/awk -f
$ tarview shorten.tar.gz
BEGIN {
Displaying contents of shorten.tar.gz (gzip-compressed tar)
FS=":"
drwxr-xr-x ajr/abbot        0 1999-02-27 16:17 shorten-2.3a/
}
-rw-r--r-- ajr/abbot      1143 1997-09-04 04:06 shorten-2.3a/Makefile
{ print $1 }
-rw-r--r-- ajr/abbot      1199 1996-02-04 12:24 shorten-2.3a/INSTALL
-rw-r--r-- ajr/abbot      839 1996-05-29 00:19 shorten-2.3a/LICENSE
....
</pre>
</pre>
Next, the script must be made executable by setting the script file's execute bit:
As you can see, arguments can be referenced inside the function definition by using the same mechanism used to reference command-line arguments. In addition, the "$#" macro will be expanded to contain the number of arguments. The only thing that may not work completely as expected is the variable "$0", which will either expand to the string "bash" (if you run the function from the shell, interactively) or to the name of the script the function is called from.


<console>$ chmod +x myscript.awk</console>
{{fancynote|Use'em interactively: Don't forget that functions, like the one above, can be placed in your ~/.bashrc or ~/.bash_profile so that they are available for use whenever you are in bash.}}


Now, you should be able to execute the script as follows:
=== Namespace ===
Often, you'll need to create environment variables inside a function. While possible, there's a technicality you should know about. In most compiled languages (such as C), when you create a variable inside a function, it's placed in a separate local namespace. So, if you define a function in C called myfunction, and in it define a variable called "x", any global (outside the function) variable called "x" will not be affected by it, eliminating side effects.


<console>$ ./myscript.awk myfile.in</console>
While true in C, this isn't true in bash. In bash, whenever you create an environment variable inside a function, it's added to the global namespace. This means that it will overwrite any global variable outside the function, and will continue to exist even after the function exits:
<source lang="bash">
#!/usr/bin/env bash


=== The BEGIN and END blocks ===
myvar="hello"
Normally, awk executes each block of your script's code once for each input line. However, there are many programming situations where you may need to execute initialization code before awk begins processing the text from the input file. For such situations, awk allows you to define a BEGIN block. We used a BEGIN block in the previous example. Because the BEGIN block is evaluated before awk starts processing the input file, it's an excellent place to initialize the FS (field separator) variable, print a heading, or initialize other global variables that you'll reference later in the program.


Awk also provides another special block, called the END block. Awk executes this block after all lines in the input file have been processed. Typically, the END block is used to perform final calculations or print summaries that should appear at the end of the output stream.
myfunc() {


=== Regular expressions and blocks ===
    myvar="one two three"
Awk allows the use of regular expressions to selectively execute an individual block of code, depending on whether or not the regular expression matches the current line. Here's an example script that outputs only those lines that contain the character sequence foo:
    for x in $myvar
 
    do
<pre>/foo/ { print }</pre>
        echo $x
 
    done
Of course, you can use more complicated regular expressions. Here's a script that will print only lines that contain a floating point number:
}
 
<pre>/[0-9]+\.[0-9]*/ { print }</pre>


=== Expressions and blocks ===
myfunc
There are many other ways to selectively execute a block of code. We can place any kind of boolean expression before a code block to control when a particular block is executed. Awk will execute a code block only if the preceding boolean expression evaluates to true. The following example script will output the third field of all lines that have a first field equal to fred. If the first field of the current line is not equal to fred, awk will continue processing the file and will not execute the print statement for the current line:


<pre>$1 == "fred" { print $3 }</pre>
echo $myvar $x
</source>
When this script is run, it produces the output "one two three three", showing how "$myvar" defined in the function clobbered the global variable "$myvar", and how the loop control variable "$x" continued to exist even after the function exited (and also would have clobbered any global "$x", if one were defined).


Awk offers a full selection of comparison operators, including the usual "==", "<", ">", "<=", ">=", and "!=". In addition, awk provides the "~" and "!~" operators, which mean "matches" and "does not match". They're used by specifying a variable on the left side of the operator, and a regular expression on the right side. Here's an example that will print only the third field on the line if the fifth field on the same line contains the character sequence root:
In this simple example, the bug is easy to spot and to compensate for by using alternate variable names. However, this isn't the right approach; the best way to solve this problem is to prevent the possibility of clobbering global variables in the first place, by using the "local" command. When we use "local" to create variables inside a function, they will be kept in the local namespace and not clobber any global variables. Here's how to implement the above code so that no global variables are overwritten:  
<source lang="bash">
#!/usr/bin/env bash


<pre>$5 ~ /root/ { print $3 }</pre>
myvar="hello"


=== Conditional statements ===
myfunc() {
Awk also offers very nice C-like if statements. If you'd like, you could rewrite the previous script using an if statement:
    local x
<pre>
    local myvar="one two three"
{
    for x in $myvar
     if ( $5 ~ /root/ ) {
     do
         print $3
         echo $x
     }
     done
}
}
</pre>
Both scripts function identically. In the first example, the boolean expression is placed outside the block, while in the second example, the block is executed for every input line, and we selectively perform the print command by using an if statement. Both methods are available, and you can choose the one that best meshes with the other parts of your script.


Here's a more complicated example of an awk if statement. As you can see, even with complex, nested conditionals, if statements look identical to their C counterparts:
myfunc
<pre>
{
    if ( $1 == "foo" ) {
        if ( $2 == "foo" ) {
            print "uno"
        } else {
            print "one"
        }
    } else if ($1 == "bar" ) {
        print "two"
    } else {
        print "three"
    }
}
</pre>
Using if statements, we can also transform this code:
<pre>
! /matchme/ { print $1 $3 $4 }
</pre>
to this:
<pre>
{
    if ( $0 !~ /matchme/ ) {
        print $1 $3 $4
    }
}
</pre>
Both scripts will output only those lines that don't contain a matchme character sequence. Again, you can choose the method that works best for your code. They both do the same thing.


Awk also allows the use of boolean operators "||" (for "logical or") and "&&"(for "logical and") to allow the creation of more complex boolean expressions:
echo $myvar $x
<pre>
</source>
( $1 == "foo" ) && ( $2 == "bar" ) { print }
This function will produce the output "hello" -- the global "$myvar" doesn't get overwritten, and "$x" doesn't continue to exist outside of myfunc. In the first line of the function, we create x, a local variable that is used later, while in the second example (local myvar="one two three"") we create a local myvar and assign it a value. The first form is handy for keeping loop control variables local, since we're not allowed to say "for local x in $myvar". This function doesn't clobber any global variables, and you are encouraged to design all your functions this way. The only time you should not use "local" is when you explicitly want to modify a global variable.  
</pre>
This example will print only those lines where field one equals foo and field two equals bar.


=== Numeric variables! ===
=== Wrapping it up ===
So far, we've either printed strings, the entire line, or specific fields. However, awk also allows us to perform both integer and floating point math. Using mathematical expressions, it's very easy to write a script that counts the number of blank lines in a file. Here's one that does just that:
Now that we've covered the most essential bash functionality, it's time to look at how to develop an entire application based in bash. In my next installment, we'll do just that. See you then!
<pre>
BEGIN { x=0 }
/^$/  { x=x+1 }
END  { print "I found " x " blank lines. :)" }
</pre>
In the BEGIN block, we initialize our integer variable x to zero. Then, each time awk encounters a blank line, awk will execute the x=x+1 statement, incrementing x. After all the lines have been processed, the END block will execute, and awk will print out a final summary, specifying the number of blank lines it found.


=== Stringy variables ===
== Resources ==
One of the neat things about awk variables is that they are "simple and stringy." I consider awk variables "stringy" because all awk variables are stored internally as strings. At the same time, awk variables are "simple" because you can perform mathematical operations on a variable, and as long as it contains a valid numeric string, awk automatically takes care of the string-to-number conversion steps. To see what I mean, check out this example:
<pre>
x="1.01"
# We just set x to contain the *string* "1.01"
x=x+1
# We just added one to a *string*
print x
# Incidentally, these are comments :)
</pre>
Awk will output:
<pre>
2.01
</pre>
Interesting! Although we assigned the string value 1.01 to the variable x, we were still able to add one to it. We wouldn't be able to do this in bash or python. First of all, bash doesn't support floating point arithmetic. And, while bash has "stringy" variables, they aren't "simple"; to perform any mathematical operations, bash requires that we enclose our math in an ugly $( ) construct. If we were using python, we would have to explicitly convert our 1.01 string to a floating point value before performing any arithmetic on it. While this isn't difficult, it's still an additional step. With awk, it's all automatic, and that makes our code nice and clean. If we wanted to square and add one to the first field in each input line, we would use this script:
<pre>
{ print ($1^2)+1 }
</pre>
If you do a little experimenting, you'll find that if a particular variable doesn't contain a valid number, awk will treat that variable as a numerical zero when it evaluates your mathematical expression.
 
=== Lots of operators ===
Another nice thing about awk is its full complement of mathematical operators. In addition to standard addition, subtraction, multiplication, and division, awk allows us to use the previously demonstrated exponent operator "^", the modulo (remainder) operator "%", and a bunch of other handy assignment operators borrowed from C.
 
These include pre- and post-increment/decrement ( i++, --foo ), add/sub/mult/div assign operators ( a+=3, b*=2, c/=2.2, d-=6.2 ). But that's not all -- we also get handy modulo/exponent assign ops as well ( a^=2, b%=4 ).
 
=== Field separators ===
Awk has its own complement of special variables. Some of them allow you to fine-tune how awk functions, while others can be read to glean valuable information about the input. We've already touched on one of these special variables, FS. As mentioned earlier, this variable allows you to set the character sequence that awk expects to find between fields. When we were using /etc/passwd as input, FS was set to ":". While this did the trick, FS allows us even more flexibility.
 
The FS value is not limited to a single character; it can also be set to a regular expression, specifying a character pattern of any length. If you're processing fields separated by one or more tabs, you'll want to set FS like so:
<pre>
FS="\t+"
</pre>
Above, we use the special "+" regular expression character, which means "one or more of the previous character".
 
If your fields are separated by whitespace (one or more spaces or tabs), you may be tempted to set FS to the following regular expression:
<pre>
FS="[[:space:]]+"
</pre>
While this assignment will do the trick, it's not necessary. Why? Because by default, FS is set to a single space character, which awk interprets to mean "one or more spaces or tabs." In this particular example, the default FS setting was exactly what you wanted in the first place!


Complex regular expressions are no problem. Even if your records are separated by the word "foo," followed by three digits, the following regular expression will allow your data to be parsed properly:
*Read [[Bash by Example, Part 1]].
<pre>
*Read [[Bash by Example, Part 3]].
FS="foo[0-9][0-9][0-9]"
*Visit [http://www.gnu.org/software/bash/bash.html GNU's bash home page].
</pre>
 
=== Number of fields ===
The next two variables we're going to cover are not normally intended to be written to, but are normally read and used to gain useful information about the input. The first is the NF variable, also called the "number of fields" variable. Awk will automatically set this variable to the number of fields in the current record. You can use the NF variable to display only certain input lines:
<pre>
NF == 3 { print "this particular record has three fields: " $0 }
</pre>
Of course, you can also use the NF variable in conditional statements, as follows:
<pre>
{
    if ( NF > 2 ) {
        print $1 " " $2 ":" $3
    }
}
</pre>
 
=== Record number ===
The record number (NR) is another handy variable. It will always contain the number of the current record (awk counts the first record as record number 1). Up until now, we've been dealing with input files that contain one record per line. For these situations, NR will also tell you the current line number. However, when we start to process multi-line records later in the series, this will no longer be the case, so be careful! NR can be used like the NF variable to print only certain lines of the input:
<pre>
(NR < 10 ) || (NR > 100) { print "We are on record number 1-9 or 101+" }
</pre>
<pre>
{
    #skip header
    if ( NR > 10 ) {
        print "ok, now for the real information!"
    }
}
</pre>
Awk provides additional variables that can be used for a variety of purposes. We'll cover more of these variables in later articles.
 
We've come to the end of our initial exploration of awk. As the series continues, I'll demonstrate more advanced awk functionality, and we'll end the series with a real-world awk application.
 
== Resources ==


* Read Daniel's other awk articles on Funtoo: Awk By Example, [[Awk by example, Part2 |Part 2]] and [[Awk by example, Part3 |Part 3]].
__NOTOC__
* If you'd like a good old-fashioned book, [http://www.oreilly.com/catalog/sed2/ O'Reilly's sed & awk, 2nd Edition] is a wonderful choice.
* Be sure to check out the [http://www.faqs.org/faqs/computer-lang/awk/faq/ comp.lang.awk FAQ]. It also contains lots of additional awk links.
* Patrick Hartigan's [http://sparky.rice.edu/~hartigan/awk.html awk tutorial] is packed with handy awk scripts.
* [http://www.tasoft.com/tawk.html Thompson's TAWK Compiler] compiles awk scripts into fast binary executables. Versions are available for Windows, OS/2, DOS, and UNIX.
* [http://www.gnu.org/software/gawk/manual/gawk.html The GNU Awk User's Guide] is available for online reference.
* [http://www.folkstalk.com/2011/12/good-examples-of-awk-command-in-unix.html Awk Command] daily useful examples.
[[Category:Linux Core Concepts]]
[[Category:Linux Core Concepts]]
[[Category:Articles]]
[[Category:Articles]]
{{ArticleFooter}}
{{ArticleFooter}}

Revision as of 08:49, December 28, 2014

   Support Funtoo!
Get an awesome Funtoo container and support Funtoo! See Funtoo Containers for more information.

More bash programming fundamentals

Accepting arguments

Let's start with a brief tip on handling command-line arguments, and then look at bash's basic programming constructs.

In the sample program in the introductory article, we used the environment variable "$1", which referred to the first command-line argument. Similarly, you can use "$2", "$3", etc. to refer to the second and third arguments passed to your script. Here's an example:

#!/usr/bin/env bash

echo name of script is $0
echo first argument is $1
echo second argument is ${2}
echo seventeenth argument is ${17}
echo number of arguments is $#

The example is self explanatory except for three small details. First, "$0" will expand to the name of the script, as called from the command line, and "$#" will expand to the number of arguments passed to the script. The use of curly braces is optional for single-digit numbers but required for arguments above "$9". Play around with the above script, passing different kinds of command-line arguments to get the hang of how it works.

Sometimes, it's helpful to refer to all command-line arguments at once. For this purpose, bash features the "$@" variable, which expands to all command-line parameters separated by spaces. We'll see an example of its use when we take a look at "for" loops, a bit later in this article.

Bash programming constructs

If you've programmed in a procedural language like C, Pascal, Python, or Perl, then you're familiar with standard programming constructs like "if" statements, "for" loops, and the like. Bash has its own versions of most of these standard constructs. In the next several sections, I will introduce several bash constructs and demonstrate the differences between these constructs and others you are already familiar with from other programming languages. If you haven't programmed much before, don't worry. I include enough information and examples so that you can follow the text.

Conditional love

If you've ever programmed any file-related code in C, you know that it requires a significant amount of effort to see if a particular file is newer than another. That's because C doesn't have any built-in syntax for performing such a comparison; instead, two stat() calls and two stat structures must be used to perform the comparison by hand. In contrast, bash has standard file comparison operators built in, so determining if "/tmp/myfile is readable" is as easy as checking to see if "$myvar is greater than 4".

The following table lists the most frequently used bash comparison operators. You'll also find an example of how to use every option correctly. The example is meant to be placed immediately after the "if". For example:

if [ -z "$myvar" ]
then
     echo "myvar is not defined"
fi

Sometimes, there are several different ways that a particular comparison can be made. For example, the following two snippets of code function identically:

if [ "$myvar" -eq 3 ]
then 
     echo "myvar equals 3"
fi

if [ "$myvar" = "3" ]
then
     echo "myvar equals 3"
fi

In the above two comparisons do exactly the same thing, but the first uses arithmetic comparison operators, while the second uses string comparison operators.

String comparison caveats

Most of the time, while you can omit the use of double quotes surrounding strings and string variables, it's not a good idea. Why? Because your code will work perfectly, unless an environment variable happens to have a space or a tab in it, in which case bash will get confused. Here's an example of a fouled-up comparison:

if [ $myvar = "foo bar oni" ]
then
     echo "yes"
fi

In the above example, if myvar equals "foo", the code will work as expected and not print anything. However, if myvar equals "foo bar oni", the code will fail with the following error:

[: too many arguments

In this case, the spaces in "$myvar" (which equals "foo bar oni") end up confusing bash. After bash expands "$myvar", it ends up with the following comparison:

[ foo bar oni = "foo bar oni" ]

Because the environment variable wasn't placed inside double quotes, bash thinks that you stuffed too many arguments in-between the square brackets. You can easily eliminate this problem by surrounding the string arguments with double-quotes. Remember, if you get into the habit of surrounding all string arguments and environment variables with double-quotes, you'll eliminate many similar programming errors. Here's how the "foo bar oni" comparison should have been written:

if [ "$myvar" = "foo bar oni" ]
then
    echo "yes"
fi

The above code will work as expected and will not create any unpleasant surprises.

   Note

If you want your environment variables to be expanded, you must enclose them in double quotes, rather than single quotes. Single quotes disable variable (as well as history) expansion.

Looping constructs: "for"

OK, we've covered conditionals, now it's time to explore bash looping constructs. We'll start with the standard "for" loop. Here's a basic example:

#!/usr/bin/env bash

for x in one two three four
do
    echo number $x
done

Output:
number one
number two 
number three 
number four

What exactly happened? The "for x" part of our "for" loop defined a new environment variable (also called a loop control variable) called "$x", which was successively set to the values "one", "two", "three", and "four". After each assignment, the body of the loop (the code between the "do" ... "done") was executed once. In the body, we referred to the loop control variable "$x" using standard variable expansion syntax, like any other environment variable. Also notice that "for" loops always accept some kind of word list after the "in" statement. In this case we specified four English words, but the word list can also refer to file(s) on disk or even file wildcards. Look at the following example, which demonstrates how to use standard shell wildcards:

#!/usr/bin/env bash

for myfile in /etc/r*
do
    if [ -d "$myfile" ] 
    then
      echo "$myfile (dir)"
    else
      echo "$myfile"
    fi
done

output:

/etc/rc.d (dir)
/etc/resolv.conf
/etc/resolv.conf~
/etc/rpc

The above code looped over each file in /etc that began with an "r". To do this, bash first took our wildcard /etc/r* and expanded it, replacing it with the string /etc/rc.d /etc/resolv.conf /etc/resolv.conf~ /etc/rpc before executing the loop. Once inside the loop, the "-d" conditional operator was used to perform two different actions, depending on whether myfile was a directory or not. If it was, a " (dir)" was appended to the output line.

We can also use multiple wildcards and even environment variables in the word list:

for x in /etc/r??? /var/lo* /home/drobbins/mystuff/* /tmp/${MYPATH}/*
do
    cp $x /mnt/mydira
done

Bash will perform wildcard and variable expansion in all the right places, and potentially create a very long word list.

While all of our wildcard expansion examples have used absolute paths, you can also use relative paths, as follows:

for x in ../* mystuff/*
do
     echo $x is a silly file
done

In the above example, bash performs wildcard expansion relative to the current working directory, just like when you use relative paths on the command line. Play around with wildcard expansion a bit. You'll notice that if you use absolute paths in your wildcard, bash will expand the wildcard to a list of absolute paths. Otherwise, bash will use relative paths in the subsequent word list. If you simply refer to files in the current working directory (for example, if you type for x in *), the resultant list of files will not be prefixed with any path information. Remember that preceding path information can be stripped using the basename executable, as follows:

for x in /var/log/*
do
    echo `basename $x` is a file living in /var/log
done

Of course, it's often handy to perform loops that operate on a script's command-line arguments. Here's an example of how to use the "$@" variable, introduced at the beginning of this article:

#!/usr/bin/env bash

for thing in "$@"
do
    echo you typed ${thing}.
done

output:

$ allargs hello there you silly
you typed hello.
you typed there.
you typed you.
you typed silly.

Shell arithmetic

Before looking at a second type of looping construct, it's a good idea to become familiar with performing shell arithmetic. Yes, it's true: You can perform simple integer math using shell constructs. Simply enclose the particular arithmetic expression between a "$((" and a "))", and bash will evaluate the expression. Here are some examples:

$ echo $(( 100 / 3 ))
33
$ myvar="56"
$ echo $(( $myvar + 12 ))
68
$ echo $(( $myvar - $myvar ))
0
$ myvar=$(( $myvar + 1 ))
$ echo $myvar
57

Now that you're familiar performing mathematical operations, it's time to introduce two other bash looping constructs, "while" and "until".

More looping constructs: "while" and "until"

A "while" statement will execute as long as a particular condition is true, and has the following format:

while [ condition ]
do
    statements
done

"While" statements are typically used to loop a certain number of times, as in the following example, which will loop exactly 10 times:

myvar=0
while [ $myvar -ne 10 ]
do
    echo $myvar
    myvar=$(( $myvar + 1 ))
done

You can see the use of arithmetic expansion to eventually cause the condition to be false, and the loop to terminate.

"Until" statements provide the inverse functionality of "while" statements: They repeat as long as a particular condition is false. Here's an "until" loop that functions identically to the previous "while" loop:

myvar=0
until [ $myvar -eq 10 ]
do
    echo $myvar
    myvar=$(( $myvar + 1 ))
done

Case statements

"Case" statements are another conditional construct that comes in handy. Here's an example snippet:

case "${x##*.}" in
     gz)
           gzunpack ${SROOT}/${x}
           ;;
     bz2)
           bz2unpack ${SROOT}/${x}
           ;;
     *)
           echo "Archive format not recognized."
           exit
           ;;
esac

Above, bash first expands "${x##*.}". In the code, "$x" is the name of a file, and "${x##*.}" has the effect of stripping all text except that following the last period in the filename. Then, bash compares the resultant string against the values listed to the left of the ")"s. In this case, "${x##*.}" gets compared against "gz", then "bz2" and finally "*". If "${x##*.}" matches any of these strings or patterns, the lines immediately following the ")" are executed, up until the ";;", at which point bash continues executing lines after the terminating "esac". If no patterns or strings are matched, no lines of code are executed; however, in this particular code snippet, at least one block of code will execute, because the "*" pattern will catch everything that didn't match "gz" or "bz2".

Functions and namespaces

In bash, you can even define functions, similar to those in other procedural languages like Pascal and C. In bash, functions can even accept arguments, using a system very similar to the way scripts accept command-line arguments. Let's take a look at a sample function definition and then proceed from there:

tarview() {
    echo -n "Displaying contents of $1 "
    if [ ${1##*.} = tar ]
    then
        echo "(uncompressed tar)"
        tar tvf $1
    elif [ ${1##*.} = gz ]
    then
        echo "(gzip-compressed tar)"
        tar tzvf $1
    elif [ ${1##*.} = bz2 ]
    then
        echo "(bzip2-compressed tar)"
        cat $1 | bzip2 -d | tar tvf -
    fi
}
   Note

Another case: The above code could have been written using a "case" statement. Can you figure out how?

Above, we define a function called "tarview" that accepts one argument, a tarball of some kind. When the function is executed, it identifies what type of tarball the argument is (either uncompressed, gzip-compressed, or bzip2-compressed), prints out a one-line informative message, and then displays the contents of the tarball. This is how the above function should be called (whether from a script or from the command line, after it has been typed in, pasted in, or sourced):

$ tarview shorten.tar.gz
Displaying contents of shorten.tar.gz (gzip-compressed tar)
drwxr-xr-x ajr/abbot         0 1999-02-27 16:17 shorten-2.3a/
-rw-r--r-- ajr/abbot      1143 1997-09-04 04:06 shorten-2.3a/Makefile
-rw-r--r-- ajr/abbot      1199 1996-02-04 12:24 shorten-2.3a/INSTALL
-rw-r--r-- ajr/abbot       839 1996-05-29 00:19 shorten-2.3a/LICENSE
....

As you can see, arguments can be referenced inside the function definition by using the same mechanism used to reference command-line arguments. In addition, the "$#" macro will be expanded to contain the number of arguments. The only thing that may not work completely as expected is the variable "$0", which will either expand to the string "bash" (if you run the function from the shell, interactively) or to the name of the script the function is called from.

   Note

Use'em interactively: Don't forget that functions, like the one above, can be placed in your ~/.bashrc or ~/.bash_profile so that they are available for use whenever you are in bash.

Namespace

Often, you'll need to create environment variables inside a function. While possible, there's a technicality you should know about. In most compiled languages (such as C), when you create a variable inside a function, it's placed in a separate local namespace. So, if you define a function in C called myfunction, and in it define a variable called "x", any global (outside the function) variable called "x" will not be affected by it, eliminating side effects.

While true in C, this isn't true in bash. In bash, whenever you create an environment variable inside a function, it's added to the global namespace. This means that it will overwrite any global variable outside the function, and will continue to exist even after the function exits:

#!/usr/bin/env bash

myvar="hello"

myfunc() {

    myvar="one two three"
    for x in $myvar
    do
        echo $x
    done
}

myfunc

echo $myvar $x

When this script is run, it produces the output "one two three three", showing how "$myvar" defined in the function clobbered the global variable "$myvar", and how the loop control variable "$x" continued to exist even after the function exited (and also would have clobbered any global "$x", if one were defined).

In this simple example, the bug is easy to spot and to compensate for by using alternate variable names. However, this isn't the right approach; the best way to solve this problem is to prevent the possibility of clobbering global variables in the first place, by using the "local" command. When we use "local" to create variables inside a function, they will be kept in the local namespace and not clobber any global variables. Here's how to implement the above code so that no global variables are overwritten:

#!/usr/bin/env bash

myvar="hello"

myfunc() {
    local x
    local myvar="one two three"
    for x in $myvar
    do
        echo $x
    done
}

myfunc

echo $myvar $x

This function will produce the output "hello" -- the global "$myvar" doesn't get overwritten, and "$x" doesn't continue to exist outside of myfunc. In the first line of the function, we create x, a local variable that is used later, while in the second example (local myvar="one two three"") we create a local myvar and assign it a value. The first form is handy for keeping loop control variables local, since we're not allowed to say "for local x in $myvar". This function doesn't clobber any global variables, and you are encouraged to design all your functions this way. The only time you should not use "local" is when you explicitly want to modify a global variable.

Wrapping it up

Now that we've covered the most essential bash functionality, it's time to look at how to develop an entire application based in bash. In my next installment, we'll do just that. See you then!

Resources

   Tip

Read the next article in this series: Bash by Example, Part 3

   Note

Browse all our available articles below. Use the search field to search for topics and keywords in real-time.

Article Subtitle
Article Subtitle
Awk by Example, Part 1 An intro to the great language with the strange name
Awk by Example, Part 2 Records, loops, and arrays
Awk by Example, Part 3 String functions and ... checkbooks?
Bash by Example, Part 1 Fundamental programming in the Bourne again shell (bash)
Bash by Example, Part 2 More bash programming fundamentals
Bash by Example, Part 3 Exploring the ebuild system
BTRFS Fun
Funtoo Filesystem Guide, Part 1 Journaling and ReiserFS
Funtoo Filesystem Guide, Part 2 Using ReiserFS and Linux
Funtoo Filesystem Guide, Part 3 Tmpfs and Bind Mounts
Funtoo Filesystem Guide, Part 4 Introducing Ext3
Funtoo Filesystem Guide, Part 5 Ext3 in Action
GUID Booting Guide
Learning Linux LVM, Part 1 Storage management magic with Logical Volume Management
Learning Linux LVM, Part 2 The cvs.gentoo.org upgrade
Libvirt
Linux Fundamentals, Part 1
Linux Fundamentals, Part 2
Linux Fundamentals, Part 3
Linux Fundamentals, Part 4
LVM Fun
Making the Distribution, Part 1
Making the Distribution, Part 2
Making the Distribution, Part 3
Maximum Swappage Getting the most out of swap
On screen annotation Write on top of apps on your screen
OpenSSH Key Management, Part 1 Understanding RSA/DSA Authentication
OpenSSH Key Management, Part 2 Introducing ssh-agent and keychain
OpenSSH Key Management, Part 3 Agent Forwarding
Partition Planning Tips Keeping things organized on disk
Partitioning in Action, Part 1 Moving /home
Partitioning in Action, Part 2 Consolidating data
POSIX Threads Explained, Part 1 A simple and nimble tool for memory sharing
POSIX Threads Explained, Part 2
POSIX Threads Explained, Part 3 Improve efficiency with condition variables
Sed by Example, Part 1
Sed by Example, Part 2
Sed by Example, Part 3
Successful booting with UUID Guide to use UUID for consistent booting.
The Gentoo.org Redesign, Part 1 A site reborn
The Gentoo.org Redesign, Part 2 The Documentation System
The Gentoo.org Redesign, Part 3 The New Main Pages
The Gentoo.org Redesign, Part 4 The Final Touch of XML
Traffic Control
Windows 10 Virtualization with KVM