Difference between pages "Bash by Example, Part 3" and "Awk by Example, Part 1"

(Difference between pages)
 
m (add highlighting ##i##)
 
Line 1: Line 1:
 
{{Article
 
{{Article
 
|Author=Drobbins
 
|Author=Drobbins
|Previous in Series=Bash by Example, Part 2
+
|Next in Series=Awk by Example, Part 2
 
}}
 
}}
== Exploring the ebuild system ==
+
{{WikiArticle}}
  
=== Enter the ebuild system ===
+
== An intro to the great language with the strange name ==
I've really been looking forward to this third and final ''Bash by example'' article, because now that we've already covered bash programming fundamentals in [[Bash by example, Part1|Part 1]] and [[Bash by example, Part 2|Part 2]], we can focus on more advanced topics, like bash application development and program design. For this article, I will give you a good dose of practical, real-world bash development experience by presenting a project that I've spent many hours coding and refining: the Gentoo Linux ebuild system.
+
  
As the creator of Gentoo Linux and the guy behind Funtoo Linux, one of my primary responsibilities is to make sure that all of the operating system packages (similar to RPM packages) are created properly and work together. As you probably know, a standard Linux system is not composed of a single unified source tree (like BSD), but is actually made up of about 25+ core packages that work together. Some of the packages include:
+
=== In defense of awk ===
 +
In this series of articles, I'm going to turn you into a proficient awk coder. I'll admit, awk doesn't have a very pretty or particularly "hip" name, and the GNU version of awk, called gawk, sounds downright weird. Those unfamiliar with the language may hear "awk" and think of a mess of code so backwards and antiquated that it's capable of driving even the most knowledgeable UNIX guru to the brink of insanity (causing him to repeatedly yelp "kill -9!" as he runs for coffee machine).
  
 +
Sure, awk doesn't have a great name. But it is a great language. Awk is geared toward text processing and report generation, yet features many well-designed features that allow for serious programming. And, unlike some languages, awk's syntax is familiar, and borrows some of the best parts of languages like C, python, and bash (although, technically, awk was created before both python and bash). Awk is one of those languages that, once learned, will become a key part of your strategic coding arsenal.
  
{{TableStart}}
+
=== The first awk ===
<tr><td class="info">Package</td><td class="info">Description</td></tr>
+
Let's go ahead and start playing around with awk to see how it works. At the command line, enter the following command:
<tr><td>linux</td><td>The actual kernel</td></tr>
+
<tr><td>util-linux</td><td>A collection of miscellaneous Linux-related programs</td></tr>
+
<tr><td>e2fsprogs</td><td>A collection of ext2 filesystem-related utilities</td></tr>
+
<tr><td>glibc</td><td>The GNU C library</td>
+
{{TableEnd}}
+
  
{{Note|Gentoo fans: the original text above used to say "I'm the chief architect of Gentoo Linux, a next-generation Linux OS currently in beta. One of my primary responsibilities is to make sure that all of the binary packages (similar to RPM packages) are created properly and work together." This is noteworthy due to the fact that the initial focus of Gentoo was to provide working binary packages.}}
+
<console>$##i## awk '{ print }' /etc/passwd</console>
  
Each package is in its own tarball and is maintained by separate independent developers, or teams of developers. To create a distribution, each package has to be separately downloaded, compiled, and packaged. Every time a package must be fixed, upgraded, or improved, the compilation and packaging steps must be repeated (and this gets old really fast). To help eliminate the repetitive steps involved in creating and updating packages, I created the ebuild system, written almost entirely in bash. To enhance your bash knowledge, I'll show you how I implemented the unpack and compile portions of the ebuild system, step by step. As I explain each step, I'll also discuss why certain design decisions were made. By the end of this article, not only will you have an excellent grasp of larger-scale bash programming projects, but you'll also have implemented a good portion of a complete auto-build system.
+
You should see the contents of your /etc/passwd file appear before your eyes. Now, for an explanation of what awk did. When we called awk, we specified /etc/passwd as our input file. When we executed awk, it evaluated the print command for each line in /etc/passwd, in order. All output is sent to stdout, and we get a result identical to catting /etc/passwd.
  
=== Why bash? ===
+
Now, for an explanation of the { print } code block. In awk, curly braces are used to group blocks of code together, similar to C. Inside our block of code, we have a single print command. In awk, when a print command appears by itself, the full contents of the current line are printed.
Bash is an essential component of the Gentoo Linux ebuild system. It was chosen as ebuild's primary language for a number of reasons. First, it has an uncomplicated and familiar syntax that is especially well suited for calling external programs. An auto-build system is "glue code" that automates the calling of external programs, and bash is very well suited to this type of application. Second, Bash's support for functions allowed the ebuild system to have modular, easy-to-understand code. Third, the ebuild system takes advantage of bash's support for environment variables, allowing package maintainers and developers to configure it easily, on-the-fly.
+
  
=== Build process review ===
+
Here is another awk example that does exactly the same thing:
Before we look at the ebuild system, let's review what's involved in getting a package compiled and installed. For our example, we will look at the "sed" package, a standard GNU text stream editing utility that is part of all Linux distributions. First, download the source tarball ('''sed-3.02.tar.gz''') (see [[#Resources|Resources]]). We will store this archive in '''/usr/src/distfiles''', a directory we will refer to using the environment variable <span style="color:green">$DISTDIR</span>. <span style="color:green">$DISTDIR</span> is the directory where all of our original source tarballs live; it's a big vault of source code.
+
  
Our next step is to create a temporary directory called '''work''', which houses the uncompressed sources. We'll refer to this directory later using the <span style="color:green">$WORKDIR</span> environment variable. To do this, change to a directory where we have write permission and type the following:
+
<console>$##i## awk '{ print $0 }' /etc/passwd</console>
<source lang="bash">
+
$ mkdir work
+
$ cd work
+
$ tar xzf /usr/src/distfiles/sed-3.02.tar.gz
+
</source>
+
The tarball is then decompressed, creating a directory called '''sed-3.02''' that contains all of the sources. We'll refer to the '''sed-3.02''' directory later using the environment variable <span style="color:green">$SRCDIR</span>. To compile the program, type the following:
+
<source lang="bash">
+
$ cd sed-3.02
+
$ ./configure --prefix=/usr
+
(autoconf generates appropriate makefiles, this can take a while)
+
  
$ make
+
In awk, the $0 variable represents the entire current line, so print and print $0 do exactly the same thing. If you'd like, you can create an awk program that will output data totally unrelated to the input data. Here's an example:
  
(the package is compiled from sources, also takes a bit of time)
+
<console>$##i## awk '{ print "" }' /etc/passwd</console>
</source>
+
We're going to skip the "make install" step, since we are just covering the unpack and compile steps in this article. If we wanted to write a bash script to perform all these steps for us, it could look something like this:
+
<source lang="bash">
+
#!/usr/bin/env bash
+
  
if [ -d work ]
+
Whenever you pass the "" string to the print command, it prints a blank line. If you test this script, you'll find that awk outputs one blank line for every line in your /etc/passwd file. Again, this is because awk executes your script for every line in the input file. Here's another example:
then
+
# remove old work directory if it exists
+
      rm -rf work
+
fi
+
mkdir work
+
cd work
+
tar xzf /usr/src/distfiles/sed-3.02.tar.gz
+
cd sed-3.02
+
./configure --prefix=/usr
+
make
+
</source>
+
  
=== Generalizing the code ===
+
<console>$##i## awk '{ print "hiya" }' /etc/passwd</console>
Although this autocompile script works, it's not very flexible. Basically, the bash script just contains the listing of all the commands that were typed at the command line. While this solution works, it would be nice to make a generic script that can be configured quickly to unpack and compile any package just by changing a few lines. That way, it's much less work for the package maintainer to add new packages to the distribution. Let's take a first stab at doing this by using lots of different environment variables, making our build script more generic:
+
<source lang="bash">
+
#!/usr/bin/env bash
+
  
# P is the package name
+
Running this script will fill your screen with hiya's. :)
  
P=sed-3.02
+
=== Multiple fields ===
 +
Awk is really good at handling text that has been broken into multiple logical fields, and allows you to effortlessly reference each individual field from inside your awk script. The following script will print out a list of all user accounts on your system:
  
# A is the archive name
+
<console>$##i## awk -F":" '{ print $1 }' /etc/passwd</console>
  
A=${P}.tar.gz
+
Above, when we called awk, we use the -F option to specify ":" as the field separator. When awk processes the print $1 command, it will print out the first field that appears on each line in the input file. Here's another example:
  
export ORIGDIR=`pwd`
+
<console>$##i## awk -F":" '{ print $1 $3 }' /etc/passwd</console>
export WORKDIR=${ORIGDIR}/work
+
export SRCDIR=${WORKDIR}/${P}
+
  
if [ -z "$DISTDIR" ]
+
Here's an excerpt of the output from this script:
then
+
<pre>
# set DISTDIR to /usr/src/distfiles if not already set
+
halt7
        DISTDIR=/usr/src/distfiles
+
operator11
fi
+
root0
export DISTDIR
+
shutdown6
 +
sync5
 +
bin1
 +
....etc.
 +
</pre>
 +
As you can see, awk prints out the first and third fields of the /etc/passwd file, which happen to be the username and uid fields respectively. Now, while the script did work, it's not perfect -- there aren't any spaces between the two output fields! If you're used to programming in bash or python, you may have expected the print $1 $3 command to insert a space between the two fields. However, when two strings appear next to each other in an awk program, awk concatenates them without adding an intermediate space. The following command will insert a space between both fields:
  
if [ -d ${WORKDIR} ]
+
<console>$##i## awk -F":" '{ print $1 " " $3 }' /etc/passwd</console>
then   
+
# remove old work directory if it exists
+
        rm -rf ${WORKDIR}
+
fi
+
  
mkdir ${WORKDIR}
+
When you call print this way, it'll concatenate $1, " ", and $3, creating readable output. Of course, we can also insert some text labels if needed:
cd ${WORKDIR}
+
tar xzf ${DISTDIR}/${A}
+
cd ${SRCDIR}
+
./configure --prefix=/usr
+
make
+
</source>
+
We've added a lot of environment variables to the code, but it still does basically the same thing. However, now, to compile any standard GNU autoconf-based source tarball, we can simply copy this file to a new file (with an appropriate name to reflect the name of the new package it compiles), and then change the values of <span style"color:green:>$A</span> and <span style"color:green:>$P</span> to new values. All other environment variables automatically adjust to the correct settings, and the script works as expected. While this is handy, there's a further improvement that can be made to the code. This particular code is much longer than the original "transcript" script that we created. Since one of the goals for any programming project should be the reduction of complexity for the user, it would be nice to dramatically shrink the code, or at least organize it better. We can do this by performing a neat trick -- we'll split the code into two separate files. Save this file as '''sed-3.02.ebuild''':
+
<source lang="bash">
+
#the sed ebuild file -- very simple!
+
P=sed-3.02
+
A=${P}.tar.gz
+
</source>
+
Our first file is trivial, and contains only those environment variables that must be configured on a per-package basis. Here's the second file, which contains the brains of the operation. Save this one as "ebuild" and make it executable:
+
<source lang="bash">
+
#!/usr/bin/env bash
+
  
 +
<console>$##i## awk -F":" '{ print "username: " $1 "\t\tuid:" $3 }' /etc/passwd</console>
  
if [ $# -ne 1 ]
+
This will cause the output to be:
then
+
<pre>
        echo "one argument expected."
+
username: halt    uid:7
        exit 1
+
username: operator uid:11
fi
+
username: root    uid:0
 +
username: shutdown uid:6
 +
username: sync    uid:5
 +
username: bin      uid:1  
 +
....etc.
 +
</pre>
  
if [ -e "$1" ]
+
=== External Scripts ===
then
+
Passing your scripts to awk as a command line argument can be very handy for small one-liners, but when it comes to complex, multi-line programs, you'll definitely want to compose your script in an external file. Awk can then be told to source this script file by passing it the -f option:
        source $1
+
else
+
        echo "ebuild file $1 not found."
+
        exit 1
+
fi
+
  
export ORIGDIR=`pwd`
+
<console>$##i## awk -f myscript.awk myfile.in </console>
export WORKDIR=${ORIGDIR}/work
+
export SRCDIR=${WORKDIR}/${P}
+
  
if [ -z "$DISTDIR" ]
+
Putting your scripts in their own text files also allows you to take advantage of additional awk features. For example, this multi-line script does the same thing as one of our earlier one-liners, printing out the first field of each line in /etc/passwd:
then
+
<pre>
        # set DISTDIR to /usr/src/distfiles if not already set
+
BEGIN {
         DISTDIR=/usr/src/distfiles
+
         FS=":"
fi
+
}
export DISTDIR
+
{ print $1 }
 +
</pre>
 +
The difference between these two methods has to do with how we set the field separator. In this script, the field separator is specified within the code itself (by setting the FS variable), while our previous example set FS by passing the -F":" option to awk on the command line. It's generally best to set the field separator inside the script itself, simply because it means you have one less command line argument to remember to type. We'll cover the FS variable in more detail later in this article.
  
if [ -d ${WORKDIR} ]
+
It is also possible to make the script directly executable, by placing a "#!/usr/bin/awk -f" at the top of the file, as follows:
then   
+
<pre>
        # remove old work directory if it exists
+
#!/usr/bin/awk -f
        rm -rf ${WORKDIR}
+
BEGIN {
fi
+
FS=":"
 +
}
 +
{ print $1 }
 +
</pre>
 +
Next, the script must be made executable by setting the script file's execute bit:
  
mkdir ${WORKDIR}
+
<console>$##i## chmod +x myscript.awk</console>
cd ${WORKDIR}
+
tar xzf ${DISTDIR}/${A}
+
cd ${SRCDIR}
+
./configure --prefix=/usr
+
make
+
</source>
+
Now that we've split our build system into two files, I bet you're wondering how it works. Basically, to compile sed, type:
+
<source lang="bash">
+
$ ./ebuild sed-3.02.ebuild
+
</source>
+
When "ebuild" executes, it first tries to "source" variable <span style="color:green">$1</span>. What does this mean? From my previous article, recall that <span style="color:green">$1</span> is the first command line argument -- in this case, '''sed-3.02.ebuild'''. In bash, the "source" command reads in bash statements from a file, and executes them as if they appeared immediately in the file the "source" command is in. So, "source ${1}" causes the "ebuild" script to execute the commands in '''sed-3.02.ebuild''', which cause <span style="color:green">$P</span> and <span style="color:green">$A</span> to be defined. This design change is really handy, because if we want to compile another program instead of sed, we can simply create a new '''.ebuild''' file and pass it as an argument to our "ebuild" script. That way, the '''.ebuild''' files end up being really simple, while the complicated brains of the ebuild system get stored in one place -- our "ebuild" script. This way, we can upgrade or enhance the ebuild system simply by editing the "ebuild" script, keeping the implementation details outside of the ebuild files. Here's a sample ebuild file for <span style="color:green">gzip</span>:
+
<source lang="bash">
+
#another really simple ebuild script!
+
P=gzip-1.2.4a
+
A=${P}.tar.gz
+
</source>
+
  
=== Adding functionality ===
+
Now, you should be able to execute the script as follows:
OK, we're making some progress. But, there is some additional functionality I'd like to add. I'd like the ebuild script to accept a second command-line argument, which will be <span style="color:green">compile</span>, <span style="color:green">unpack</span>, or <span style="color:green">all</span>. This second command-line argument tells the ebuild script which particular step of the build process to perform. That way, I can tell ebuild to unpack the archive, but not compile it (just in case I need to inspect the source archive before compilation begins). To do this, I'll add a case statement that will test variable <span style="color:green">$2</span>, and do different things based on its value. Here's what the code looks like now:
+
<source lang="bash">
+
#!/usr/bin/env bash
+
  
if [ $# -ne 2 ]
+
<console>$##i## ./myscript.awk myfile.in</console>
then
+
        echo "Please specify two args - .ebuild file and unpack, compile or all"
+
        exit 1
+
fi
+
  
 +
=== The BEGIN and END blocks ===
 +
Normally, awk executes each block of your script's code once for each input line. However, there are many programming situations where you may need to execute initialization code before awk begins processing the text from the input file. For such situations, awk allows you to define a BEGIN block. We used a BEGIN block in the previous example. Because the BEGIN block is evaluated before awk starts processing the input file, it's an excellent place to initialize the FS (field separator) variable, print a heading, or initialize other global variables that you'll reference later in the program.
  
if [ -z "$DISTDIR" ]
+
Awk also provides another special block, called the END block. Awk executes this block after all lines in the input file have been processed. Typically, the END block is used to perform final calculations or print summaries that should appear at the end of the output stream.
then
+
# set DISTDIR to /usr/src/distfiles if not already set
+
        DISTDIR=/usr/src/distfiles
+
fi
+
export DISTDIR
+
  
ebuild_unpack() {
+
=== Regular expressions and blocks ===
        #make sure we're in the right directory
+
Awk allows the use of regular expressions to selectively execute an individual block of code, depending on whether or not the regular expression matches the current line. Here's an example script that outputs only those lines that contain the character sequence foo:
        cd ${ORIGDIR}
+
       
+
        if [ -d ${WORKDIR} ]
+
        then   
+
                rm -rf ${WORKDIR}
+
        fi
+
  
        mkdir ${WORKDIR}
+
<pre>/foo/ { print }</pre>
        cd ${WORKDIR}
+
        if [ ! -e ${DISTDIR}/${A} ]
+
        then
+
            echo "${DISTDIR}/${A} does not exist.  Please download first."
+
            exit 1
+
        fi   
+
        tar xzf ${DISTDIR}/${A}
+
        echo "Unpacked ${DISTDIR}/${A}."
+
        #source is now correctly unpacked
+
}
+
  
 +
Of course, you can use more complicated regular expressions. Here's a script that will print only lines that contain a floating point number:
  
ebuild_compile() {
+
<pre>/[0-9]+\.[0-9]*/ { print }</pre>
       
+
        #make sure we're in the right directory
+
        cd ${SRCDIR}
+
        if [ ! -d "${SRCDIR}" ]
+
        then
+
                echo "${SRCDIR} does not exist -- please unpack first."
+
                exit 1
+
        fi
+
        ./configure --prefix=/usr
+
        make   
+
}
+
  
export ORIGDIR=`pwd`
+
=== Expressions and blocks ===
export WORKDIR=${ORIGDIR}/work
+
There are many other ways to selectively execute a block of code. We can place any kind of boolean expression before a code block to control when a particular block is executed. Awk will execute a code block only if the preceding boolean expression evaluates to true. The following example script will output the third field of all lines that have a first field equal to fred. If the first field of the current line is not equal to fred, awk will continue processing the file and will not execute the print statement for the current line:
  
if [ -e "$1" ]
+
<pre>$1 == "fred" { print $3 }</pre>
then
+
        source $1
+
else
+
        echo "Ebuild file $1 not found."
+
        exit 1
+
fi
+
  
export SRCDIR=${WORKDIR}/${P}
+
Awk offers a full selection of comparison operators, including the usual "==", "<", ">", "<=", ">=", and "!=". In addition, awk provides the "~" and "!~" operators, which mean "matches" and "does not match". They're used by specifying a variable on the left side of the operator, and a regular expression on the right side. Here's an example that will print only the third field on the line if the fifth field on the same line contains the character sequence root:
  
case "${2}" in
+
<pre>$5 ~ /root/ { print $3 }</pre>
        unpack)
+
                ebuild_unpack
+
                ;;
+
        compile)
+
                ebuild_compile
+
                ;;
+
        all)
+
                ebuild_unpack
+
                ebuild_compile
+
                ;;
+
        *)
+
                echo "Please specify unpack, compile or all as the second arg"
+
                exit 1
+
                ;;
+
esac
+
</source>
+
We've made a lot of changes, so let's review them. First, we placed the compile and unpack steps in their own functions, and called <span style="color:green:>ebuild_compile()</span> and <span style="color:green">ebuild_unpack()</span>, respectively. This is a good move, since the code is getting more complicated, and the new functions provide some modularity, which helps to keep things organized. On the first line in each function, I explicitly <span style="color:green">cd</span> into the directory I want to be in because, as our code is becoming more modular rather than linear, it's more likely that we might slip up and execute a function in the wrong current working directory. The <span style="color:green">cd</span> commands explicitly put us in the right place, and prevent us from making a mistake later -- an important step -- especially if you will be deleting files inside the functions.
+
  
Also, I added a useful check to the beginning of the <span style="color:green">ebuild_compile()</span> function. Now, it checks to make sure the <span style="color:green">$SRCDIR</span> exists, and, if not, it prints an error message telling the user to unpack the archive first, and then exits. If you like, you can change this behavior so that if <span style="color:green">$SRCDIR</span> doesn't exist, our ebuild script will unpack the source archive automatically. You can do this by replacing <span style="color:green">ebuild_compile()</span> with the following code:
+
=== Conditional statements ===
<source lang="bash">
+
Awk also offers very nice C-like if statements. If you'd like, you could rewrite the previous script using an if statement:
ebuild_compile() {
+
<pre>
        #make sure we're in the right directory
+
{  
        if [ ! -d "${SRCDIR}" ]
+
    if ( $5 ~ /root/ ) {  
         then
+
         print $3
                ebuild_unpack
+
     }
        fi
+
        cd ${SRCDIR}
+
        ./configure --prefix=/usr
+
        make      
+
 
}
 
}
</source>
+
</pre>
One of the most obvious changes in our second version of the ebuild script is the new case statement at the end of the code. This case statement simply checks the second command-line argument, and performs the correct action, depending on its value. If we now type:
+
Both scripts function identically. In the first example, the boolean expression is placed outside the block, while in the second example, the block is executed for every input line, and we selectively perform the print command by using an if statement. Both methods are available, and you can choose the one that best meshes with the other parts of your script.
<source lang="bash">
+
$ ebuild sed-3.02.ebuild
+
</source>
+
We'll actually get an error message. ebuild now wants to be told what to do, as follows:
+
<source lang="bash">
+
$ ebuild sed-3.02.ebuild unpack
+
</source>
+
or:
+
<source lang="bash">
+
$ ebuild sed-3.02.ebuild compile
+
</source>
+
or:
+
<source lang="bash">
+
$ ebuild sed-3.02.ebuild all
+
</source>
+
  
{{fancyimportant|If you provide a second command-line argument, other than those listed above, you get an error message (the * clause), and the program exits.}}
+
Here's a more complicated example of an awk if statement. As you can see, even with complex, nested conditionals, if statements look identical to their C counterparts:
 
+
<pre>
=== Modularizing the code ===
+
{
Now that the code is quite advanced and functional, you may be tempted to create several more ebuild scripts to unpack and compile your favorite programs. If you do, sooner or later you'll come across some sources that do not use autoconf (<span style="color:green">./configure</span>) or possibly others that have non-standard compilation processes. We need to make some more changes to the ebuild system to accommodate these programs. But before we do, it is a good idea to think a bit about how to accomplish this.
+
    if ( $1 == "foo" ) {
 
+
        if ( $2 == "foo" ) {
One of the great things about hard-coding <span style="color:green">./configure --prefix=/usr; make</span> into our compile stage is that, most of the time, it works. But, we must also have the ebuild system accommodate sources that do not use autoconf or normal Makefiles. To solve this problem, I propose that our ebuild script should, by default, do the following:
+
            print "uno"
 
+
         } else {
# If there is a configure script in <span style="color:green">${SRCDIR}</span>, execute it as follows: <span style="color:green">./configure --prefix=/usr</span>. Otherwise, skip this step.
+
            print "one"
# Execute the following command: make
+
         }
 
+
    } else if ($1 == "bar" ) {
Since ebuild only runs configure if it actually exists, we can now automatically accommodate those programs that don't use autoconf and have standard makefiles. But what if a simple "make" doesn't do the trick for some sources? We need a way to override our reasonable defaults with some specific code to handle these situations. To do this, we'll transform our <span style="color:green">ebuild_compile()</span> function into two functions. The first function, which can be looked at as a "parent" function, will still be called <span style="color:green">ebuild_compile()</span>. However, we'll have a new function, called <span style="color:green">user_compile()</span>, which contains only our reasonable default actions:
+
         print "two"
<source lang="bash">
+
    } else {
user_compile() {
+
         print "three"
         #we're already in ${SRCDIR}
+
    }
        if [ -e configure ]
+
         then
+
                #run configure script if it exists
+
                ./configure --prefix=/usr
+
        fi
+
        #run make
+
        make
+
}             
+
 
+
ebuild_compile() {
+
         if [ ! -d "${SRCDIR}" ]
+
         then
+
                echo "${SRCDIR} does not exist -- please unpack first."
+
                exit 1
+
        fi
+
        #make sure we're in the right directory
+
        cd ${SRCDIR}
+
        user_compile
+
 
}
 
}
</source>
+
</pre>
It may not seem obvious why I'm doing this right now, but bear with me. While the code works almost identically to our previous version of ebuild, we can now do something that we couldn't do before -- we can override <span style="color:green">user_compile()</span> in '''sed-3.02.ebuild'''. So, if the default <span style="color:green:>user_compile()</span> function doesn't meet our needs, we can define a new one in our '''.ebuild''' file that contains the commands required to compile the package. For example, here's an ebuild file for <span style="color:green">e2fsprogs-1.18</span>, which requires a slightly different <span style="color:green">./configure</span> line:
+
Using if statements, we can also transform this code:
<source lang="bash">
+
<pre>
#this ebuild file overrides the default user_compile()
+
! /matchme/ { print $1 $3 $4 }
P=e2fsprogs-1.18
+
</pre>
A=${P}.tar.gz
+
to this:
+
<pre>
user_compile() {
+
{
      ./configure --enable-elf-shlibs
+
    if ( $0 !~ /matchme/ ) {
      make
+
        print $1 $3 $4
 +
    }
 
}
 
}
</source>
+
</pre>
Now, <span style="color:green">e2fsprogs</span> will be compiled exactly the way we want it to be. But, for most packages, we can omit any custom <span style="color:green">user_compile()</span> function in the '''.ebuild''' file, and the default user_compile() function is used instead.
+
Both scripts will output only those lines that don't contain a matchme character sequence. Again, you can choose the method that works best for your code. They both do the same thing.
  
How exactly does the ebuild script know which user_compile() function to use? This is actually quite simple. In the ebuild script, the default <span style="color:green">user_compile()</span> function is defined before the '''e2fsprogs-1.18.ebuild''' file is sourced. If there is a <span style="color:green">user_compile()</span> in '''e2fsprogs-1.18.ebuild''', it overwrites the default version defined previously. If not, the default <span style="color:green">user_compile()</span> function is used.
+
Awk also allows the use of boolean operators "||" (for "logical or") and "&&"(for "logical and") to allow the creation of more complex boolean expressions:
 +
<pre>
 +
( $1 == "foo" ) && ( $2 == "bar" ) { print }
 +
</pre>
 +
This example will print only those lines where field one equals foo and field two equals bar.
  
This is great stuff; we've added a lot of flexibility without requiring any complex code if it's not needed. We won't cover it here, but you could also make similar modifications to <span style="color:green">ebuild_unpack()</span> so that users can override the default unpacking process. This could come in handy if any patching has to be done, or if the files are contained in multiple archives. It is also a good idea to modify our unpacking code so that it recognizes bzip2-compressed tarballs by default.
+
=== Numeric variables! ===
 +
So far, we've either printed strings, the entire line, or specific fields. However, awk also allows us to perform both integer and floating point math. Using mathematical expressions, it's very easy to write a script that counts the number of blank lines in a file. Here's one that does just that:
 +
<pre>
 +
BEGIN { x=0 }
 +
/^$/  { x=x+1 }
 +
END  { print "I found " x " blank lines. :)" }
 +
</pre>
 +
In the BEGIN block, we initialize our integer variable x to zero. Then, each time awk encounters a blank line, awk will execute the x=x+1 statement, incrementing x. After all the lines have been processed, the END block will execute, and awk will print out a final summary, specifying the number of blank lines it found.
  
=== Configuration files ===
+
=== Stringy variables ===
We've covered a lot of sneaky bash techniques so far, and now it's time to cover one more. Often, it's handy for a program to have a global configuration file that resides in '''/etc'''. Fortunately, this is easy to do using bash. Simply create the following file and save it as '''/etc/ebuild.conf''':
+
One of the neat things about awk variables is that they are "simple and stringy." I consider awk variables "stringy" because all awk variables are stored internally as strings. At the same time, awk variables are "simple" because you can perform mathematical operations on a variable, and as long as it contains a valid numeric string, awk automatically takes care of the string-to-number conversion steps. To see what I mean, check out this example:
<source lang="bash">
+
<pre>
# /etc/ebuild.conf: set system-wide ebuild options in this file
+
x="1.01"
 +
# We just set x to contain the *string* "1.01"
 +
x=x+1
 +
# We just added one to a *string*
 +
print x
 +
# Incidentally, these are comments :)
 +
</pre>
 +
Awk will output:
 +
<pre>
 +
2.01
 +
</pre>
 +
Interesting! Although we assigned the string value 1.01 to the variable x, we were still able to add one to it. We wouldn't be able to do this in bash or python. First of all, bash doesn't support floating point arithmetic. And, while bash has "stringy" variables, they aren't "simple"; to perform any mathematical operations, bash requires that we enclose our math in an ugly $( ) construct. If we were using python, we would have to explicitly convert our 1.01 string to a floating point value before performing any arithmetic on it. While this isn't difficult, it's still an additional step. With awk, it's all automatic, and that makes our code nice and clean. If we wanted to square and add one to the first field in each input line, we would use this script:
 +
<pre>
 +
{ print ($1^2)+1 }
 +
</pre>
 +
If you do a little experimenting, you'll find that if a particular variable doesn't contain a valid number, awk will treat that variable as a numerical zero when it evaluates your mathematical expression.
  
# MAKEOPTS are options passed to make
+
=== Lots of operators ===
MAKEOPTS="-j2"
+
Another nice thing about awk is its full complement of mathematical operators. In addition to standard addition, subtraction, multiplication, and division, awk allows us to use the previously demonstrated exponent operator "^", the modulo (remainder) operator "%", and a bunch of other handy assignment operators borrowed from C.
</source>
+
In this example, I've included just one configuration option, but you could include many more. One of the beautiful things about bash is that this file can be parsed by simply sourcing it. This is a design trick that works with most interpreted languages. After '''/etc/ebuild.conf''' is sourced, <span style="color:green">$MAKEOPTS</span> is defined inside our ebuild script. We'll use it to allow the user to pass options to make. Normally, this option would be used to allow the user to tell ebuild to do a parallel make. This is explained below.
+
  
{{fancynote|'''What is a parallel make?''' <nowiki>To speed compilation on multiprocessor systems, make supports compiling a program in parallel. This means that instead of compiling just one source file at a time, make compiles a user-specified number of source files simultaneously (so those extra processors in a multiprocessor system are used). Parallel makes are enabled by passing the -j # option to make, as follows: make -j4 MAKE="make -j4". This code instructs make to compile four programs simultaneously. The MAKE="make -j4" argument tells make to pass the -j4 option to any child make processes it launches.</nowiki>}}
+
These include pre- and post-increment/decrement ( i++, --foo ), add/sub/mult/div assign operators ( a+=3, b*=2, c/=2.2, d-=6.2 ). But that's not all -- we also get handy modulo/exponent assign ops as well ( a^=2, b%=4 ).
  
Here's the final version of our ebuild program:
+
=== Field separators ===
<source lang="bash">
+
Awk has its own complement of special variables. Some of them allow you to fine-tune how awk functions, while others can be read to glean valuable information about the input. We've already touched on one of these special variables, FS. As mentioned earlier, this variable allows you to set the character sequence that awk expects to find between fields. When we were using /etc/passwd as input, FS was set to ":". While this did the trick, FS allows us even more flexibility.
#!/usr/bin/env bash
+
  
if [ $# -ne 2 ]
+
The FS value is not limited to a single character; it can also be set to a regular expression, specifying a character pattern of any length. If you're processing fields separated by one or more tabs, you'll want to set FS like so:
then
+
<pre>
        echo "Please specify ebuild file and unpack, compile or all"
+
FS="\t+"
        exit 1
+
</pre>
fi
+
Above, we use the special "+" regular expression character, which means "one or more of the previous character".
  
source /etc/ebuild.conf
+
If your fields are separated by whitespace (one or more spaces or tabs), you may be tempted to set FS to the following regular expression:
 +
<pre>
 +
FS="[[:space:]]+"
 +
</pre>
 +
While this assignment will do the trick, it's not necessary. Why? Because by default, FS is set to a single space character, which awk interprets to mean "one or more spaces or tabs." In this particular example, the default FS setting was exactly what you wanted in the first place!
  
if [ -z "$DISTDIR" ]
+
Complex regular expressions are no problem. Even if your records are separated by the word "foo," followed by three digits, the following regular expression will allow your data to be parsed properly:
then
+
<pre>
        # set DISTDIR to /usr/src/distfiles if not already set
+
FS="foo[0-9][0-9][0-9]"
        DISTDIR=/usr/src/distfiles
+
</pre>
fi
+
export DISTDIR
+
  
ebuild_unpack() {
+
=== Number of fields ===
        #make sure we're in the right directory
+
The next two variables we're going to cover are not normally intended to be written to, but are normally read and used to gain useful information about the input. The first is the NF variable, also called the "number of fields" variable. Awk will automatically set this variable to the number of fields in the current record. You can use the NF variable to display only certain input lines:
        cd ${ORIGDIR}
+
<pre>
       
+
NF == 3 { print "this particular record has three fields: " $0 }
        if [ -d ${WORKDIR} ]
+
</pre>
        then   
+
Of course, you can also use the NF variable in conditional statements, as follows:
                rm -rf ${WORKDIR}
+
<pre>
        fi
+
{
 
+
    if ( NF > 2 ) {
        mkdir ${WORKDIR}
+
         print $1 " " $2 ":" $3
        cd ${WORKDIR}
+
    }
        if [ ! -e ${DISTDIR}/${A} ]
+
         then
+
                echo "${DISTDIR}/${A} does not exist.  Please download first."
+
                exit 1
+
        fi
+
        tar xzf ${DISTDIR}/${A}
+
        echo "Unpacked ${DISTDIR}/${A}."
+
        #source is now correctly unpacked
+
 
}
 
}
 +
</pre>
  
user_compile() {
+
=== Record number ===
        #we're already in ${SRCDIR}
+
The record number (NR) is another handy variable. It will always contain the number of the current record (awk counts the first record as record number 1). Up until now, we've been dealing with input files that contain one record per line. For these situations, NR will also tell you the current line number. However, when we start to process multi-line records later in the series, this will no longer be the case, so be careful! NR can be used like the NF variable to print only certain lines of the input:
        if [ -e configure ]
+
<pre>
        then
+
(NR < 10 ) || (NR > 100) { print "We are on record number 1-9 or 101+" }
                #run configure script if it exists
+
</pre>
                ./configure --prefix=/usr
+
<pre>
        fi
+
{
        #run make
+
    #skip header
        make $MAKEOPTS MAKE="make $MAKEOPTS" 
+
    if ( NR > 10 ) {
}
+
         print "ok, now for the real information!"
 
+
    }
ebuild_compile() {
+
        if [ ! -d "${SRCDIR}" ]
+
        then
+
                echo "${SRCDIR} does not exist -- please unpack first."
+
                exit 1
+
        fi
+
         #make sure we're in the right directory
+
        cd ${SRCDIR}
+
        user_compile
+
 
}
 
}
 +
</pre>
 +
Awk provides additional variables that can be used for a variety of purposes. We'll cover more of these variables in later articles.
  
export ORIGDIR=`pwd`
+
We've come to the end of our initial exploration of awk. As the series continues, I'll demonstrate more advanced awk functionality, and we'll end the series with a real-world awk application.
export WORKDIR=${ORIGDIR}/work
+
 
+
if [ -e "$1" ]
+
then
+
        source $1
+
else
+
        echo "Ebuild file $1 not found."
+
        exit 1
+
fi
+
 
+
export SRCDIR=${WORKDIR}/${P}
+
 
+
case "${2}" in
+
        unpack)
+
                ebuild_unpack
+
                ;;
+
        compile)
+
                ebuild_compile
+
                ;;
+
        all)
+
                ebuild_unpack
+
                ebuild_compile
+
                ;;
+
        *)
+
                echo "Please specify unpack, compile or all as the second arg"
+
                exit 1
+
                ;;
+
esac
+
</source>
+
Notice '''/etc/ebuild.conf''' is sourced near the beginning of the file. Also, notice that we use <span style="color:green">$MAKEOPTS</span> in our default <span style="color:green">user_compile()</span> function. You may be wondering how this will work -- after all, we refer to <span style="color:green">$MAKEOPTS</span> before we source '''/etc/ebuild.conf''', which actually defines <span style="color:green">$MAKEOPTS</span> in the first place. Fortunately for us, this is OK because variable expansion only happens when <span style="color:green">user_compile()</span> is executed. By the time <span style="color:green">user_compile()</span> is executed, '''/etc/ebuild.conf''' has already been sourced, and <span style="color:green">$MAKEOPTS</span> is set to the correct value.
+
 
+
=== Wrapping it up ===
+
We've covered a lot of bash programming techniques in this article, but we've only touched the surface of the power of bash. For example, the production Gentoo Linux ebuild system not only automatically unpacks and compiles each package, but it can also:
+
 
+
* Automatically download the sources if they are not found in $DISTDIR
+
* Verify that the sources are not corrupted by using MD5 message digests
+
* If requested, install the compiled application into the live filesystem, recording all installed files so that the package can be easily uninstalled at a later date.
+
* If requested, package the compiled application in a tarball (compressed the way you like it) so that it can be installed later, on another computer, or during the CD-based installation process (if you are building a distribution CD)
+
 
+
In addition, the production ebuild system has several other global configuration options, allowing the user to specify options such as what optimization flags to use during compilation, and whether optional support for packages like GNOME and slang should be enabled by default in those packages that support it.
+
 
+
It's clear that bash can accomplish much more than what I've touched on in this series of articles. I hope you've learned a lot about this incredible tool, and are excited about using bash to speed up and enhance your development projects.
+
  
 
== Resources ==
 
== Resources ==
  
* Download the source tarball ('''sed-3.02.tar.gz''') from ftp://ftp.gnu.org/pub/gnu/sed.
+
* Read Daniel's other awk articles on Funtoo: Awk By Example, [[Awk by example, Part2 |Part 2]] and [[Awk by example, Part3 |Part 3]].
* Read [[Bash by example, Part1]].
+
* If you'd like a good old-fashioned book, [http://www.oreilly.com/catalog/sed2/ O'Reilly's sed & awk, 2nd Edition] is a wonderful choice.
* Read [[Bash by example, Part 2]].
+
* Be sure to check out the [http://www.faqs.org/faqs/computer-lang/awk/faq/ comp.lang.awk FAQ]. It also contains lots of additional awk links.
* Check out the [http://www.gnu.org/software/bash/manual/bash.html bash online reference manual].
+
* Patrick Hartigan's [http://sparky.rice.edu/~hartigan/awk.html awk tutorial] is packed with handy awk scripts.
 
+
* [http://www.tasoft.com/tawk.html Thompson's TAWK Compiler] compiles awk scripts into fast binary executables. Versions are available for Windows, OS/2, DOS, and UNIX.
__NOTOC__
+
* [http://www.gnu.org/software/gawk/manual/gawk.html The GNU Awk User's Guide] is available for online reference.
 +
* [http://www.folkstalk.com/2011/12/good-examples-of-awk-command-in-unix.html Awk Command] daily useful examples.
 
[[Category:Linux Core Concepts]]
 
[[Category:Linux Core Concepts]]
 
[[Category:Articles]]
 
[[Category:Articles]]
 
{{ArticleFooter}}
 
{{ArticleFooter}}

Revision as of 09:10, December 28, 2014

Support Funtoo and help us grow! Donate $15 per month and get a free SSD-based Funtoo Virtual Container.

Tip

This article is now wiki-enabled!

An intro to the great language with the strange name

In defense of awk

In this series of articles, I'm going to turn you into a proficient awk coder. I'll admit, awk doesn't have a very pretty or particularly "hip" name, and the GNU version of awk, called gawk, sounds downright weird. Those unfamiliar with the language may hear "awk" and think of a mess of code so backwards and antiquated that it's capable of driving even the most knowledgeable UNIX guru to the brink of insanity (causing him to repeatedly yelp "kill -9!" as he runs for coffee machine).

Sure, awk doesn't have a great name. But it is a great language. Awk is geared toward text processing and report generation, yet features many well-designed features that allow for serious programming. And, unlike some languages, awk's syntax is familiar, and borrows some of the best parts of languages like C, python, and bash (although, technically, awk was created before both python and bash). Awk is one of those languages that, once learned, will become a key part of your strategic coding arsenal.

The first awk

Let's go ahead and start playing around with awk to see how it works. At the command line, enter the following command:

$ awk '{ print }' /etc/passwd

You should see the contents of your /etc/passwd file appear before your eyes. Now, for an explanation of what awk did. When we called awk, we specified /etc/passwd as our input file. When we executed awk, it evaluated the print command for each line in /etc/passwd, in order. All output is sent to stdout, and we get a result identical to catting /etc/passwd.

Now, for an explanation of the { print } code block. In awk, curly braces are used to group blocks of code together, similar to C. Inside our block of code, we have a single print command. In awk, when a print command appears by itself, the full contents of the current line are printed.

Here is another awk example that does exactly the same thing:

$ awk '{ print $0 }' /etc/passwd

In awk, the $0 variable represents the entire current line, so print and print $0 do exactly the same thing. If you'd like, you can create an awk program that will output data totally unrelated to the input data. Here's an example:

$ awk '{ print "" }' /etc/passwd

Whenever you pass the "" string to the print command, it prints a blank line. If you test this script, you'll find that awk outputs one blank line for every line in your /etc/passwd file. Again, this is because awk executes your script for every line in the input file. Here's another example:

$ awk '{ print "hiya" }' /etc/passwd

Running this script will fill your screen with hiya's. :)

Multiple fields

Awk is really good at handling text that has been broken into multiple logical fields, and allows you to effortlessly reference each individual field from inside your awk script. The following script will print out a list of all user accounts on your system:

$ awk -F":" '{ print $1 }' /etc/passwd

Above, when we called awk, we use the -F option to specify ":" as the field separator. When awk processes the print $1 command, it will print out the first field that appears on each line in the input file. Here's another example:

$ awk -F":" '{ print $1 $3 }' /etc/passwd

Here's an excerpt of the output from this script:

halt7 
operator11 
root0 
shutdown6 
sync5 
bin1 
....etc. 

As you can see, awk prints out the first and third fields of the /etc/passwd file, which happen to be the username and uid fields respectively. Now, while the script did work, it's not perfect -- there aren't any spaces between the two output fields! If you're used to programming in bash or python, you may have expected the print $1 $3 command to insert a space between the two fields. However, when two strings appear next to each other in an awk program, awk concatenates them without adding an intermediate space. The following command will insert a space between both fields:

$ awk -F":" '{ print $1 " " $3 }' /etc/passwd

When you call print this way, it'll concatenate $1, " ", and $3, creating readable output. Of course, we can also insert some text labels if needed:

$ awk -F":" '{ print "username: " $1 "\t\tuid:" $3 }' /etc/passwd

This will cause the output to be:

username: halt     uid:7 
username: operator uid:11 
username: root     uid:0 
username: shutdown uid:6 
username: sync     uid:5 
username: bin      uid:1 
....etc. 

External Scripts

Passing your scripts to awk as a command line argument can be very handy for small one-liners, but when it comes to complex, multi-line programs, you'll definitely want to compose your script in an external file. Awk can then be told to source this script file by passing it the -f option:

$ awk -f myscript.awk myfile.in 

Putting your scripts in their own text files also allows you to take advantage of additional awk features. For example, this multi-line script does the same thing as one of our earlier one-liners, printing out the first field of each line in /etc/passwd:

BEGIN { 
        FS=":" 
} 
{ print $1 } 

The difference between these two methods has to do with how we set the field separator. In this script, the field separator is specified within the code itself (by setting the FS variable), while our previous example set FS by passing the -F":" option to awk on the command line. It's generally best to set the field separator inside the script itself, simply because it means you have one less command line argument to remember to type. We'll cover the FS variable in more detail later in this article.

It is also possible to make the script directly executable, by placing a "#!/usr/bin/awk -f" at the top of the file, as follows:

#!/usr/bin/awk -f
BEGIN {
	FS=":"
}
{ print $1 }

Next, the script must be made executable by setting the script file's execute bit:

$ chmod +x myscript.awk

Now, you should be able to execute the script as follows:

$ ./myscript.awk myfile.in

The BEGIN and END blocks

Normally, awk executes each block of your script's code once for each input line. However, there are many programming situations where you may need to execute initialization code before awk begins processing the text from the input file. For such situations, awk allows you to define a BEGIN block. We used a BEGIN block in the previous example. Because the BEGIN block is evaluated before awk starts processing the input file, it's an excellent place to initialize the FS (field separator) variable, print a heading, or initialize other global variables that you'll reference later in the program.

Awk also provides another special block, called the END block. Awk executes this block after all lines in the input file have been processed. Typically, the END block is used to perform final calculations or print summaries that should appear at the end of the output stream.

Regular expressions and blocks

Awk allows the use of regular expressions to selectively execute an individual block of code, depending on whether or not the regular expression matches the current line. Here's an example script that outputs only those lines that contain the character sequence foo:

/foo/ { print }

Of course, you can use more complicated regular expressions. Here's a script that will print only lines that contain a floating point number:

/[0-9]+\.[0-9]*/ { print }

Expressions and blocks

There are many other ways to selectively execute a block of code. We can place any kind of boolean expression before a code block to control when a particular block is executed. Awk will execute a code block only if the preceding boolean expression evaluates to true. The following example script will output the third field of all lines that have a first field equal to fred. If the first field of the current line is not equal to fred, awk will continue processing the file and will not execute the print statement for the current line:

$1 == "fred" { print $3 }

Awk offers a full selection of comparison operators, including the usual "==", "<", ">", "<=", ">=", and "!=". In addition, awk provides the "~" and "!~" operators, which mean "matches" and "does not match". They're used by specifying a variable on the left side of the operator, and a regular expression on the right side. Here's an example that will print only the third field on the line if the fifth field on the same line contains the character sequence root:

$5 ~ /root/ { print $3 }

Conditional statements

Awk also offers very nice C-like if statements. If you'd like, you could rewrite the previous script using an if statement:

{ 
    if ( $5 ~ /root/ ) { 
        print $3 
    }
}

Both scripts function identically. In the first example, the boolean expression is placed outside the block, while in the second example, the block is executed for every input line, and we selectively perform the print command by using an if statement. Both methods are available, and you can choose the one that best meshes with the other parts of your script.

Here's a more complicated example of an awk if statement. As you can see, even with complex, nested conditionals, if statements look identical to their C counterparts:

{
    if ( $1 == "foo" ) {
        if ( $2 == "foo" ) {
            print "uno"
        } else {
            print "one"
        }
    } else if ($1 == "bar" ) {
        print "two"
    } else {
        print "three"
    }
}

Using if statements, we can also transform this code:

! /matchme/ { print $1 $3 $4 }

to this:

{
    if ( $0 !~ /matchme/ ) {
        print $1 $3 $4
    }
}

Both scripts will output only those lines that don't contain a matchme character sequence. Again, you can choose the method that works best for your code. They both do the same thing.

Awk also allows the use of boolean operators "||" (for "logical or") and "&&"(for "logical and") to allow the creation of more complex boolean expressions:

( $1 == "foo" ) && ( $2 == "bar" ) { print } 

This example will print only those lines where field one equals foo and field two equals bar.

Numeric variables!

So far, we've either printed strings, the entire line, or specific fields. However, awk also allows us to perform both integer and floating point math. Using mathematical expressions, it's very easy to write a script that counts the number of blank lines in a file. Here's one that does just that:

BEGIN { x=0 } 
/^$/  { x=x+1 } 
END   { print "I found " x " blank lines. :)" } 

In the BEGIN block, we initialize our integer variable x to zero. Then, each time awk encounters a blank line, awk will execute the x=x+1 statement, incrementing x. After all the lines have been processed, the END block will execute, and awk will print out a final summary, specifying the number of blank lines it found.

Stringy variables

One of the neat things about awk variables is that they are "simple and stringy." I consider awk variables "stringy" because all awk variables are stored internally as strings. At the same time, awk variables are "simple" because you can perform mathematical operations on a variable, and as long as it contains a valid numeric string, awk automatically takes care of the string-to-number conversion steps. To see what I mean, check out this example:

x="1.01" 
# We just set x to contain the *string* "1.01" 
x=x+1 
# We just added one to a *string* 
print x 
# Incidentally, these are comments :) 

Awk will output:

2.01

Interesting! Although we assigned the string value 1.01 to the variable x, we were still able to add one to it. We wouldn't be able to do this in bash or python. First of all, bash doesn't support floating point arithmetic. And, while bash has "stringy" variables, they aren't "simple"; to perform any mathematical operations, bash requires that we enclose our math in an ugly $( ) construct. If we were using python, we would have to explicitly convert our 1.01 string to a floating point value before performing any arithmetic on it. While this isn't difficult, it's still an additional step. With awk, it's all automatic, and that makes our code nice and clean. If we wanted to square and add one to the first field in each input line, we would use this script:

{ print ($1^2)+1 }

If you do a little experimenting, you'll find that if a particular variable doesn't contain a valid number, awk will treat that variable as a numerical zero when it evaluates your mathematical expression.

Lots of operators

Another nice thing about awk is its full complement of mathematical operators. In addition to standard addition, subtraction, multiplication, and division, awk allows us to use the previously demonstrated exponent operator "^", the modulo (remainder) operator "%", and a bunch of other handy assignment operators borrowed from C.

These include pre- and post-increment/decrement ( i++, --foo ), add/sub/mult/div assign operators ( a+=3, b*=2, c/=2.2, d-=6.2 ). But that's not all -- we also get handy modulo/exponent assign ops as well ( a^=2, b%=4 ).

Field separators

Awk has its own complement of special variables. Some of them allow you to fine-tune how awk functions, while others can be read to glean valuable information about the input. We've already touched on one of these special variables, FS. As mentioned earlier, this variable allows you to set the character sequence that awk expects to find between fields. When we were using /etc/passwd as input, FS was set to ":". While this did the trick, FS allows us even more flexibility.

The FS value is not limited to a single character; it can also be set to a regular expression, specifying a character pattern of any length. If you're processing fields separated by one or more tabs, you'll want to set FS like so:

FS="\t+"

Above, we use the special "+" regular expression character, which means "one or more of the previous character".

If your fields are separated by whitespace (one or more spaces or tabs), you may be tempted to set FS to the following regular expression:

FS="[[:space:]]+"

While this assignment will do the trick, it's not necessary. Why? Because by default, FS is set to a single space character, which awk interprets to mean "one or more spaces or tabs." In this particular example, the default FS setting was exactly what you wanted in the first place!

Complex regular expressions are no problem. Even if your records are separated by the word "foo," followed by three digits, the following regular expression will allow your data to be parsed properly:

FS="foo[0-9][0-9][0-9]"

Number of fields

The next two variables we're going to cover are not normally intended to be written to, but are normally read and used to gain useful information about the input. The first is the NF variable, also called the "number of fields" variable. Awk will automatically set this variable to the number of fields in the current record. You can use the NF variable to display only certain input lines:

NF == 3 { print "this particular record has three fields: " $0 }

Of course, you can also use the NF variable in conditional statements, as follows:

{
    if ( NF > 2 ) {
        print $1 " " $2 ":" $3 
    }
}

Record number

The record number (NR) is another handy variable. It will always contain the number of the current record (awk counts the first record as record number 1). Up until now, we've been dealing with input files that contain one record per line. For these situations, NR will also tell you the current line number. However, when we start to process multi-line records later in the series, this will no longer be the case, so be careful! NR can be used like the NF variable to print only certain lines of the input:

(NR < 10 ) || (NR > 100) { print "We are on record number 1-9 or 101+" }
{
    #skip header
    if ( NR > 10 ) {
        print "ok, now for the real information!"
    }
}

Awk provides additional variables that can be used for a variety of purposes. We'll cover more of these variables in later articles.

We've come to the end of our initial exploration of awk. As the series continues, I'll demonstrate more advanced awk functionality, and we'll end the series with a real-world awk application.

Resources

Next >>>

Read the next article in this series: Awk by Example, Part 2

Support Funtoo and help us grow! Donate $15 per month and get a free SSD-based Funtoo Virtual Container.

About the Author

Daniel Robbins is best known as the creator of Gentoo Linux and author of many IBM developerWorks articles about Linux. Daniel currently serves as Benevolent Dictator for Life (BDFL) of Funtoo Linux. Funtoo Linux is a Gentoo-based distribution and continuation of Daniel's original Gentoo vision.

Got Funtoo?

Have you installed Funtoo Linux yet? Discover the power of a from-source meta-distribution optimized for your hardware! See our installation instructions and browse our CPU-optimized builds.

Funtoo News

Drobbins

Perl Updates

Gentoo has bumped perl from 5.20 to 5.22. Be sure to run perl-cleaner --all after the upgrade.
2015-07-25 by Drobbins
Drobbins

ARM Rebuild

ARM systems will use new stage3's that are not compatible with earlier versions.
2015-06-27 by Drobbins
Drobbins

ABI X86 64 and 32

Funtoo Linux has new 32-bit compatibility libraries inherited from Gentoo. Learn about them here.
2015-06-18 by Drobbins
More...

More Articles

Browse all our Linux-related articles, below:

A

B

F

G

K

L

M

O

P

S

T

W

X