Difference between pages "Awk by Example, Part 3" and "Bash by Example, Part 1"

From Funtoo
(Difference between pages)
Jump to navigation Jump to search
m (spam removal)
 
 
Line 1: Line 1:
{{WikiArticle}}
== Fundamental programming in the Bourne again shell (bash) ==


== String functions and ... checkbooks? ==
=== Introduction ===
You might wonder why you ought to learn Bash programming. Well, here are a couple of compelling reasons:


=== Formatting output ===
=== You're already running it ===
While awk's print statement does do the job most of the time, sometimes more is needed. For those times, awk offers two good old friends called printf() and sprintf(). Yes, these functions, like so many other awk parts, are identical to their C counterparts. printf() will print a formatted string to stdout, while sprintf() returns a formatted string that can be assigned to a variable. If you're not familiar with printf() and sprintf(), an introductory C text will quickly get you up to speed on these two essential printing functions. You can view the printf() man page by typing "man 3 printf" on your Linux system.
If you check, you'll probably find that you are running bash right now. Even if you changed your default shell, bash is probably still running somewhere on your system, because it's the standard Linux shell and is used for a variety of purposes. Because bash is already running, any additional bash scripts that you run are inherently memory-efficient because they share memory with any already-running bash processes. Why load a 500K interpreter if you already are running something that will do the job, and do it well?


Here's some sample awk sprintf() and printf() code. As you can see in the following script, everything looks almost identical to C.
=== You're already using it ===
<pre>
Not only are you already running bash, but you're actually interacting with bash on a daily basis. It's always there, so it makes sense to learn how to use it to its fullest potential. Doing so will make your bash experience more fun and productive. But why should you learn bash programming? Easy, because you already think in terms of running commands, CPing files, and piping and redirecting output. Shouldn't you learn a language that allows you to use and build upon these powerful time-saving constructs you already know how to use? Command shells unlock the potential of a UNIX system, and bash is the Linux shell. It's the high-level glue between you and the machine. Grow in your knowledge of bash, and you'll automatically increase your productivity under Linux and UNIX -- it's that simple.
#!/usr/bin/awk -f
 
BEGIN {
=== Bash confusion ===
x=1
Learning bash the wrong way can be a very confusing process. Many newbies type <span style="color:green;">man bash</span> to view the bash man page, only to be confronted with a very terse and technical description of shell functionality. Others type <span style="color:green;">info bash</span> (to view the GNU info documentation), causing either the man page to be redisplayed, or (if they are lucky) only slightly more friendly info documentation to appear.
b="foo"
 
printf("%s got a %d on the last test\n","Jim",83)
While this may be somewhat disappointing to novices, the standard bash documentation can't be all things to all people, and caters towards those already familiar with shell programming in general. There's definitely a lot of excellent technical information in the man page, but its helpfulness to beginners is limited.
myout=sprintf("%s-%d",b,x)
 
print myout
That's where this series comes in. In it, I'll show you how to actually use bash programming constructs, so that you will be able to write your own scripts. Instead of technical descriptions, I'll provide you with explanations in plain English, so that you will know not only what something does, but when you should actually use it. By the end of this three-part series, you'll be able to write your own intricate bash scripts, and be at the level where you can comfortably use bash and supplement your knowledge by reading (and understanding!) the standard bash documentation. Let's begin.
}
 
</pre>
=== Environment variables ===
This code will print:
Under bash and almost all other shells, the user can define environment variables, which are stored internally as ASCII strings. One of the handiest things about environment variables is that they are a standard part of the UNIX process model. This means that environment variables not only are exclusive to shell scripts, but can be used by standard compiled programs as well. When we "export" an environment variable under bash, any subsequent program that we run can read our setting, whether it is a shell script or not. A good example is the <span style="color:green">vipw</span> command, which normally allows root to edit the system password file. By setting the <span style="color:green">EDITOR</span> environment variable to the name of your favorite text editor, you can configure vipw to use it instead of vi, a handy thing if you are used to xemacs and really dislike vi.
 
The standard way to define an environment variable under bash is:
<pre>
<pre>
Jim got a 83 on the last test
$ myvar='This is my environment variable!'
foo-1
</pre>
</pre>
The above command defined an environment variable called "myvar" and contains the string "This is my environment variable!". There are several things to notice above: first, there is no space on either side of the "=" sign; any space will result in an error (try it and see). The second thing to notice is that while we could have done away with the quotes if we were defining a single word, they are necessary when the value of the environment variable is more than a single word (contains spaces or tabs).
{{fancynote|For extremely detailed information on how quotes should be used in bash, you may  want to look at the "QUOTING" section in the bash man page. The existence of special character sequences that get "expanded" (replaced) with other values does complicate how strings are handled in bash. We will just cover the most often-used quoting functionality in this series.}}


=== String functions ===
Thirdly, while we can normally use double quotes instead of single quotes, doing so in the above example would have caused an error. Why? Because using single quotes disables a bash feature called expansion, where special characters and sequences of characters are replaced with values. For example, the "!" character is the history expansion character, which bash normally replaces with a previously-typed command. (We won't be covering history expansion in this series of articles, because it is not frequently used in bash programming. For more information on it, see the "HISTORY EXPANSION" section in the bash man page.) While this macro-like functionality can come in handy, right now we want a literal exclamation point at the end of our environment variable, rather than a macro.
Awk has a plethora of string functions, and that's a good thing. In awk, you really need string functions, since you can't treat a string as an array of characters as you can in other languages like C, C++, and Python. For example, if you execute the following code:
<pre>
mystring="How are you doing today?"
print mystring[3]
</pre>
You'll receive an error that looks something like this:
<pre>
awk: string.gawk:59: fatal: attempt to use scalar as array
</pre>
Oh, well. While not as convenient as Python's sequence types, awk's string functions get the job done. Let's take a look at them.


First, we have the basic length() function, which returns the length of a string. Here's how to use it:
Now, let's take a look at how one actually uses environment variables. Here's an example:
<pre>
<pre>
print length(mystring)
$ echo $myvar
This is my environment variable!
</pre>
</pre>
This code will print the value:
By preceding the name of our environment variable with a $, we can cause bash to replace it with the value of myvar. In bash terminology, this is called "variable expansion". But, what if we try the following:
<pre>
<pre>
24
$ echo foo$myvarbar
foo
</pre>
</pre>
OK, let's keep going. The next string function is called index, and will return the position of the occurrence of a substring in another string, or it will return 0 if the string isn't found. Using mystring, we can call it this way:
We wanted this to echo "fooThis is my environment variable!bar", but it didn't work. What went wrong? In a nutshell, bash's variable expansion facility in got confused. It couldn't tell whether we wanted to expand the variable $m, $my, $myvar, $myvarbar, etc. How can we be more explicit and clearly tell bash what variable we are referring to? Try this:
<pre>
<pre>
print index(mystring,"you")
$ echo foo${myvar}bar
fooThis is my environment variable!bar
</pre>
</pre>
Awk prints:
As you can see, we can enclose the environment variable name in curly braces when it is not clearly separated from the surrounding text. While $myvar is faster to type and will work most of the time, ${myvar} can be parsed correctly in almost any situation. Other than that, they both do the same thing, and you will see both forms of variable expansion in the rest of this series. You'll want to remember to use the more explicit curly-brace form when your environment variable is not isolated from the surrounding text by whitespace (spaces or tabs).
 
Recall that we also mentioned that we can "export" variables. When we export an environment variable, it's automatically available in the environment of any subsequently-run script or executable. Shell scripts can "get to" the environment variable using that shell's built-in environment-variable support, while C programs can use the getenv() function call. Here's some example C code that you should type in and compile -- it'll allow us to understand environment variables from the perspective of C:
<syntaxhighlight lang="c">
#include <stdio.h>
#include <stdlib.h>
 
int main(void) {
  char *myenvvar=getenv("EDITOR");
  printf("The editor environment variable is set to %s\n",myenvvar);
}
</syntaxhighlight>
Save the above source into a file called '''myenv.c''', and then compile it by issuing the command:
<pre>
<pre>
9
$ gcc myenv.c -o myenv
</pre>
</pre>
We move on to two more easy functions, tolower() and toupper(). As you might guess, these functions will return the string with all characters converted to lowercase or uppercase respectively. Notice that tolower() and toupper() return the new string, and don't modify the original. This code:
Now, there will be an executable program in your directory that, when run, will print the value of the <span style="color:green">EDITOR</span> environment variable, if any. This is what happens when I run it on my machine:
<pre>
<pre>
print tolower(mystring)
$ ./myenv
print toupper(mystring)
The editor environment variable is set to (null)
print mystring
</pre>
</pre>
....will produce this output:
Hmmm... because the <span style="color:green">EDITOR</span> environment variable was not set to anything, the C program gets a null string. Let's try setting it to a specific value:
<pre>
<pre>
how are you doing today?
$ EDITOR=xemacs
HOW ARE YOU DOING TODAY?
$ ./myenv
How are you doing today?
The editor environment variable is set to (null)
</pre>
</pre>
So far so good, but how exactly do we select a substring or even a single character from a string? That's where substr() comes in. Here's how to call substr():
While you might have expected myenv to print the value "xemacs", it didn't quite work, because we didn't export the EDITOR environment variable. This time, we'll get it working:
<pre>
<pre>
mysub=substr(mystring,startpos,maxlen)
$ export EDITOR
$ ./myenv
The editor environment variable is set to xemacs
</pre>
</pre>
mystring should be either a string variable or a literal string from which you'd like to extract a substring. startpos should be set to the starting character position, and maxlen should contain the maximum length of the string you'd like to extract. Notice that I said maximum length; if length(mystring) is shorter than startpos+maxlen, your result will be truncated. substr() won't modify the original string, but returns the substring instead. Here's an example:
So, you have seen with your very own eyes that another process (in this case our example C program) cannot see the environment variable until it is exported. Incidentally, if you want, you can define and export an environment variable using one line, as follows:
<pre>
<pre>
print substr(mystring,9,3)
$ export EDITOR=xemacs
</pre>
</pre>
Awk will print:
It works identically to the two-line version. This would be a good time to show how to erase an environment variable by using <span style="color:green">unset</span>:
<pre>
<pre>
you
$ unset EDITOR
$ ./myenv
The editor environment variable is set to (null)
</pre>
</pre>
If you regularly program in a language that uses array indices to access parts of a string (and who doesn't), make a mental note that substr() is your awk substitute. You'll need to use it to extract single characters and substrings; because awk is a string-based language, you'll be using it often.


Now, we move on to some meatier functions, the first of which is called match(). match() is a lot like index(), except instead of searching for a substring like index() does, it searches for a regular expression. The match() function will return the starting position of the match, or zero if no match is found. In addition, match() will set two variables called RSTART and RLENGTH. RSTART contains the return value (the location of the first match), and RLENGTH specifies its span in characters (or -1 if no match was found). Using RSTART, RLENGTH, substr(), and a small loop, you can easily iterate through every match in your string. Here's an example match() call:
=== Chopping strings overview ===
Chopping strings -- that is, splitting an original string into smaller, separate chunk(s) -- is one of those tasks that is performed daily by your average shell script. Many times, shell scripts need to take a fully-qualified path, and find the terminating file or directory. While it's possible (and fun!) to code this in bash, the standard <span style="color:green">basename</span> UNIX executable performs this extremely well:
<pre>
<pre>
print match(mystring,/you/), RSTART, RLENGTH
$ basename /usr/local/share/doc/foo/foo.txt
foo.txt
$ basename /usr/home/drobbins
drobbins
</pre>
</pre>
Awk will print:
<span style="color:green">basename</span> is quite a handy tool for chopping up strings. It's companion, called <span style="color:green">dirname</span>, returns the "other" part of the path that <span style="color:green">basename</span> throws away:
<pre>
<pre>
9 9 3
$ dirname /usr/local/share/doc/foo/foo.txt
/usr/local/share/doc/foo
$ dirname /usr/home/drobbins/
/usr/home
</pre>
</pre>
{{fancynote|Both dirname and basename do not look at any files or directories on disk; they are purely string manipulation commands.}}


=== String substitution ===
=== Command substitution ===
Now, we're going to look at a couple of string substitution functions, sub() and gsub(). These guys differ slightly from the functions we've looked at so far in that they actually modify the original string. Here's a template that shows how to call sub():
One very handy thing to know is how to create an environment variable that contains the result of an executable command. This is very easy to do:
<pre>
<pre>
sub(regexp,replstring,mystring)
$ MYDIR=$(dirname /usr/local/share/doc/foo/foo.txt)
$ echo $MYDIR
/usr/local/share/doc/foo
</pre>
</pre>
When you call sub(), it'll find the first sequence of characters in mystring that matches regexp, and it'll replace that sequence with replstring. sub() and gsub() have identical arguments; the only way they differ is that sub() will replace the first regexp match (if any), and gsub() will perform a global replace, swapping out all matches in the string. Here's an example sub() and gsub() call:
What we did above is called ''command substitution''. Several things are worth noticing in this example. On the first line, we simply enclosed the command we wanted to execute with ''$( )''.
 
Note that it is also possible to do the same thing using backquotes, the keyboard key that normally sits above the Tab key:
<pre>
<pre>
sub(/o/,"O",mystring)
$ MYDIR=`dirname /usr/local/share/doc/foo/foo.txt`
print mystring
$ echo $MYDIR
mystring="How are you doing today?"
/usr/local/share/doc/foo
gsub(/o/,"O",mystring)
print mystring
</pre>
</pre>
We had to reset mystring to its original value because the first sub() call modified mystring directly. When executed, this code will cause awk to output:
As you can see, bash provides multiple ways to perform exactly the same thing. Using command substitution, we can place any command or pipeline of commands in between ''` `'' or ''$( )'' and assign it to an environment variable. Handy stuff! Here's an example of how to use a pipeline with command substitution:
<pre>
HOw are you doing today?
HOw are yOu dOing tOday?
</pre>
Of course, more complex regular expressions are possible. I'll leave it up to you to test out some complicated regexps.


We wrap up our string function coverage by introducing you to a function called split(). split()'s job is to "chop up" a string and place the various parts into an integer-indexed array. Here's an example split() call:
<pre>
<pre>
numelements=split("Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec",mymonths,",")
$ MYFILES=$(ls /etc | grep pa)
</pre>
$ echo $MYFILES
When calling split(), the first argument contains the literal string or string variable to be chopped. In the second argument, you should specify the name of the array that split() will stuff the chopped parts into. In the third element, specify the separator that will be used to chop the strings up. When split() returns, it'll return the number of string elements that were split. split() assigns each one to an array index starting with one, so the following code:
pam.d passwd
<pre>
print mymonths[1],mymonths[numelements]
</pre>
....will print:
<pre>
Jan Dec
</pre>
</pre>


=== Special string forms ===
It's also worth pointing out that ''$( )'' is generally preferred over ''` `'' in shell scripts because it is more universally supported across different shells, is easier to type and read, and is less complicated to use in a nested form, as follows:
A quick note -- when calling length(), sub(), or gsub(), you can drop the last argument and awk will apply the function call to $0 (the entire current line). To print the length of each line in a file, use this awk script:
<pre>
<pre>
{
$ MYFILES=$(ls $(dirname foo/bar/oni))
    print length()  
}
</pre>
</pre>


=== Financial fun ===
=== Chopping strings like a pro ===
A few weeks ago, I decided to write my own checkbook balancing program in awk. I decided that I'd like to have a simple tab-delimited text file into which I can enter my most recent deposits and withdrawals. The idea was to hand this data to an awk script that would automatically add up all the amounts and tell me my balance. Here's how I decided to record all my transactions into my "ASCII checkbook":
While <span style="color:green">basename</span> and <span style="color:green">dirname</span> are great tools, there are times where we may need to perform more advanced string "chopping" operations than just standard pathname manipulations. When we need more punch, we can take advantage of bash's advanced built-in variable expansion functionality. We've already used the standard kind of variable expansion, which looks like this: ${MYVAR}. But bash can also perform some handy string chopping on its own. Take a look at these examples:
<pre>
<pre>
23 Aug 2000    food    -    -    Y    Jimmy's Buffet    30.25
$ MYVAR=foodforthought.jpg
$ echo ${MYVAR##*fo}
rthought.jpg
$ echo ${MYVAR#*fo}
odforthought.jpg
</pre>
</pre>
Every field in this file is separated by one or more tabs. After the date (field 1, $1), there are two fields called "expense category" and "income category". When I'm entering an expense like on the above line, I put a four-letter nickname in the exp field, and a "-" (blank entry) in the inc field. This signifies that this particular item is a "food expense" :) Here's what a deposit looks like:
In the first example, we typed ${MYVAR##*fo}. What exactly does this mean? Basically, inside the ''${ }'', we typed the name of the environment variable, two ##s, and a wildcard ("*fo"). Then, bash took <span style="color:green">MYVAR</span>, found the longest substring from the beginning of the string "foodforthought.jpg" that matched the wildcard "*fo", and chopped it off the beginning of the string. That's a bit hard to grasp at first, so to get a feel for how this special "##" option works, let's step through how bash completed this expansion. First, it began searching for substrings at the beginning of "foodforthought.jpg" that matched the "*fo" wildcard. Here are the substrings that it checked:
<pre>
<pre>
23 Aug 2000    -    inco    -    Y    Boss Man       2001.00
f     
fo              MATCHES *fo
foo   
food
foodf         
foodfo          MATCHES *fo
foodfor
foodfort        
foodforth
foodfortho     
foodforthou
foodforthoug
foodforthought
foodforthought.j
foodforthought.jp
foodforthought.jpg
</pre>
</pre>
In this case, I put a "-" (blank) in the exp category, and put "inco" in the inc category. "inco" is my nickname for generic (paycheck-style) income. Using category nicknames allows me to generate a breakdown of my income and expenditures by category. As far as the rest of the records, all the other fields are fairly self-explanatory. The cleared? field ("Y" or "N") records whether the transaction has been posted to my account; beyond that, there's a transaction description, and a positive dollar amount.
After searching the string for matches, you can see that bash found two. It selects the longest match, removes it from the beginning of the original string, and returns the result.


The algorithm used to compute the current balance isn't too hard. Awk simply needs to read in each line, one by one. If an expense category is listed but there is no income category (denoted by "-"), then this item is a debit. If an income category is listed, but no expense category (denoted by "-") is present, then the dollar amount is a credit. And, if there is both an expense and income category listed, then this amount is a "category transfer"; that is, the dollar amount will be subtracted from the expense category and added to the income category. Again, all these categories are virtual, but are very useful for tracking income and expenditures, as well as for budgeting.
The second form of variable expansion shown above appears identical to the first, except it uses only one "#" -- and bash performs an almost identical process. It checks the same set of substrings as our first example did, except that bash removes the shortest match from our original string, and returns the result. So, as soon as it checks the "fo" substring, it removes "fo" from our string and returns "odforthought.jpg".


=== The code ===
This may seem extremely cryptic, so I'll show you an easy way to remember this functionality. When searching for the longest match, use ## (because ## is longer than #). When searching for the shortest match, use #. See, not that hard to remember at all! Wait, how do you remember that we are supposed to use the '#' character to remove from the *beginning* of a string? Simple! You will notice that on a US keyboard, shift-4 is "$", which is the bash variable expansion character. On the keyboard, immediately to the left of "$" is "#". So, you can see that "#" is "at the beginning" of "$", and thus (according to our mnemonic), "#" removes characters from the beginning of the string. You may wonder how we remove characters from the end of the string. If you guessed that we use the character immediately to the right of "$" on the US keyboard ("%"), you're right! Here are some quick examples of how to chop off trailing portions of strings:
Time to look at the code. We'll start off with the first line, the BEGIN block and a function definition:
<pre>
<pre>
#!/usr/bin/awk -f
$ MYFOO="chickensoup.tar.gz"
BEGIN {
$ echo ${MYFOO%%.*}
    FS="\t+"
chickensoup
    months="Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec"
$ echo ${MYFOO%.*}
}
chickensoup.tar
 
function monthdigit(mymonth) {
    return (index(months,mymonth)+3)/4
}
</pre>
</pre>
Adding the first "#!..." line to any awk script will allow it to be directly executed from the shell, provided that you "chmod +x myscript" first. The remaining lines define our BEGIN block, which gets executed before awk starts processing our checkbook file. We set FS (the field separator) to "\t+", which tells awk that the fields will be separated by one or more tabs. In addition, we define a string called months that's used by our monthdigit() function, which appears next.
As you can see, the % and %% variable expansion options work identically to # and ##, except they remove the matching wildcard from the end of the string. Note that you don't have to use the "*" character if you wish to remove a specific substring from the end:
 
The last three lines show you how to define your own awk function. The format is simple -- type "function", then the function name, and then the parameters separated by commas, inside parentheses. After this, a "{ }" code block contains the code that you'd like this function to execute. All functions can access global variables (like our months variable). In addition, awk provides a "return" statement that allows the function to return a value, and operates similarly to the "return" found in C, Python, and other languages. This particular function converts a month name in a 3-letter string format into its numeric equivalent. For example, this:
<pre>
<pre>
print monthdigit("Mar")
MYFOOD="chickensoup"
$ echo ${MYFOOD%%soup}
chicken
</pre>
</pre>
....will print this:
In this example, it doesn't matter whether we use "%%" or "%", since only one match is possible. And remember, if you forget whether to use "#" or "%", look at the 3, 4, and 5 keys on your keyboard and figure it out.
 
We can use another form of variable expansion to select a specific substring, based on a specific character offset and length. Try typing in the following lines under bash:
<pre>
<pre>
3
$ EXCLAIM=cowabunga
</pre>
$ echo ${EXCLAIM:0:3}
Now, let's move on to some more functions.
cow
 
$ echo ${EXCLAIM:3:7}
=== Financial functions ===
abunga
Here are three more functions that perform the bookkeeping for us. Our main code block, which we'll see soon, will process each line of the checkbook file sequentially, calling one of these functions so that the appropriate transactions are recorded in an awk array. There are three basic kinds of transactions, credit (doincome), debit (doexpense) and transfer (dotransfer). You'll notice that all three functions accept one argument, called mybalance. mybalance is a placeholder for a two-dimensional array, which we'll pass in as an argument. Up until now, we haven't dealt with two-dimensional arrays; however, as you can see below, the syntax is quite simple. Just separate each dimension with a comma, and you're in business.
 
We'll record information into "mybalance" as follows. The first dimension of the array ranges from 0 to 12, and specifies the month, or zero for the entire year. Our second dimension is a four-letter category, like "food" or "inco"; this is the actual category we're dealing with. So, to find the entire year's balance for the food category, you'd look in mybalance[0,"food"]. To find June's income, you'd look in mybalance[6,"inco"].
<pre>       
function doincome(mybalance) {
    mybalance[curmonth,$3] += amount
    mybalance[0,$3] += amount       
}
 
function doexpense(mybalance) {
    mybalance[curmonth,$2] -= amount
    mybalance[0,$2] -= amount       
}
 
function dotransfer(mybalance) {
    mybalance[0,$2] -= amount
    mybalance[curmonth,$2] -= amount
    mybalance[0,$3] += amount
    mybalance[curmonth,$3] += amount
}
</pre>
</pre>
When doincome() or any of the other functions are called, we record the transaction in two places -- mybalance[0,category] and mybalance[curmonth, category], the entire year's category balance and the current month's category balance, respectively. This allows us to easily generate either an annual or monthly breakdown of income/expenditures later on.
This form of string chopping can come in quite handy; simply specify the character to start from and the length of the substring, all separated by colons.


If you look at these functions, you'll notice that the array referenced by mybalance is passed in by reference. In addition, we also refer to several global variables: curmonth, which holds the numeric value of the month of the current record, $2 (the expense category), $3 (the income category), and amount ($7, the dollar amount). When doincome() and friends are called, all these variables have already been set correctly for the current record (line) being processed.
=== Applying string chopping ===
Now that we've learned all about chopping strings, let's write a simple little shell script. Our script will accept a single file as an argument, and will print out whether it appears to be a tarball. To determine if it is a tarball, it will look for the pattern ".tar" at the end of the file. Here it is:
<syntaxhighlight lang="bash">
#!/bin/bash


=== The main block ===
if [ "${1##*.}" = "tar" ]
Here's the main code block that contains the code that parses each line of input data. Remember, because we have set FS correctly, we can refer to the first field as $1, the second field as $2, etc. When doincome() and friends are called, the functions can access the current values of curmonth, $2, $3 and amount from inside the function. Take a look at the code and meet me on the other side for an explanation.
then
      echo This appears to be a tarball.
else
      echo At first glance, this does not appear to be a tarball.
fi
</syntaxhighlight>
To run this script, enter it into a file called '''mytar.sh''', and type <span style="color:green">chmod 755 mytar.sh</span> to make it executable. Then, give it a try on a tarball, as follows:
<pre>
<pre>
{
$ ./mytar.sh thisfile.tar
    curmonth=monthdigit(substr($1,4,3))
This appears to be a tarball.
    amount=$7
$ ./mytar.sh thatfile.gz
     
At first glance, this does not appear to be a tarball.
    #record all the categories encountered
    if ( $2 != "-" )
        globcat[$2]="yes"
    if ( $3 != "-" )
        globcat[$3]="yes"
 
    #tally up the transaction properly
    if ( $2 == "-" ) {
        if ( $3 == "-" ) {
            print "Error: inc and exp fields are both blank!"
            exit 1
        } else {
            #this is income
            doincome(balance)
            if ( $5 == "Y" )
                doincome(balance2)
        }
    } else if ( $3 == "-" ) {
        #this is an expense
        doexpense(balance)
        if ( $5 == "Y" )
            doexpense(balance2)
    } else {
        #this is a transfer
        dotransfer(balance)
        if ( $5 == "Y" )
            dotransfer(balance2)
    }                       
}
</pre>
</pre>
In the main block, the first two lines set curmonth to an integer between 1 and 12, and set amount to field 7 (to make the code easier to understand). Then, we have four interesting lines, where we write values into an array called globcat. globcat, or the global categories array, is used to record all those categories encountered in the file -- "inco", "misc", "food", "util", etc. For example, if $2 == "inco", we set globcat["inco"] to "yes". Later on, we can iterate through our list of categories with a simple "for (x in globcat)" loop.
OK, it works, but it's not very functional. Before we make it more useful, let's take a look at the "if" statement used above. In it, we have a boolean expression. In bash, the "=" comparison operator checks for string equality. In bash, all boolean expressions are enclosed in square brackets. But what does the boolean expression actually test for? Let's take a look at the left side. According to what we've learned about string chopping, "${1##*.}" will remove the longest match of "*." from the beginning of the string contained in the environment variable "1", returning the result. This will cause everything after the last "." in the file to be returned. Obviously, if the file ends in ".tar", we will get "tar" as a result, and the condition will be true.


On the next twenty or so lines, we analyze fields $2 and $3, and record the transaction appropriately. If $2=="-" and $3!="-", we have some income, so we call doincome(). If the situation is reversed, we call doexpense(); and if both $2 and $3 contain categories, we call dotransfer(). Each time, we pass the "balance" array to these functions so that the appropriate data is recorded there.
You may be wondering what the "1" environment variable is in the first place. Very simple -- $1 is the first command-line argument to the script, $2 is the second, etc. OK, now that we've reviewed the function, we can take our first look at "if" statements.


You'll also notice several lines that say "if ( $5 == "Y" ), record that same transaction in balance2". What exactly are we doing here? You'll recall that $5 contains either a "Y" or a "N", and records whether the transaction has been posted to the account. Because we record the transaction to balance2 only if the transaction has been posted, balance2 will contain the actual account balance, while "balance" will contain all transactions, whether they have been posted or not. You can use balance2 to verify your data entry (since it should match with your current account balance according to your bank), and use "balance" to make sure that you don't overdraw your account (since it will take into account any checks you have written that have not yet been cashed).
=== If statements ===
 
Like most languages, bash has its own form of conditional. When using them, stick to the format above; that is, keep the "if" and the "then" on separate lines, and keep the "else" and the terminating and required "fi" in horizontal alignment with them. This makes the code easier to read and debug. In addition to the "if,else" form, there are several other forms of "if" statements:
=== Generating the report ===
<syntaxhighlight lang="bash">
After the main block repeatedly processes each input record, we now have a fairly comprehensive record of debits and credits broken down by category and by month. Now, all we need to do is define an END block that will generate a report, in this case a modest one:
if      [ condition ]
<pre>
then
END {
         action
    bal=0
fi
    bal2=0       
</syntaxhighlight>
    for (x in globcat) {
This one performs an action only if condition is true, otherwise it performs no action and continues executing any lines following the "fi".
        bal=bal+balance[0,x]
<syntaxhighlight lang="bash">
         bal2=bal2+balance2[0,x]   
if [ condition ]
    }
then
    printf("Your available funds: %10.2f\n", bal)
        action
    printf("Your account balance: %10.2f\n", bal2)       
elif [ condition2 ]
}
then
</pre>
        action2
This report prints out a summary that looks something like this:
.
<pre>
.
Your available funds:    1174.22
.
Your account balance:    2399.33
elif [ condition3 ]
</pre>
then
In our END block, we used the "for (x in globcat)" construct to iterate through every category, tallying up a master balance based on all the transactions recorded. We actually tally up two balances, one for available funds, and another for the account balance. To execute the program and process your own financial goodies that you've entered into a file called '''mycheckbook.txt''', put all the above code into a text file called '''balance''' and do <span style="color:green;">"chmod +x balance"</span>, and then type <span style="color:green;">"./balance mycheckbook.txt"</span>. The balance script will then add up all your transactions and print out a two-line balance summary for you.


=== Upgrades ===
else
I use a more advanced version of this program to manage my personal and business finances. My version (which I couldn't include here due to space limitations) prints out a monthly breakdown of income and expenses, including annual totals, net income and a bunch of other stuff. Even better, it outputs the data in HTML format, so that I can view it in a Web browser :) If you find this program useful, I encourage you to add these features to this script. You won't need to configure it to record any additional information; all the information you need is already in balance and balance2. Just upgrade the END block, and you're in business!
        actionx
fi
</syntaxhighlight>
The above "elif" form will consecutively test each condition and execute the action corresponding to the first true condition. If none of the conditions are true, it will execute the "else" action, if one is present, and then continue executing lines following the entire "if,elif,else" statement.


I hope you've enjoyed this series. For more information on awk, check out the resources listed below.
=== Next time ===
Now that we've covered the most basic bash functionality, it's time to pick up the pace and get ready to write some real scripts. In the next article, I'll cover looping constructs, functions, namespace, and other essential topics. Then, we'll be ready to write some more complicated scripts. In the third article, we'll focus almost exclusively on very complex scripts and functions, as well as several bash script design options. See you then!


== Resources ==
== Resources ==
* Read Daniel's other awk articles on Funtoo: Awk By Example, [[Awk by example, Part1|Part 1]] and [[Awk by example, Part2|Part 2]].
* Read [[Bash by Example, Part 2]]
* If you'd like a good old-fashioned book, [http://www.oreilly.com/catalog/sed2/ O'Reilly's sed & awk, 2nd Edition] is a wonderful choice.
* Read [[Bash by Example, Part 3]]
* Be sure to check out the [http://www.faqs.org/faqs/computer-lang/awk/faq/ comp.lang.awk FAQ]. It also contains lots of additional awk links.
* Visit [http://www.gnu.org/software/bash/bash.html GNU's bash home page]
* Patrick Hartigan's [http://sparky.rice.edu/~hartigan/awk.html awk tutorial] is packed with handy awk scripts.
* [http://www.tasoft.com/tawk.html Thompson's TAWK Compiler] compiles awk scripts into fast binary executables. Versions are available for Windows, OS/2, DOS, and UNIX.
* [http://www.gnu.org/software/gawk/manual/gawk.html The GNU Awk User's Guide] is available for online reference.


[[ Category:Linux Core Concepts ]]
__NOTOC__
[[Category:Linux Core Concepts]]
[[Category:Articles]]
[[Category:Articles]]

Revision as of 06:27, December 25, 2011

Fundamental programming in the Bourne again shell (bash)

Introduction

You might wonder why you ought to learn Bash programming. Well, here are a couple of compelling reasons:

You're already running it

If you check, you'll probably find that you are running bash right now. Even if you changed your default shell, bash is probably still running somewhere on your system, because it's the standard Linux shell and is used for a variety of purposes. Because bash is already running, any additional bash scripts that you run are inherently memory-efficient because they share memory with any already-running bash processes. Why load a 500K interpreter if you already are running something that will do the job, and do it well?

You're already using it

Not only are you already running bash, but you're actually interacting with bash on a daily basis. It's always there, so it makes sense to learn how to use it to its fullest potential. Doing so will make your bash experience more fun and productive. But why should you learn bash programming? Easy, because you already think in terms of running commands, CPing files, and piping and redirecting output. Shouldn't you learn a language that allows you to use and build upon these powerful time-saving constructs you already know how to use? Command shells unlock the potential of a UNIX system, and bash is the Linux shell. It's the high-level glue between you and the machine. Grow in your knowledge of bash, and you'll automatically increase your productivity under Linux and UNIX -- it's that simple.

Bash confusion

Learning bash the wrong way can be a very confusing process. Many newbies type man bash to view the bash man page, only to be confronted with a very terse and technical description of shell functionality. Others type info bash (to view the GNU info documentation), causing either the man page to be redisplayed, or (if they are lucky) only slightly more friendly info documentation to appear.

While this may be somewhat disappointing to novices, the standard bash documentation can't be all things to all people, and caters towards those already familiar with shell programming in general. There's definitely a lot of excellent technical information in the man page, but its helpfulness to beginners is limited.

That's where this series comes in. In it, I'll show you how to actually use bash programming constructs, so that you will be able to write your own scripts. Instead of technical descriptions, I'll provide you with explanations in plain English, so that you will know not only what something does, but when you should actually use it. By the end of this three-part series, you'll be able to write your own intricate bash scripts, and be at the level where you can comfortably use bash and supplement your knowledge by reading (and understanding!) the standard bash documentation. Let's begin.

Environment variables

Under bash and almost all other shells, the user can define environment variables, which are stored internally as ASCII strings. One of the handiest things about environment variables is that they are a standard part of the UNIX process model. This means that environment variables not only are exclusive to shell scripts, but can be used by standard compiled programs as well. When we "export" an environment variable under bash, any subsequent program that we run can read our setting, whether it is a shell script or not. A good example is the vipw command, which normally allows root to edit the system password file. By setting the EDITOR environment variable to the name of your favorite text editor, you can configure vipw to use it instead of vi, a handy thing if you are used to xemacs and really dislike vi.

The standard way to define an environment variable under bash is:

$ myvar='This is my environment variable!'

The above command defined an environment variable called "myvar" and contains the string "This is my environment variable!". There are several things to notice above: first, there is no space on either side of the "=" sign; any space will result in an error (try it and see). The second thing to notice is that while we could have done away with the quotes if we were defining a single word, they are necessary when the value of the environment variable is more than a single word (contains spaces or tabs).

   Note

For extremely detailed information on how quotes should be used in bash, you may want to look at the "QUOTING" section in the bash man page. The existence of special character sequences that get "expanded" (replaced) with other values does complicate how strings are handled in bash. We will just cover the most often-used quoting functionality in this series.

Thirdly, while we can normally use double quotes instead of single quotes, doing so in the above example would have caused an error. Why? Because using single quotes disables a bash feature called expansion, where special characters and sequences of characters are replaced with values. For example, the "!" character is the history expansion character, which bash normally replaces with a previously-typed command. (We won't be covering history expansion in this series of articles, because it is not frequently used in bash programming. For more information on it, see the "HISTORY EXPANSION" section in the bash man page.) While this macro-like functionality can come in handy, right now we want a literal exclamation point at the end of our environment variable, rather than a macro.

Now, let's take a look at how one actually uses environment variables. Here's an example:

$ echo $myvar
This is my environment variable!

By preceding the name of our environment variable with a $, we can cause bash to replace it with the value of myvar. In bash terminology, this is called "variable expansion". But, what if we try the following:

$ echo foo$myvarbar
foo

We wanted this to echo "fooThis is my environment variable!bar", but it didn't work. What went wrong? In a nutshell, bash's variable expansion facility in got confused. It couldn't tell whether we wanted to expand the variable $m, $my, $myvar, $myvarbar, etc. How can we be more explicit and clearly tell bash what variable we are referring to? Try this:

$ echo foo${myvar}bar
fooThis is my environment variable!bar

As you can see, we can enclose the environment variable name in curly braces when it is not clearly separated from the surrounding text. While $myvar is faster to type and will work most of the time, ${myvar} can be parsed correctly in almost any situation. Other than that, they both do the same thing, and you will see both forms of variable expansion in the rest of this series. You'll want to remember to use the more explicit curly-brace form when your environment variable is not isolated from the surrounding text by whitespace (spaces or tabs).

Recall that we also mentioned that we can "export" variables. When we export an environment variable, it's automatically available in the environment of any subsequently-run script or executable. Shell scripts can "get to" the environment variable using that shell's built-in environment-variable support, while C programs can use the getenv() function call. Here's some example C code that you should type in and compile -- it'll allow us to understand environment variables from the perspective of C:

#include <stdio.h>
#include <stdlib.h>

int main(void) {
  char *myenvvar=getenv("EDITOR");
  printf("The editor environment variable is set to %s\n",myenvvar);
}

Save the above source into a file called myenv.c, and then compile it by issuing the command:

$ gcc myenv.c -o myenv

Now, there will be an executable program in your directory that, when run, will print the value of the EDITOR environment variable, if any. This is what happens when I run it on my machine:

$ ./myenv
The editor environment variable is set to (null)

Hmmm... because the EDITOR environment variable was not set to anything, the C program gets a null string. Let's try setting it to a specific value:

$ EDITOR=xemacs
$ ./myenv
The editor environment variable is set to (null)

While you might have expected myenv to print the value "xemacs", it didn't quite work, because we didn't export the EDITOR environment variable. This time, we'll get it working:

$ export EDITOR
$ ./myenv
The editor environment variable is set to xemacs

So, you have seen with your very own eyes that another process (in this case our example C program) cannot see the environment variable until it is exported. Incidentally, if you want, you can define and export an environment variable using one line, as follows:

$ export EDITOR=xemacs

It works identically to the two-line version. This would be a good time to show how to erase an environment variable by using unset:

$ unset EDITOR
$ ./myenv
The editor environment variable is set to (null)

Chopping strings overview

Chopping strings -- that is, splitting an original string into smaller, separate chunk(s) -- is one of those tasks that is performed daily by your average shell script. Many times, shell scripts need to take a fully-qualified path, and find the terminating file or directory. While it's possible (and fun!) to code this in bash, the standard basename UNIX executable performs this extremely well:

$ basename /usr/local/share/doc/foo/foo.txt
foo.txt
$ basename /usr/home/drobbins
drobbins

basename is quite a handy tool for chopping up strings. It's companion, called dirname, returns the "other" part of the path that basename throws away:

$ dirname /usr/local/share/doc/foo/foo.txt
/usr/local/share/doc/foo
$ dirname /usr/home/drobbins/
/usr/home
   Note

Both dirname and basename do not look at any files or directories on disk; they are purely string manipulation commands.

Command substitution

One very handy thing to know is how to create an environment variable that contains the result of an executable command. This is very easy to do:

$ MYDIR=$(dirname /usr/local/share/doc/foo/foo.txt)
$ echo $MYDIR
/usr/local/share/doc/foo

What we did above is called command substitution. Several things are worth noticing in this example. On the first line, we simply enclosed the command we wanted to execute with $( ).

Note that it is also possible to do the same thing using backquotes, the keyboard key that normally sits above the Tab key:

$ MYDIR=`dirname /usr/local/share/doc/foo/foo.txt`
$ echo $MYDIR
/usr/local/share/doc/foo

As you can see, bash provides multiple ways to perform exactly the same thing. Using command substitution, we can place any command or pipeline of commands in between ` ` or $( ) and assign it to an environment variable. Handy stuff! Here's an example of how to use a pipeline with command substitution:

$ MYFILES=$(ls /etc | grep pa)
$ echo $MYFILES
pam.d passwd

It's also worth pointing out that $( ) is generally preferred over ` ` in shell scripts because it is more universally supported across different shells, is easier to type and read, and is less complicated to use in a nested form, as follows:

$ MYFILES=$(ls $(dirname foo/bar/oni))

Chopping strings like a pro

While basename and dirname are great tools, there are times where we may need to perform more advanced string "chopping" operations than just standard pathname manipulations. When we need more punch, we can take advantage of bash's advanced built-in variable expansion functionality. We've already used the standard kind of variable expansion, which looks like this: ${MYVAR}. But bash can also perform some handy string chopping on its own. Take a look at these examples:

$ MYVAR=foodforthought.jpg
$ echo ${MYVAR##*fo}
rthought.jpg
$ echo ${MYVAR#*fo}
odforthought.jpg

In the first example, we typed ${MYVAR##*fo}. What exactly does this mean? Basically, inside the ${ }, we typed the name of the environment variable, two ##s, and a wildcard ("*fo"). Then, bash took MYVAR, found the longest substring from the beginning of the string "foodforthought.jpg" that matched the wildcard "*fo", and chopped it off the beginning of the string. That's a bit hard to grasp at first, so to get a feel for how this special "##" option works, let's step through how bash completed this expansion. First, it began searching for substrings at the beginning of "foodforthought.jpg" that matched the "*fo" wildcard. Here are the substrings that it checked:

f       
fo              MATCHES *fo
foo     
food
foodf           
foodfo          MATCHES *fo
foodfor
foodfort        
foodforth
foodfortho      
foodforthou
foodforthoug
foodforthought
foodforthought.j
foodforthought.jp
foodforthought.jpg

After searching the string for matches, you can see that bash found two. It selects the longest match, removes it from the beginning of the original string, and returns the result.

The second form of variable expansion shown above appears identical to the first, except it uses only one "#" -- and bash performs an almost identical process. It checks the same set of substrings as our first example did, except that bash removes the shortest match from our original string, and returns the result. So, as soon as it checks the "fo" substring, it removes "fo" from our string and returns "odforthought.jpg".

This may seem extremely cryptic, so I'll show you an easy way to remember this functionality. When searching for the longest match, use ## (because ## is longer than #). When searching for the shortest match, use #. See, not that hard to remember at all! Wait, how do you remember that we are supposed to use the '#' character to remove from the *beginning* of a string? Simple! You will notice that on a US keyboard, shift-4 is "$", which is the bash variable expansion character. On the keyboard, immediately to the left of "$" is "#". So, you can see that "#" is "at the beginning" of "$", and thus (according to our mnemonic), "#" removes characters from the beginning of the string. You may wonder how we remove characters from the end of the string. If you guessed that we use the character immediately to the right of "$" on the US keyboard ("%"), you're right! Here are some quick examples of how to chop off trailing portions of strings:

$ MYFOO="chickensoup.tar.gz"
$ echo ${MYFOO%%.*}
chickensoup
$ echo ${MYFOO%.*}
chickensoup.tar

As you can see, the % and %% variable expansion options work identically to # and ##, except they remove the matching wildcard from the end of the string. Note that you don't have to use the "*" character if you wish to remove a specific substring from the end:

MYFOOD="chickensoup"
$ echo ${MYFOOD%%soup}
chicken

In this example, it doesn't matter whether we use "%%" or "%", since only one match is possible. And remember, if you forget whether to use "#" or "%", look at the 3, 4, and 5 keys on your keyboard and figure it out.

We can use another form of variable expansion to select a specific substring, based on a specific character offset and length. Try typing in the following lines under bash:

$ EXCLAIM=cowabunga
$ echo ${EXCLAIM:0:3}
cow
$ echo ${EXCLAIM:3:7}
abunga

This form of string chopping can come in quite handy; simply specify the character to start from and the length of the substring, all separated by colons.

Applying string chopping

Now that we've learned all about chopping strings, let's write a simple little shell script. Our script will accept a single file as an argument, and will print out whether it appears to be a tarball. To determine if it is a tarball, it will look for the pattern ".tar" at the end of the file. Here it is:

#!/bin/bash

if [ "${1##*.}" = "tar" ]
then
       echo This appears to be a tarball.
else
       echo At first glance, this does not appear to be a tarball.
fi

To run this script, enter it into a file called mytar.sh, and type chmod 755 mytar.sh to make it executable. Then, give it a try on a tarball, as follows:

$ ./mytar.sh thisfile.tar
This appears to be a tarball.
$ ./mytar.sh thatfile.gz
At first glance, this does not appear to be a tarball.

OK, it works, but it's not very functional. Before we make it more useful, let's take a look at the "if" statement used above. In it, we have a boolean expression. In bash, the "=" comparison operator checks for string equality. In bash, all boolean expressions are enclosed in square brackets. But what does the boolean expression actually test for? Let's take a look at the left side. According to what we've learned about string chopping, "${1##*.}" will remove the longest match of "*." from the beginning of the string contained in the environment variable "1", returning the result. This will cause everything after the last "." in the file to be returned. Obviously, if the file ends in ".tar", we will get "tar" as a result, and the condition will be true.

You may be wondering what the "1" environment variable is in the first place. Very simple -- $1 is the first command-line argument to the script, $2 is the second, etc. OK, now that we've reviewed the function, we can take our first look at "if" statements.

If statements

Like most languages, bash has its own form of conditional. When using them, stick to the format above; that is, keep the "if" and the "then" on separate lines, and keep the "else" and the terminating and required "fi" in horizontal alignment with them. This makes the code easier to read and debug. In addition to the "if,else" form, there are several other forms of "if" statements:

if      [ condition ]
then
        action
fi

This one performs an action only if condition is true, otherwise it performs no action and continues executing any lines following the "fi".

if [ condition ]
then 
        action
elif [ condition2 ]
then
        action2
.
.
.
elif [ condition3 ]
then

else
        actionx
fi

The above "elif" form will consecutively test each condition and execute the action corresponding to the first true condition. If none of the conditions are true, it will execute the "else" action, if one is present, and then continue executing lines following the entire "if,elif,else" statement.

Next time

Now that we've covered the most basic bash functionality, it's time to pick up the pace and get ready to write some real scripts. In the next article, I'll cover looping constructs, functions, namespace, and other essential topics. Then, we'll be ready to write some more complicated scripts. In the third article, we'll focus almost exclusively on very complex scripts and functions, as well as several bash script design options. See you then!

Resources