Difference between pages "Awk by Example, Part 1" and "The Gentoo.org Redesign, Part 1"

(Difference between pages)
 
(Created page with "{{Article |Subtitle=A site reborn |Summary=Have you ever woken up one morning and suddenly realized that your cute little personal development Web site isn't really that great...")
 
Line 1: Line 1:
 
{{Article
 
{{Article
 +
|Subtitle=A site reborn
 +
|Summary=Have you ever woken up one morning and suddenly realized that your cute little personal development Web site isn't really that great? If so, you're in good company. In this series, Daniel Robbins shares his experiences as he redesigns the www.gentoo.org Web site using technologies like XML, XSLT, and Python. Along the way, you may find some excellent approaches to use for your next Web site redesign. In this article, Daniel creates a user-centric action plan and introduces pytext, an embedded Python interpreter.
 
|Author=Drobbins
 
|Author=Drobbins
|Next in Series=Awk by Example, Part 2
 
 
}}
 
}}
 +
==  An unruly horde ==
  
== An intro to the great language with the strange name ==
+
Fellow software developer, may I ask you a question? Why is it that although many of us are intimately familiar with Web technologies such as HTML, CGI, Perl, Python, Java technology, and XML, our very own Web sites -- the ones devoted to our precious development projects -- look like they were thrown together by an unruly horde of hyperactive 12-year-olds? Why, oh why, is this so?
  
=== In defense of awk ===
+
Could it be because most of the time, we've left our Web site out to rot while we squander our precious time hacking away on our free software projects? The answer, at least in my case, is a most definite "Yes."
In this series of articles, I'm going to turn you into a proficient awk coder. I'll admit, awk doesn't have a very pretty or particularly "hip" name, and the GNU version of awk, called gawk, sounds downright weird. Those unfamiliar with the language may hear "awk" and think of a mess of code so backwards and antiquated that it's capable of driving even the most knowledgeable UNIX guru to the brink of insanity (causing him to repeatedly yelp "kill -9!" as he runs for coffee machine).
+
  
Sure, awk doesn't have a great name. But it is a great language. Awk is geared toward text processing and report generation, yet features many well-designed features that allow for serious programming. And, unlike some languages, awk's syntax is familiar, and borrows some of the best parts of languages like C, python, and bash (although, technically, awk was created before both python and bash). Awk is one of those languages that, once learned, will become a key part of your strategic coding arsenal.
+
When I'm not writing articles for IBM developerWorks or being a new dad, I'm feverishly working on the next release of Gentoo Linux, along with my skilled team of volunteers. And, yes, Gentoo Linux has its own Web site (see Resources). As of right now (March 2001), our Web site isn't that special; that's because we don't spend much time working on it because we're generally engrossed in improving Gentoo Linux itself. Sure, our site does have several admittedly cute logos that I whipped up using Xara X (see Resources), but when you look past the eye candy, our site leaves a lot to be desired. Maybe yours does too. If so, I have one thing to say to you -- welcome to the club.
  
=== The first awk ===
+
== www.gentoo.org ==
Let's go ahead and start playing around with awk to see how it works. At the command line, enter the following command:
+
  
<console>$##i## awk '{ print }' /etc/passwd</console>
+
In our case, our Web site dilemma exists because our project has been growing, and our Web site hasn't. Now that Gentoo Linux is approaching the 1.0 release (when it'll be officially ready for non-developers) and is growing in popularity, we need to start seriously looking at how our Web site can better serve its users. Here's a snapshot of www.gentoo.org:
  
You should see the contents of your /etc/passwd file appear before your eyes. Now, for an explanation of what awk did. When we called awk, we specified /etc/passwd as our input file. When we executed awk, it evaluated the print command for each line in /etc/passwd, in order. All output is sent to stdout, and we get a result identical to catting /etc/passwd.
+
<div style="margin: 10px;">[[File:L-redesign-01.gif|frame|class=img-responsive|The current (March 2001) state of affairs at www.gentoo.org]]</div>
  
Now, for an explanation of the { print } code block. In awk, curly braces are used to group blocks of code together, similar to C. Inside our block of code, we have a single print command. In awk, when a print command appears by itself, the full contents of the current line are printed.
+
As you can see, we have all the bare essentials -- a description of Gentoo Linux, a features list, a daily Changelog (automatically updated thanks to Python), and a bunch of important links (to the download sites, to our mailing list sign-up pages, and to cvsWeb). We also have links to three documentation resources -- the Gentoo Linux Install Guide and Development Guides, and Christian Zander's NVIDIA Troubleshooting Guide.
  
Here is another awk example that does exactly the same thing:
+
However, while the site seems O.K., we're missing a lot of things. The most obvious is documentation -- our installation and development guides need a lot of work. And then we need to add an FAQ, new links, new user information...the list is endless.
  
<console>$##i## awk '{ print $0 }' /etc/passwd</console>
+
== Content vs. display ==
  
In awk, the $0 variable represents the entire current line, so print and print $0 do exactly the same thing. If you'd like, you can create an awk program that will output data totally unrelated to the input data. Here's an example:
+
And now we come to our second problem. Right now, all of our work is done in raw HTML; I hack away at the index.html file until it looks O.K. Even worse, our Web documentation is written in raw HTML. This isn't a good thing from a development perspective because our raw content (consisting of paragraphs, sections, chapters) is garbled together with a bunch of display-related HTML tags. This, of course, makes it difficult to change both the content and the look of our site. While this approach has worked so far, it is bound to cause problems as our site continues to grow.
  
<console>$##i## awk '{ print "" }' /etc/passwd</console>
+
Clearly, we need to be using better technologies behind the scenes. Instead of using HTML directly, we need to start using things like XML, XSLT, and Python. The goal is to automate as much as possible so that we can add and expand our site with ease. If we do our job well, even major future changes to our site should be relatively painless.
  
Whenever you pass the "" string to the print command, it prints a blank line. If you test this script, you'll find that awk outputs one blank line for every line in your /etc/passwd file. Again, this is because awk executes your script for every line in the input file. Here's another example:
+
== A strategy! ==
  
<console>$##i## awk '{ print "hiya" }' /etc/passwd</console>
+
It was clear that we had a lot of work ahead of us. In fact, there was so much to be done that I didn't know where to begin. Just as I was trying to sort out everything in my head, I came across Laura Wonnacott's "Site Savvy" InfoWorld column (see Resources). In it, she explained the concept of "user-centric" design -- how to improve a Web site while keeping the needs of your target audience (in this case, Gentoo Linux users and developers) in focus. Reading the article and taking a look at the "Handbook of User-Centered Design" link from the article helped me to formulate a strategy -- an action plan -- for the redesign:
  
Running this script will fill your screen with hiya's. :)
+
# First, clearly define the official goal of the Web site -- in writing. What's it there for, and what's it supposed to do?
 +
#  Identify the different categories of users who will be using your site -- your target audience. Rank them in order of priority: Which ones are most important to you?
 +
# Set up a system for getting feedback from your target audience, so they can let you know what you're doing right and wrong.
 +
# Evaluate the feedback, and use it to determine what parts of the site need to be improved or redesigned. Tackle high-priority sections first.
 +
# Once you've selected the part of the site to improve, get to work! During your implementation, make sure that the content and design of the new section caters specifically to the needs of your target audience and fixes all known deficiencies.
 +
# When the section redesign is complete, add it to your live site, even if it has a look that's markedly different from your current site. This way, your users can begin benefitting from the newly redesigned section immediately. If there's a problem with the redesign, you'll get user feedback more quickly. Finally, making incremental improvements to your site (rather than revamping the whole site and then rolling it out all at once -- surprise!) will help prevent your users from feeling alienated by your (possibly dramatic) site changes.
 +
#  After completing step 6, jump to step 4 and repeat.
  
=== Multiple fields ===
+
== The mission statement ==
Awk is really good at handling text that has been broken into multiple logical fields, and allows you to effortlessly reference each individual field from inside your awk script. The following script will print out a list of all user accounts on your system:
+
  
<console>$##i## awk -F":" '{ print $1 }' /etc/passwd</console>
+
I was happy to discover that we already had step 3 in place. We had received several e-mail suggestions from visitors to the site, and our developer mailing list also served as a way of exchanging suggestions and comments. However, I had never really completed steps 1 or 2. While the answers may seem obvious, I did find it helpful to actually sit down and write out our mission statement:
  
Above, when we called awk, we use the -F option to specify ":" as the field separator. When awk processes the print $1 command, it will print out the first field that appears on each line in the input file. Here's another example:
+
www.gentoo.org exists to assist those who use and develop for Gentoo Linux by providing relevant, up-to-date information about Gentoo Linux and Linux in general, focusing on topics related to Gentoo Linux installation, use, administration, and development. As the central hub for all things Gentoo, the site should also feature important news relevant to Gentoo Linux users and developers. In addition to catering to Gentoo Linux users and developers, www.gentoo.org has the secondary purpose of meeting the needs of potential Gentoo Linux users, providing the information they need to decide whether Gentoo Linux is right for them.
  
<console>$##i## awk -F":" '{ print $1 $3 }' /etc/passwd</console>
+
== The target audience ==
  
Here's an excerpt of the output from this script:
+
So far, so good. Now for step 2 -- defining our target audience:
<pre>
+
halt7
+
operator11
+
root0
+
shutdown6
+
sync5
+
bin1
+
....etc.
+
</pre>
+
As you can see, awk prints out the first and third fields of the /etc/passwd file, which happen to be the username and uid fields respectively. Now, while the script did work, it's not perfect -- there aren't any spaces between the two output fields! If you're used to programming in bash or python, you may have expected the print $1 $3 command to insert a space between the two fields. However, when two strings appear next to each other in an awk program, awk concatenates them without adding an intermediate space. The following command will insert a space between both fields:
+
  
<console>$##i## awk -F":" '{ print $1 " " $3 }' /etc/passwd</console>
+
www.gentoo.org has three target audiences -- Gentoo Linux developers, users, and potential users. While no one group is absolutely a higher priority than another, right now the needs of Gentoo Linux developers are our highest priority, followed by Gentoo Linux users, and then potential users. This is because Gentoo Linux is currently in a prerelease state. When Gentoo Linux reaches version 1.0, Gentoo Linux users and potential users will also become a priority.
  
When you call print this way, it'll concatenate $1, " ", and $3, creating readable output. Of course, we can also insert some text labels if needed:
+
== Comments and suggestions ==
  
<console>$##i## awk -F":" '{ print "username: " $1 "\t\tuid:" $3 }' /etc/passwd</console>
+
O.K., now it's time to evaluate the suggestions and comments we've collected:
  
This will cause the output to be:
+
Over the past few months, we've received a number of suggestions from Web site visitors. Overwhelmingly, people are requesting better documentation -- for both developers and users. Several developers have asked if we could create a mailing list that would be devoted exclusively to describing CVS commits.
<pre>
+
username: halt    uid:7
+
username: operator uid:11
+
username: root    uid:0
+
username: shutdown uid:6
+
username: sync    uid:5
+
username: bin      uid:1
+
....etc.
+
</pre>
+
  
=== External Scripts ===
+
Interestingly, we've also received a couple of e-mails asking whether Gentoo Linux is a commercial or free product. I'm guessing that because our main logo is inscribed with the name "Gentoo Technologies, Inc." (our legal corporation name), people assume that we have a commercial focus. Modifying our logo so that it reads "Gentoo Linux" and adding small opening paragraph to the main page explaining that we are a free software project should help.
Passing your scripts to awk as a command line argument can be very handy for small one-liners, but when it comes to complex, multi-line programs, you'll definitely want to compose your script in an external file. Awk can then be told to source this script file by passing it the -f option:
+
  
<console>$##i## awk -f myscript.awk myfile.in </console>
+
== The improvement list ==
  
Putting your scripts in their own text files also allows you to take advantage of additional awk features. For example, this multi-line script does the same thing as one of our earlier one-liners, printing out the first field of each line in /etc/passwd:
+
O.K., now let's turn these suggestions into a list of possible improvements:
<pre>
+
BEGIN {
+
        FS=":"
+
}
+
{ print $1 }
+
</pre>
+
The difference between these two methods has to do with how we set the field separator. In this script, the field separator is specified within the code itself (by setting the FS variable), while our previous example set FS by passing the -F":" option to awk on the command line. It's generally best to set the field separator inside the script itself, simply because it means you have one less command line argument to remember to type. We'll cover the FS variable in more detail later in this article.
+
  
It is also possible to make the script directly executable, by placing a "#!/usr/bin/awk -f" at the top of the file, as follows:
+
* Revamp main page
<pre>
+
** Implementation: update logo and add free software blurb
#!/usr/bin/awk -f
+
** Goal: to clearly state that we are a free software project
BEGIN {
+
** Target group: potential users
FS=":"
+
**  Difficulty: medium
}
+
* Improve basic user documentation
{ print $1 }
+
**  Implementation: new XML/XSLT system, verbose documentation
</pre>
+
** Goal: to make it easier for users to install Gentoo Linux
Next, the script must be made executable by setting the script file's execute bit:
+
** Target group: new users
 +
** Difficulty: medium
 +
*Improve/create developer documentation
 +
** Implementation: new XML/XSLT system, CVS guide, dev guide, Portage guide
 +
**  Goal: to help our developers to do a great job
 +
** Target group: developers
 +
** Difficulty: hard
 +
*Add a CVS mailing list
 +
** Implementation: use our existing mailman mailing list manager
 +
** Goal: to better inform our developers
 +
** Target group: developers
 +
** Difficulty: easy
  
<console>$##i## chmod +x myscript.awk</console>
+
== A selection! ==
  
Now, you should be able to execute the script as follows:
+
Two things leap out from the list, for different reasons. The first is the CVS mailing list -- this one is a no-brainer because it's so easy to implement. Often, it makes sense to implement the easiest changes first so that users can benefit from them right away.
  
<console>$##i## ./myscript.awk myfile.in</console>
+
The second big thing that leaps out from the list is the need for developer documentation. This is a longer-term project that will require much more work. From my conversations with the other developers, we all appear to be in agreement that some kind of XML/XSL approach is the right solution.
  
=== The BEGIN and END blocks ===
+
== The XML/XSL prototype ==
Normally, awk executes each block of your script's code once for each input line. However, there are many programming situations where you may need to execute initialization code before awk begins processing the text from the input file. For such situations, awk allows you to define a BEGIN block. We used a BEGIN block in the previous example. Because the BEGIN block is evaluated before awk starts processing the input file, it's an excellent place to initialize the FS (field separator) variable, print a heading, or initialize other global variables that you'll reference later in the program.
+
  
Awk also provides another special block, called the END block. Awk executes this block after all lines in the input file have been processed. Typically, the END block is used to perform final calculations or print summaries that should appear at the end of the output stream.
+
To help start the process, I developed a prototype XML syntax to be used for all our online documentation. By using this XML syntax (called "guide"), our documentation will be clearly organized into paragraphs, sections, and chapters (using XML tags like <section>, <chapter>, etc.) while remaining free of any display-related tags. To create the HTML for display on our site, I created a prototype set of XSL transforms. By using an XSLT processor such as Sablotron, our guide XML files can be converted into HTML as follows:
  
=== Regular expressions and blocks ===
+
devguide.xml + guide.xsl ---XSLT processor---> devguide.html
Awk allows the use of regular expressions to selectively execute an individual block of code, depending on whether or not the regular expression matches the current line. Here's an example script that outputs only those lines that contain the character sequence foo:
+
  
<pre>/foo/ { print }</pre>
+
The great thing about this XML/XSLT approach is that it separates our raw content (XML) from the display-related information contained in the guide.xsl (XSLT) file. If we ever need to update the look of our Web pages, we simply modify the guide.xsl file and run all our XML through the XSLT processor (Sablotron), creating updated HTML pages. Or, if we need to add a few chapters to the development guide, we can modify devguide.xml. Once we're done, we then run the XML through Sablotron, which then spits out a fully-formatted devguide.html file with several added chapters. Think of XML as the content and XSLT as the display-related formatting macros.
  
Of course, you can use more complicated regular expressions. Here's a script that will print only lines that contain a floating point number:
+
While our entire team is convinced that XML/XSLT is the way to go, we haven't yet agreed upon an official XML syntax. Achim, our development lead, suggested that we use docbook instead of rolling our own XML syntax. However, the prototype guide XML format has helped to start the decision-making process. Because we developers are going to be the ones using the XML/XSL on a daily basis, it's important to choose a solution that we're comfortable with and meets all of our needs. By my next article, I should have a working XML/XSL doc system to show off to you.
  
<pre>/[0-9]+\.[0-9]*/ { print }</pre>
+
== Technology demo: pytext ==
  
=== Expressions and blocks ===
+
For the most part, our current Web site isn't using any new or super-cool technologies that are worth mentioning. However, there's one notable exception -- our tiny pytext embedded Python interpreter.
There are many other ways to selectively execute a block of code. We can place any kind of boolean expression before a code block to control when a particular block is executed. Awk will execute a code block only if the preceding boolean expression evaluates to true. The following example script will output the third field of all lines that have a first field equal to fred. If the first field of the current line is not equal to fred, awk will continue processing the file and will not execute the print statement for the current line:
+
  
<pre>$1 == "fred" { print $3 }</pre>
+
Like many of you, I'm a huge Python fan and much prefer it over other scripting languages, so when it came time to add some dynamic content to our Web site, I naturally wanted to use Python. And, as you probably know, when coding dynamic HTML content, it's usually much more convenient to embed the language commands inside the HTML, rather than the other way around. Thus, the need for an embedded Python interpreter that can take a document like this:
  
Awk offers a full selection of comparison operators, including the usual "==", "<", ">", "<=", ">=", and "!=". In addition, awk provides the "~" and "!~" operators, which mean "matches" and "does not match". They're used by specifying a variable on the left side of the operator, and a regular expression on the right side. Here's an example that will print only the third field on the line if the fifth field on the same line contains the character sequence root:
 
 
<pre>$5 ~ /root/ { print $3 }</pre>
 
 
=== Conditional statements ===
 
Awk also offers very nice C-like if statements. If you'd like, you could rewrite the previous script using an if statement:
 
 
<pre>
 
<pre>
{
+
<p>
    if ( $5 ~ /root/ ) {
+
Yeah, sure; I got some questions:<br>
        print $3
+
<!--code
    }
+
names=["bob","jimmy","ralph"]
}
+
items=["socks","lunch","accordion"]
 +
for x in items:
 +
for y in names:
 +
print "Anyone seen",y+"'s",x+"?<br>"
 +
-->
 +
See, told you so.
 
</pre>
 
</pre>
Both scripts function identically. In the first example, the boolean expression is placed outside the block, while in the second example, the block is executed for every input line, and we selectively perform the print command by using an if statement. Both methods are available, and you can choose the one that best meshes with the other parts of your script.
 
  
Here's a more complicated example of an awk if statement. As you can see, even with complex, nested conditionals, if statements look identical to their C counterparts:
+
....and transform it into this:
<pre>
+
{
+
    if ( $1 == "foo" ) {
+
        if ( $2 == "foo" ) {
+
            print "uno"
+
        } else {
+
            print "one"
+
        }
+
    } else if ($1 == "bar" ) {
+
        print "two"
+
    } else {
+
        print "three"
+
    }
+
}
+
</pre>
+
Using if statements, we can also transform this code:
+
<pre>
+
! /matchme/ { print $1 $3 $4 }
+
</pre>
+
to this:
+
<pre>
+
{
+
    if ( $0 !~ /matchme/ ) {
+
        print $1 $3 $4
+
    }
+
}
+
</pre>
+
Both scripts will output only those lines that don't contain a matchme character sequence. Again, you can choose the method that works best for your code. They both do the same thing.
+
  
Awk also allows the use of boolean operators "||" (for "logical or") and "&&"(for "logical and") to allow the creation of more complex boolean expressions:
 
 
<pre>
 
<pre>
( $1 == "foo" ) && ( $2 == "bar" ) { print }
+
<p>
 +
Yeah, sure; I got some questions:<br>
 +
Anyone seen bob's socks?<br>
 +
Anyone seen jimmy's socks?<br>
 +
Anyone seen ralph's socks?<br>
 +
Anyone seen bob's lunch?<br>
 +
Anyone seen jimmy's lunch?<br>
 +
Anyone seen ralph's lunch?<br>
 +
Anyone seen bob's accordion?<br>
 +
Anyone seen jimmy's accordion?<br>
 +
Anyone seen ralph's accordion?<br>
 +
See, told you so.
 
</pre>
 
</pre>
This example will print only those lines where field one equals foo and field two equals bar.
 
  
=== Numeric variables! ===
+
Here's the source code for pytext:
So far, we've either printed strings, the entire line, or specific fields. However, awk also allows us to perform both integer and floating point math. Using mathematical expressions, it's very easy to write a script that counts the number of blank lines in a file. Here's one that does just that:
+
<pre>
+
BEGIN { x=0 }
+
/^$/  { x=x+1 }
+
END  { print "I found " x " blank lines. :)" }
+
</pre>
+
In the BEGIN block, we initialize our integer variable x to zero. Then, each time awk encounters a blank line, awk will execute the x=x+1 statement, incrementing x. After all the lines have been processed, the END block will execute, and awk will print out a final summary, specifying the number of blank lines it found.
+
  
=== Stringy variables ===
+
Code Listing 2.4:  
One of the neat things about awk variables is that they are "simple and stringy." I consider awk variables "stringy" because all awk variables are stored internally as strings. At the same time, awk variables are "simple" because you can perform mathematical operations on a variable, and as long as it contains a valid numeric string, awk automatically takes care of the string-to-number conversion steps. To see what I mean, check out this example:
+
{{file|name=pytext|lang=python|desc=The pytext embedded Python interpreter|body=
<pre>
+
#!/usr/bin/env python2
x="1.01"
+
# We just set x to contain the *string* "1.01"
+
x=x+1
+
# We just added one to a *string*
+
print x
+
# Incidentally, these are comments :)
+
</pre>
+
Awk will output:
+
<pre>
+
2.01
+
</pre>
+
Interesting! Although we assigned the string value 1.01 to the variable x, we were still able to add one to it. We wouldn't be able to do this in bash or python. First of all, bash doesn't support floating point arithmetic. And, while bash has "stringy" variables, they aren't "simple"; to perform any mathematical operations, bash requires that we enclose our math in an ugly $( ) construct. If we were using python, we would have to explicitly convert our 1.01 string to a floating point value before performing any arithmetic on it. While this isn't difficult, it's still an additional step. With awk, it's all automatic, and that makes our code nice and clean. If we wanted to square and add one to the first field in each input line, we would use this script:
+
<pre>
+
{ print ($1^2)+1 }
+
</pre>
+
If you do a little experimenting, you'll find that if a particular variable doesn't contain a valid number, awk will treat that variable as a numerical zero when it evaluates your mathematical expression.
+
  
=== Lots of operators ===
+
# pytext 2.1
Another nice thing about awk is its full complement of mathematical operators. In addition to standard addition, subtraction, multiplication, and division, awk allows us to use the previously demonstrated exponent operator "^", the modulo (remainder) operator "%", and a bunch of other handy assignment operators borrowed from C.
+
# Copyright 1999-2001 Daniel Robbins
 +
# Distributed under the GPL
  
These include pre- and post-increment/decrement ( i++, --foo ), add/sub/mult/div assign operators ( a+=3, b*=2, c/=2.2, d-=6.2 ). But that's not all -- we also get handy modulo/exponent assign ops as well ( a^=2, b%=4 ).
+
import sys
  
=== Field separators ===
+
def runfile(myarg):
Awk has its own complement of special variables. Some of them allow you to fine-tune how awk functions, while others can be read to glean valuable information about the input. We've already touched on one of these special variables, FS. As mentioned earlier, this variable allows you to set the character sequence that awk expects to find between fields. When we were using /etc/passwd as input, FS was set to ":". While this did the trick, FS allows us even more flexibility.
+
  "interprets a text file with embedded elements"
 +
  mylocals={}
 +
  try:
 +
      a=open(myarg,'r')
 +
  except IOError:
 +
      sys.stderr.write("!!! Error opening "+myarg+"!\n")
 +
      return
 +
  mylines=a.readlines()
 +
  a.close()
 +
  pos=0
 +
  while pos<len(mylines):
 +
      if mylines[pos][0:8]=="<!--code":
 +
  mycode=""
 +
  pos=pos+1
 +
  while (pos<len(mylines)) and (mylines[pos][0:3]!="-->"):
 +
      mycode=mycode+mylines[pos]
 +
      pos=pos+1
 +
  exec(mycode,globals(),mylocals)
 +
      else:
 +
  sys.stdout.write(mylines[pos])
 +
      pos=pos+1
  
The FS value is not limited to a single character; it can also be set to a regular expression, specifying a character pattern of any length. If you're processing fields separated by one or more tabs, you'll want to set FS like so:
+
if len(sys.argv)>1:
<pre>
+
  for x in sys.argv[1:]:
FS="\t+"
+
      runfile(x)
</pre>
+
  sys.exit(0)
Above, we use the special "+" regular expression character, which means "one or more of the previous character".
+
else:
 +
  sys.stderr.write
 +
    ("pytext 2.1 -- Copyright 1999-2001 Daniel Robbins. ")
 +
  sys.stderr.write
 +
    ("Distributed under the\nGNU Public License\n\n")
 +
  sys.stderr.write
 +
    ("Usage: "+sys.argv[0]+" file0 [file1]...\n")
 +
  sys.exit(1)
 +
}}
  
If your fields are separated by whitespace (one or more spaces or tabs), you may be tempted to set FS to the following regular expression:
+
== How pytext works ==
<pre>
+
FS="[[:space:]]+"
+
</pre>
+
While this assignment will do the trick, it's not necessary. Why? Because by default, FS is set to a single space character, which awk interprets to mean "one or more spaces or tabs." In this particular example, the default FS setting was exactly what you wanted in the first place!
+
  
Complex regular expressions are no problem. Even if your records are separated by the word "foo," followed by three digits, the following regular expression will allow your data to be parsed properly:
+
Here's how it works. It scans each input line, and most of the time, each input line is simply echoed to stdout. However, if pytext encounters a line beginning with <!--code, then the contents of every line up to the first line beginning with --> are appended to a string called mycode. Pytext then executes the mycode string using the built-in exec() function, effectively creating an embedded Python interpreter.
<pre>
+
 
FS="foo[0-9][0-9][0-9]"
+
There's something really beautiful about this particular implementation -- we call exec() in such a way that all modifications to the global and local namespaces are saved. This makes it possible to import a module or define a variable in one embedded block, and then access this previously-created object in a later block, as this example clearly demonstrates:
</pre>
+
  
=== Number of fields ===
 
The next two variables we're going to cover are not normally intended to be written to, but are normally read and used to gain useful information about the input. The first is the NF variable, also called the "number of fields" variable. Awk will automatically set this variable to the number of fields in the current record. You can use the NF variable to display only certain input lines:
 
 
<pre>
 
<pre>
NF == 3 { print "this particular record has three fields: " $0 }
+
<!--code
</pre>
+
import os
Of course, you can also use the NF variable in conditional statements, as follows:
+
foo=23
<pre>
+
-->
{
+
    if ( NF > 2 ) {
+
        print $1 " " $2 ":" $3
+
    }
+
}
+
</pre>
+
  
=== Record number ===
+
Hello
The record number (NR) is another handy variable. It will always contain the number of the current record (awk counts the first record as record number 1). Up until now, we've been dealing with input files that contain one record per line. For these situations, NR will also tell you the current line number. However, when we start to process multi-line records later in the series, this will no longer be the case, so be careful! NR can be used like the NF variable to print only certain lines of the input:
+
 
<pre>
+
<!--code
(NR < 10 ) || (NR > 100) { print "We are on record number 1-9 or 101+" }
+
print foo
</pre>
+
if os.path.exists("/tmp/mytmpfile"):
<pre>
+
print "it exists"
{
+
else:
    #skip header
+
print "I don't see it"
    if ( NR > 10 ) {
+
-->
        print "ok, now for the real information!"
+
    }
+
}
+
 
</pre>
 
</pre>
Awk provides additional variables that can be used for a variety of purposes. We'll cover more of these variables in later articles.
 
  
We've come to the end of our initial exploration of awk. As the series continues, I'll demonstrate more advanced awk functionality, and we'll end the series with a real-world awk application.
+
Handy, eh? pytext serves is an excellent demonstration of the power of Python, and is an extremely useful tool for Python fans. For our current site, we call pytext from a cron job, using it to periodically generate the HTML code for our main page Changelog:
  
== Resources ==
+
<console>
 +
$ ##i##pytext index.ehtml > index.html
 +
</console>
  
* Read Daniel's other awk articles on Funtoo: Awk By Example, [[Awk by example, Part2 |Part 2]] and [[Awk by example, Part3 |Part 3]].
+
That's it for now; I'll see you next time when we'll take a look at the first stage of the www.gentoo.org redesign!
* If you'd like a good old-fashioned book, [http://www.oreilly.com/catalog/sed2/ O'Reilly's sed & awk, 2nd Edition] is a wonderful choice.
+
* Be sure to check out the [http://www.faqs.org/faqs/computer-lang/awk/faq/ comp.lang.awk FAQ]. It also contains lots of additional awk links.
+
* Patrick Hartigan's [http://sparky.rice.edu/~hartigan/awk.html awk tutorial] is packed with handy awk scripts.
+
* [http://www.tasoft.com/tawk.html Thompson's TAWK Compiler] compiles awk scripts into fast binary executables. Versions are available for Windows, OS/2, DOS, and UNIX.
+
* [http://www.gnu.org/software/gawk/manual/gawk.html The GNU Awk User's Guide] is available for online reference.
+
* [http://www.folkstalk.com/2011/12/good-examples-of-awk-command-in-unix.html Awk Command] daily useful examples.
+
[[Category:Linux Core Concepts]]
+
[[Category:Articles]]
+
 
{{ArticleFooter}}
 
{{ArticleFooter}}

Revision as of 08:21, December 31, 2014

A site reborn

Have you ever woken up one morning and suddenly realized that your cute little personal development Web site isn't really that great? If so, you're in good company. In this series, Daniel Robbins shares his experiences as he redesigns the www.gentoo.org Web site using technologies like XML, XSLT, and Python. Along the way, you may find some excellent approaches to use for your next Web site redesign. In this article, Daniel creates a user-centric action plan and introduces pytext, an embedded Python interpreter.

Support Funtoo and help us grow! Donate $15 per month and get a free SSD-based Funtoo Virtual Container.
Looking for people interested in testing and documenting Docker support! Contact Daniel Robbins for more info.

An unruly horde

Fellow software developer, may I ask you a question? Why is it that although many of us are intimately familiar with Web technologies such as HTML, CGI, Perl, Python, Java technology, and XML, our very own Web sites -- the ones devoted to our precious development projects -- look like they were thrown together by an unruly horde of hyperactive 12-year-olds? Why, oh why, is this so?

Could it be because most of the time, we've left our Web site out to rot while we squander our precious time hacking away on our free software projects? The answer, at least in my case, is a most definite "Yes."

When I'm not writing articles for IBM developerWorks or being a new dad, I'm feverishly working on the next release of Gentoo Linux, along with my skilled team of volunteers. And, yes, Gentoo Linux has its own Web site (see Resources). As of right now (March 2001), our Web site isn't that special; that's because we don't spend much time working on it because we're generally engrossed in improving Gentoo Linux itself. Sure, our site does have several admittedly cute logos that I whipped up using Xara X (see Resources), but when you look past the eye candy, our site leaves a lot to be desired. Maybe yours does too. If so, I have one thing to say to you -- welcome to the club.

www.gentoo.org

In our case, our Web site dilemma exists because our project has been growing, and our Web site hasn't. Now that Gentoo Linux is approaching the 1.0 release (when it'll be officially ready for non-developers) and is growing in popularity, we need to start seriously looking at how our Web site can better serve its users. Here's a snapshot of www.gentoo.org:

The current (March 2001) state of affairs at www.gentoo.org

As you can see, we have all the bare essentials -- a description of Gentoo Linux, a features list, a daily Changelog (automatically updated thanks to Python), and a bunch of important links (to the download sites, to our mailing list sign-up pages, and to cvsWeb). We also have links to three documentation resources -- the Gentoo Linux Install Guide and Development Guides, and Christian Zander's NVIDIA Troubleshooting Guide.

However, while the site seems O.K., we're missing a lot of things. The most obvious is documentation -- our installation and development guides need a lot of work. And then we need to add an FAQ, new links, new user information...the list is endless.

Content vs. display

And now we come to our second problem. Right now, all of our work is done in raw HTML; I hack away at the index.html file until it looks O.K. Even worse, our Web documentation is written in raw HTML. This isn't a good thing from a development perspective because our raw content (consisting of paragraphs, sections, chapters) is garbled together with a bunch of display-related HTML tags. This, of course, makes it difficult to change both the content and the look of our site. While this approach has worked so far, it is bound to cause problems as our site continues to grow.

Clearly, we need to be using better technologies behind the scenes. Instead of using HTML directly, we need to start using things like XML, XSLT, and Python. The goal is to automate as much as possible so that we can add and expand our site with ease. If we do our job well, even major future changes to our site should be relatively painless.

A strategy!

It was clear that we had a lot of work ahead of us. In fact, there was so much to be done that I didn't know where to begin. Just as I was trying to sort out everything in my head, I came across Laura Wonnacott's "Site Savvy" InfoWorld column (see Resources). In it, she explained the concept of "user-centric" design -- how to improve a Web site while keeping the needs of your target audience (in this case, Gentoo Linux users and developers) in focus. Reading the article and taking a look at the "Handbook of User-Centered Design" link from the article helped me to formulate a strategy -- an action plan -- for the redesign:

  1. First, clearly define the official goal of the Web site -- in writing. What's it there for, and what's it supposed to do?
  2. Identify the different categories of users who will be using your site -- your target audience. Rank them in order of priority: Which ones are most important to you?
  3. Set up a system for getting feedback from your target audience, so they can let you know what you're doing right and wrong.
  4. Evaluate the feedback, and use it to determine what parts of the site need to be improved or redesigned. Tackle high-priority sections first.
  5. Once you've selected the part of the site to improve, get to work! During your implementation, make sure that the content and design of the new section caters specifically to the needs of your target audience and fixes all known deficiencies.
  6. When the section redesign is complete, add it to your live site, even if it has a look that's markedly different from your current site. This way, your users can begin benefitting from the newly redesigned section immediately. If there's a problem with the redesign, you'll get user feedback more quickly. Finally, making incremental improvements to your site (rather than revamping the whole site and then rolling it out all at once -- surprise!) will help prevent your users from feeling alienated by your (possibly dramatic) site changes.
  7. After completing step 6, jump to step 4 and repeat.

The mission statement

I was happy to discover that we already had step 3 in place. We had received several e-mail suggestions from visitors to the site, and our developer mailing list also served as a way of exchanging suggestions and comments. However, I had never really completed steps 1 or 2. While the answers may seem obvious, I did find it helpful to actually sit down and write out our mission statement:

www.gentoo.org exists to assist those who use and develop for Gentoo Linux by providing relevant, up-to-date information about Gentoo Linux and Linux in general, focusing on topics related to Gentoo Linux installation, use, administration, and development. As the central hub for all things Gentoo, the site should also feature important news relevant to Gentoo Linux users and developers. In addition to catering to Gentoo Linux users and developers, www.gentoo.org has the secondary purpose of meeting the needs of potential Gentoo Linux users, providing the information they need to decide whether Gentoo Linux is right for them.

The target audience

So far, so good. Now for step 2 -- defining our target audience:

www.gentoo.org has three target audiences -- Gentoo Linux developers, users, and potential users. While no one group is absolutely a higher priority than another, right now the needs of Gentoo Linux developers are our highest priority, followed by Gentoo Linux users, and then potential users. This is because Gentoo Linux is currently in a prerelease state. When Gentoo Linux reaches version 1.0, Gentoo Linux users and potential users will also become a priority.

Comments and suggestions

O.K., now it's time to evaluate the suggestions and comments we've collected:

Over the past few months, we've received a number of suggestions from Web site visitors. Overwhelmingly, people are requesting better documentation -- for both developers and users. Several developers have asked if we could create a mailing list that would be devoted exclusively to describing CVS commits.

Interestingly, we've also received a couple of e-mails asking whether Gentoo Linux is a commercial or free product. I'm guessing that because our main logo is inscribed with the name "Gentoo Technologies, Inc." (our legal corporation name), people assume that we have a commercial focus. Modifying our logo so that it reads "Gentoo Linux" and adding small opening paragraph to the main page explaining that we are a free software project should help.

The improvement list

O.K., now let's turn these suggestions into a list of possible improvements:

  • Revamp main page
    • Implementation: update logo and add free software blurb
    • Goal: to clearly state that we are a free software project
    • Target group: potential users
    • Difficulty: medium
  • Improve basic user documentation
    • Implementation: new XML/XSLT system, verbose documentation
    • Goal: to make it easier for users to install Gentoo Linux
    • Target group: new users
    • Difficulty: medium
  • Improve/create developer documentation
    • Implementation: new XML/XSLT system, CVS guide, dev guide, Portage guide
    • Goal: to help our developers to do a great job
    • Target group: developers
    • Difficulty: hard
  • Add a CVS mailing list
    • Implementation: use our existing mailman mailing list manager
    • Goal: to better inform our developers
    • Target group: developers
    • Difficulty: easy

A selection!

Two things leap out from the list, for different reasons. The first is the CVS mailing list -- this one is a no-brainer because it's so easy to implement. Often, it makes sense to implement the easiest changes first so that users can benefit from them right away.

The second big thing that leaps out from the list is the need for developer documentation. This is a longer-term project that will require much more work. From my conversations with the other developers, we all appear to be in agreement that some kind of XML/XSL approach is the right solution.

The XML/XSL prototype

To help start the process, I developed a prototype XML syntax to be used for all our online documentation. By using this XML syntax (called "guide"), our documentation will be clearly organized into paragraphs, sections, and chapters (using XML tags like <section>, <chapter>, etc.) while remaining free of any display-related tags. To create the HTML for display on our site, I created a prototype set of XSL transforms. By using an XSLT processor such as Sablotron, our guide XML files can be converted into HTML as follows:

devguide.xml + guide.xsl ---XSLT processor---> devguide.html

The great thing about this XML/XSLT approach is that it separates our raw content (XML) from the display-related information contained in the guide.xsl (XSLT) file. If we ever need to update the look of our Web pages, we simply modify the guide.xsl file and run all our XML through the XSLT processor (Sablotron), creating updated HTML pages. Or, if we need to add a few chapters to the development guide, we can modify devguide.xml. Once we're done, we then run the XML through Sablotron, which then spits out a fully-formatted devguide.html file with several added chapters. Think of XML as the content and XSLT as the display-related formatting macros.

While our entire team is convinced that XML/XSLT is the way to go, we haven't yet agreed upon an official XML syntax. Achim, our development lead, suggested that we use docbook instead of rolling our own XML syntax. However, the prototype guide XML format has helped to start the decision-making process. Because we developers are going to be the ones using the XML/XSL on a daily basis, it's important to choose a solution that we're comfortable with and meets all of our needs. By my next article, I should have a working XML/XSL doc system to show off to you.

Technology demo: pytext

For the most part, our current Web site isn't using any new or super-cool technologies that are worth mentioning. However, there's one notable exception -- our tiny pytext embedded Python interpreter.

Like many of you, I'm a huge Python fan and much prefer it over other scripting languages, so when it came time to add some dynamic content to our Web site, I naturally wanted to use Python. And, as you probably know, when coding dynamic HTML content, it's usually much more convenient to embed the language commands inside the HTML, rather than the other way around. Thus, the need for an embedded Python interpreter that can take a document like this:

<p>
Yeah, sure; I got some questions:<br>
<!--code
names=["bob","jimmy","ralph"]
items=["socks","lunch","accordion"]
for x in items:
for y in names:
print "Anyone seen",y+"'s",x+"?<br>"
-->
See, told you so.

....and transform it into this:

<p>
Yeah, sure; I got some questions:<br>
Anyone seen bob's socks?<br>
Anyone seen jimmy's socks?<br>
Anyone seen ralph's socks?<br>
Anyone seen bob's lunch?<br>
Anyone seen jimmy's lunch?<br>
Anyone seen ralph's lunch?<br>
Anyone seen bob's accordion?<br>
Anyone seen jimmy's accordion?<br>
Anyone seen ralph's accordion?<br>
See, told you so.

Here's the source code for pytext:

Code Listing 2.4:

pytext (python source code) - The pytext embedded Python interpreter
#!/usr/bin/env python2
 
# pytext 2.1
# Copyright 1999-2001 Daniel Robbins
# Distributed under the GPL
 
import sys
 
def runfile(myarg):
   "interprets a text file with embedded elements"
   mylocals={}
   try:
      a=open(myarg,'r')
   except IOError:
      sys.stderr.write("!!! Error opening "+myarg+"!\n")
      return
   mylines=a.readlines()
   a.close()
   pos=0
   while pos<len(mylines):
      if mylines[pos][0:8]==""):
       mycode=mycode+mylines[pos]
       pos=pos+1
  exec(mycode,globals(),mylocals)
       else:
  sys.stdout.write(mylines[pos])
       pos=pos+1
 
if len(sys.argv)>1:
   for x in sys.argv[1:]:
       runfile(x)
   sys.exit(0)
else:
   sys.stderr.write
     ("pytext 2.1 -- Copyright 1999-2001 Daniel Robbins. ")
   sys.stderr.write
     ("Distributed under the\nGNU Public License\n\n")
   sys.stderr.write
     ("Usage: "+sys.argv[0]+" file0 [file1]...\n")
   sys.exit(1)

How pytext works

Here's how it works. It scans each input line, and most of the time, each input line is simply echoed to stdout. However, if pytext encounters a line beginning with are appended to a string called mycode. Pytext then executes the mycode string using the built-in exec() function, effectively creating an embedded Python interpreter.

There's something really beautiful about this particular implementation -- we call exec() in such a way that all modifications to the global and local namespaces are saved. This makes it possible to import a module or define a variable in one embedded block, and then access this previously-created object in a later block, as this example clearly demonstrates:

<!--code
import os
foo=23
-->

Hello

<!--code
print foo
if os.path.exists("/tmp/mytmpfile"):
print "it exists"
else:
print "I don't see it"
-->

Handy, eh? pytext serves is an excellent demonstration of the power of Python, and is an extremely useful tool for Python fans. For our current site, we call pytext from a cron job, using it to periodically generate the HTML code for our main page Changelog:

$ pytext index.ehtml > index.html

That's it for now; I'll see you next time when we'll take a look at the first stage of the www.gentoo.org redesign!

Next >>>

Read the next article in this series: The Gentoo.org Redesign, Part 2

Support Funtoo and help us grow! Donate $15 per month and get a free SSD-based Funtoo Virtual Container.
Looking for people interested in testing and documenting Docker support! Contact Daniel Robbins for more info.

About the Author

Daniel Robbins is best known as the creator of Gentoo Linux and author of many IBM developerWorks articles about Linux. Daniel currently serves as Benevolent Dictator for Life (BDFL) of Funtoo Linux. Funtoo Linux is a Gentoo-based distribution and continuation of Daniel's original Gentoo vision.

Got Funtoo?

Have you installed Funtoo Linux yet? Discover the power of a from-source meta-distribution optimized for your hardware! See our installation instructions and browse our CPU-optimized builds.

Funtoo News

Drobbins

How We're Keeping You At the Center of the Funtoo Universe

Read about recent developments that keep you, our users, at the forefront of our focus as Funtoo moves forward.
10 April 2015 by Drobbins
Mgorny

New OpenGL management in Funtoo

Funtoo is switching to an improved system for managing multiple OpenGL providers (Mesa/Xorg, AMD and NVIDIA). The update may involve blockers and file collisions.
30 March 2015 by Mgorny
Drobbins

Subarch Profiles are coming...

Subarch profiles are on their way! Learn more here.
29 March 2015 by Drobbins
View More News...

More Articles

Browse all our Linux-related articles, below:

A

B

F

G

K

L

M

O

P

S

T

W

X

Z