Difference between pages "Sed by Example, Part 1" and "Sed by Example, Part 2"

From Funtoo
(Difference between pages)
Jump to navigation Jump to search
 
 
Line 1: Line 1:
{{Article
{{Article
|Author=Drobbins
|Author=Drobbins
|Next in Series=Sed by Example, Part 2
|Previous in Series=Sed by Example, Part 1
|Next in Series=Sed by Example, Part 3
}}
}}
== Get to know the powerful UNIX editor ==
== How to further take advantage of the UNIX text editor ==


=== Pick an editor ===
=== Substitution! ===
In the UNIX world, we have a lot of options when it comes to editing files. Think of it -- vi, emacs, and jed come to mind, as well as many others. We all have our favorite editor (along with our favorite keybindings) that we have come to know and love. With our trusty editor, we are ready to tackle any number of UNIX-related administration or programming tasks with ease.
Let's look at one of sed's most useful commands, the substitution command. Using it, we can replace a particular string or matched regular expression with another string. Here's an example of the most basic use of this command:


While interactive editors are great, they do have limitations. Though their interactive nature can be a strength, it can also be a weakness. Consider a situation where you need to perform similar types of changes on a group of files. You could instinctively fire up your favorite editor and perform a bunch of mundane, repetitive, and time-consuming edits by hand. But there's a better way.
<console>$##i## sed -e 's/foo/bar/' myfile.txt</console>


=== Enter sed ===
The above command will output the contents of myfile.txt to stdout, with the first occurrence of 'foo' (if any) on each line replaced with the string 'bar'. Please note that I said first occurrence on each line, though this is normally not what you want. Normally, when I do a string replacement, I want to perform it globally. That is, I want to replace all occurrences on every line, as follows:
It would be nice if we could automate the process of making edits to files, so that we could "batch" edit files, or even write scripts with the ability to perform sophisticated changes to existing files. Fortunately for us, for these types of situations, there is a better way -- and the better way is called sed.


sed is a lightweight stream editor that's included with nearly all UNIX flavors, including Linux. sed has a lot of nice features. First of all, it's very lightweight, typically many times smaller than your favorite scripting language. Secondly, because sed is a stream editor, it can perform edits to data it receives from stdin, such as from a pipeline. So, you don't need to have the data to be edited stored in a file on disk. Because data can just as easily be piped to sed, it's very easy to use sed as part of a long, complex pipeline in a powerful shell script. Try doing that with your favorite editor.
<console>$##i## sed -e 's/foo/bar/g' myfile.txt</console>


=== GNU sed ===
The additional 'g' option after the last slash tells sed to perform a global replace.
Fortunately for us Linux users, one of the nicest versions of sed out there happens to be GNU sed. Every Linux distribution has GNU sed, or at least should. GNU sed is popular not only because its sources are freely distributable, but because it happens to have a lot of handy, time-saving extensions to the POSIX sed standard. GNU sed also doesn't suffer from many of the limitations that earlier and proprietary versions of sed had, such as a limited line length -- GNU sed handles lines of any length with ease.


=== The right sed ===
Here are a few other things you should know about the <span style="color:green">s///</span> substitution command. First, it is a command, and a command only; there are no addresses specified in any of the above examples. This means that the <span style="color:green">s///</span> command can also be used with addresses to control what lines it will be applied to, as follows:
In this series, we will be using GNU sed. Some (but very few) of the most advanced examples you'll find in my upcoming, follow-on articles in this series will not work with GNU sed 3.02 or 3.02a and will require a modern version. If you're using a non-GNU sed, your results may vary. Why not take some time to install GNU sed now (see [[#Resources|Resources]] for source code)? Then, not only will you be ready for the rest of the series, but you'll also be able to use arguably the best sed in existence!


=== Sed examples ===
$ sed -e '1,10s/enchantment/entrapment/g' myfile2.txt
Sed works by performing any number of user-specified editing operations ("commands") on the input data. Sed is line-based, so the commands are performed on each line in order. And, sed writes its results to standard output (stdout); it doesn't modify any input files.


Let's look at some examples. The first several are going to be a bit weird because I'm using them to illustrate how sed works rather than to perform any useful task. However, if you're new to sed, it's very important that you understand them. Here's our first example:
The above example will cause all occurrences of the phrase 'enchantment' to be replaced with the phrase 'entrapment', but only on lines one through ten, inclusive.


<console>$##i## sed -e 'd' /etc/services</console>
$ sed -e '/^$/,/^END/s/hills/mountains/g' myfile3.txt


If you type this command, you'll get absolutely no output. Now, what happened? In this example, we called sed with one editing command, <span style="color:green">d</span>. Sed opened the '''/etc/services''' file, read a line into its pattern buffer, performed our editing command ("delete line"), and then printed the pattern buffer (which was empty). It then repeated these steps for each successive line. This produced no output, because the <span style="color:green">d</span> command zapped every single line in the pattern buffer!
This example will swap 'hills' for 'mountains', but only on blocks of text beginning with a blank line, and ending with a line beginning with the three characters 'END', inclusive.


There are a couple of things to notice in this example. First, '''/etc/services''' was not modified at all. This is because, again, sed only reads from the file you specify on the command line, using it as input -- it doesn't try to modify the file. The second thing to notice is that sed is line-oriented. The <span style="color:green">d</span> command didn't simply tell sed to delete all incoming data in one fell swoop. Instead, sed read each line of /etc/services one by one into its internal buffer, called the pattern buffer. Once a line was read into the pattern buffer, it performed the <span style="color:green">d</span> command and printed the contents of the pattern buffer (nothing in this example). Later, I'll show you how to use address ranges to control which lines a command is applied to -- but in the absence of addresses, a command is applied to all lines.
Another nice thing about the <span style="color:green">s///</span> command is that we have a lot of options when it comes to those <span style="color:green">/</span> separators. If we're performing string substitution and the regular expression or replacement string has a lot of slashes in it, we can change the separator by specifying a different character after the 's'. For example, this will replace all occurrences of '''/usr/local''' with '''/usr''':


The third thing to notice is the use of single quotes to surround the d command. It's a good idea to get into the habit of using single quotes to surround your sed commands, so that shell expansion is disabled.
<console>$##i## sed -e 's:/usr/local:/usr:g' mylist.txt</console>


=== Another sed example ===
{{note|In this example, we're using the colon as a separator. If you ever need to specify the separator character in the regular expression, put a backslash before it.}}
Here's an example of how to use sed to remove the first line of the '''/etc/services''' file from our output stream:


<console>$##i## sed -e '1d' /etc/services | more</console>
=== Regexp snafus ===
Up until now, we've only performed simple string substitution. While this is handy, we can also match a regular expression. For example, the following sed command will match a phrase beginning with '<' and ending with '>', and containing any number of characters inbetween. This phrase will be deleted (replaced with an empty string):


As you can see, this command is very similar to our first <span style="color:green">d</span> command, except that it is preceded by a 1. If you guessed that the 1 refers to line number one, you're right. While in our first example, we used d by itself, this time we use the <span style="color:green">d</span> command preceded by an optional numerical address. By using addresses, you can tell sed to perform edits only on a particular line or lines.
<console>$##i## sed -e 's/<.*>//g' myfile.html</console>


=== Address ranges ===
This is a good first attempt at a sed script that will remove HTML tags from a file, but it won't work well, due to a regular expression quirk. The reason? When sed tries to match the regular expression on a line, it finds the longest match on the line. This wasn't an issue in my previous sed article, because we were using the d and p commands, which would delete or print the entire line anyway. But when we use the s/// command, it definitely makes a big difference, because the entire portion that the regular expression matches will be replaced with the target string, or in this case, deleted. This means that the above example will turn the following line:
Now, let's look at how to specify an address range. In this example, sed will delete lines 1-10 of the output:
<pre>
<b>This</b> is what <b>I</b> meant.
</pre>
Into this:
<pre>
meant.
</pre>
Rather than this, which is what we wanted to do:
<pre>
This is what I meant.
</pre>
Fortunately, there is an easy way to fix this. Instead of typing in a regular expression that says "a '<' character followed by any number of characters, and ending with a '>' character", we just need to type in a regexp that says "a '<' character followed by any number of non-'>' characters, and ending with a '>' character". This will have the effect of matching the shortest possible match, rather than the longest possible one. The new command looks like this:


<console>$##i## sed -e '1,10d' /etc/services | more</console>
<console>$##i## sed -e 's/<[^>]*>//g' myfile.html</console>


When we separate two addresses by a comma, sed will apply the following command to the range that starts with the first address, and ends with the second address. In this example, the <span style="color:green">d</span> command was applied to lines 1-10, inclusive. All other lines were ignored.
In the above example, the '[^>]' specifies a "non-'>'" character, and the '*' after it completes this expression to mean "zero or more non-'>' characters". Test this command on a few sample html files, pipe them to more, and review their results.


=== Addresses with regular expressions ===
=== More character matching ===
Now, it's time for a more useful example. Let's say you wanted to view the contents of your '''/etc/services''' file, but you aren't interested in viewing any of the included comments. As you know, you can place comments in your '''/etc/services''' file by starting the line with the '#' character. To avoid comments, we'd like sed to delete lines that start with a '#'. Here's how to do it:
The '[ ]' regular expression syntax has some more additional options. To specify a range of characters, you can use a '-' as long as it isn't in the first or last position, as follows:
 
<pre>
<console>$##i## sed -e '/^#/d' /etc/services | more</console>
'[a-x]*'
 
</pre>
Try this example and see what happens. You'll notice that sed performs its desired task with flying colors. Now, let's figure out what happened.
This will match zero or more characters, as long as all of them are 'a','b','c'...'v','w','x'. In addition, the '[:space:]' character class is available for matching whitespace. Here's a fairly complete list of available character classes:
 
To understand the '/^#/d' command, we first need to dissect it. First, let's remove the 'd' -- we're using the same delete line command that we've used previously. The new addition is the '/^#/' part, which is a new kind of regular expression address. Regular expression addresses are always surrounded by slashes. They specify a pattern, and the command that immediately follows a regular expression address will only be applied to a line if it happens to match this particular pattern.
 
So, '/^#/' is a regular expression. But what does it do? Obviously, this would be a good time for a regular expression refresher.
 
=== Regular expression refresher ===
We can use regular expressions to express patterns that we may find in the text. If you've ever used the '*' character on the shell command line, you've used something that's similar, but not identical to, regular expressions. Here are the special characters that you can use in regular expressions:
{| border=1
{| border=1
!'''Character class'''
!'''Description'''
|-
|-
|'''Character'''
|<nowiki>[:alnum:]</nowiki>
|'''Description'''
|<nowiki>Alphanumeric [a-z A-Z 0-9]</nowiki>
|-
|-
|^
|<nowiki>[:alpha:]</nowiki>
|Matches the beginning of the line
|<nowiki>Alphabetic [a-z A-Z]</nowiki>
|-
|-
|$
|<nowiki>[:blank:]</nowiki>
|Matches the end of the line
|Spaces or tabs
|-
|-
|.
|<nowiki>[:cntrl:]</nowiki>
|Matches any single character
|Any control characters
|-
|-
|*
|<nowiki>[:digit:]</nowiki>
|Will match zero or more occurrences of the previous character
|<nowiki>Numeric digits [0-9]</nowiki>
|-
|-
|[ ]
|<nowiki>[:graph:]</nowiki>
|Matches all the characters inside the [ ]
|Any visible characters (no whitespace)
|}
 
Probably the best way to get your feet wet with regular expressions is to see a few examples. All of these examples will be accepted by sed as valid addresses to appear on the left side of a command. Here are a few:
{| border=1
|-
|'''Regular expression'''
|'''Description'''
|-
|/./
|Will match any line that contains at least one character
|-
|/../
|Will match any line that contains at least two characters
|-
|-
|/^#/
|<nowiki>[:lower:]</nowiki>
|Will match any line that begins with a '#'
|<nowiki>Lower-case [a-z]</nowiki>
|-
|-
|/^$/
|<nowiki>[:print:]</nowiki>
|Will match all blank lines
|Non-control characters
|-
|-
|/}$/
|<nowiki>[:punct:]</nowiki>
|Will match any lines that ends with '}' (no spaces)
|Punctuation characters
|-
|-
|/} *$/
|<nowiki>[:space:]</nowiki>
|Will match any line ending with '}' followed by zero or more spaces
|Whitespace
|-
|-
|/[abc]/
|<nowiki>[:upper:]</nowiki>
|Will match any line that contains a lowercase 'a', 'b', or 'c'
|<nowiki>Upper-case [A-Z]</nowiki>
|-
|-
|/^[abc]/
|<nowiki>[:xdigit:]</nowiki>
|Will match any line that begins with an 'a', 'b', or 'c'
|<nowiki>hex digits [0-9 a-f A-F]</nowiki>
|}
|}
I encourage you to try several of these examples. Take some time to get familiar with regular expressions, and try a few regular expressions of your own creation. You can use a regexp this way:
It's advantageous to use character classes whenever possible, because they adapt better to nonEnglish speaking locales (including accented characters when necessary, etc.).
 
=== Advanced substitution stuff ===
We've looked at how to perform simple and even reasonably complex straight substitutions, but sed can do even more. We can actually refer to either parts of or the entire matched regular expression, and use these parts to construct the replacement string. As an example, let's say you were replying to a message. The following example would prefix each line with the phrase "ralph said: ":
 
<console>$##i## sed -e 's/.*/ralph said: &/' origmsg.txt</console>


<console>$##i## sed -e '/regexp/d' /path/to/my/test/file | more</console>
The output will look like this:
<pre>
ralph said: Hiya Jim,
ralph said:
ralph said: I sure like this sed stuff!
ralph said:
</pre>
In this example, we use the '&' character in the replacement string, which tells sed to insert the entire matched regular expression. So, whatever was matched by '.*' (the largest group of zero or more characters on the line, or the entire line) can be inserted anywhere in the replacement string, even multiple times. This is great, but sed is even more powerful.


This will cause sed to delete any matching lines. However, it may be easier to get familiar with regular expressions by telling sed to print regexp matches, and delete non-matches, rather than the other way around. This can be done with the following command:
=== Those wonderful backslashed parentheses ===
Even better than '&', the <span style="color:green">s///</span> command allows us to define regions in our regular expression, and we can refer to these specific regions in our replacement string. As an example, let's say we have a file that contains the following text:
<pre>
foo bar oni
eeny meeny miny
larry curly moe
jimmy the weasel
</pre>
Now, let's say we wanted to write a sed script that would replace "eeny meeny miny" with "Victor eeny-meeny Von miny", etc. To do this, first we would write a regular expression that would match the three strings, separated by spaces:
<pre>
'.* .* .*'
</pre>
There. Now, we will define regions by inserting backslashed parentheses around each region of interest:
<pre>
'\(.*\) \(.*\) \(.*\)'
</pre>
This regular expression will work the same as our first one, except that it will define three logical regions that we can refer to in our replacement string. Here's the final script:


<console>$##i## sed -n -e '/regexp/p' /path/to/my/test/file | more</console>
<console>$##i## sed -e 's/\(.*\) \(.*\) \(.*\)/Victor \1-\2 Von \3/' myfile.txt</console>


Note the new '-n' option, which tells sed to not print the pattern space unless explicitly commanded to do so. You'll also notice that we've replaced the <span style="color:green">d</span> command with the <span style="color:green">p</span> command, which as you might guess, explicitly commands sed to print the pattern space. Voila, now only matches will be printed.
As you can see, we refer to each parentheses-delimited region by typing '\x', where x is the number of the region, starting at one. Output is as follows:
<pre>
Victor foo-bar Von oni
Victor eeny-meeny Von miny
Victor larry-curly Von moe
Victor jimmy-the Von weasel
</pre>
As you become more familiar with sed, you will be able to perform fairly powerful text processing with a minimum of effort. You may want to think about how you'd have approached this problem using your favorite scripting language -- could you have easily fit the solution in one line?


=== More on addresses ===
=== Mixing things up ===
Up till now, we've taken a look at line addresses, line range addresses, and regexp addresses. But there are even more possibilities. We can specify two regular expressions separated by a comma, and sed will match all lines starting from the first line that matches the first regular expression, up to and including the line that matches the second regular expression. For example, the following command will print out a block of text that begins with a line containing "BEGIN", and ending with a line that contains "END":
As we begin creating more complex sed scripts, we need the ability to enter more than one command. There are several ways to do this. First, we can use semicolons between the commands. For example, this series of commands uses the '=' command, which tells sed to print the line number, as well as the p command, which explicitly tells sed to print the line (since we're in '-n' mode):


<console>$##i## sed -n -e '/BEGIN/,/END/p' /my/test/file | more</console>
<console>$##i## sed -n -e '=;p' myfile.txt</console>


If "BEGIN" isn't found, no data will be printed. And, if "BEGIN" is found, but no "END" is found on any line below it, all subsequent lines will be printed. This happens because of sed's stream-oriented nature -- it doesn't know whether or not an "END" will appear.
Whenever two or more commands are specified, each command is applied (in order) to every line in the file. In the above example, first the '=' command is applied to line 1, and then the p command is applied. Then, sed proceeds to line 2, and repeats the process. While the semicolon is handy, there are instances where it won't work. Another alternative is to use two -e options to specify two separate commands:


=== C source example ===
<console>$##i## sed -n -e '=' -e 'p' myfile.txt</console>
If you want to print out only the main() function in a C source file, you could type:


<console>$##i## sed -n -e '/main[[:space:]]*(/,/^}/p' sourcefile.c | more</console>
However, when we get to the more complex append and insert commands, even multiple '-e' options won't help us. For complex multiline scripts, the best way is to put your commands in a separate file. Then, reference this script file with the -f options:


This command has two regular expressions, <nowiki>'/main[[:space:]]*(/' and '/^}/'</nowiki>, and one command, <span style="color:green">p</span>. The first regular expression will match the string "main" followed by any number of spaces or tabs, followed by an open parenthesis. This should match the start of your average ANSI C main() declaration.
<console>$##i## sed -n -f mycommands.sed myfile.txt</console>


<nowiki>In this particular regular expression, we encounter the '[[:space:]]' character class. This is simply a special keyword that tells sed to match either a TAB or a space. If you wanted, instead of typing '[[:space:]]', you could have typed '[', then a literal space, then Control-V, then a literal tab and a ']' -- The Control-V tells bash that you want to insert a "real" tab rather than perform command expansion. It's clearer, especially in scripts, to use the '[[:space:]]' command class.</nowiki>
This method, although arguably less convenient, will always work.


OK, now on to the second regexp. '/^}/' will match a '}' character that appears at the beginning of a new line. If your code is formatted nicely, this will match the closing brace of your main() function. If it's not, it won't -- one of the tricky things about performing pattern matching.
=== Multiple commands for one address ===
Sometimes, you may want to specify multiple commands that will apply to a single address. This comes in especially handy when you are performing lots of s/// to transform words or syntax in the source file. To perform multiple commands per address, enter your sed commands in a file, and use the '{ }' characters to group commands, as follows:
<pre>
1,20{
        s/[Ll]inux/GNU\/Linux/g
        s/samba/Samba/g
        s/posix/POSIX/g
}
</pre>
The above example will apply three substitution commands to lines 1 through 20, inclusive. You can also use regular expression addresses, or a combination of the two:
<pre>
1,/^END/{
        s/[Ll]inux/GNU\/Linux/g
        s/samba/Samba/g
        s/posix/POSIX/g
      p
}
</pre>
This example will apply all the commands between '{ }' to the lines starting at 1 and up to a line beginning with the letters "END", or the end of file if "END" is not found in the source file.


The <span style="color:green">p</span> command does what it always does, explicitly telling sed to print out the line, since we are in '-n' quiet mode. Try running the command on a C source file -- it should output the entire main() { } block, including the initial "main()" and the closing '}'.
=== Append, insert, and change line ===
Now that we're writing sed scripts in separate files, we can take advantage of the append, insert, and change line commands. These commands will insert a line after the current line, insert a line before the current line, or replace the current line in the pattern space. They can also be used to insert multiple lines into the output. The insert line command is used as follows:
<pre>
i\
This line will be inserted before each line
</pre>
If you don't specify an address for this command, it will be applied to each line and produce output that looks like this:
<pre>
This line will be inserted before each line
line 1 here
This line will be inserted before each line
line 2 here
This line will be inserted before each line
line 3 here
This line will be inserted before each line
line 4 here
</pre>
If you'd like to insert multiple lines before the current line, you can add additional lines by appending a backslash to the previous line, like so:
<pre>
i\
insert this line\
and this one\
and this one\
and, uh, this one too.
</pre>
The append command works similarly, but will insert a line or lines after the current line in the pattern space. It's used as follows:
<pre>
a\
insert this line after each line.  Thanks! :)
</pre>
On the other hand, the "change line" command will actually replace the current line in the pattern space, and is used as follows:
<pre>
c\
You're history, original line! Muhahaha!
</pre>
Because the append, insert, and change line commands need to be entered on multiple lines, you'll want to type them in to text sed scripts and tell sed to source them by using the '-f' option. Using the other methods to pass commands to sed will result in problems.


=== Next time ===
=== Next time ===
Now that we've touched on the basics, we'll be picking up the pace for the next two articles. If you're in the mood for some meatier sed material, be patient -- it's coming! In the meantime, you might want to check out the following sed and regular expression resources.
Next time, in the final article of this series on sed, I'll show you lots of excellent real-world examples of using sed for many different kinds of tasks. Not only will I show you what the scripts do, but why they do what they do. After you're done, you'll have additional excellent ideas of how to use sed in your various projects. I'll see you then!


== Resources ==
== Resources ==
* Read Daniel's other sed articles: Sed by example, [[Sed by Example, Part 2|Part 2]] and [[Sed by Example, Part 3|Part 3]].
* Read Daniel's other sed articles: Sed by Example, [[Sed by Example, Part 1|Part 1]] and [[Sed by Example, Part 3|Part 3]].
* Check out Eric Pement's excellent [http://sed.sourceforge.net/sedfaq.html sed FAQ].
* Check out Eric Pement's excellent [http://sed.sourceforge.net/sedfaq.html sed FAQ].
* You can find the sources to sed at ftp://ftp.gnu.org/pub/gnu/sed.
* You can find the sources to sed at ftp://ftp.gnu.org/pub/gnu/sed.
Line 157: Line 234:
[[Category:Linux Core Concepts]]
[[Category:Linux Core Concepts]]
[[Category:Articles]]
[[Category:Articles]]
[[Category:Featured]]
{{ArticleFooter}}
{{ArticleFooter}}

Latest revision as of 08:48, December 28, 2014

   Support Funtoo!
Get an awesome Funtoo container and support Funtoo! See Funtoo Containers for more information.

How to further take advantage of the UNIX text editor

Substitution!

Let's look at one of sed's most useful commands, the substitution command. Using it, we can replace a particular string or matched regular expression with another string. Here's an example of the most basic use of this command:

user $ sed -e 's/foo/bar/' myfile.txt

The above command will output the contents of myfile.txt to stdout, with the first occurrence of 'foo' (if any) on each line replaced with the string 'bar'. Please note that I said first occurrence on each line, though this is normally not what you want. Normally, when I do a string replacement, I want to perform it globally. That is, I want to replace all occurrences on every line, as follows:

user $ sed -e 's/foo/bar/g' myfile.txt

The additional 'g' option after the last slash tells sed to perform a global replace.

Here are a few other things you should know about the s/// substitution command. First, it is a command, and a command only; there are no addresses specified in any of the above examples. This means that the s/// command can also be used with addresses to control what lines it will be applied to, as follows:

$ sed -e '1,10s/enchantment/entrapment/g' myfile2.txt

The above example will cause all occurrences of the phrase 'enchantment' to be replaced with the phrase 'entrapment', but only on lines one through ten, inclusive.

$ sed -e '/^$/,/^END/s/hills/mountains/g' myfile3.txt

This example will swap 'hills' for 'mountains', but only on blocks of text beginning with a blank line, and ending with a line beginning with the three characters 'END', inclusive.

Another nice thing about the s/// command is that we have a lot of options when it comes to those / separators. If we're performing string substitution and the regular expression or replacement string has a lot of slashes in it, we can change the separator by specifying a different character after the 's'. For example, this will replace all occurrences of /usr/local with /usr:

user $ sed -e 's:/usr/local:/usr:g' mylist.txt
   Note

In this example, we're using the colon as a separator. If you ever need to specify the separator character in the regular expression, put a backslash before it.

Regexp snafus

Up until now, we've only performed simple string substitution. While this is handy, we can also match a regular expression. For example, the following sed command will match a phrase beginning with '<' and ending with '>', and containing any number of characters inbetween. This phrase will be deleted (replaced with an empty string):

user $ sed -e 's/<.*>//g' myfile.html

This is a good first attempt at a sed script that will remove HTML tags from a file, but it won't work well, due to a regular expression quirk. The reason? When sed tries to match the regular expression on a line, it finds the longest match on the line. This wasn't an issue in my previous sed article, because we were using the d and p commands, which would delete or print the entire line anyway. But when we use the s/// command, it definitely makes a big difference, because the entire portion that the regular expression matches will be replaced with the target string, or in this case, deleted. This means that the above example will turn the following line:

<b>This</b> is what <b>I</b> meant.

Into this:

meant.

Rather than this, which is what we wanted to do:

This is what I meant.

Fortunately, there is an easy way to fix this. Instead of typing in a regular expression that says "a '<' character followed by any number of characters, and ending with a '>' character", we just need to type in a regexp that says "a '<' character followed by any number of non-'>' characters, and ending with a '>' character". This will have the effect of matching the shortest possible match, rather than the longest possible one. The new command looks like this:

user $ sed -e 's/<[^>]*>//g' myfile.html

In the above example, the '[^>]' specifies a "non-'>'" character, and the '*' after it completes this expression to mean "zero or more non-'>' characters". Test this command on a few sample html files, pipe them to more, and review their results.

More character matching

The '[ ]' regular expression syntax has some more additional options. To specify a range of characters, you can use a '-' as long as it isn't in the first or last position, as follows:

'[a-x]*'

This will match zero or more characters, as long as all of them are 'a','b','c'...'v','w','x'. In addition, the '[:space:]' character class is available for matching whitespace. Here's a fairly complete list of available character classes:

Character class Description
[:alnum:] Alphanumeric [a-z A-Z 0-9]
[:alpha:] Alphabetic [a-z A-Z]
[:blank:] Spaces or tabs
[:cntrl:] Any control characters
[:digit:] Numeric digits [0-9]
[:graph:] Any visible characters (no whitespace)
[:lower:] Lower-case [a-z]
[:print:] Non-control characters
[:punct:] Punctuation characters
[:space:] Whitespace
[:upper:] Upper-case [A-Z]
[:xdigit:] hex digits [0-9 a-f A-F]

It's advantageous to use character classes whenever possible, because they adapt better to nonEnglish speaking locales (including accented characters when necessary, etc.).

Advanced substitution stuff

We've looked at how to perform simple and even reasonably complex straight substitutions, but sed can do even more. We can actually refer to either parts of or the entire matched regular expression, and use these parts to construct the replacement string. As an example, let's say you were replying to a message. The following example would prefix each line with the phrase "ralph said: ":

user $ sed -e 's/.*/ralph said: &/' origmsg.txt

The output will look like this:

ralph said: Hiya Jim,
ralph said:
ralph said: I sure like this sed stuff!
ralph said:

In this example, we use the '&' character in the replacement string, which tells sed to insert the entire matched regular expression. So, whatever was matched by '.*' (the largest group of zero or more characters on the line, or the entire line) can be inserted anywhere in the replacement string, even multiple times. This is great, but sed is even more powerful.

Those wonderful backslashed parentheses

Even better than '&', the s/// command allows us to define regions in our regular expression, and we can refer to these specific regions in our replacement string. As an example, let's say we have a file that contains the following text:

foo bar oni
eeny meeny miny
larry curly moe
jimmy the weasel

Now, let's say we wanted to write a sed script that would replace "eeny meeny miny" with "Victor eeny-meeny Von miny", etc. To do this, first we would write a regular expression that would match the three strings, separated by spaces:

'.* .* .*'

There. Now, we will define regions by inserting backslashed parentheses around each region of interest:

'\(.*\) \(.*\) \(.*\)'

This regular expression will work the same as our first one, except that it will define three logical regions that we can refer to in our replacement string. Here's the final script:

user $ sed -e 's/\(.*\) \(.*\) \(.*\)/Victor \1-\2 Von \3/' myfile.txt

As you can see, we refer to each parentheses-delimited region by typing '\x', where x is the number of the region, starting at one. Output is as follows:

Victor foo-bar Von oni
Victor eeny-meeny Von miny
Victor larry-curly Von moe
Victor jimmy-the Von weasel

As you become more familiar with sed, you will be able to perform fairly powerful text processing with a minimum of effort. You may want to think about how you'd have approached this problem using your favorite scripting language -- could you have easily fit the solution in one line?

Mixing things up

As we begin creating more complex sed scripts, we need the ability to enter more than one command. There are several ways to do this. First, we can use semicolons between the commands. For example, this series of commands uses the '=' command, which tells sed to print the line number, as well as the p command, which explicitly tells sed to print the line (since we're in '-n' mode):

user $ sed -n -e '=;p' myfile.txt

Whenever two or more commands are specified, each command is applied (in order) to every line in the file. In the above example, first the '=' command is applied to line 1, and then the p command is applied. Then, sed proceeds to line 2, and repeats the process. While the semicolon is handy, there are instances where it won't work. Another alternative is to use two -e options to specify two separate commands:

user $ sed -n -e '=' -e 'p' myfile.txt

However, when we get to the more complex append and insert commands, even multiple '-e' options won't help us. For complex multiline scripts, the best way is to put your commands in a separate file. Then, reference this script file with the -f options:

user $ sed -n -f mycommands.sed myfile.txt

This method, although arguably less convenient, will always work.

Multiple commands for one address

Sometimes, you may want to specify multiple commands that will apply to a single address. This comes in especially handy when you are performing lots of s/// to transform words or syntax in the source file. To perform multiple commands per address, enter your sed commands in a file, and use the '{ }' characters to group commands, as follows:

1,20{
        s/[Ll]inux/GNU\/Linux/g
        s/samba/Samba/g
        s/posix/POSIX/g
}

The above example will apply three substitution commands to lines 1 through 20, inclusive. You can also use regular expression addresses, or a combination of the two:

1,/^END/{
        s/[Ll]inux/GNU\/Linux/g 
        s/samba/Samba/g 
        s/posix/POSIX/g 
       p
}

This example will apply all the commands between '{ }' to the lines starting at 1 and up to a line beginning with the letters "END", or the end of file if "END" is not found in the source file.

Append, insert, and change line

Now that we're writing sed scripts in separate files, we can take advantage of the append, insert, and change line commands. These commands will insert a line after the current line, insert a line before the current line, or replace the current line in the pattern space. They can also be used to insert multiple lines into the output. The insert line command is used as follows:

i\
This line will be inserted before each line

If you don't specify an address for this command, it will be applied to each line and produce output that looks like this:

This line will be inserted before each line
line 1 here
This line will be inserted before each line
line 2 here
This line will be inserted before each line
line 3 here
This line will be inserted before each line
line 4 here

If you'd like to insert multiple lines before the current line, you can add additional lines by appending a backslash to the previous line, like so:

i\
insert this line\
and this one\
and this one\
and, uh, this one too.

The append command works similarly, but will insert a line or lines after the current line in the pattern space. It's used as follows:

a\
insert this line after each line.  Thanks! :)

On the other hand, the "change line" command will actually replace the current line in the pattern space, and is used as follows:

c\
You're history, original line! Muhahaha!

Because the append, insert, and change line commands need to be entered on multiple lines, you'll want to type them in to text sed scripts and tell sed to source them by using the '-f' option. Using the other methods to pass commands to sed will result in problems.

Next time

Next time, in the final article of this series on sed, I'll show you lots of excellent real-world examples of using sed for many different kinds of tasks. Not only will I show you what the scripts do, but why they do what they do. After you're done, you'll have additional excellent ideas of how to use sed in your various projects. I'll see you then!

Resources

   Tip

Read the next article in this series: Sed by Example, Part 3

   Note

Browse all our available articles below. Use the search field to search for topics and keywords in real-time.

Article Subtitle
Article Subtitle
Awk by Example, Part 1 An intro to the great language with the strange name
Awk by Example, Part 2 Records, loops, and arrays
Awk by Example, Part 3 String functions and ... checkbooks?
Bash by Example, Part 1 Fundamental programming in the Bourne again shell (bash)
Bash by Example, Part 2 More bash programming fundamentals
Bash by Example, Part 3 Exploring the ebuild system
BTRFS Fun
Funtoo Filesystem Guide, Part 1 Journaling and ReiserFS
Funtoo Filesystem Guide, Part 2 Using ReiserFS and Linux
Funtoo Filesystem Guide, Part 3 Tmpfs and Bind Mounts
Funtoo Filesystem Guide, Part 4 Introducing Ext3
Funtoo Filesystem Guide, Part 5 Ext3 in Action
GUID Booting Guide
Learning Linux LVM, Part 1 Storage management magic with Logical Volume Management
Learning Linux LVM, Part 2 The cvs.gentoo.org upgrade
Libvirt
Linux Fundamentals, Part 1
Linux Fundamentals, Part 2
Linux Fundamentals, Part 3
Linux Fundamentals, Part 4
LVM Fun
Making the Distribution, Part 1
Making the Distribution, Part 2
Making the Distribution, Part 3
Maximum Swappage Getting the most out of swap
On screen annotation Write on top of apps on your screen
OpenSSH Key Management, Part 1 Understanding RSA/DSA Authentication
OpenSSH Key Management, Part 2 Introducing ssh-agent and keychain
OpenSSH Key Management, Part 3 Agent Forwarding
Partition Planning Tips Keeping things organized on disk
Partitioning in Action, Part 1 Moving /home
Partitioning in Action, Part 2 Consolidating data
POSIX Threads Explained, Part 1 A simple and nimble tool for memory sharing
POSIX Threads Explained, Part 2
POSIX Threads Explained, Part 3 Improve efficiency with condition variables
Sed by Example, Part 1
Sed by Example, Part 2
Sed by Example, Part 3
Successful booting with UUID Guide to use UUID for consistent booting.
The Gentoo.org Redesign, Part 1 A site reborn
The Gentoo.org Redesign, Part 2 The Documentation System
The Gentoo.org Redesign, Part 3 The New Main Pages
The Gentoo.org Redesign, Part 4 The Final Touch of XML
Traffic Control
Windows 10 Virtualization with KVM