What are regular expressions?

A regular expression is a pattern matching tool. The pattern can comprise alphanumeric and non-alphanumeric characters, numbers, letters and digits.

Patterns can be used to match certain sequences of characters - almost like a shape and sort. OK, but what does this actually mean?

When doing file manipulations in an earlier chapter, we used patterns. For example, the splat ( * ) is a pattern that matches 0 or more characters. Thus:

ls *
            

matched filenames of 0 or more characters.

The splat is a pattern matching either none (0) characters, or 1 character, or 1 character followed by another character (i.e. 2 characters) or 1 character followed by another and another (i.e. 3 characters) etc., irrespective of what those characters are.

Thus, we've had a glimpse of patterns previously, however RE patterns are much more versatile (and complex) and we want to look at them in detail.

The fullstop

The first of our pattern sequences: a fullstop (or a period as the Americans call it) matches any character.

We might say:

sed 's/Linu./LINUX/g' bazaar.txt
                

This sed expression means:

s (search for) / Linu.  / (replace with) LINUX / g (globally) <filename to search>
 ----------^-------^-----------------^---------
                

Looking at the command in detail: The pattern 'Linu.' says match any set of characters that begins with a uppercase 'l', followed by i, an 'n' and a 'u' followed by any other character - the fullstop matches the "any other character". In the file bazaar.txt the following strings appear, that would match this pattern:

Linus
Linux
                

The pattern we used in the sed above, will match occurrences of Linux and Linus.

Using the fullstop in place of 's' or 'x' ensures that these two strings are matched. However, the pattern will also match:

Linup
                

Why? Because it matches the pattern 'Linu' followed by any single character.

[Important] Important

The fullstop in regular expression terms matches any character.

Pipe the sed expression through nl, and look at line 9 ... "Linus Torvalds" has been changed to "LINUX Torvalds".

sed 's/Linu./LINUX/g' bazaar.txt | nl
                

Let's explore "sed" syntax

Sed is an acronym for "Stream Editor". The "stream", in our example above, comes from the file bazaar.txt.

Besides the input stream sed must also have a command pattern-command combination.

SYNTAX:
sed [command] / pattern / [replace sequence]  / [modifier] [command]
                

In this case our command is 's' for search while the pattern we're searching for is enclosed in forward slashes (forward slashes are not strictly required, but we'll not going to complicate matters right now).After the second forward slash, we have a replace sequence.

sed will search for a pattern (Linu.) and on finding it, will replace it with the replace sequence (LINUX).

Finally we have a modifier in this case a 'g' meaning "globally" i.e. search for a pattern and replace it as many times as you find it on a line. If there were 10 instances of 'Linu<any character>' on a line, it would replace all occurrences.

Since sed is a stream editor, it considers each line in the file bazaar.txt independently (in essence, "finish processing this line, then get the next line from the input file"). The stream ends when an end-of-file character is reached. Thus the "globally" modifier only operates on the current line under consideration.

If we just wanted to replace only the second instance and not the first or the third, etc. we could replace the g with a 2. sed would then only replace the second instance of the desired pattern. As you go through this chapter, you will become friendly with sed, and work with many patterns.

To Summarise: a fullstop (.) as a regular expression matches any single character.

Exercises

Using sed and the bazaar.txt file, write regular expressions to match the following:

  1. Any word containing the letters "inu" in order. Thus, your RE should match Linux , Linus, linux and linus.

  2. Match only 5 letter words.

  3. Write a RE to match only words with an even number of letters up to a maximum of 10 letters.

  4. Replace all the words 'the' with the word "ETH" in the file bazaar.txt

Challenge sequence

Without redirecting the output to a file, or moving the resulting file, get sed to automatically modify the file bazaar.txt - i.e. edit the original file. (Hint: RMMG)

Square brackets ( [ ] ), the caret ( ^ ) and the dollar ( $ )

Square brackets mean a range of characters. If we tried [abc], ( you should remember this from earlier), it means match a single character which is either and 'a' or a 'b' or a 'c'.

A caret ( ^ ) matches a start of line and the dollar ( $ ) the end of the line.

Now I'm going to use these together to create more complex RE's. We're going to write a sed expression that's going to match lines (not search and replace as before, ) that begin with 'a', 'e' or i and print ( p ) them.

sed '/^[aeI]/p' bazaar.txt
                

You will notice that before we were doing a search-replace, this time around we're doing a pattern match, but merely printing the matched lines.

This regular expression would match lines that begin with either 'a' or 'e' or i. Now, you'll notice when we ran this command, the lines that begin with 'a', 'e' or i are printed twice while every non-matching line is printed only once. sed parsed the entire file, line by line and each time it matched a line that began with 'a', 'e' or i, the line was printed (which is why the lines were duplicated). In our example we can see that line 6 begins with an i - hence a match:

I believe that most important software....
                

Similarly, line 8 is also printed twice:

editor) needed to be built....
                

How would we match both 'e' and 'E'? Simply include 'E' in the pattern so that the RE becomes:

sed '/^[aeEI]/p' bazaar.txt
                

This time if you run it, you will notice that line 16 is also matched:

Extract taken from....
                

We've seen two things:

  1. that [ ] match a range or choice of characters, and

  2. that the caret matches the start of a line.

Now what makes this slightly more complex is that if this caret appeared inside the square brackets, it's meaning becomes altered.

Examine the following Regular Expression:

sed '/^[^aeEI]/p 
                

This causes every line that does NOT begin with 'a', 'e', 'E' or i to be printed. What's happening here? Well, the caret inside the square bracket means "do not match".

The caret outside the square bracket says match the start of the line, the caret inside the square bracket says do not match a,e,E or I. Reading this RE left to right:

"any line that starts with NOT an 'a' or an 'e' or an 'E' or and i - print it".

What happens if we replace the 'p' with a 'd'?

sed '/^[^aeEI]/d'
---------------^
                

means:

"any line that starts with NOT an 'a' or an 'e' or an 'E' or and i - delete "it".[4]

Here are the new concepts you've learned:

  1. .We've learnt that we can simply match a pattern without doing a search and replace. In the previous example we talked about search and replace patterns, now we're talking about matching-only patterns. We do this using a straightforward slash without a 's' preceding it. In this case, we operate first by printing then by deleting the pattern. Earlier we looked at searching and replacing, now we're looking at other operations that sed can perform. In essence, "find the pattern accordance with my pattern structure and print or delete it".

  2. Secondly, a caret outside a square bracket means "start of line", while a caret inside a square bracket means "invert the pattern" or more commonly "do NOT match the pattern""

Just the same way that the caret means the beginning of the line, the $ means the end of the line. An expression such as:

sed '/[aeEI]$/!d' bazaar.txt
                

means

"don't ( ! ) delete any line that ENDS in either an 'a', an 'e' an 'E' or an 'I'".
                

We've used the following expressions:

. any single character
[ ] a range of characters
^ start of line (when outside [ ])
^ do not (when inside [ ])
$ end of line

Perhaps we want to print only the lines beginning with 'a', 'e', 'E' or i.

How can sed achieve this? We could try,

"delete all lines NOT beginning with an 'a,e,E or I'"

sed '/^[^aeEI]/d' bazaar.txt
                

Bingo. However it also produced a series of blank lines. How do we remove blank lines, leaving only the lines that we are interested in? We could pipe the command into yet another sed, where we could look for blank lines. The pattern for a blank line is:

^$
                

Using sed and pipes

So the following command would delete the unwanted blank lines:

sed '/^[^aeEI]/d' bazaar.txt | sed '/^$/d' 
                

Bingo (again), we end up with the lines we wanted. You might want to pipe this command through nl just to see the line numbers:

sed '/^[^aeEI]/d' bazaar.txt | sed '/^$/d' | nl
                

Notice that the first sed is acting on a file, while the second sed is acting on a stream of lines as output by the initial sed. Well we could have actually simplified this slightly, because sed can accommodate multiple command-pattern-command sequences is they are separated by a ';' Hence, a modified command:

sed '/^[^aeEI]/d;/^$/d' bazaar.txt | nl
 ----------------^----------------------
[ notice the ^ indicating the ; ]
                

These examples illustrate two concepts:

  1. How to put multiple sed commands on the same line,

  2. It is important to optimise your shell scripts.[5] In the first example (where we called sed twice) we were invoking sed twice, which obviously takes time. In the second instance we're invoking sed once, while doing two sets of commands (albeit sequentially) thereby optimising our code, naturally making it run significantly quicker.

By way of re-enforcing this, run the following commands:

time sed '/^[^aeEI]/d' bazaar.txt |sed '/^$/d' |nl
time sed '/^[^aeEI]/d;/^$/d' bazaar.txt |nl

[ the 'time' command will time the commands ]
                

This will show the elapsed time in addition to a host of other information about how long this command took to run. In this case since our RE is so simple and the file we're operating on is so small, the time difference is marginal. If however this were a 100Mb file, invoking sed twice would be a significant impairment on the speed with which your script executes.

sed is a stream editor, but what's a stream? A stream is just like a river, in which information is flowing. sed is able to edit the stream as it 'flows' past. We've been invoking sed using a file as an argument, however we could alternatively have used sed as part of a pipe :

cat bazaar.txt | sed '/^[^aeEI]/d;/^$/d'
                

This would produce the same results as invoking sed earlier. sed is one of the commands that you should be comfortable using since it can be used in many and varied ways. Now, as part of the pipe, sed is searching for a pattern. On finding the pattern it's modified and sent on to stdout.

Exercises

  1. Only print lines that DO NOT have the word Linux in them

  2. Remove all blank lines, as well as those that DO NOT have the word Linux in them

  3. Remove any line that begins or ends with a vowel (a,e,i,o,u).

  4. Search for the word "bazaar", only printing lines containing the word. Ensure that you search for both "Bazaar" and "bazaar".

  5. Remove all non-blank lines from the file.

Challenge sequence

Using our bazaar file, print only those lines that end with either a full stop ( . ) or a '?'.

The splat (asterisk) ( * )

The splat (*) matches 0, one or more occurrences OF THE PREVIOUS PATTERN.

Supposing we wanted to match Line or Linux or Linus the pattern:

sed '/Lin.*/p' bazaar.txt	
                

would match lines containing these words.

The splat says "match 0, one or more of the previous pattern (which was the Full-stop, and the full-stop means one occurrence of any single character)".

Lets looks at another example:

/a*bc[e-g]*[0-9]*/
                

matches:

aaaaabcfgh19919234
bc
abcefg123456789
abc45
aabcggg87310
                

Let's looks at another example:

/.*it.$/
                

matches any number of alphanumeric characters followed by and i followed by a 't' followed by the end-of-line.

Exercises

Using the file index.html, available.

Match the following RE's

  1. Look for every line beginning with a '<'. Did it give you what you were expecting? Why?

  2. Modify the above RE to give you EVERY line beginning with a '<'. Now is it giving you what you were expecting? If not, have another look at question 1. Linux people may be lazy, but they think a great deal.

  3. I am only interested in the divider HTML code (the code with "<div" in it). Note that although I have asked you for <div, there may be anomalies in it's case. It could be <Div or <DiV, etc. Ensure your solution gets all of them.

  4. Look for all references to QED. Number each line you find.

  5. Show all lines that are headings (H1, H2, H3, etc.). Again case may be an issue.

Let's update our list of patterns:

character pattern
. any single character
[ ] a range of characters
^ start of line (when outside [ ])
^ do not (when inside [ ])
$ end of line
* 0 or more of the previous pattern
+ 1 or more of the previous pattern
\{n\}  
\{n, \}+  
\{n,m\}  

The plus operator ( + )

The plus operator will match the preceding pattern 1 or more times. To match the character 'a' or 'b' or 'c', one or more times, we could use:

[abc+]
                

Perhaps we want to match 19?? in the bazaar.txt file (Here we would want to find any year, 1986 or 1999 whichever you would like to find.)

19[0-9+]
                

To match the character a, one or more times, we would use

a+
                
[Note] Note

Note that in the previous examples, the plus character is not matched, since this ( + ) has special meaning in a RE. If we wanted to search for a plus sign (or any of the RE pattern matching tools) in a pattern, we would need to escape the plus sign.

How do we escape characters that are special in RE's? We need to escape them with a backslash ( \ ). Thus to search for the pattern a+ we would use:

a\+
                

Similarly if we wanted to match a splat ( * ), we would have to match it with:

a\*
                

So, the plus is a special character, which matches one or more of THE PREVIOUS PATTERN.

Matching a specified number of the pattern using the curly brackets {}

Using {n}, we match exactly that number of the previous expression. If we want to match 'aaaa' then we could use:

a{4}
                    

This would match exactly four a's. If we want to match the pattern 1999 in our file bazaar.txt, then we would do:

sed '/19{3}/p' bazaar.txt
                    

This should print all lines containing the pattern 1999 in the bazaar.txt file.

You will notice that if we try to do this, it doesn't seem to work. This is because we need to escape the curly braces by preceding each one with a backslash.

If we wanted to match three characters irrespective of what they are (e.g. fox, bat, cat, car)?

sed \%\<[a-z][a-z][a-z]\>%p' /usr/share/dict/words
                    

A detour - Using a different field separator in sed pattern matching

I've alluded to this previously, but now here it is in use. While sed will normally use the / as the pattern delimiter, any character can be used instead of /. This is particularly useful when using sed to modify a PATH. For example: supposing we were wanting to search for the pattern:

/home/hamish/some_where
                    

sed could achieve this, but consider how "complex" the RE would be:

'/\/home\/hamish\/some_where/!d'
                    

Confusing? Now rather than using the / as the pattern delimiter, we could use a % sign, simplifying the RE to:

%/home/hamish/some_where%!d
                    

This will only work however, if we escape the initial %, making our sed statement look like this:

\%/home/hamish/some_where%!d
                    

Using Word Encapsulating Characters

I have used the word encapsulation characters here (\< and \>) to trap ONLY whole words that are ONLY 3 letters in length. Try

sed 's/.../321/g' bazaar.txt
                    

versus

sed 's/\<...\>/321/g' bazaar.txt
                    

The word encapsulation characters are < and >, but naturally, since these hold special meaning in the shell (and in fact in sed too), we need to escape them, hence the \< and \>.

The second sed should produce closer to what you may have been expecting and would match fox, the, bar, bat, its, joe, etc....

Returning from detour to our discussion on curly braces …

The above RE ( sed \%\<[a-z][a-z][a-z]\>%p' /usr/share/dict/words ) is a little long, so we could shorten it using the splat to:

sed '/\<[a-z]\{3\}\>/p' /usr/share/dict/words
                

(this may be hard to see that you are in fact getting the results you are after. You could, instead, not delete words that are 3 charaters in length by replacing the "p" with a "!d" (don't delete) in the sed expression above:

sed '/\<[a-z]\{3\}\>/!d' /usr/share/dict/words )

sed '/19\{3\}/p' bazaar.txt
                

The command now executes as expected and only one duplicate line is output from the file, that which contains the text 1999. So {n} matches exactly n occurrences of the expression.

If we wanted to match a string with a minimum of 4 a's, up to .... well infinity a's we could use the pattern:

a\{4,\} 
                

This regular expression says match no upper limit, but the string must contain at least four a's. Thus it would match four a's, forty a's or even four hundred a's following one another, but it would not match three a's.

Let's now match the letter m at least once and with no upper limit. We would do this by:

sed '/m\{1,\}/p' bazaar.txt
                

If we change the 1 to a 2, our pattern becomes:

sed '/m\{2,\}/p' bazaar.txt
                

This would match only those lines with the words: community, programming etcetera (i.e. any words containing at least two m's).

The following expression would match a minimum of four a's but a maximum of 10 a's in a particular pattern:

a\{4,10\}
                

Let's say we wanted to match any character a minimum of 3 times, but a maximum of 7 times, then we could affect a regular expression like:

.\{3,7\}
                

This allows us a huge degree of flexibility when we start combining these operators.

What does the following RE match?

^[aeEI]\{1,3\}
                

This RE means: "look for any line that starts with any of the characters a,e,E,I a minimum of one time but a maximum of 3 times. Thus it would match any of the following:

aaa
a
aE
e
E
I
                

Would it match abd or adb or azz for that matter, or only lines that start with any of the characters in the RE, followed by up to 2 other characters from the RE?

It would not match the following:

aaEI
EIea
bEaa
IIEEaae
iEE
                

(why?-- you should answer this.)

RE's are greedy for matching patterns

If you think this is bizarre, hang in there, it gets more bizarre. Let me finish off RE's with two concepts. The first is 'greediness'. RE's are greedy, which means that they will match as much as they possibly can.

Assuming you have an expression:

ham.*
                

This will match as much as it possibly can within that expression. So it would match

	
ham
                

but if we had:

hammock
                

it will match the entire word hammock, because it tries to grab as much as it possibly can - it's greedy. RE's are greedy and sometimes they'll be matching a lot more than you expect them to match. The closer you can get your RE to the actual thing that you're looking for, the less the greediness will affect your results. Let's look at some examples of that.

Exercises

The following exercises will show you how sed's greediness affects the output, and how to create RE's that will only give you the results you want.

I have included 3 files, emails{1,2,3}.txt in the examples directory you should have downloaded these previously.

In order to REALLY come to terms with RE's, work through these exercises using these 3 files:

  1. Match the subject lines in these files. Subject lines look as follows:

    Subject:
                                    

  2. List only the 'id' of each message. This can be found with the string 'id', but there is a catch!

  3. What mail clients have people used?

  4. Obtain a listing of all za domains, all com domains, etc in these emails.

  5. Given the following RE's what would they match?

ht\{2\}p:\/\/



ht\{2\}p:\/\{2\}



ht\{2\}p:\/\/w\{3\}.*za$



ht\{2\}p:\/\{2\}.*\/.\{9\}\/

                    
[Note] Note

You will have noticed that in order to understand these, you have to work through them systematically, left to right, understanding each part as you go!

Placeholders and word boundaries

Placeholders are a way of keeping the pattern that you've matched.

In your example files, there's a second file called columns.txt. This file has two columns:

name		age
                

I want to swap the two columns around so that the file contains the age column on the left, and the name column on the right.

Now, if you start thinking about how to do that, it might become quite a complex thing to achieve (without using tools like awk or perl etc.).

With RE's and sed, it's very simple using placeholders. So let's first try and develop a pattern that matches name and a pattern that matches age. Notice that the two columns in the file are separated by a single space. The expression for the name column would be:

[a-z]* 
                

Assuming that no one in our file is 100 years or older we can use the following expression to match the values of the age column:

[0-9]\{1,2\}
                

That should match any age (in the file) because it means match any digit in the range 0-9 a minimum of once but a maximum of twice. So it should match a person whose age is: 1, 9 or 99.

Now the sed expression would then be:

sed '/^[a-z]* [0-9]\{1,2\}$/p'
                

This only searches for lines matching and prints them.

How do I swap the name and the age around? I'm going to enclose the name in round brackets (remember you have to escape round brackets). Similarly I'm going to enclose the age expression in round brackets.

Our sed expression now looks like:

sed 's/^\([a-z]*\) \([0-9]\{1,2\}\)$/\2,\1/' columns.txt
----^__------__^__------------__^-__-___
       1 2   3     4 5 6         7           8 9 10  11

	1 = Caret (start of line)
	2 = Start of placeholder for the name RE
	3 = Name RE
	4 = End placeholder for the name RE
	5 = Space between the name and the age in the file
	6 = Start placeholder for the age RE
	7 = The Age RE
	8 = End placeholder for the age RE
	9 = Dollar (end of line)
	10= Placeholder 2 (the age)
	11= Placeholder 1 (the name)
                

The first set of round brackets contains the 'name' RE, while the second set of round brackets enclose the 'age' RE. By encompassing them in round brackets, I've marked the expressions within placeholders. We could then use \2 to represent the 'age' placeholder, and \1 to represent the 'name' placeholder. Essentially this expression says "search for the name and age, and replace it with the age and then name". Thus we've switched the two columns.

The above final expression looks very complex but I tackled this regular expression in byte-size chunks.

I said let's write a regular expression to match the name. Now let's write a regular expression to match the age. Once I had these two individual expressions, I combined them. When I combined them into a single regular expression I then just included round brackets to create placeholders. Later in sed, we were able to use these placeholders in our search-replace expression. Now try and do that in other operating systems!

Try these:

free | sed '/^Mem/!d'
free | sed '/^Mem/!d';  '/  */,/g'
VAR=`free | sed '/^Mem/!d';  '/  */,/g'`
echo $VAR
                

Word boundaries ( < and > ) - a formal explanation

A final useful trick is that of word boundaries. We've seen them a little earlier, but here is a formal explanation. Suppose we are wanting to search for all words 'the':

sed 's/the/THE/g' bazaar.txt
                

would probably be our first try. Problem is, this will also match (and change) 'there', 'them', 'then', 'therefore', etc. Problem, yes?

Solution? Well, the solution is to bound our word with word boundary markers (the official term is word anchors).

Let's rewrite our pattern with this in mind:

sed 's/\<the\>/THE/g' bazaar.txt
                

This time, we only match the whole word 'the' and not any of the others. So the word anchors will restrict the pattern to complete words and not segments of words.

Exercises:

The following exercises can be used on any of the text files in your directory. See if you can work out what will be matched before using sed to do it for you.

  1. s/the/THE/g

  2. s/\<the\>/THE/g

  3. s/\(.*\)@\(.*\)/\2 user \1/g

  4. s/\([-a-zA-Z0-9\.]*\)@\([-a-zA-Z0-9\.]*\)/\2 .. \1/g

  5. s/\([-a-zA-Z0-9\.]*\)@\([-a-zA-Z0-9\.]*\)/<<<\2>>> .. [[[\1]]]/g

It may be a good place to pause and tell you about the best editor ever written - vi. If you aren't familiar with it, get hold of VIM (the Improved version of vi.)



[4] the d command in sed means delete the line if there is a match

[5] Since the shell is a command interpreter it does not compile the script. Since it is not pre-compiled the shell interprets every command it encounters.