Regular Expressions

Regular expressions are a tool that pervades the Unix world. They are a way to describe text in an intelligent way that allows for variation. Different unix programs offer regular expressions of differing power, but even in their weakest forms they are generally pretty useful.

As a technical note, regular expressions will generally be denoted with quotation marks ("") around them. The quotation marks are not part of the regular expression, they are simply there in indicate where the regular expression begins and ends since nearly anything can be part of a regular expression. The sequence \" inside of a regular expression will be used to indicate an actual quote character. What characters are used as delimeters varies from program to program and you will have to modify your regular expressions as needed.

The simplest regular expression is string. "Bob", "Mary", and "Dragon" are all extremely simple regular expressions. If we wanted to make them a bit more complex, we could specify that we want Bob to be the first three characters on the line. In that case our regular expression would be "^Bob". If we wanted Bob to be the last three characters of the line, we'd use the regular expression "Bob$".

This brings us to the meat of regular expressions: meta-characters. These are characters which stand for something other than themselves. The two that we've seen so far are ^ and $. The carrot (^) represents the beginning of a line. Thus wherever the carrot appears, it will only match if the beginning of the line is there. The dollar sign ($) represents the end of the line, again matching only where the end of the line is. As might be obvious, if you put a ^ anywhere but at the beginning of the regular expression or the $ anywhere but at the end, the regular expression cannot match any text. It is impossible for the beginning of the line to match after characters: the definition of the beginning of the line is the place before any characters. (This isn't strictly true. The ^ may have other meanings in special contexts but this is true for simple regular expressions. Certain options to the regular expression parser may also make this untrue. However, the general idea is important, at least for simple regular expressions.)

As a note going forward, the standard way of representing meta-characters when we want to match the characters themselves is to escape them, that is, to put a \ in front of them. Thus if we want to match the string $allary, we would use the regular expression "\$allary".

There are other meta characters, though the more powerful ones tend to vary from program to program. One universal meta character is the dot ".". The dot will match any single character. The sequence \w usually matches a word character (a-z, A-Z, and 0-9) and the sequence \W usually matches non-word characters (everything else, including spaces, tabs, and punctuation).

Another useful feature in regular expressions is the ability to specify character classes. This is done by listing the acceptable characters inside of square braces. If we wanted to accept any vowell followed by a q, we would use the regular expression "[aeiou]q". There are two other methods of specifying character classes and one of them is ranges. To specify all lower case letters we would use "[a-z]", upper case would be "[A-Z]", and numbers would be "[0-9]". The other method is to specify a character class by negation. In this case we use square brackets where the first character inside of them is a ^. The rest of the characters are a list of unallowable characters. For example, if we want to match anything but normal punctuation, we would use the regular expression "[^.?!]". As you may have guessed from this example, meta-characters lose their special meaning inside of character classes. If you want the ^ to be one of the allowable characters in a character class, simply specify it in some other place than first.

One example of the usefulness of character classes is the ability to disregard case in specific places. For example, if I want to find lines containing my last name whether capitalized or not, I could specify "[Ll]ansdown".

The final basic building block of regular expressions is the ability to indicate number.