Corpus Linguistics 1/4
© 2003 Anatol Stefanowitsch
REGULAR EXPRESSIONS
1. WHAT ARE REGULAR EXPRESSIONS
Regular expressions (or ‘regex patterns’ or ‘grep patterns’) are expressions that stand
for symbols or strings of symbols, or more often a class of symbols or strings of
symbols. You may be familiar with this from the S
EARCH AND REPLACE function of
word processors like MS Word, OpenOffice Writer, etc., where this mechanism is
known as pattern matching, or wildcard operators. For example, a period (.) may stand
for ‘any character’, such that m.le would find male, mile, mole, mule, etc. Obviously,
such a mechanism is very useful in corpus linguistics, e.g. in order to search for
different tense forms of the same verb.
This handout explains the basic use of regular expressions using two regex
packages that can be downloaded at no charge from the Internet: TextPad (for
Windows) and BBEdit Lite (for Mac OS and OSX). Both are powerful text editors that
can actually be used as rudimentary concordancers. Note that this handout will only
deal with searching, not with replacing. Read the documentation for the software
packages to find out more about their replace functions; however, you should never use
the replace function with your original corpus files.
2. TWO REGEX TOOLS
2.1 TEXT PAD (WINDOWS)
This shareware software package can be downloaded at www.textpad.com. Before you
work with it, go to the CONFIGURE menu, choose the PREFERENCES command and then
the EDITOR subcommand and activate the USE POSIX control box. TextPad has two
types of search commands, both in the SEARCH menu: FIND and FIND IN FILES. The first
is shown in Figure 1a.
Figure 1a: TextPad FIND dialogue box
When working with regular expressions, make sure that the REGULAR EXPRESSION
dialogue box is activated. The regex pattern is typed into the FIND WHAT box. If you
click on FIND, the next occurrence of your pattern in the currently open document will
be found. If you click on M
ARK ALL, TextPad will mark all lines containing an
occurrence of your search pattern. You can then use the B
OOKMARKED LINES
Regular Expressions 2/4
subccommand from the COPY OTHER command in the EDIT menu to copy all
occurrences and paste them into a new document. You can also search all open
documents by activating the IN ALL DOCUMENTS control box. Note that you can perform
case-sensitive and case-insensitive searches by activating or deactivating the
appropriate control box.
TextPad can also search multiple files in a single pass if they are not currently
open. To do this, you use the FIND IN FILES command, whose dialogue box is shown in
Figure 1b.
Figure 1b: TextPad FIND IN FILES dialogue box
Here you have the same basic options as before, but in addition you can specify a file
type (e.g. .txt) in the IN FILES box and a folder in the IN FOLDER box. TextPad will then
search all files of the specified type in the specified folder, and create a new document
listing all lines containing an occurrence of your search pattern (to do this, make sure
the radio button ALL MATCHING LINES is activated).
2.2 BBEDIT LITE (MAC)
This freeware software package can be downloaded at www.barebones.com. BBEdit’s
FIND & REPLACE dialogue box is shown in Figure 2. When working with regular
expressions, make sure that the USE GREP control box is activated.
Figure 2: BBEdit FIND & REPLACE dialogue box
Regular Expressions 3/4
The regex pattern is typed into the SEARCH FOR box. If you click on FIND, the next
occurrence of your pattern in the currently open document will be found. If you click on
FIND ALL, a new document is created, which lists all lines containing an occurrence of
your search pattern. Note that you can perform case-sensitive and case-insensitive
searches by activating or deactivating the appropriate control box.
Like TextPad, BBEdit can search multiple files in a single pass. To do this, you
simply activate the MULTI-FILE SEARCH control box and then choose the folder
containing the files you want to search using the right OTHER switch The name of the
folder which you have selected will appear in the lowest of the three text boxes. Again,
by using the FIND ALL command, you can generate a document listing all lines that
contain your search pattern (along with the name and path of the file in which it was
found).
3. TWO DIALECTS OF REGEX
Table 1 lists the most important regex characters in TextPad and BBEdit:
Table 1: Regex characters in TextPad and BBEdit
T
EXTPAD BBEDIT LITE EXPLANATION
.. Any character (including whitespace characters) except a line break
[xyz] [xyz] Any of the characters x, y, z
Example: b[aeiou]t finds bat, bet, bit, bot, and but
[a-z] [a-z] Any characters from a to z in the ASCII table
[^xyz] [^xyz] Any character except x,y,z
Example: b[^u]t finds e.g. bat, bit and bet but not but
^^ Beginning of a line (unless used in square brackets, cf. preceding entry)
$$ End of a line (unless used in square brackets)
\< Left word boundary (beginning of a word)
Example: \<un finds un at the beginning of a word, as in undo,
unnatural, until
\> Right word boundary (end of a word)
Example: ing\> finds ing at the end of a word, as in running,
thinking, and ring
\t \t Tab
\f \f Page break (Form Feed).
\n \n (Unix)
\r (Mac)
Line break (Newline)
** Zero or more occurrences of the preceding character
Example: but?s finds bus, buts, and butts; f[aeiou]*l finds e.g.
fail, foil, feel, fool, foul, foal, etc.
?? Zero or one occurrence of the preceding character
Example: but?s finds bus, and buts; honou?r finds honor and
honour
++ One or more occurrences of the preceding character
Example: but+s finds buts and butts, but not bus
{x} Exactly x occurrences of the preceding character
{x,} At least x occurrences of the preceding character
{x,y} At least x, but no more than y occurrences of the preceding character
(x|y) (x|y) Either x or y
Example f(a|i)t finds fat or fit; (a|the) finds a and the;
(a|the|this) finds a, the, and this.
\\ Cancels the status of a character as a wildcard; e.g. ? finds one or more
occurrences of the preceding character, but \? finds question marks
Regular Expressions 4/4
In addition, there are some predefined expressions for whole classes of characters, as
shown in Table 2:
Table 2: Regex character classes in TextPad and BBEdit
[:alpha:] Any alphabetical character
[:lower:] Any lowecase alphabetical character
[:upper:] Any uppercase alphabetical character
[:alnum:] \w Any alphanumeric character
[:word:] Any alphanumeric character, hyphen, and apostrophe
\W Any character (including whitespace) except alphanumeric characters
[:digit:] \d or # Any numerical character
\D Any character except alphanumeric characters
[:blank:] Space or tab
[:space:] \s Any whitespace character
[:graph:] \S Any character except whitespace characters
[:punct:] Any character except alphanumeric and whitespace characters
4. EXERCISES
1. For each of the following adjectives, design a regex pattern that will retrieve
all of its forms.
TALL (tall, taller, tallest) FIT (fit, fitter, fittest)
NICE (nice, nicer, nicest) SCARY (scary, scarier, scariest)
2. For each of the following nouns, design a regex pattern that will retrieve all of
its forms:
BOOK (book, books) CHILD (child, children)
BUS (bus, buses) LEAF (leaf, leaves)
WOMAN (woman, women) MOUSE (mouse, mice)
3. For each of the following verbs, design at least one regex pattern that will
retrieve all of its forms:
WALK (walk, walks, walking, walked)
HIT (hit, hits, hitting)
FLIP (flip, flips, flipping, flipped)
SIT (sit, sits, sitting, sat)
STEAL (steal, steals, stealing, stole, stolen)
FIND (find, finds, finding, found)
SING (sing, sings, singing, sang, sung)
TAKE (take, takes, taking, took, taken)
FLY (fly, flies, flying, flew, flown)
WREAK (wreaks, wreaked, wrought, wreaking)
Ger. SPRINGEN (spring, springe, springst, springt, springen, sprang, sprangst,
sprangt, sprangen, gesprungen)
4. Use TextPad or BBEdit to search a 1-million word corpus (like BROWN,
FROWN, LOB, FROB, etc.) for some of the patterns you have designed.