Corpus Linguistics Corpus Linguistics
(L615) Regular
Regular expressions expressions
Regular
Markus Dickinson expressions
Department of Linguistics, Indiana University
Syntax of regular expressions
Spring 2013 Grep
help/help to
Online web
searching
1 / 18
Motivating regular expressions Corpus Linguistics
Regular expressions help describe complex patterns of Regular
words and text expressions
Find help to V constructions in POS-tagged text, with Regular
words & tags mixed up together expressions
Retrieve the first verb used in a relative clause
Find all Indiana email addresses occurring in a long text Syntax of regular expressions
Grep
help/help to
Online web
searching
2 / 18
Regular expressions: What they are Corpus Linguistics
A regular expression is a compact description of a set Regular
of strings, i.e., a language (in formal language theory) expressions
They are used to search for occurrences of these Regular
strings expressions
Regular expressions can only describe so-called
regular languages Syntax of regular expressions
Some patterns cannot be specified using regular Grep
expressions, e.g., finding a string containing an arbitrary
number of matching parentheses help/help to
Online web
searching
3 / 18
Regular expressions: Tools that use them Corpus Linguistics
A variety of unix tools (grep, sed, . . . ), editors (emacs, Regular
jEdit, . . . ), and programming languages (perl, python, expressions
Java, . . . ) incorporate regular expressions.
Regular
We’ll start with grep & then move to perl expressions
Some of the concordancing tools you’ve seen (e.g.,
MLCT, AntConc) allow for regular expression searching. Syntax of regular expressions
Implementations are very efficient so that large text files Grep
can be searched quickly
The various tools differ w.r.t. the exact syntax of the regular help/help to
expressions they allow, but knowledge of one transfers
Online web
searching
4 / 18
The syntax of regular expressions (I) Corpus Linguistics
Regular expressions consist of Regular
strings of literal characters: c, A100, natural expressions
language, 30 years!
disjunction: Regular
ordinary disjunction: devoured|ate, famil(y|ies) expressions
character classes: [Tt]he, bec[oa]me
ranges: [A-Z] (any capital letter) Syntax of regular expressions
negation: Grep
[ˆa] (any symbol but a)
[ˆA-Z0-9] (not an uppercase letter or number) help/help to
Online web
searching
5 / 18
Specific character classes Corpus Linguistics
Use aliases to designate particular recurrent sets of Regular
characters expressions
\d = [0-9]: digit Regular
\D = [ˆ\d]: non-digit expressions
\w = [a-zA-Z0-9 ]: alphanumeric
\W = [ˆ\w]: non-alphanumeric Syntax of regular expressions
\s = [\r\t\n\f]: whitespace character Grep
\r: space, \t: tab, \n: newline, \f: formfeed help/help to
\S [ˆ\s]: non-whitespace
Online web
searching
6 / 18
The syntax of regular expressions (II) Corpus Linguistics
counters Regular
optionality: ? expressions
colou?r
any number of occurrences: * (Kleene star) Regular
[0-9]* years expressions
at least one occurrence: +
[0-9]+ dollars Syntax of regular expressions
Grep
wildcard for any character: .
beg.n for any character in between beg and n help/help to
Parentheses to group items together
ant(farm)? Online web
Escaped characters to specify characters with special searching
meanings:
\*, \+, \?, \(, \), \|, \[, \]
7 / 18
The syntax of regular expressions (III) Corpus Linguistics
Operator precedence, from highest to lowest: Regular
parentheses () expressions
counters * + ?
character sequences Regular
disjunction | expressions
fire|ing = fire or ing Syntax of regular expressions
fir(e|ing) = fir followed by either e or ing Grep
help/help to
Online web
searching
8 / 18
The syntax of regular expressions (IV) Corpus Linguistics
Anchors: anchor expressions to various parts of the string Regular
ˆ = start of line expressions
do not confuse with [ˆ...] used to express negation
$ = end of line Regular
\b = non-word character (i.e., word boundary) expressions
word characters are digits, underscores, or letters, i.e.,
[0-9A-Za-z ] Syntax of regular expressions
Grep
Instead of writing out specific numbers of occurrences,
repetition can be represented between { } help/help to
a{4} = 4 a’s Online web
a{1,4} = 1-4 a’s searching
9 / 18
Some RE practice Corpus Linguistics
What does \$[0-9]+(\.[0-9][0-9]) signify? Regular
Write a RE to capture the times on a digital watch expressions
(hours and minutes). Think about:
Regular
the (im)possible values for the hours expressions
the (im)possible values for the minutes
Syntax of regular expressions
Grep
help/help to
Online web
searching
10 / 18
Grep Corpus Linguistics
grep is a powerful and efficient program for searching in Regular
text files using regular expressions. expressions
It is standard on Unix, Linux, and Mac OSX, and there
also are various ports to Windows (e.g., Regular
expressions
http://gnuwin32.sourceforge.net/packages/grep.htm,
Syntax of regular expressions
http://www.interlog.com/∼tcharron/grep.html or http://www.wingrep.com/). Grep
The version of grep that supports the full set of
operators mentioned above is generally called egrep help/help to
(for extended grep).
Online web
searching
11 / 18
Grep: Examples for using regular expressions Corpus Linguistics
(I)
Regular
In the following, we assume a text file f.txt containing, expressions
among others, the strings that we mention as matching.
Regular
Strings of literal characters: expressions
egrep ’and’ f.txt matches and, Ayn Rand, Candy
and so on Syntax of regular expressions
Character classes: Grep
egrep ’the year [0-9][0-9][0-9][0-9]’ f.txt
matches the year 1776, the year 1812, the year help/help to
2001, and so on
Online web
searching
12 / 18
Grep: Examples for using regular expressions Corpus Linguistics
(II)
Regular
disjunction (|): egrep ’couch|sofa’ f.txt matches expressions
couch or sofa
grouping with parentheses: Regular
egrep ’un(interest|excit)ing’ f.txt matches expressions
uninteresting or unexciting.
Any character (.): Syntax of regular expressions
egrep ’o.e’ f.txt matches ore, one, ole Grep
help/help to
Online web
searching
13 / 18
Grep: Examples for using regular expressions Corpus Linguistics
(III)
Regular
Kleene star (*): expressions
egrep ’a*rgh’ f.txt matches argh, aargh, aaargh
One or more (+): Regular
egrep ’john+y’ f.txt matches johny, johnny, . . . , expressions
but not johy
Optionality (?): Syntax of regular expressions
egrep ’joh?n’ f.txt matches jon and john Grep
help/help to
Online web
searching
14 / 18
Revisiting help/help to Corpus Linguistics
Regular
expressions
We compared help V & help to V Regular
and help NP V & help NP to V expressions
For these latter cases, we simplified the NP to be a
single noun (tag starts with n) or pronoun (p) Syntax of regular expressions
Grep
help/help to
Online web
searching
The patterns we used:
help V: \b(help\w*?/v\w*?\s+\w+/v\w*?)\b
help to V:
\b(help\w*?/v\w*?\s+to/to\s+\w+/v\w*?)\b
help NP V:
\b(help\w*?/v\w*?\s+\w+/[np]\w*?\s+\w+/v\w*?)\b
help NP to V:
\b(help\w*?/v\w*?\s+\w+/[np]\w*?\s+to/to\s+\w+/v\w*?)\b
15 / 18
Breaking down the regular expression Corpus Linguistics
\b(help\w*?/v\w*?\s+\w+/v\w*?)\b Regular
expressions
So, what do we see here?
Regular
Word boundaries before help & at the end expressions
help followed by a sequence of 0 or more (*) word
characters (\w) Syntax of regular expressions
Grep
This matches help, helps, helpful, etc.
We’ll talk about *? momentarily help/help to
/v\w*?: this matches a string starting with /v & followed Online web
by any word characters searching
Taking these 2 together matches, e.g., helping/vbg
\s+: 1 or more whitespace characters
\w+/v\w*?: matches any verb
With \w+, we match any word, not just help
16 / 18
Greediness & Capturing parentheses Corpus Linguistics
Greediness Regular
In Perl, * is greedy: it tries to match as much text as expressions
possible
Regular
Consider a text John goes to the store and an RE t.*s expressions
With the normal, greedy *, this matches to the s
With the non-greedy *? (i.e., t.*?s), this matches the s Syntax of regular expressions
Grep
Capturing parentheses: parentheses do more than just
distinguish subparts of an RE help/help to
They “capture” the part(s) of the RE you may want Online web
further access to searching
\b(help\w*?/v\w*?\s+\w+/v\w*?)\b
We can use $1 to refer to the captured part of the RE
(and $2 if there were a second capture, etc.)
e.g., <word>(\w+)</word> will match the whole string,
but only capture the part in-between the XML tags
17 / 18
Online web searching with REs Corpus Linguistics
Various online web interfaces allow RE queries Regular
expressions
To provide efficient searching in large corpora, in these
search engines regular expressions over characters are Regular
often limited to single tokens (i.e. generally words) expressions
BNC:
Syntax of regular expressions
web form: Grep
http://www.natcorp.ox.ac.uk/using/index.xml?ID=simple
regular expressions are enclosed in { } help/help to
Internet corpora:
http://corpus.leeds.ac.uk/internet.html Online web
See notes on query language: searching
http://corpus.leeds.ac.uk/help.html
18 / 18