The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.

Corpus Linguistics Regular expressions Regular expressions Syntax of regular expressions Grep help/help to Online web searching Regular expressions: What they are

Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by , 2017-07-18 07:20:03

Corpus Linguistics (L615) - Regular expressions

Corpus Linguistics Regular expressions Regular expressions Syntax of regular expressions Grep help/help to Online web searching Regular expressions: What they are

Corpus Linguistics Corpus Linguistics
(L615) Regular

Regular expressions expressions
Regular
Markus Dickinson expressions
Department of Linguistics, Indiana University
Syntax of regular expressions
Spring 2013 Grep

help/help to
Online web
searching

1 / 18

Motivating regular expressions Corpus Linguistics

Regular expressions help describe complex patterns of Regular
words and text expressions

Find help to V constructions in POS-tagged text, with Regular
words & tags mixed up together expressions
Retrieve the first verb used in a relative clause
Find all Indiana email addresses occurring in a long text Syntax of regular expressions
Grep

help/help to

Online web
searching

2 / 18

Regular expressions: What they are Corpus Linguistics

A regular expression is a compact description of a set Regular
of strings, i.e., a language (in formal language theory) expressions

They are used to search for occurrences of these Regular
strings expressions
Regular expressions can only describe so-called
regular languages Syntax of regular expressions
Some patterns cannot be specified using regular Grep
expressions, e.g., finding a string containing an arbitrary
number of matching parentheses help/help to

Online web
searching

3 / 18

Regular expressions: Tools that use them Corpus Linguistics

A variety of unix tools (grep, sed, . . . ), editors (emacs, Regular
jEdit, . . . ), and programming languages (perl, python, expressions
Java, . . . ) incorporate regular expressions.
Regular
We’ll start with grep & then move to perl expressions
Some of the concordancing tools you’ve seen (e.g.,
MLCT, AntConc) allow for regular expression searching. Syntax of regular expressions
Implementations are very efficient so that large text files Grep
can be searched quickly
The various tools differ w.r.t. the exact syntax of the regular help/help to
expressions they allow, but knowledge of one transfers
Online web
searching

4 / 18

The syntax of regular expressions (I) Corpus Linguistics

Regular expressions consist of Regular
strings of literal characters: c, A100, natural expressions
language, 30 years!
disjunction: Regular
ordinary disjunction: devoured|ate, famil(y|ies) expressions
character classes: [Tt]he, bec[oa]me
ranges: [A-Z] (any capital letter) Syntax of regular expressions
negation: Grep
[ˆa] (any symbol but a)
[ˆA-Z0-9] (not an uppercase letter or number) help/help to

Online web
searching

5 / 18

Specific character classes Corpus Linguistics

Use aliases to designate particular recurrent sets of Regular
characters expressions

\d = [0-9]: digit Regular
\D = [ˆ\d]: non-digit expressions
\w = [a-zA-Z0-9 ]: alphanumeric
\W = [ˆ\w]: non-alphanumeric Syntax of regular expressions
\s = [\r\t\n\f]: whitespace character Grep

\r: space, \t: tab, \n: newline, \f: formfeed help/help to
\S [ˆ\s]: non-whitespace
Online web
searching

6 / 18

The syntax of regular expressions (II) Corpus Linguistics

counters Regular
optionality: ? expressions
colou?r
any number of occurrences: * (Kleene star) Regular
[0-9]* years expressions
at least one occurrence: +
[0-9]+ dollars Syntax of regular expressions
Grep
wildcard for any character: .
beg.n for any character in between beg and n help/help to
Parentheses to group items together
ant(farm)? Online web
Escaped characters to specify characters with special searching
meanings:
\*, \+, \?, \(, \), \|, \[, \]

7 / 18

The syntax of regular expressions (III) Corpus Linguistics

Operator precedence, from highest to lowest: Regular
parentheses () expressions
counters * + ?
character sequences Regular
disjunction | expressions

fire|ing = fire or ing Syntax of regular expressions
fir(e|ing) = fir followed by either e or ing Grep

help/help to

Online web
searching

8 / 18

The syntax of regular expressions (IV) Corpus Linguistics

Anchors: anchor expressions to various parts of the string Regular
ˆ = start of line expressions
do not confuse with [ˆ...] used to express negation
$ = end of line Regular
\b = non-word character (i.e., word boundary) expressions
word characters are digits, underscores, or letters, i.e.,
[0-9A-Za-z ] Syntax of regular expressions
Grep
Instead of writing out specific numbers of occurrences,
repetition can be represented between { } help/help to

a{4} = 4 a’s Online web
a{1,4} = 1-4 a’s searching

9 / 18

Some RE practice Corpus Linguistics

What does \$[0-9]+(\.[0-9][0-9]) signify? Regular
Write a RE to capture the times on a digital watch expressions
(hours and minutes). Think about:
Regular
the (im)possible values for the hours expressions
the (im)possible values for the minutes
Syntax of regular expressions
Grep

help/help to

Online web
searching

10 / 18

Grep Corpus Linguistics

grep is a powerful and efficient program for searching in Regular
text files using regular expressions. expressions
It is standard on Unix, Linux, and Mac OSX, and there
also are various ports to Windows (e.g., Regular
expressions
http://gnuwin32.sourceforge.net/packages/grep.htm,
Syntax of regular expressions
http://www.interlog.com/∼tcharron/grep.html or http://www.wingrep.com/). Grep
The version of grep that supports the full set of
operators mentioned above is generally called egrep help/help to
(for extended grep).
Online web
searching

11 / 18

Grep: Examples for using regular expressions Corpus Linguistics
(I)
Regular
In the following, we assume a text file f.txt containing, expressions
among others, the strings that we mention as matching.
Regular
Strings of literal characters: expressions
egrep ’and’ f.txt matches and, Ayn Rand, Candy
and so on Syntax of regular expressions
Character classes: Grep
egrep ’the year [0-9][0-9][0-9][0-9]’ f.txt
matches the year 1776, the year 1812, the year help/help to
2001, and so on
Online web
searching

12 / 18

Grep: Examples for using regular expressions Corpus Linguistics
(II)
Regular
disjunction (|): egrep ’couch|sofa’ f.txt matches expressions
couch or sofa
grouping with parentheses: Regular
egrep ’un(interest|excit)ing’ f.txt matches expressions
uninteresting or unexciting.
Any character (.): Syntax of regular expressions
egrep ’o.e’ f.txt matches ore, one, ole Grep

help/help to

Online web
searching

13 / 18

Grep: Examples for using regular expressions Corpus Linguistics
(III)
Regular
Kleene star (*): expressions
egrep ’a*rgh’ f.txt matches argh, aargh, aaargh
One or more (+): Regular
egrep ’john+y’ f.txt matches johny, johnny, . . . , expressions
but not johy
Optionality (?): Syntax of regular expressions
egrep ’joh?n’ f.txt matches jon and john Grep

help/help to

Online web
searching

14 / 18

Revisiting help/help to Corpus Linguistics

Regular
expressions

We compared help V & help to V Regular
and help NP V & help NP to V expressions
For these latter cases, we simplified the NP to be a
single noun (tag starts with n) or pronoun (p) Syntax of regular expressions
Grep

help/help to

Online web
searching

The patterns we used:

help V: \b(help\w*?/v\w*?\s+\w+/v\w*?)\b

help to V:
\b(help\w*?/v\w*?\s+to/to\s+\w+/v\w*?)\b

help NP V:
\b(help\w*?/v\w*?\s+\w+/[np]\w*?\s+\w+/v\w*?)\b

help NP to V:
\b(help\w*?/v\w*?\s+\w+/[np]\w*?\s+to/to\s+\w+/v\w*?)\b

15 / 18

Breaking down the regular expression Corpus Linguistics

\b(help\w*?/v\w*?\s+\w+/v\w*?)\b Regular
expressions
So, what do we see here?
Regular
Word boundaries before help & at the end expressions
help followed by a sequence of 0 or more (*) word
characters (\w) Syntax of regular expressions
Grep
This matches help, helps, helpful, etc.
We’ll talk about *? momentarily help/help to

/v\w*?: this matches a string starting with /v & followed Online web
by any word characters searching

Taking these 2 together matches, e.g., helping/vbg

\s+: 1 or more whitespace characters
\w+/v\w*?: matches any verb

With \w+, we match any word, not just help

16 / 18

Greediness & Capturing parentheses Corpus Linguistics

Greediness Regular
In Perl, * is greedy: it tries to match as much text as expressions
possible
Regular
Consider a text John goes to the store and an RE t.*s expressions
With the normal, greedy *, this matches to the s
With the non-greedy *? (i.e., t.*?s), this matches the s Syntax of regular expressions
Grep
Capturing parentheses: parentheses do more than just
distinguish subparts of an RE help/help to

They “capture” the part(s) of the RE you may want Online web
further access to searching
\b(help\w*?/v\w*?\s+\w+/v\w*?)\b

We can use $1 to refer to the captured part of the RE
(and $2 if there were a second capture, etc.)
e.g., <word>(\w+)</word> will match the whole string,
but only capture the part in-between the XML tags

17 / 18

Online web searching with REs Corpus Linguistics

Various online web interfaces allow RE queries Regular
expressions
To provide efficient searching in large corpora, in these
search engines regular expressions over characters are Regular
often limited to single tokens (i.e. generally words) expressions
BNC:
Syntax of regular expressions
web form: Grep
http://www.natcorp.ox.ac.uk/using/index.xml?ID=simple
regular expressions are enclosed in { } help/help to
Internet corpora:
http://corpus.leeds.ac.uk/internet.html Online web
See notes on query language: searching
http://corpus.leeds.ac.uk/help.html

18 / 18


Click to View FlipBook Version