Writing search patterns

Writing Search Patterns

Most Characters Match Themselves

Most characters that you type into the Find dialog box match themselves. For instance, if you are looking for the letter "t", Grep stops and reports a match when it encounters a "t" in the text. This idea is so obvious that it seems not worth mentioning, but the important thing to remember is that these characters are search patterns. Very simple patterns, to be sure, but patterns nonetheless.

Escaping Special Characters

In addition to the simple character matching discussed above, there are various special characters that have different meanings when used in a grep pattern than in a normal search. (The use of these characters is covered in the following sections.)

However, sometimes you will need to include an exact, or literal, instance of these characters in your grep pattern. In this case, you must use the backslash character \ before that special character to have it be treated literally; this is known as "escaping" the special character. To search for a backslash character itself, double it \\ so that its first appearance will escape the second.

For example, perhaps the most common "special character" in grep is the dot: ".". In grep, a dot character will match any character except a return. But what if you only want to match a literal dot? If you escape the dot: "\.", it will only match another literal dot character in your text.

So, most characters match themselves, and even the special characters will match themselves if they are preceded by a backslash. BBEdit's grep syntax coloring helps make this clear.

Note: When passing grep patterns to BBEdit via AppleScript, be aware that both the backslash and double-quote characters have special meaning to AppleScript. In order to pass these through correctly, you must escape them in your script. Thus, to pass \r for a carriage return to BBEdit, you must write \\r in your AppleScript string.

Wildcards Match Types of Characters

These special characters, or metacharacters, are used to match certain types of other characters:

Wildcard		Matches...

.		any character except a line break (i.e. a carriage return)

^		beginning of a line (unless used in a character class)

$		end of line (unless used in a character class)

Being able to specifically match text starting at the beginning or end of a line is an especially handy feature of grep. For example, if you wanted to find every instance of a message sent by Patrick, from a log file which contains various other information like so:

    From: Rich, server: barebones.com

    To: BBEdit-Talk, server: lists.barebones.com

    From: Patrick, server: example.barebones.com

you could search for the pattern:

    ^From: Patrick

and you will find every occurrence of these lines in your file (or set of files if you do a multi-file search instead).

It is important to note that ^ and $ do not actually match return characters. They match zero-width positions after and before returns, respectively. So, if you are looking for "foo" at the end of a line, the pattern " foo$ " will match the three characters 'f', 'o',and 'o'. If you search for " foo\r ", you will match the same text, but the match will contain four characters: 'f', 'o', 'o', and a return.

In softwrapped files, ^ and $ will also match after and before soft line breaks, respectively.

You can combine ^ and $ within a pattern to force a match to constitute an entire line. For example:

    ^foo$

will only match "foo" on a line by itself, with no other characters. Try it against these three lines to see for yourself:

foobar

foo

fighting foo

The pattern will only match the second line.

Other Positional Assertions

Escape		Matches...

\A		only at the beginning of the document (as opposed to ^, which matches at the bginning of the document and also at the beginning of each line)

\b		any word boundary, defined as any position between a \w character and a \W character, in either order

\B		any position that is NOT a word boundary

\z		at the end of the document (as opposed to $, which matches at the end of the document, but also at the end of each line)

\Z		at the end of the document, or before a trailing return at the end of the document, if there is one

Examples (the text matched by the pattern is underlined)

Search for:	\bfoo\b

Will match:	bar foo bar

Will match:	foo bar

Won't match:	foobar

Search for:	\bJane\b

Will match:	Jane's

Will match:	Tell Jane about the monkey.

Won't match:	Commander Janeway

Search for:	\Afoo

Will match:	foobar

Won't match:	This is good foo.

Character Classes Match Sets or Ranges of Characters

The character class construct lets you specify a set or a range of characters to match, or to ignore. A character class is constructed by placing a pair of square brackets [...] around the group or range of characters you wish to include. To exclude, or ignore, all characters specified by a character class, add a caret character ^ just after the opening bracket [^...]. For example:

Character class		Matches...

[xyz]		any one of the characters x, y, z

[^xyz]		any character except x, y, z

[a-z]		any character in the range a to z

You can use any number of characters or ranges between the brackets. Here are some examples:

Character class		matches

[aeiou]		any vowel

[^aeiou]		any character that is not a vowel

[a-zA-Z0-9]		any alphanumeric character

[^aeiou0-9]		any character that is neither a vowel nor a digit

Character classes respect the setting of the Case Sensitive checkbox in the Find dialog. For example, if Case Sensitive is on, [a] will only match "a"; if Case Sensitive is off, [a] will match both "a" and "A".

A character class matches when the search encounters any one of the characters in the pattern. However, the contents of a set are only treated as separate characters, not as words. For example, if your search pattern is [beans] and the text in the window is "lima beans", BBEdit will report a match at the "a" of the word "lima".

To include the character ] in a set or a range, place it immediately after the opening bracket. To use the ^ character, place it anywhere except immediately after the opening bracket. To match a dash character (hyphen) in a range, place it at the beginning of the range; to match it as part of a set, place it at the beginning or end of the set. Or, you can include any of these character at any point in the class by escaping them with a backslash.

Character class		matches

[]0-9]		any digit or ]

[aeiou^]		a vowel or ^

[-A-Z]		a dash or A - Z

[--A]		any character in the range from - to A

[aeiou-]		a vowel or -

Matching Non-Printing Characters

As described in Chapter 7 on searching, BBEdit provides several special character pairs that you can use to match certain non-printing characters. You can use these special characters in grep patterns as well as for normal searching.

For example, to look for a tab or a space, you would use the character class [\t ] (consisting of a tab special character and a space character).

Character		Matches...

\r		line break (carriage return)

\n		Unix line break (line feed)

\t		tab

\f		page break (form feed)

\xNN		hexadecimal character code NN (e.g. \x0D for CR)

\\		backslash

Use \r to match a line break in the middle of a pattern and the special characters ^ and $ (described above) to "anchor" a pattern to the beginning of a line or to the end of a line. In the case of ^ and $, the line break character is not included in the match.

Other Special Character Classes

BBEdit several other sequences for matching different types or categories of characters.

Special Character		Matches...

\s		any whitespace character (space, tab, carriage return, line feed, form feed)

\S		any non-whitespace character (any character not included by \s)

\w		any word character (a-z, A-Z, 0-9, _, and some 8-bit characters)

\W		any non-word character (all characters not included by \w, including carriage returns)

\d		any digit (0-9)

\D		any non-digit character (incl. carriage return)

A "word" is defined in BBEdit as any run of non-word-break characters bounded by word breaks. Word characters are generally alphanumeric, and some characters whose value is greater than 127 are also considered word characters.

Note that any character matched by \s is by definition not a word character; thus, anything matched by \s will also be matched by \W (but not the reverse!).

Quantifiers Repeat Subpatterns

The special characters * , + , and ? specify how many times the pattern preceding them may repeat. {} -style quantifiers allow you to specify exactly how many times a subpattern can repeat. The preceding pattern can be a literal character, a wildcard character, a character class, or a special character.

Pattern		Matches...

p+		one or more p's

p*		zero or more p's

p?		zero or one p's

p{COUNT}		match exactly COUNT p's, where COUNT is an integer

p{MIN,}		match at least MIN p's

p{MIN,MAX}		match at least MIN p's, but no more than MAX

Note that the repetition characters * and ? match zero or more occurrences of the pattern. That means that they will always succeed, because there will always be at least zero occurrences of any pattern, but that they will not necessarily select any text (if no occurrences of the preceding pattern are present).

For this reason, when you are trying to match more than one occurrence, it usually better to use a + than a * , because + requires a match, whereas * can match the empty string. Only use * when you are sure that you really mean "zero or more times", not just "more than once".

Try the following examples to see how their behavior matches what you expect:

Pattern	Text is...	Matches...

.*	Fourscore and seven years	Fourscore and seven years

[0-9]+	I've been a loyal member since 1983 or so.	1983

\d+	I've got 12 years on him.	12

A*	BAAAAAAAB	advances the insertion point past the first and last "B"s, and matches "AAAAAAA"

A+	BAAAAAAAB	AAAAAAA

A?	Andy joined AAA	the "A" from Andy

A+	Ted joined AAA yesterday	"AAA" and the "a" from yesterday

Combining Patterns to Make Complex Patterns

So far, the patterns you have seen match a single character or the repetition of a single character or class of characters. This is very useful when you are looking for runs of digits or single letters, but often that's not enough.

However, by combining these patterns, you can search for more complex items. you are already familiar with combining patterns. Remember the section at beginning of this discussion that said that each individual character is a pattern that matches itself? When you search for a word, you are already combining basic patterns.

You can combine any of the preceding grep patterns in the same way. Here are some examples.

Pattern	Matches	Examples

\d+\+\d+	a string of digits, followed by a literal plus sign, followed by more digits	4+2 1234+5829

\d{4}[\t ]B\.C\.	four digits, followed by a tab or a space, followed by the string B.C.	2152 B.C.

\$?[0-9,]+\.\d*	an optional dollar sign, followed by one or more digits and commas, followed by a period, then zero or more digits	1,234.56 $4,296,459.19 $3,5,6,4.0000 0. (oops!)

Note again in these examples how the characters that have special meaning to grep are preceded by a backslash (\+, \., and \$) when we want them to match themselves.

Creating Subpatterns

Subpatterns provide a means of organizing or grouping complex grep patterns. This is primarily important for two reasons: for limiting the scope of the alternation operator (which otherwise creates an alternation of everything to its left and right), and for changing the matched text when performing replacements. A subpattern consists of any simple or complex pattern, enclosed in a pair of parentheses:

Pattern		Matches...

(p)		the pattern p and remembers it

You can combine more than one subpattern into a grep pattern, or mix subpatterns and other pattern elements as you need.

Taking the last set of examples, you could modify these to use subpatterns wherever actual data appears:

Pattern	Matches	Examples

(\d+)\+(\d+)	a string of digits, followed by a plus sign, followed by more digits	4+2 1234+5829

(\d{4})[\t ]B\.C\.	four digits, followed by a tab or a space, followed by the string B.C.	2152 B.C.

\$?([0-9,]+)\.(\d*)	an optional dollar sign, followed by one or more digits and commas, followed by a period, then zero or more digits	1,234.56 $4,296,459.19 $3,5,6,4.0000 0.

What if we wanted to match a series of digits, followed by a plus sign, followed by the exact same series of digits as on the left side of the plus? In other words, we want to match "1234+1234" or "7+7", but not "5432+1984".

Using grouping parentheses, you can do this by referring to a backreference, also known as a captured subpattern. Each set of parentheses in the pattern is numbered from left to right, starting with the opening parenthesis. Later in the pattern, you can refer to the text matched within these backreferences by using a backslash followed by the number of the backreference.

For example, the pattern "(\d+)\+\1" will match a string of digits, followed by a plus sign, followed by the same digits. So, it will match "7+7" and "1234+1234", but will not match "6133+4839"

We will revisit subpatterns in the section on replacement, where you will see how the choice of subpatterns affects the changes you can make.

Using Alternation

The alternation operator | allows you to match any of several patterns at a given point. To use this operator, place it between one or more patterns x|y to match either x or y.

As with all of the preceding options, you can combine alternation with other pattern elements to handle more complex searches.

Pattern	Text is...	Matches...

a\|t	A cat	each "a" and "t"

a\|c\|t	A cat	each "a", "c", and "t"

a (cat\|dog) is	A cat is here. A dog is here. A giraffe is here.	"A cat is", "A dog is"

A\|b+	Abba	"A", "bb", and "a"

Andy\|Ted	Andy and Ted joined AAA yesterday	"Andy" and "Ted"

\d{4}\|years	I've been a loyal member since 1983, almost 16 years ago.	"1983", "years"

[a-z]+\|\d+	That's almost 16 years.	"That", "s", "almost", "16", "years"

The `Longest Match' Issue

When creating complex patterns, you should bear in mind that the quantifiers + , * , ? and {} are "greedy". That is, they will always make the longest possible match possible to a given pattern, so if your pattern is E+ (one or more E's) and your text contains "EEEE", the pattern matches all the E's at once, not just the first one. This is usually what you want, but not always.

Suppose, for instance, that you want to match an HTML tag. At first, you may think that a good way to do this would be to search for the pattern:

    <.+>

consisting of a less-than sign, followed by one or more occurrences of a single character, followed by a greater-than sign. To understand why this may not work the way you think it should, consider the following sample text to be searched:

    <B>This text is in boldface.</B>

The intent was to write a pattern that would match both of the HTML tags separately. Let's see what actually happens. The < character at the beginning of this line matches the beginning of the pattern. The next character in the pattern is . which matches any character (except a line break), modified with the + quantifier, taken together, this combination means one or more repetitions of any character. That, of course, takes care of the B. The problem is that the next > is also "any character" and that it also qualifies as "one or more repetitions." In fact, all of the text up to the end of the line qualifies as "one or more repetitions of any character" (the line break doesn't qualify, so grep stops there). After grep has reached the line break, it has exhausted the + operator, so it backs up and sees if it can find a match for >. Lo and behold, it can: the last character is a greater-than symbol. Success!

In other words, the pattern matches our entire sample line at once, not the two separate HTML tags in it as we intended. More generally, the pattern matches all the text in a given line or paragraph from the first < to the last >. The pattern only does what we intended when there is only one HTML tag in a line or paragraph. This is what we meant when we say that the regular quantifiers try to make the longest possible match.

Non-Greedy Quantifiers

Pattern		Matches...

p+?		one or more p's

p*?		zero or more p's

p??		zero or one p's

p{COUNT}?		match exactly COUNT p's, where COUNT is an integer

p{MIN,}?		match at least MIN p's

p{MIN,MAX}?		match at least MIN p's, but no more than MAX

Astute readers will note that these non-greedy quantifers correspond exactly to their normal (greedy) counterparts, appended with a question mark.

Revisting our problem of matching HTML tags, for example, we can search for:

    <.+?>

This matches an opening bracket, then one or more occurrences of any character other than a return, followed by a closing bracket. The non-greedy quantifer achieves the results we want, preventing BBEdit from "overrunning" the closing angle bracket and matching across several tags.

A slightly more complicated example: how could you write a pattern that matches all text between <B> and </B> HTML tags? Consider the sample text below:

    <B>Welcome</B> to the home of <B>BBEdit!</B>

As before, you might be tempted to write:

    <B>.*</B>

but for the same reasons as before, this will match the entire line of text. The solution is similar. We'll use the non-greedy *? quantifer:

    <B>.*?</B>