Most recent update (All By Hand(TM)): 18-Nov-2011 21:28

 

 

... drain you of your sanity,
face The Thing That Should Not Be ...

Metallica, The Thing That Should Not Be

 

 

InDesign CS4/CS5/CS5.5 GREP

GREP is one of the most powerful features in InDesign. It can save lots of time: anything you can do with the regular text search and replace, you can do with GREP, and there are lots more things you can only do with GREP.

But using GREP is hard ... The basics are easy enough to grasp: the concept of wildcards for any character, upper- and lowercase, and digits; start and end of paragraphs; zero-, once- and more-repeats; and even basic inclusion and exclusion groups. Add nested grouping and OR bars, and things tend to get muddy (even though this allows for some fairly advanced tricks). Mention Negative and Positive Lookbehind and Lookahead, and your average InDesigner's eyes will start to glaze over. Throw in Expression Modifiers, Unicode and POSIX classes, and Atomic groups. Then, for good measure, take a look at If/Then/Else constructions ... and even an experienced GREP user will soon loose his grip over his own creations.

Adobe doesn't make it easy to figure out why a certain GREP expression fails to find anything, as this may ultimately be for several reasons:

  1. Perhaps the text just isn't in your document. That's what Search is for, after all.
  2. Perhaps you made a logical error, and your GREP expression is syntactically correct, but not what you meant.
  3. And perhaps your GREP expression is simply wrong; an invalid expression always fails -- but InDesign doesn't tell you it is, only that it "Cannot find match".

While I cannot help you on the first point, I can sure make the second easier, and the third point -- validating a GREP expression -- is "only" an exercise in programming acumen. That's where I wrote a script that lets you enter a GREP expression and test it for "correctness".

In doing so, I had the pleasure of vastly expanding my GREP knowledge, up to and beyond InDesign's capabilities, as there are things even InDesign's version cannot do. It even revealed a few tidbits of knowledge about InDesign itself. For example, it's perfectly valid to enter a NULL character into the GREP search box, but at that point, the interface to Javascript breaks (apparently, Javascript strings are what's known as "zero-terminated" -- the first NULL character in a Javascript string will signify the string end). Another factoid is that the "Case Insensitive" switch in GREP translates your string into lowercase, rather than uppercase. (I'll keep it a secret how I found that out, as it's a nice exercise for the reader, and quite unlikely to cause problems in daily use.) Another one is that the "Non-Marking Group" (?:text) is not really a group with a special command ?:, but in reality a "Modifier" list without any modifier at all (FYI, the allowed 'modifiers' here are i, s, m, and x -- you can enter more than one, but evidently also none at all). I was delighted when I deduced that, and a simple experiment, searching for (?-:text) proved me right (that's a Modifier Off Group, with a hyphen but containing no modifiers).

But enough about me. What's this all about?

 

What the GREP!?

That's what people are shouting when their GREP doesn't seem to be working, and that's the name of the script. Before continuing with explaining what it does and what it doesn't do, download it here:

 

whatthegrep-0.1.zip (22KB; v0.1(b), dated 18-Nov-2011)

 

Unpack the ZIP to a temporary location. Copy the file "WhatTheGrep.jsx" into your Users Script folder; if put into the correct place, it should magically appear inside your Scripts Panel in InDesign. Double-click the script to run.

 

So ... what the GREP does it do?

This is what it looks like:

 

What the GREP!?

 

On running the script, it opens a dialog where you can enter or paste a GREP expression in. Initially, the last GREP expression you used is automatically shown, but you can edit it at will, or simply delete it and replace with another one.

The Close button does just that -- it closes the dialog. May be handy if you changed your mind about running the script.

The Show Me button creates a new document in your InDesign and writes out a full explanation of each of the codes in the expression, with all of its special characters explained, and all groupings enumerated and indented. This is the part where some expression may suddenly be more comprehensible! If the expression contains an error somewhere, it'll be mentioned in this explanation.

As with all scripted dialogs, the Return key closes the dialog and runs the default action ("Show Me"), and the Escape key closes it without doing anything at all. Fortunately, that's not a problem with GREP -- no need to insert hard returns. You can insert Soft Returns (Ctrl+Return) but beware these will also pop up in the explanation.

 

At the top of the new document, the original GREP expression is shown. At the bottom, a cleaned-up version is written out, with color codings and super- and subscripts denoting various properties. Any erronous code will be shown in bright red bold italics, so you can immediately see why and where the expression fails.

 

"What the GREP" understands a fair part of the regular daily commands. There are only a few cases where it reports an error where there is none -- for example, the expression [a-z[] is perfectly legal in InDesign, yet my script will notice the second opening [ and assume it's an error. However, in general, you are supposed to trust its judgement ...

There are also a few rarely used command sequences that are baffling me, and that may cause the script to either accept a faulty expression (hopefully, very rare), or signal an error where there is none (not quite as rare as I'd hope...). Then again, if you are confident to actually use one of those functions, you'll be able to judge by yourself what your error was, as well as appreciate the amount of work I put into this.

 

What the GREP only explains Find What expressions. Beginners, beware! "Replace With" expressions are different from this, and you cannot enter a "Replace With" expression into the dialog and expect any meaningful results.

(Some day I just might look into that as well, but not on any short notice!)

 


InDesign's GREP flavor

Following www.regular-expressions.info, here is a list of GREP features and InDesign's support for these, as I understand them.

There are a few question marks, because I'm not really sure how these functions should to work ... Any comments on this list are highly appreciated!

Note: not all commands listed as "YES" are recognized by What the GREP itself.

 

Characters
FeatureCS4/CS5/CS5.5
Backslash escapes one metacharacter YES
\Q...\E escapes a string of metacharacters YES
\x00 through \xFF (ASCII character) YES
\n (LF), \r (CR) and \t (tab) YES
\f (form feed) and \v (vtab) no
\a (bell) and \e (escape) no
\b (backspace) and \B (backslash) no
\cA through \cZ (control character) YES
\ca through \cz (control character) YES
Character Classes or Character Sets [abc]
FeatureCS4/CS5/CS5.5
[abc] character class YES
[^abc] negated character class YES
[a-z] character class range YES
Hyphen in [\d-z] is a literal ?
Hyphen in [a-\d] is a literal ?
Backslash escapes one character class metacharacter YES
\Q...\E escapes a string of character class metacharacters no
\d shorthand for digits YES
\w shorthand for word characters YES
\s shorthand for whitespace YES
\D, \W and \S shorthand negated character classes YES
[\b] backspace no
Dot
FeatureCS4/CS5/CS5.5
. (dot; any character except line break) YES
Anchors
FeatureCS4/CS5/CS5.5
^ (start of string/line) Paragraph Break, Soft Line Break
$ (end of string/line) Paragraph Break, Soft Line Break
\A (start of string) Start of Story, Footnote, Cell instead
\Z (end of string, before final line break) End of Story, Footnote, Cell instead a
\z (end of string) End of Story, Footnote, Cell insteada
\` (start of string) no b
\' (end of string) no c
a Behavior is not exactly the same for \Z and \z but the difference is hard to prove.
b Appears to do the same as \A -- Start of Story.
c \' is a control character, rather than two separate ASCII characters. Unknown what it does search for.
Word Boundaries
FeatureCS4/CS5/CS5.5
\b (at the beginning or end of a word) YES
\B (NOT at the beginning or end of a word) YES
\y (at the beginning or end of a word) no (equal to y)
\Y (NOT at the beginning or end of a word) no (equal to Y)
\m (at the beginning of a word) no (equal to m)
\M (at the end of a word) no (equal to M)
\< (at the beginning of a word) YES
\> (at the end of a word) YES
Alternation
FeatureCS4/CS5/CS5.5
| (alternation) YES
Quantifiers
FeatureCS4/CS5/CS5.5
? (0 or 1) YES
* (0 or more) YES
+ (1 or more) YES
{n} (exactly n) YES
{n,m} (between n and m) YES
{n,} (n or more) YES
? after any of the above quantifiers to make it "lazy" YES
Grouping and Backreferences
FeatureCS4/CS5/CS5.5
(regex) (numbered capturing group) YES
(?:regex) (non-capturing group) YES
\1 through \9 (backreferences) YES
\10 through \99 (backreferences) no
Forward references \1 through \9 no
Nested references \1 through \9 no
Backreferences non-existent groups are an error YES
Backreferences to failed groups also fail ?
Modifiers
FeatureCS4/CS5/CS5.5
(?i) (case insensitive) YES d
(?s) (dot matches newlines) YES d
(?m) (^ and $ match at line breaks) YES d (default mode)
(?x) (free-spacing mode) YES d
(?n) (explicit capture) no
(?-ismxn) (turn off mode modifiers) YES d
(?ismxn:group) (mode modifiers local to group) YES d
d Any of i, s, m, x, or none at start, none or only one after the hyphen
Atomic Grouping and Possessive Quantifiers
FeatureCS4/CS5/CS5.5
(?>regex) (atomic group) YES
?+, *+, ++ and {m,n}+ (possessive quantifiers) no
Lookaround
FeatureCS4/CS5/CS5.5
(?=regex) (positive lookahead) YES
(?!regex) (negative lookahead) YES
(?<=text) (positive lookbehind) fixed length
(?<!text) (negative lookbehind) fixed length
Continuing from The Previous Match
FeatureCS4/CS5/CS5.5
\G (start of match attempt) no
Conditionals
FeatureCS4/CS5/CS5.5
(?(?=regex)then|else) (using any lookaround) YES
(?(regex)then|else) no
(?(1)then|else) YES
(?(group)then|else) no
Comments
FeatureCS4/CS5/CS5.5
(?#comment) YES
Free-Spacing Syntax
FeatureCS4/CS5/CS5.5
Free-spacing syntax supported YES
Character class is a single token ?
# starts a comment YES
Unicode Characters
FeatureCS4/CS5/CS5.5
\X (Unicode grapheme) YES
\u0000 through \uFFFF (Unicode character) no
\x{0} through \x{FFFF} (Unicode character) YES
Unicode Properties, Scripts and Blocks
FeatureCS4/CS5/CS5.5
\pL through \pC (Unicode properties) YES (?)
\p{L} through \p{C} (Unicode properties) YES (?)
\p{Lu} through \p{Cn} (Unicode property) YES (?)
\p{L&} and \p{Letter&} (equivalent of [\p{Lu}\p{Ll}\p{Lt}] Unicode properties) no
\p{IsL} through \p{IsC} (Unicode properties) no
\p{IsLu} through \p{IsCn} (Unicode property) no
\p{Letter} through \p{Other} (Unicode properties) YES
\p{Lowercase_Letter} through \p{Not_Assigned} (Unicode property) YES
\p{IsLetter} through \p{IsOther} (Unicode properties) no
\p{IsLowercase_Letter} through \p{IsNot_Assigned} (Unicode property) no
\p{Arabic} through \p{Yi} (Unicode script) bo
\p{IsArabic} through \p{IsYi} (Unicode script) no
\p{BasicLatin} through \p{Specials} (Unicode block) no
\p{InBasicLatin} through \p{InSpecials} (Unicode block) no
\p{IsBasicLatin} through \p{IsSpecials} (Unicode block) no
Part between {} in all of the above is case insensitive YES
Spaces, hyphens and underscores allowed in all long names listed above (e.g. BasicLatin can be written as Basic-Latin or Basic_Latin or Basic Latin) YES
\P (negated variants of all \p as listed above) YES
\p{^...} (negated variants of all \p{...} as listed above) no
Named Capture and Backreferences
FeatureCS4/CS5/CS5.5
(?<name>regex) (.NET-style named capturing group) no
(?'name'regex) (.NET-style named capturing group) no
\k<name> (.NET-style named backreference) no
\k'name' (.NET-style named backreference) no
(?P<name>regex) (Python-style named capturing group no
(?P=name) (Python-style named backreference) no
multiple capturing groups can have the same name no
XML Character Classes
FeatureCS4/CS5/CS5.5
\i, \I, \c and \C shorthand XML name character classes ?
[abc-[abc]] character class subtraction no
POSIX Bracket Expressions
FeatureCS4/CS5/CS5.5
[:alpha:] POSIX character class YES
\p{Alpha} POSIX character class YES
\p{IsAlpha} POSIX character class no
[.span-ll.] POSIX collation sequence YESe
[=x=] POSIX character equivalence YES
e Apparently it does. [[.ch.]] finds the combination "ch", but it seems to be handled as if it were a single character (as can be shown using its inverse: [^[.ch.]]). It supports a small set of digraphs: ae, ch, dz, lj, ll, nj, ss. Now if someone can explain how to use this .. :-)

 

 


[Jongware]

Many thanks to Adobe for writing both InDesign and its GREP abilities, and to Jan Goyvaerts of http://www.regular-expressions.info for writing a thorough overview of All Things GREP. A definitive must-read page by page for an in-depth look at GREP!

Thanks to TroyCole for pointing out a Random Difference between CS4 and later versions (suddenly, text mysteriously appeared in [Paper] color in modern versions -- thanks to some programmer at Adobe who switched [Paper] and [Black] swatches around!).

This data is by no means exhaustive and/or authoritive. Any observation of InDesign's GREP is mine. The official reference for InDesign's GREP function can be found at the Adobe Systems Incorporated web site. Errors may have crept into my story, and in that case are most likely exclusively mine.

InDesign CS3, InDesign CS4, CS5, and CS5.5 and ExtendScript are trademarks of Adobe Systems Incorporated.