Expressions 1 2 1 – Play With Regular Expressions

broken image


| Quick Start | Tutorial | Tools & Languages | Examples | Reference | Book Reviews |

  1. Expressions 1 2 1 – Play With Regular Expressions Like
  2. Expressions 1 2 1 – Play With Regular Expressions
  3. Expressions 1 2 1 – Play With Regular Expressions Worksheets

This quick start gets you up to speed quickly with regular expressions. Obviously, this brief introduction cannot explain everything there is to know about regular expressions. For detailed information, consult the regular expressions tutorial. Each topic in the quick start corresponds with a topic in the tutorial, so you can easily go back and forth between the two.

Regular expressions and NFAs (KB1, PA1, PA2) For each of the following regular expressions, give two positive and two negative members for the. Language it generates: (a) 4 marks a(ba)∗b; (b) 4 marks (E ∪ b) a; (c) 8 marks Design an NFA for each language given in (a) and (b). Used to group parts of the expression into sub-expressions. This can be used to limit an operator to a sub-expression. For example, the regular expression z/OS.((1.10-3) (2.1-2)) matches 'z/OS® 1.13' and 'z/OS 2.1'.

Many applications and programming languages have their own implementation of regular expressions, often with slight and sometimes with significant differences from other implementations. When two applications use a different implementation of regular expressions, we say that they use different 'regular expression flavors'. This quick start explains the syntax supported by the most popular regular expression flavors.

Text Patterns and Matches

A regular expression, or regex for short, is a pattern describing a certain amount of text. On this website, regular expressions are highlighted in red as regex. This is actually a perfectly valid regex. It is the most basic pattern, simply matching the literal text regex. Matches are highlighted in blue on this site. We use the term 'string' to indicate the text that the regular expression is applied to. Strings are highlighted in green.

Characters with special meanings in regular expressions are highlighted in various different colors. The regex ([Rr]egexp?)? shows meta tokens in purple, grouping in green, character classes in orange, quantifiers and other special tokens in blue, and escaped characters in gray.

Literal Characters

The most basic regular expression consists of a single literal character, such as a. It matches the first occurrence of that character in the string. If the string is Jack is a boy, it matches the a after the J.

This regex can match the second a too. It only does so when you tell the regex engine to start searching through the string after the first match. In a text editor, you can do so by using its 'Find Next' or 'Search Forward' function. In a programming language, there is usually a separate function that you can call to continue searching through the string after the previous match.

Twelve characters have special meanings in regular expressions: the backslash , the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening parenthesis (, the closing parenthesis ), the opening square bracket [, and the opening curly brace {. These special characters are often called 'metacharacters'. Most of them are errors when used alone.

If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If you want to match 1+1=2, the correct regex is 1+1=2. Otherwise, the plus sign has a special meaning.

Non-Printable Characters

You can use special character sequences to put non-printable characters in your regular expression. Use t to match a tab character (ASCII 0x09), r for carriage return (0x0D) and n for line feed (0x0A). More exotic non-printables are a (bell, 0x07), e (escape, 0x1B), f (form feed, 0x0C) and v (vertical tab, 0x0B). Remember that Windows text files use rn to terminate lines, while UNIX text files use n.

If your application supports Unicode, use uFFFF or x{FFFF} to insert a Unicode character. u20AC or x{20AC} matches the euro currency sign.

If your application does not support Unicode, use xFF to match a specific character by its hexadecimal index in the character set. xA9 matches the copyright symbol in the Latin-1 character set.

All non-printable characters can be used directly in the regular expression, or as part of a character class.

Character Classes or Character Sets

A 'character class' matches only one out of several characters. To match an a or an e, use [ae]. You could use this in gr[ae]y to match either gray or grey. A character class matches only a single character. gr[ae]y does not match graay, graey or any such thing. The order of the characters inside a character class does not matter.

You can use a hyphen inside a character class to specify a range of characters. [0-9] matches a single digit between 0 and 9. You can use more than one range. [0-9a-fA-F] matches a single hexadecimal digit, case insensitively. You can combine ranges and single characters. [0-9a-fxA-FX] matches a hexadecimal digit or the letter X.

Typing a caret after the opening square bracket negates the character class. The result is that the character class matches any character that is not in the character class. q[^x] matches qu in question. It does not match Iraq since there is no character after the q for the negated character class to match.

Shorthand Character Classes

d matches a single character that is a digit, w matches a 'word character' (alphanumeric characters plus underscore), and s matches a whitespace character (includes tabs and line breaks). The actual characters matched by the shorthands depends on the software you're using. In modern applications, they include non-English letters and numbers.

The Dot Matches (Almost) Any Character

The dot matches a single character, except line break characters. Most applications have a 'dot matches all' or 'single line' mode that makes the dot match any single character, including line breaks.

gr.y matches gray, grey, gr%y, etc. Use the dot sparingly. Often, a character class or negated character class is faster and more precise.

Anchors

Anchors do not match any characters. They match a position. ^ matches at the start of the string, and $ matches at the end of the string. Most regex engines have a 'multi-line' mode that makes ^ match after any line break, and $ before any line break. E.g. ^b matches only the first b in bob.

b matches at a word boundary. A word boundary is a position between a character that can be matched by w and a character that cannot be matched by w. b also matches at the start and/or end of the string if the first and/or last characters in the string are word characters. B matches at every position where b cannot match.

Alternation

Alternation is the regular expression equivalent of 'or'. cat|dog matches cat in About cats and dogs. If the regex is applied again, it matches dog. You can add as many alternatives as you want: cat|dog|mouse|fish.

Alternation has the lowest precedence of all regex operators. cat|dog food matches cat or dog food. To create a regex that matches cat food or dog food, you need to group the alternatives: (cat|dog) food.

Repetition

The question mark makes the preceding token in the regular expression optional. colou?r matches colour or color.

Expressions

The asterisk or star tells the engine to attempt to match the preceding token zero or more times. The plus tells the engine to attempt to match the preceding token once or more. <[A-Za-z][A-Za-z0-9]*> matches an HTML tag without any attributes. <[A-Za-z0-9]+> is easier to write but matches invalid tags such as <1>.

Use curly braces to specify a specific amount of repetition. Use b[1-9][0-9]{3}b to match a number between 1000 and 9999. b[1-9][0-9]{2,4}b matches a number between 100 and 99999.

Greedy and Lazy Repetition

The repetition operators or quantifiers are greedy. They expand the match as far as they can, and only give back if they must to satisfy the remainder of the regex. The regex <.+> matches first in This is a first test.

Place a question mark after the quantifier to make it lazy. <.+?> matches in the above string.

A better solution is to follow my advice to use the dot sparingly. Use <[^<>]+> to quickly match an HTML tag without regard to attributes. The negated character class is more specific than the dot, which helps the regex engine find matches quickly.

Grouping and Capturing

Place parentheses around multiple tokens to group them together. You can then apply a quantifier to the group. E.g. Set(Value)? matches Set or SetValue.

Parentheses create a capturing group. The above example has one group. After the match, group number one contains nothing if Set was matched. It contains Value if SetValue was matched. How to access the group's contents depends on the software or programming language you're using. Group zero always contains the entire regex match.

Use the special syntax Set(?:Value)? to group tokens without creating a capturing group. This is more efficient if you don't plan to use the group's contents. Do not confuse the question mark in the non-capturing group syntax with the quantifier.

Backreferences

Within the regular expression, you can use the backreference 1 to match the same text that was matched by the capturing group. ([abc])=1 matches a=a, b=b, and c=c. It does not match anything else. If your regex has multiple capturing groups, they are numbered counting their opening parentheses from left to right.

Named Groups and Backreferences

If your regex has many groups, keeping track of their numbers can get cumbersome. Make your regexes easier to read by naming your groups. (?[abc])=k is identical to ([abc])=1, except that you can refer to the group by its name.

Unicode Properties

p{L} matches a single character that is in the given Unicode category. L stands for letter. P{L} matches a single character that is not in the given Unicode category. You can find a complete list of Unicode categories in the tutorial.

Lookaround

Lookaround is a special kind of group. The tokens inside the group are matched normally, but then the regex engine makes the group give up its match and keeps only the result. Lookaround matches a position, just like anchors. It does not expand the regex match.

q(?=u) matches the q in question, but not in Iraq. This is positive lookahead. The u is not part of the overall regex match. The lookahead matches at each position in the string before a u.

q(?!u) matches q in Iraq but not in question. This is negative lookahead. The tokens inside the lookahead are attempted, their match is discarded, and the result is inverted.

To look backwards, use lookbehind. The positive lookbehind (?<=a)b matches the b in abc. The negative lookbehind (?a)b fails to match abc.

You can use a full-fledged regular expression inside lookahead. Most applications only allow fixed-length expressions in lookbehind.

Free-Spacing Syntax

Many application have an option that may be labeled 'free-spacing' or 'ignore whitespace' or 'comments' that makes the regular expression engine ignore unescaped spaces and line breaks and that makes the # character start a comment that runs until the end of the line. This allows you to use whitespace to format your regular expression in a way that makes it easier for humans to read and thus makes it easier to maintain.

Make a Donation

Did this website just save you a trip to the bookstore? Please make a donation to support this site, and you'll get a lifetime of advertisement-free access to this site!

| Quick Start | Tutorial | Tools & Languages | Examples | Reference | Book Reviews |

| Introduction | Regular Expressions Quick Start | Regular Expressions Tutorial | Replacement Strings Tutorial | Applications and Languages | Regular Expressions Examples | Regular Expressions Reference | Replacement Strings Reference | Book Reviews | Printable PDF | About This Site | RSS Feed & Blog |

Page URL: https://regular-expressions.mobi/quickstart.html
Page last updated: 05 October 2020
Site last updated: 05 October 2020
Copyright © 2003-2021 Jan Goyvaerts. All rights reserved.

This information below describes the construction and syntax of regular expressions that can be used within certain Araxis products.

The text below is an edited version of the Regex++ Library's regular expression syntax documentation. The original text can be found on the Boost website.

Literals

All characters match themselves except for the following special characters:

These characters will match themselves when preceded by a .

Wildcard

The dot character ‘.' matches any single character.

Line anchors

A ‘^' character matches the null string at the start of a line.

A ‘$' character matches the null string at the end of a line.

Repeats

A repeat is an expression that is repeated an arbitrary number of times. An expression followed by ‘*' can be repeated any number of times, including zero. An expression followed by ‘+' can be repeated any number of times, but at least once.

An expression followed by ‘?' may be repeated zero or one times only. When it is necessary to specify the minimum and maximum number of repeats explicitly, the bounds operator {} may be used. Thus. a{2} is the letter ‘a' repeated exactly twice, a{2,4} represents the letter ‘a' repeated between 2 and 4 times, and a{2,} represents the letter ‘a' repeated at least twice with no upper limit. Note that there must be no whitespace inside the {}, and there is no upper limit on the values of the lower and upper bounds. All repeat expressions refer to the shortest possible previous sub-expression: a single character; a character set, or a sub-expression grouped with () for example.

Examples

ba* will match all of b, ba, baaa, etc.

ba+ will match ba or baaaa, for example, but not b.

ba? will match b or ba.

ba{2,4} will match baa, baaa and baaaa.

Non-greedy repeats

Non-greedy repeats are possible by appending a ‘?' after the repeat; a non-greedy repeat is one which will match the shortest possible string.

For example, to match html element pairs, one could use something like:

In this case $1 (tagged-match 1, from (.*?)) will contain the text between the tag pairs, and will be the shortest possible matching string.

Parenthesis

Parentheses serve two purposes, to group items together into a sub-expression, and to mark what generated the match. For example the expression (ab)* would match all of the string ababab.

When using regular expressions in Araxis Merge with a line-pairing rule, you can select which sub-expressions should be used for pairing. Likewise, when using regular expresions to ignore specific matching parts of a line rather than the entire line using an expression, you can select which sub-expressions should be ignored.

Non-marking parenthesis

Sometimes you need to group sub-expressions with parenthesis, but don't want the parenthesis to spit out another marked sub-expression, in this case a non-marking parenthesis (?:expression) can be used. For example the following expression creates no sub-expressions:

Forward lookahead asserts

There are two forms of these; one for positive forward lookahead asserts, and one for negative lookahead asserts:

(?=abc) matches zero characters only if they are followed by the expression abc.

(?!abc) matches zero characters only if they are not followed by the expression abc.

Alternatives

Alternatives occur when the expression can match either one sub-expression or another, each alternative is separated by a ‘|'. Each alternative is the largest possible previous sub-expression; this is the opposite behaviour from repetition operators.

Examples:

a(b|c) could match ab or ac.

abc|def could match abc or def.

Character sets

A set is a set of characters that can match any single character that is a member of the set. Sets are delimited by ‘[' and ‘]' and can contain literals, character ranges, character classes, collating elements and equivalence classes. Set declarations that start with ‘^' contain the compliment of (that is, everything but) the elements that follow.

Character literals

[abc] will match either of ‘a', ‘b', or ‘c'.

[^abc] will match any character other than ‘a', ‘b', or ‘c'.

Character ranges

[a-z] will match any character in the range ‘a' to ‘z'.

[^A-Z] will match any character other than those in the range ‘A' to ‘Z'. Cloudtv 3 7 1 – international tv on your desktop.

Aiseesoft iphone ringtone maker 7 0 6 download free. Note that character ranges are highly locale dependent: they match any character that collates between the endpoints of the range. When using ranges with languages that may be affected by collation rules, we recommend trying out matching and mis-matching sample expressions to confirm that the ranges operate as intended.

Character classes

Character classes are denoted using the syntax [:classname:] within a set declaration, for example [[:space:]] is the set of all whitespace characters. The available character classes are:

alnumAny alphanumeric character.
alphaAny alphabetical character a-z and A-Z. Other characters may also be included depending upon the locale.
blankAny blank character, either a space or a tab.
cntrlAny control character.
digitAny digit 0-9.
graphAny graphical character.
lowerAny lower case character a-z. Other characters may also be included depending upon the locale.
printAny printable character.
punctAny punctuation character.
spaceAny whitespace character.
upperAny upper case character A-Z. Other characters may also be included depending upon the locale.
xdigitAny hexadecimal digit character, 0-9, a-f and A-F.
wordAny word character - all alphanumeric characters plus the underscore.
unicodeAny character whose code is greater than 255, this applies to the wide character traits classes only.

There are some shortcuts that can be used in place of the character classes:

w in place of [:word:]

s in place of [:space:]

d in place of [:digit:]

l in place of [:lower:]

u in place of [:upper:]

Collating elements

Collating elements take the general form [.tagname.] inside a set declaration, where tagname is either a single character, or a name of a collating element. For example [[.a.]] is equivalent to [a], and [[.comma.]] is equivalent to [,]. All the standard POSIX collating element names are supported, and in addition the following digraphs: ‘ae', ‘ch', ‘ll', ‘ss', ‘nj', ‘dz', ‘lj', each in lower, upper and title case variations. Multi-character collating elements can result in the set matching more than one character. For example, [[.ae.]] would match two characters, but note that [^[.ae.]] would only match one character.

Equivalence classes

Equivalence classes take the general form [=tagname=] inside a set declaration, where tagname is either a single character, or a name of a collating element, and matches any character that is a member of the same primary equivalence class as the collating element [.tagname.]. An equivalence class is a set of characters that collate the same, a primary equivalence class is a set of characters whose primary sort key are all the same (for example strings are typically collated by character, then by accent, and then by case; the primary sort key then relates to the character, the secondary to the accentation, and the tertiary to the case). If there is no equivalence class corresponding to tagname, then [=tagname=] is exactly the same as [.tagname.]. Desk 1 0 – free writing environments app.

To include a literal ‘-' in a set declaration: make it the first character after the opening [ or [^, the endpoint of a range, a collating element, or precede with an escape character as in [-]. To include a literal ‘[' or ‘]' or ‘^' in a set, make them the endpoint of a range, a collating element, or precede them with an escape character.

Back references

A back reference is a reference to a previous sub-expression that has already been matched. The reference is to what the sub-expression matched, not to the expression itself. A back reference consists of the escape character ‘' followed by a digit ‘1' to ‘9'. 1 refers to the first sub-expression, 2 to the second, etc. For example, the expression (.*)1 matches any string that is repeated about its mid-point. For example, abcabc or xyzxyz. A back reference to a sub-expression that did not participate in any match matches the null string. Note that this is different to some other regular expression matchers.

Characters by code

This extension consists of the escape character followed by the digit ‘0' followed by the octal character code. For example, 023 represents the character whose octal code is 23. Where ambiguity could occur, use parentheses to break the expression up: 0103 represents the character whose code is 103; (010)3 represents the character 10 followed by ‘3'. To match characters by their hexadecimal code, use x followed by a string of hexadecimal digits, optionally enclosed inside {}. For example xf0 or x{aff}. Notice that the latter example is a Unicode character.

Word operators

The following operators are provided for compatibility with the GNU regular expression library.

Expressions 1 2 1 – Play With Regular Expressions Like

w matches any single character that is a member of the ‘word' character class. This is identical to the expression [[:word:]].

W matches any single character that is not a member of the ‘word' character class. This is identical to the expression [^[:word:]].

< matches the null string at the start of a word.

> matches the null string at the end of the word.

b matches the null string at either the start or the end of a word.

B matches a null string within a word.

The start of the sequence passed to the matching algorithms is considered to be a potential start of a word. The end of the sequence passed to the matching algorithms is considered to be a potential end of a word.

Buffer operators

The following operators are provide for compatibility with the GNU regular expression library, and Perl regular expressions:

``` matches the start of a buffer.

A matches the start of the buffer.

' matches the end of a buffer.

z matches the end of a buffer.

Expressions 1 2 1 – Play With Regular Expressions

Z matches the end of a buffer, or possibly one or more new line characters followed by the end of the buffer.

A buffer is considered to consist of the entire line under consideration.

Expressions 1 2 1 – Play With Regular Expressions Worksheets

Escape operator

The escape character ‘' has several meanings.

Inside a set declaration, whatever follows the escape is a literal character regardless of its normal meaning.

The escape operator may introduce an operator, such as back references or a word operator.

The escape operator may make the following character normal. For example ‘*' represents a literal ‘*' rather than the repeat operator.

Expressions 1 2 1 – play with regular expressions

The asterisk or star tells the engine to attempt to match the preceding token zero or more times. The plus tells the engine to attempt to match the preceding token once or more. <[A-Za-z][A-Za-z0-9]*> matches an HTML tag without any attributes. <[A-Za-z0-9]+> is easier to write but matches invalid tags such as <1>.

Use curly braces to specify a specific amount of repetition. Use b[1-9][0-9]{3}b to match a number between 1000 and 9999. b[1-9][0-9]{2,4}b matches a number between 100 and 99999.

Greedy and Lazy Repetition

The repetition operators or quantifiers are greedy. They expand the match as far as they can, and only give back if they must to satisfy the remainder of the regex. The regex <.+> matches first in This is a first test.

Place a question mark after the quantifier to make it lazy. <.+?> matches in the above string.

A better solution is to follow my advice to use the dot sparingly. Use <[^<>]+> to quickly match an HTML tag without regard to attributes. The negated character class is more specific than the dot, which helps the regex engine find matches quickly.

Grouping and Capturing

Place parentheses around multiple tokens to group them together. You can then apply a quantifier to the group. E.g. Set(Value)? matches Set or SetValue.

Parentheses create a capturing group. The above example has one group. After the match, group number one contains nothing if Set was matched. It contains Value if SetValue was matched. How to access the group's contents depends on the software or programming language you're using. Group zero always contains the entire regex match.

Use the special syntax Set(?:Value)? to group tokens without creating a capturing group. This is more efficient if you don't plan to use the group's contents. Do not confuse the question mark in the non-capturing group syntax with the quantifier.

Backreferences

Within the regular expression, you can use the backreference 1 to match the same text that was matched by the capturing group. ([abc])=1 matches a=a, b=b, and c=c. It does not match anything else. If your regex has multiple capturing groups, they are numbered counting their opening parentheses from left to right.

Named Groups and Backreferences

If your regex has many groups, keeping track of their numbers can get cumbersome. Make your regexes easier to read by naming your groups. (?[abc])=k is identical to ([abc])=1, except that you can refer to the group by its name.

Unicode Properties

p{L} matches a single character that is in the given Unicode category. L stands for letter. P{L} matches a single character that is not in the given Unicode category. You can find a complete list of Unicode categories in the tutorial.

Lookaround

Lookaround is a special kind of group. The tokens inside the group are matched normally, but then the regex engine makes the group give up its match and keeps only the result. Lookaround matches a position, just like anchors. It does not expand the regex match.

q(?=u) matches the q in question, but not in Iraq. This is positive lookahead. The u is not part of the overall regex match. The lookahead matches at each position in the string before a u.

q(?!u) matches q in Iraq but not in question. This is negative lookahead. The tokens inside the lookahead are attempted, their match is discarded, and the result is inverted.

To look backwards, use lookbehind. The positive lookbehind (?<=a)b matches the b in abc. The negative lookbehind (?a)b fails to match abc.

You can use a full-fledged regular expression inside lookahead. Most applications only allow fixed-length expressions in lookbehind.

Free-Spacing Syntax

Many application have an option that may be labeled 'free-spacing' or 'ignore whitespace' or 'comments' that makes the regular expression engine ignore unescaped spaces and line breaks and that makes the # character start a comment that runs until the end of the line. This allows you to use whitespace to format your regular expression in a way that makes it easier for humans to read and thus makes it easier to maintain.

Make a Donation

Did this website just save you a trip to the bookstore? Please make a donation to support this site, and you'll get a lifetime of advertisement-free access to this site!

| Quick Start | Tutorial | Tools & Languages | Examples | Reference | Book Reviews |

| Introduction | Regular Expressions Quick Start | Regular Expressions Tutorial | Replacement Strings Tutorial | Applications and Languages | Regular Expressions Examples | Regular Expressions Reference | Replacement Strings Reference | Book Reviews | Printable PDF | About This Site | RSS Feed & Blog |

Page URL: https://regular-expressions.mobi/quickstart.html
Page last updated: 05 October 2020
Site last updated: 05 October 2020
Copyright © 2003-2021 Jan Goyvaerts. All rights reserved.

This information below describes the construction and syntax of regular expressions that can be used within certain Araxis products.

The text below is an edited version of the Regex++ Library's regular expression syntax documentation. The original text can be found on the Boost website.

Literals

All characters match themselves except for the following special characters:

These characters will match themselves when preceded by a .

Wildcard

The dot character ‘.' matches any single character.

Line anchors

A ‘^' character matches the null string at the start of a line.

A ‘$' character matches the null string at the end of a line.

Repeats

A repeat is an expression that is repeated an arbitrary number of times. An expression followed by ‘*' can be repeated any number of times, including zero. An expression followed by ‘+' can be repeated any number of times, but at least once.

An expression followed by ‘?' may be repeated zero or one times only. When it is necessary to specify the minimum and maximum number of repeats explicitly, the bounds operator {} may be used. Thus. a{2} is the letter ‘a' repeated exactly twice, a{2,4} represents the letter ‘a' repeated between 2 and 4 times, and a{2,} represents the letter ‘a' repeated at least twice with no upper limit. Note that there must be no whitespace inside the {}, and there is no upper limit on the values of the lower and upper bounds. All repeat expressions refer to the shortest possible previous sub-expression: a single character; a character set, or a sub-expression grouped with () for example.

Examples

ba* will match all of b, ba, baaa, etc.

ba+ will match ba or baaaa, for example, but not b.

ba? will match b or ba.

ba{2,4} will match baa, baaa and baaaa.

Non-greedy repeats

Non-greedy repeats are possible by appending a ‘?' after the repeat; a non-greedy repeat is one which will match the shortest possible string.

For example, to match html element pairs, one could use something like:

In this case $1 (tagged-match 1, from (.*?)) will contain the text between the tag pairs, and will be the shortest possible matching string.

Parenthesis

Parentheses serve two purposes, to group items together into a sub-expression, and to mark what generated the match. For example the expression (ab)* would match all of the string ababab.

When using regular expressions in Araxis Merge with a line-pairing rule, you can select which sub-expressions should be used for pairing. Likewise, when using regular expresions to ignore specific matching parts of a line rather than the entire line using an expression, you can select which sub-expressions should be ignored.

Non-marking parenthesis

Sometimes you need to group sub-expressions with parenthesis, but don't want the parenthesis to spit out another marked sub-expression, in this case a non-marking parenthesis (?:expression) can be used. For example the following expression creates no sub-expressions:

Forward lookahead asserts

There are two forms of these; one for positive forward lookahead asserts, and one for negative lookahead asserts:

(?=abc) matches zero characters only if they are followed by the expression abc.

(?!abc) matches zero characters only if they are not followed by the expression abc.

Alternatives

Alternatives occur when the expression can match either one sub-expression or another, each alternative is separated by a ‘|'. Each alternative is the largest possible previous sub-expression; this is the opposite behaviour from repetition operators.

Examples:

a(b|c) could match ab or ac.

abc|def could match abc or def.

Character sets

A set is a set of characters that can match any single character that is a member of the set. Sets are delimited by ‘[' and ‘]' and can contain literals, character ranges, character classes, collating elements and equivalence classes. Set declarations that start with ‘^' contain the compliment of (that is, everything but) the elements that follow.

Character literals

[abc] will match either of ‘a', ‘b', or ‘c'.

[^abc] will match any character other than ‘a', ‘b', or ‘c'.

Character ranges

[a-z] will match any character in the range ‘a' to ‘z'.

[^A-Z] will match any character other than those in the range ‘A' to ‘Z'. Cloudtv 3 7 1 – international tv on your desktop.

Aiseesoft iphone ringtone maker 7 0 6 download free. Note that character ranges are highly locale dependent: they match any character that collates between the endpoints of the range. When using ranges with languages that may be affected by collation rules, we recommend trying out matching and mis-matching sample expressions to confirm that the ranges operate as intended.

Character classes

Character classes are denoted using the syntax [:classname:] within a set declaration, for example [[:space:]] is the set of all whitespace characters. The available character classes are:

alnumAny alphanumeric character.
alphaAny alphabetical character a-z and A-Z. Other characters may also be included depending upon the locale.
blankAny blank character, either a space or a tab.
cntrlAny control character.
digitAny digit 0-9.
graphAny graphical character.
lowerAny lower case character a-z. Other characters may also be included depending upon the locale.
printAny printable character.
punctAny punctuation character.
spaceAny whitespace character.
upperAny upper case character A-Z. Other characters may also be included depending upon the locale.
xdigitAny hexadecimal digit character, 0-9, a-f and A-F.
wordAny word character - all alphanumeric characters plus the underscore.
unicodeAny character whose code is greater than 255, this applies to the wide character traits classes only.

There are some shortcuts that can be used in place of the character classes:

w in place of [:word:]

s in place of [:space:]

d in place of [:digit:]

l in place of [:lower:]

u in place of [:upper:]

Collating elements

Collating elements take the general form [.tagname.] inside a set declaration, where tagname is either a single character, or a name of a collating element. For example [[.a.]] is equivalent to [a], and [[.comma.]] is equivalent to [,]. All the standard POSIX collating element names are supported, and in addition the following digraphs: ‘ae', ‘ch', ‘ll', ‘ss', ‘nj', ‘dz', ‘lj', each in lower, upper and title case variations. Multi-character collating elements can result in the set matching more than one character. For example, [[.ae.]] would match two characters, but note that [^[.ae.]] would only match one character.

Equivalence classes

Equivalence classes take the general form [=tagname=] inside a set declaration, where tagname is either a single character, or a name of a collating element, and matches any character that is a member of the same primary equivalence class as the collating element [.tagname.]. An equivalence class is a set of characters that collate the same, a primary equivalence class is a set of characters whose primary sort key are all the same (for example strings are typically collated by character, then by accent, and then by case; the primary sort key then relates to the character, the secondary to the accentation, and the tertiary to the case). If there is no equivalence class corresponding to tagname, then [=tagname=] is exactly the same as [.tagname.]. Desk 1 0 – free writing environments app.

To include a literal ‘-' in a set declaration: make it the first character after the opening [ or [^, the endpoint of a range, a collating element, or precede with an escape character as in [-]. To include a literal ‘[' or ‘]' or ‘^' in a set, make them the endpoint of a range, a collating element, or precede them with an escape character.

Back references

A back reference is a reference to a previous sub-expression that has already been matched. The reference is to what the sub-expression matched, not to the expression itself. A back reference consists of the escape character ‘' followed by a digit ‘1' to ‘9'. 1 refers to the first sub-expression, 2 to the second, etc. For example, the expression (.*)1 matches any string that is repeated about its mid-point. For example, abcabc or xyzxyz. A back reference to a sub-expression that did not participate in any match matches the null string. Note that this is different to some other regular expression matchers.

Characters by code

This extension consists of the escape character followed by the digit ‘0' followed by the octal character code. For example, 023 represents the character whose octal code is 23. Where ambiguity could occur, use parentheses to break the expression up: 0103 represents the character whose code is 103; (010)3 represents the character 10 followed by ‘3'. To match characters by their hexadecimal code, use x followed by a string of hexadecimal digits, optionally enclosed inside {}. For example xf0 or x{aff}. Notice that the latter example is a Unicode character.

Word operators

The following operators are provided for compatibility with the GNU regular expression library.

Expressions 1 2 1 – Play With Regular Expressions Like

w matches any single character that is a member of the ‘word' character class. This is identical to the expression [[:word:]].

W matches any single character that is not a member of the ‘word' character class. This is identical to the expression [^[:word:]].

< matches the null string at the start of a word.

> matches the null string at the end of the word.

b matches the null string at either the start or the end of a word.

B matches a null string within a word.

The start of the sequence passed to the matching algorithms is considered to be a potential start of a word. The end of the sequence passed to the matching algorithms is considered to be a potential end of a word.

Buffer operators

The following operators are provide for compatibility with the GNU regular expression library, and Perl regular expressions:

``` matches the start of a buffer.

A matches the start of the buffer.

' matches the end of a buffer.

z matches the end of a buffer.

Expressions 1 2 1 – Play With Regular Expressions

Z matches the end of a buffer, or possibly one or more new line characters followed by the end of the buffer.

A buffer is considered to consist of the entire line under consideration.

Expressions 1 2 1 – Play With Regular Expressions Worksheets

Escape operator

The escape character ‘' has several meanings.

Inside a set declaration, whatever follows the escape is a literal character regardless of its normal meaning.

The escape operator may introduce an operator, such as back references or a word operator.

The escape operator may make the following character normal. For example ‘*' represents a literal ‘*' rather than the repeat operator.

Single character escape sequences

The following escape sequences are aliases for single characters:

Escape sequenceCharacter codeMeaning
a0x07Bell character.
f0x0CForm feed.
n0x0ANewline character.
r0x0DCarriage return.
t0x09Tab character.
v0x0BVertical tab.
e0x1BASCII Escape character.
0dd0ddAn octal character code, where dd is one or more octal digits.
xXX0xXXA hexadecimal character code, where XX is one or more hexadecimal digits.
x{XX}0xXXA hexadecimal character code, where XX is one or more hexadecimal digits, optionally a Unicode character.
cZz-@An ASCII escape sequence control-Z, where Z is any ASCII character greater than or equal to the character code for ‘@'.

Miscellaneous escape sequences

The following are provided mostly for perl compatibility, but note that there are some differences in the meanings of l, L, u and U:

wEquivalent to [[:word:]].
WEquivalent to [^[:word:]].
sEquivalent to [[:space:]].
SEquivalent to [^[:space:]].
dEquivalent to [[:digit:]].
DEquivalent to [^[:digit:]].
lEquivalent to [[:lower:]].
LEquivalent to [^[:lower:]].
uEquivalent to [[:upper:]].
UEquivalent to [^[:upper:]].
CAny single character, equivalent to ‘.'.
XMatch any Unicode combining character sequence, for example ax 0301 (a letter a with an acute).
QThe begin quote operator. Everything that follows is treated as a literal character until a E end quote operator is found.
EThe end quote operator. Terminates a sequence begun with Q.

What gets matched?

The regular expression library will match the first possible matching string. If more than one string starting at a given location can match, it matches the longest possible string. In cases where there are multiple possible matches all starting at the same location and all of them are the same length, the match chosen is the one with the longest first sub-expression. If that is the same for two or more matches, the second sub-expression will be examined and so on.





broken image