Tuesday, October 13, 2009

My regex Notes

These are notes I have created for regex. It is not formatted right now but I will update this post soon.

[] = use square brackets [] as OR operator. You can also call it a character class.
[^xyz] = ^ in the beginning of [] means "(not x) or y or z"
^abc means start char must be a
abc$ means end characted must be c
example:
^\s+ = beginning with whitespace-charater one or more times
\s+$ matches trailing white space
in perl trim function can be $input =~ s/^\s+|\s+$//g

Word boundaries:
[a-zA-A0-9_] are word characters. [a-zA-A0-9] = \w
\b is used to define boundary
\bis => "my name is priyank" and "I like Hawai islands" will match. But "this was a book" won't.
\bis\b => Only "my name is priyank" will match.
is\b => "my name is priyank" and "this was a book" will match "I like Hawai islands" won't.

Regex is Eager
ref: http://www.regular-expressions.info/alternation.html

You use "?" for optional items. ? means 1 or 0 times
B4?U matched both BU and B4U.
(Jan)?uary mathces Jan and January
* means 0 or more times.
+ means 1 or more times.

You use "?" to make regex non-greedy.
So in "tennis is a nice sport. ", if your regex is (.*is)will match the string "tennis is" but if your regex is "(.*?is).*" will find only "this".
SO "?" makes the regex lazy or ungreedy.

Round brackets are used for grouping.
Round bracket also create a 'backreference'.
What is a "back reference" in regex?
Ans: A backreference stores the part of the string matched by the of the regex inside the parantheses.
Backreference slows down the regex engine as it has to some more work.
You use "?:" to tell regex engine not to create any back reference.
How to use a backreference ?
Backreference allows you to re-use matched part of a regex. You can re-use it inside the regex itself.
A very good example ofr backreference is find starting and closing html tag ans contetnt inside that tag.
<([A-Z][A-Z0-9]*)\b[^>]*>.*?


Download regex cheatsheet

2 comments: