RegEx Cheat Sheet for SES URLs

on

Recently I have been going through the fantastic book, Mastering Regular Expressions, by Jeffrey Friedl. Since I first learned the basics of creating a URL re-write in ColdFusion from a buddy that I used to work with, I developed a crazy fascination with Regular Expressions. For the CF developers that are reading this, it may seem like really old news to you, but many of the “SEO” folks will find this to be foreign, (a programmer is laughing right now …)

So, for the non-code oriented readers, regular expressions are a very powerful and efficient set of tools, methods and commands to manipulate strings of text and data. You might be wondering how that can be of any benefit to the SEO community. I shall attempt to explain.

The first major benefit that regular expressions offer is through SES (Search Engine Safe) URL re-writing. For Instance, let’s take a URL that would be considered as “Unsafe” for search engine optimization.

~ section508.gov/index.cfm?FuseAction=Content&ID=3

Now, I am probably going to catch hell for using this URL as an example, but you have got to love the fact that the web site for Section 508 which regards usability standards, does not use search engine safe URL’s. ( Is there anyone screaming ‘REMatch’ out there? ). There are several characters in dynamic URL’s that cause search engine spiders to stop crawling – Question marks, equal signs, ampersands, and colons, are but a few to mention. So, in the example above, we could simply run the URL through a regular expression which, replaces all of the unwanted characters with ones that are search engine safe.

Since this is not a tutorial on ColdFusion’s Regular expression functions, I’m only going to show an example of how this URL could be manipulated with a regular expression.


<cfset dirtyURL = ('#CGI.PATH_INFO# #CGI.QUERY_STRING#')>
<cfset fixit = #ReplaceList(dirtyURL, "?,=,&","/,/,/")#>
<cfset cleanURL = #ReReplace(fixit,"([[:space:]])","/","ALL")#>
<cfoutput>#cleanURL#</cfoutput>

So, here we simply store the URL as a string variable, and replace the unsafe characters with desirable ones.

The end result above, would take an unsafe URL, like this …


~ section508.gov/index.cfm?FuseAction=Content&ID=3

And return a SES URL like this …


~ section508.gov/index.cfm/FuseAction/Content/ID/3

Which, is much better for the search engine spiders to index the content on your site … and … “it looks prettier”.

So, that is just one of the many powerful things that can be done with regular expressions, and as I learn more about them, I’ll be sure to post my discoveries, delights, and not-so-friendly encounters for all to see … (Oh joy … )

In the mean time, I have concocted a cool little cheat sheet, based on the one from Dave’s IloveJackDaniel’s Old site …

I’m putting it here so that I can remember what the hell all the different Meta-characters in the various flavors of RegEx syntax do.

Feel free to do with it as you will. 😉

» RegEx Cheat Sheet «

Anchors Quantifiers Groups and Ranges
^
\A
$
\Z
\b
\B
\<
\>
Start of string
Start of string
End of string
End of string
Word boundary
Not word boundary
Start of word
End of word
*
+
?
{3}
{3,}
{3,5}
0 or more
1 or more
0 or 1
Exactly 3
3 or more
3, 4 or 5
.
(a|b)
(…)
(?:…)
[abc]
[^abc]
[a-q]
[A-Q]
[0-7]
\n
Any char except
new line (\n)
a or b
Group
Passive Group
Range (a or b or c)
Not a or b or c
Letter between a and q
Upper case letter
between A and Q
Digit between 0 and 7
nth group/subpatternNote: Ranges are inclusive.
Quantifier Modifiers
“x” ~ below represents a quantifier
x? ~ Ungreedy version of “x”
Character Classes Escape Character Pattern Modifiers
\c
\s
\S
\d
\D
\w
\W
\x
\O
Control character
White space
Not white space
Digit
Not digit
Word
Not word
Hexadecimal digit
Octal digit
\ ~ Escape Character g
i
m
s
x
e
U
Global match
Case-insensitive
Multiple lines
Treat string as single line
Allow comments and
white space in pattern
Evaluate replacement
Ungreedy pattern
Metacharacters (must be escaped)
^
$
(
)
<
  • [
    {
    \
    |
    >
  • .
    *
    +
    ?
POSIX Special Characters String Replacement (Backreferences)
[:upper:]
[:lower:]
[:alpha:]
[:alnum:]
[:digit:]
[:xdigit:]
[:punct:]
[:blank:]
[:space:]
[:cntrl:]
[:graph:]
[:print:][:word:]
Upper case letters
Lower case letters
All letters
Digits and letters
Digits
Hexadecimal digits
Punctuation
Space and tab
Blank characters
Control characters
Printed characters
Printed characters and
spaces
Digits, letters and
underscore
\n
\r
\t
\v
\f
\xxx
\xhh
New line
Carriage return
Tab
Vertical tab
Form feed
Octal character xxx
Hex character hh
$n
$2
$1
$`
$’
$+
$&
nth non-passive group
“xyz” in /^(abc(xyz))$/
“xyz” in /^(?:abc)(xyz)$/
Before matched string
After matched string
Last matched string
Entire matched string
Assertions Sample Patterns
?=
?!
?<=
?!= or ?<!
?>
?()
?()|
?#
Lookahead assertion
Negative lookahead
Lookbehind assertion
Negative lookbehind
Once-only Subexpression
Condition [if then]
Condition [if then else]
Comment
Pattern
([A-Za-z0-9-]+)
(\d{1,2}\/\d{1,2}\/\d{4})
([^\s]+(?=\.(jpg|gif|png))\.\2)
(^[1-9]{1}$|^[1-4]{1}[0-9]{1}$|^50$)
(#?([A-Fa-f0-9]){3}(([A-Fa-f0-9]){3})?)
((?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,15})(\w+@[a-zA-Z_]+?\.[a-zA-Z]{2,6})
(\<(/?[^\>]+)\>)
Will Match
Letters, numbers and hyphens
Date (e.g. 5/3/2008)
jpg, gif or png image
Any number from 1 to 50 inclusive
Valid hexadecimal colour code
String with at least one upper case
letter, one lower case letter, and one
digit (useful for passwords).
Email addresses
HTML Tags
Note: These patterns are intended for reference purposes and have not been
extensively tested. Please use with caution and test thoroughly before use.
The following two tabs change content below.
Edward J. Beckett is a passionate software engineer, web developer, server administrator and polyglot programmer with nearly a decade experience building desktop and web applications ranging from simple personal web sites to enterprise level applications on many technology stacks including Java, Java EE, Spring, Spring MVC, Spring Data, Hibernate, SQL, JPA, JMS, HTML, CSS, JavaScript, ColdFusion, PHP, Node.js and more...
  • http://oakhazelnut.com Amber Case

    Thanks for this great cheatsheet. I just began using regular expressions in Google Analytics, and this will be a nice reference. :)