Note: This post has moved from Leapthree.com to Ayima as part of the 2018 acquisition.
Regular expressions, commonly known as regex, form the basis of much of my work with Google Analytics. Quite simply, I could not do half of what I do without it. My knowledge is based on a lot of trial and error but it all started with an ebook from Luna Metrics for which I am incredibly grateful – http://www.lunametrics.com/regex-book/Regular-Expressions-Google-Analytics.pdf.
This blog post is not meant to replace that ebook but to be a quick reference guide followed by some practical applications within Google Analytics of regular expressions.
Where you use regular expressions
Regular expressions end up being used in what feels like most Google Analytics functionality. I am going to forget at least one use but here are the key areas:
- Applying report filters (particularly with the following reports)
- All Pages
- Keywords (where provided)
- All Referrals
- Creating View (Profile) filters in the configuration – examples include:
- Rename pages
- Exclude IP Addresses or robot traffic
- Rename traffic sources
- Creating Segments
- Setting up Goals
- Including filters while creating Custom Reports or Dashboard widgets
With that list of uses, regex becomes something you MUST know if you are planning on using Google Analytics on a regular basis to get real value out of it.
The key characters
Regular expressions are built around a set of wildcard characters. But whereas you may think of * as a match anything character, there is a lot more flexibility here. The below is not a complete list but what you need to know to get started.
Hat (technically called a caret) ^
- This is a regular expression for “begins with”
- Matches any string that contains whatever follows the ^
- E.g. ^aa will match “aaa”, “aab” but not “baa”
- Use for identifying category of pages e.g. ^/services
Dollar sign $
- This is the regular expression for “ends with”
- It is the simple opposite of ^
- E.g. aa$ will match “aaa”, “baa” but not “aab”
- Use for identifying category of pages e.g. html$ (no URL query parameters)
- The simplest regex character, the dot can replace any character
- E.g. a.b will match “aab”, “abb”, “a5b” or “a!b” but not “ab”
- Use to correct for spelling mistakes e.g. ach..ve (is it ie or ei?)
- The pipe means OR and is one of the simplest characters to use when learning regex
- E.g. aa|bb will match “aa” or “bb” but not “aabb” or “ab”
- Use when identifying social media traffic e.g. facebook|twitter
Question Mark ?
- The previous character is optional and is not required for that string to match
- So a? means match a or blank
- e.g. ba?b will match “bb”, “bab” but not “baaab”
- Use with multi word brand names where the space may not be included e.g. l3 ?analytics
Plus symbol +
- The plus symbol means match one or more of the previous character
- So a+ means match one or more a’s
- e.g. ba+b will match “bab” and “baaab”
Asterisk symbol *
- Similar to the plus symbol except it matches zero or more characters
- So a* means match zero or more a’s
- e.g. ba*b will match “bab”, “baaab” and “bb”
Square brackets 
- Square brackets are where it starts to get complicated, they mean include one of these characters
- So [abc] could match any of a, b or c
- Square brackets can also include characters so [ab12_/] means match one of a, b, 1, 2, _ or /
- And you can also use a range where [a-z0-9] means match one of any letter or any number
Round brackets ()
- These are like a mathematical formula in that their contents are processed independently
- E.g. b(a|c)b will match “bab” or “bcb” but not “bacb”, “ba” or “bc”
- In plain English, this example means match b, then either a OR c, then another b
- With all of these wildcard characters, sometimes they are the characters you need to identify in a string
- In that situation, use the backslash to indicate the character is not a wildcard
- E.g. ba?b will match “ba?b” but not “bb” or “bab”
More complicated examples
I gave pretty simple examples when describing these wildcards but the real power when you combine wildcards. Some examples of these combinations are:
- .* will match any string
- .+ will match non blank string
- [a-z]+ will match one or more letters
- [a-z0-9-_]+ will match one or more letters, numbers, dashes or underscores – as is found in most page names
- a(cat|dog)?b will match “acatb”, “adogb” or (as the entire contents of the () are optional), “ab”
- ^/[0-9]+/[0-9]+/ will match pages names for blog posts that commence with /yyyy/mm/<blog post title>
Use cases in Google Analytics
Valid Traffic Only
Your live Google Analytics profiles (I can’t get used to calling them Views) should only include data from your live website. If L3 Analytics had three subdomains (www, blog and support), the regular expression to use in the profile filter is:
This translates as “starts with www. OR blog. OR support. OR is blank, followed by l3analytics.com, ending there”.
Identify Social Media traffic
Traffic from Social Media networks that is not tagged with campaign parameters will appear with a medium of “referral” and a source of the social media network domain name. You can rename the medium to social media for all this traffic using the following regular expression for the source:
Identify Subset of Pages
It can be easy to identify a subset of pages (e.g. product pages, article pages) if there is a good URL structure in place. If not, it can be incredibly difficult but using regular expressions makes it possible/easier. Let’s use Paperchase as an example:
To identify department pages (within the All Pages report), they will need to filter on
To identify product list pages (within the All Pages report), they will need to filter on
To identify product pages (within the All Pages report), they will need to filter on
These examples are not exact as there appears to be some variation in URLs but will cover most cases. Basically looking to have one string prior to icat for department level pages and two strings for product list level. There are many more details that could be identified on filters that are applied to these pages.
Similar logic is used if you want to set Goals for viewing a Product page or for using a filter on a Product List page.
Taking the previous logic to the next level, these pages can be renamed using Profile Filters and regular expressions. A key change here is that anything within a () is remembered by Google Analytics and can be used within a string that is output.
So, to rename blog posts that use the format of /yyyy/mm/<blog post title>, use a Custom Advanced filter and select Request URI (page name) for field A and Output. The regular expression is then:
- Field A – ^/[0-9]+/[0-9]+/(.+)
- Output – /blog/post/$A1
Where $A1 is the blog post title
You can then identify all blog posts by simply filtering on “/blog/post”.
Ok, regular expressions are very powerful which means they can also get very complicated. I tend to choose those advanced complicated examples myself but it would be more useful to provide examples that everyone can follow.