Read This If You Don’t Know Enough About Regex
Regular expressions are a way to do pattern matching on text. They’re often overlooked and most programmers think about them as something that’s hard to learn. With Ephesoft you need to use regex to extract information for a document.
Despite the fact that regular expressions are very powerful, a lot of programmers aren’t really knowledgable when it comes to regular expressions.
Diving into regular expressions can get quite hard since it definitely has a learning curve. Though once you’ll get to know a little bit more about regular expressions, you’ll see that they can accomplish quite a lot.
One of the most obvious use cases is to search large codebases for certain pieces of text. If you’re a web developer you’ve probably used regular expressions at some point in your career to validate user data. And regular expressions can even be used to prohibit committing certain strings.
As you can see, there are a lot of use cases for regular expressions. The use cases for regular expressions are so versatile that you can’t simply ignore them.
That’s exactly why it’s a good thing to get more familiar with regular expressions. In order to become a better programmer, it’s essential that you get more of a feeling with regular expressions and that you at least know the basics.
In this article, we’ll cover the basics of regular expressions. Along with the theory, the basics of regular expressions will be demonstrated using examples that clarify the way regular expressions work.
Starting at the Very Beginning
The most basic example of a regular expression is when you’re trying to find a certain string in a text. We’ll use the alphabet for this example.
In order to search through the text for the string hij, we use the following regex: /hij/
. Note that the string is surrounded by forward slashes. These indicate the start and end of the regular expression.
As you can see in the image below we have a match. It’s good to know that this search is case-sensitive and only returns the first match.
But most of the time you want to do something that’s a little more complex than what we did in the previous example. One of the things that you might want to do is look for a string that starts with certain characters.
So let’s do that. Let’s make a regex that matches abc at the start of a string. In order to do that you’ll have to prefix your regex with a ^
— which results in the regex /^abc/
.
We could do the same if we want to look for a string that matches certain characters at the end of a string — but we’ll have to use $
instead of ^
and suffix it. This results in the regex /xyz$/
.
It’s also possible to combine both of these anchors which results in an exact string match.
Summary
bar matches a string that contains bar
^bar matches a string that starts with bar
bar$ matches a string that ends with bar
^foo bar$matches a string that starts and ends with foo bar
There are a few special characters that have a special meaning in regular expressions. Those characters need to be escaped in order to do a literal search on them.
Escaping
Here’s a list of all the special characters:
[ \ ^ $ . | ? * + ( )
Let’s say that we want to search for a question mark. In regular expressions, the question mark has a special meaning. So in order to do a literal search for the question mark, we need to escape it.
You can escape a character by prefixing it with a backslash: \?
. This lets you use a special character as a regular one.
Summary
In order to do a literal search on special characters, [ \ ^ $ . | ? * + ( )
, you need to escape them with a backslash.
Flags
Now that you have a better understanding of how to construct regular expressions it’s time to go over another fundamental part of regular expressions, called flags.
As you know by now, a regular expression usually comes within a form where the search pattern is delimited by two forward slash characters. To specify a flag you add it right after the last slash characters. It’s also possible to combine flags.
Although there are a lot of different flags, we’ll be going over the three flags that are most used which are the global, case insensitive, and multi-line flag.
Global
The global flag will take all matches instead of returning after the first match.
Case insensitive
Like said before, we can also combine these flags:
Multi-line
When you want to do search a text over multiple lines you’ve got to use the multi-line flag.
Summary
g global, don’t return after first match
i case insensitive matching
m multi-line, ^ and $ match start and end of line
Character classes
A character class is a special notation that matches any symbol from a certain set. Let’s say that we have a phone number, +(903)123-4567
, that we want to turn into numbers only.
In order to do that we have to find and remove anything that’s not a number. And that’s something that character classes can help with. The \d
character class, for example, matches any digit.
That’s perfect for our use case:
The \d
character class matches a digit, which is a character from 0 to 9. But \d
isn’t the only character class.
The \w
character class matches a “wordly” character which is either a letter of the Latin alphabet, a digit, or an underscore. Non-Latin letters don’t belong to the \w
character class.
The \s
character class matches a space symbol. This includes spaces, tabs, and newlines.
Although there are more character classes, the ones that we’ve gone over are the most-used.
Inverse
It’s possible to get the inverse of a character class. Let’s say that we want a non-digit — which basically comes down to any character except \d
. In order to do so, we could use the inverse.
Every character class has an inverse which is denoted with the same letter, but uppercased. The inverse for \d
is \D
.
Summary
\w wordly character \d digit \s whitespace Inverse \W non wordly character \D nondigit \S nonwhitespace
Quantifiers
The last topic that we’re going to touch in this article is quantifiers. A regex quantifier specifies how often a preceding regular expression should match.
There are a few different quantifiers, that we’ll all go over.
The first quantifier that we’ll go over is the ?
which means zero or one. In the following example, we use the /ba?/g
regex. This regex will check for any b character followed by zero or one a character.
The second quantifier is the *
which means zero or more.
And there is also a quantifier for one or more, which is the +
quantifier.
You can also work with a specific number of occurrences that have to be matched in a quantifier. The/ba{2}/g
regex matches any b character that is followed by two consecutive a characters.
If you want to limit the amount of consecutive a characters you can at a second number, separated by a comma. The /ba{2,4}/
regex matches any b character that is followed by two to four consecutive a characters.
We can also do this for any b character that is followed by at least two a characters. In order to do that, you’ll have to place a comma behind the number of occurrences — which results in /ba{2,}/g
.
Summary
a? matches an `a` character or nothing
a* matches zero or more consecutive `a` characters
a+ matches one or more consecutive `a` characters
a{2} matches exactly 2 consecutive `a` characters
a{2,4} matches between 2 and 4 consecutive `a` characters
a{2,} matches at least 2 consecutive `a` characters.
Daniel Jordan
Cloud Evangelist