Fun with Javascript Regex

October 07, 2018

Regex, or regular expressions in its full name, can feel like some kind of scary dark witchcraft if you are not familiar with them. You know those magic spells are powerful for pattern matching and string parsing, but those weird looking question marks, slashes and asterisks are just plain gibberish to an untrained mind.

Not all regex are equal. The regex we use in programming today comes in all kinds of syntax flavors. However, the most popular ones nowadays are mostly derivatives of Perl’s regex syntax. If you have mastered one regex dialect (like the Javascript one we are going to play with today, which is 99% identical to Dart’s regex syntax), picking up other dialects like Python’s or Java’s would be trivial. So now, let’s have some regex fun!

Getting Started!

In Javascript, a “regex pattern” is a class of objects, which we can define either via a constructor or a simpler regex literal (note the lack of quotation marks).

const regex0 = new RegExp(',') // regex constructor
const regex1 = /,/ // regex literal

The two RegExp objects above are equivalent - they both represent the “pattern” of a single comma.

So now we’ve defined a pattern, how do we use it? If what concerns us is only whether a pattern exists in a string or not, we can simply run the test method on a RegExp object.

const str0 = `1,000,000,000 is like, tres comas.`
console.log(regex1.test(str0)) // => true

If we want to find the location of the pattern’s occurrence, we can run the exec method, like, executing the regex on this string.

console.log(regex1.exec(str0))
// => [ ',', index: 1, input: '1,000,000,000 is like, tres comas.' ]

That’s some cool info, but it only returns the index of the first match. Hmm, maybe running exec() multiple times will do the trick, like pulling out data from an iterator?

console.log(regex1.exec(str0))
// => [ ',', index: 1, input: '1,000,000,000 is like, tres comas.' ]
console.log(regex1.exec(str0))
// => [ ',', index: 1, input: '1,000,000,000 is like, tres comas.' ]

Oops, nope! Well, we are partially right though - the exec() method is indeed stateful, and this is the correct way to iterate through matches. The problem actually lies within the regex pattern we defined.

Regex Flags

Flags let us toggle options of how the searching or matching should be carried out, and are part of the regex pattern.

What we need in the last example is a global flag g, which tells the regex engine to do a “global” search while not just stop at the first match (like the examples above). regex2 now will return null when the iteration is complete, then restart from index 0.

const regex2 = /,/g
console.log(regex2.exec(str0))
// => [ ',', index: 1, input: '1,000,000,000 is like, tres comas.' ]
console.log(regex2.exec(str0))
// => [ ',', index: 5, input: '1,000,000,000 is like, tres comas.' ]
console.log(regex2.exec(str0))
// => [ ',', index: 9, input: '1,000,000,000 is like, tres comas.' ]
// let's only run 3 times for now

There’s an interesting thing to observe - each RegExp object has an attribute called lastIndex, making it stateful. However, the object itself doesn’t remember which string is passed into the exec method. Right now, our regex2 object has its lastIndex set to 10 - if we swap str0 with another one, the matching will start from index 10 instead of 0.

console.log(regex2.lastIndex)
// => 10
const str1 = `This, is, cool.`
console.log(regex2.exec(str1))
// => null, because the searching starts at index 10.

Other useful flags are: i which makes the search case-insensitive, m which basically ignores newlines and does multi-line searches, and other less used ones. A new dotAll s flag has been added to the ECMAScript 2018 this year - this is a very helpful addition since the dot character (.) now finally matches all characters in the string, including the \n newline characters and co.. This new flag is supported by Chrome since version 62.

Now let’s see what all those question marks, slashes and asterisks are actually about!

Dealing with Wildcards

If you are familiar with terminal emulators in either UNIX or Windows style, you have probably dealt with Wildcards before. You know, you use rm -f *.gif on Mac or Linux to delete all GIFs in the current directory with no questions asked, use del *.gif /q on your Windows box to do the same. Well, it is important to know that wildcards in Perl-like regular expressions work in other ways.

We have only one wildcard character in Regex - the period . (aka the dot). This character pattern represents one single unknown character, but doesn’t match a newline character (\n), so /c.t/ matches string cat and doesn’t match c\nt. It basically works like the ? wildcard you are familiar with inside command lines.

Repetition Qualifiers (aka. Quantifiers)

So how do want to match many unknown characters? This is where repetition qualifiers come in.

Asterisks * represent 0 or more characters, ? means 0 or 1 characters, and + means 1 or more characters.

So for example, essential can be matched with /es.*sential/ (0 extra characters in this case), /es.+ential/ (1 extra here), /es.?ential/ (1 extra character), or obviously /.*/. Repetition qualifiers works with all characters, so /ess?enstial/ matches essential too - the second s character appears once.

What’s more, you can DIY the range of the repetition - at least n to at most m - with {n,m}, or specify the exact amount of occurrences with {n}. We can also match n to infinity (greater than or equal to n) occurrences with {n,}.

For example, essential can be matched with /es{2}ential/, 1000101 and 1000000101 can both be matched with 10{3,6}101 but 10101 can not.

Sometimes We Need to Escape

Sometimes, we need to match characters like { or * in our strings too - we can use backslash \ to escape those characters. In JavaScript, the special characters to escape are \ / [ { ( ? + * | . ^ $. Interestingly, ] } ) are not special characters, but trying to escaping them is not harmful. You can also escape normal characters, but you have to be careful since in regex, there are character classes (like \d for all number characters) that are written like escapes but are not - you can match /\o/ with /dog/ but not /\d/!

Sets and Classes

Character classes makes our life easier when we want to match characters from a specific set. For example, if we want to match numbers in a ID string, we could simply use \d to represent that number - essentially like a dot wildcard but only for numbers.

const regex = /\d+/g // the string must contain numbers
const str0 = '1234'
const str1 = 'D1234'
const str2 = 'D'

console.log(regex.test(str0)) // => true
console.log(regex.test(str1)) // => true
console.log(regex.test(str2)) // => false

We can also use a more flexible set notation [0-9] to replace \d - range 0 to 9. Following this “range” logic, for basic latin letters we can also do [a-z] or [A-Z], or simply [a-zA-Z]. These are actually just predefined shorthands to simplify [0123456789] or [abcdef...]. If you are matching something from the extended latin alphabet, you’ll need to add the extra letters in manually. For example, [a-zA-ZüöäÜÖÄß] for German. You get the idea 😉.

You can also use ^ inside the brackets as a negation operator - it negates all the rules inside the brackets - [^0-9] will match everything except numbers.

It is important to notice that special characters like $ or . don’t mean anything extra here inside the brackets - the brackets strip away all their magic, and they are simply plain special characters that may appear in normal texts.

Predefined Character Class Shorthands

As we have seen above, Javascript regex (or any other regex languages) has some predefined shorthands for common situations. Let’s have a look at the code snippet below.

const regex1 = /\w/ // matches "word" characters - equivalent to [a-zA-Z0-9_]
const regex2 = /\s/ // matches "space" characters, also tabs and various unicode control code and stuff
const nregex1 = /\W/ // negated \w - matches everything other than "word" characters
const nregex2 = /\S/ // negated \s - matches everything other than "space" characters

OR operator

Like in normal programming languages, | is the OR operator. [0123456789] can also be written like [01234]|[56789] if you feel like experimenting!

Replace by groups

Other than matching patterns, regex is also super useful for replacing characters in a match. We can use Javascript string’s replace() method to do this.

Let’s first construct a phone number matcher.

const str0 = '+49 123-123-1234' // a phone number...
const regex0 = /^(\+\d+)\s(\d+)-(\d+)-(\d+)/g // matches the number and put all the digits into 4 groups.
regex0.test(str0); // => true, of course!

Now, if we use the replace() method, we can use $ plus a number to represent the corresponding group that we have defined in the regex pattern inside the second (replacer) parameter.

For example, we would like to extract the country code.

str0.replace(regex0, '$1') 
// replace the match (the whole string in this case) with the first matched group, which is  (\+\d+)
// => '+49'

Or replace the last 4 numbers with 4321

str0.replace(regex0, '$1 $2-$3-4321')
// => '+49 123-123-4321'

Cool isn’t it? 😉