The case against Regular Expressions

July 2020 | Permalink

Illustration by Barbara

A regular expression is a sequence of symbols that defines a pattern. This pattern is then used to search for characters or combinations of characters within text.

In its simplest form, regular expressions can provide a viable method for text matching and extraction, using a concise api (in the javascript world, it would be the RegExp api). Below is an example of how regular expressions can help on working with text:

// constructor form
new RegExp('https?://(www\\.)?youtube\\.com/(watch|playlist).*'); 

// literal form

Looking at the above code, you can clearly identify the intent without looking much at symbols and flags. It is trying to match/extract Youtube videos or playlist urls.

Let’s have a look at a different and apparently simple example: a library book feed. Book records might look like this:

The Hunger Games, Suzanne Collins - ISBN: 0439023483 - pages 374
Gone with the Wind, Margaret Mitchell - ISBN: 0446675539  - pages 1037

If I was requested to extract the ISBN number for each of the feed records, a regular expression would definitely be one of the solutions I could be looking at:

const bookPattern = /ISBN:\s(\d+)/gm;

A developer looking at this code may realise that I am trying to match the ISBN number or be looking for text around the ISBN number. After decoding the pattern, he might be able to find that I was in fact trying to match:

  • an “ISBN” label
  • followed by a space
  • followed by some numbers
  • in all lines of text

Indeed, that is the intent. The results would be acceptable, but not exactly what we wanted (due to how grouping works):

// ["ISBN: 0439023483", "ISBN: 0446675539"]

In order to match the ISBN number only, we would not be searching for the “ISBN label, followed by a space, followed by a number” but rather the “ISBN number, preceeded by the ISBN label”. It does seems like a small tweak on our pattern: in practice, we are looking to introduce a positive lookbehind assertion.

Our pattern would then look like this:

const bookPattern = /(?<=ISBN:\s)(\d+)/gm;
// ["0439023483", "0446675539"]

And no, it is not small tweak to me.

It is somewhat reasonable to expect a developer to decode the pattern however, there are a number of symbols that create noise and on more convoluted scenarios they could possibly divert the focus from the business problem itself.

Let’s say we have a new business requirement:

We just want 13-digit ISBNs and only if the record specifies the number of pages. - The Business

It seems like a realistic request, let’s see how this translates into work:

const bookPattern = /(?<=ISBN:\s)(\d{13})(?=\s-\spages)/gm;

Frankly speaking, our pattern now looks more like a merge conflict rather than an high level piece of code.

Unfortunately, the data that we happen to work with is rarely well structured and abusing regular expressions is also quite common. Here is a solution proposed in a Stackoverflow thread for somebody looking to validate a url:

/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+@)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+@)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=& %@.\w_]*)#?(?:[\w]*))?)/

Trying to understand the above is time consuming and even though the intent could be manifested using variables or test coverage, I would not want to find myself looking at an uncomprehensible, unmaintanable piece of code.

Alternatives to using Regular Expressions

Abstracting regular expressions is quite unpopular and there seem to be an unconditioned love for regular expressions in the developer world which I struggle to understand: to me, this is a problem worth abstracting as explained in this article.

Fortunately, I’m not alone in thinking this way and libraries such as Verbal Expressions offer a great api to constructing regular expression patterns. I would watch out for the the upcoming version 2.0, a complete rewrite with an increased support for patterns.

Sometimes breaking the problem into smaller parts, perhaps resorting to array/string manipulation (mind the performance!) can also be a viable, but often overlooked, alternative.

In Conclusion

Regular Expressons are powerful tools and with power comes responsibility.

I find most regular expressions patterns easy to construct, but not so easy to decode and I tend to use them only if the pattern is immediately readable. I am not surprised a native abstraction is not available (there is no javascript Standard Library, let alone a regex package) however the language and its developers would benefit from a higher level api.