Implementing a library for readable regular expressions in JavaScript

In the last two articles, I've touched upon using TDD to kickstart a small project and designing a user-centric API. Now, it's time to implement the API we've designed!

API design recap

Let's quickly recap the decisions we've made in terms of the project and the API so far. The project itself is a simplified, readable and reusable API on top of the regular expression DSL, developed using the TDD approach.

The API is designed with users somewhat familiar with RegEx in mind, moving towards a more natural language approach. The interface will use simple naming, variadic functions and named arguments for special cases, such as named groups, lookaheads or quantifier counts. Finally, we'll export a function that will accept an array of patterns and a flags object, and return a regular expression object.

API implementation

For the API implementation, I'll once again focus on the decision-making part of the process. The implementation is either trivial or too complex, depending on your skill level and experience. I'll try to explain the reasoning behind the decisions I make, rather than the code itself, which can be browsed through on the GitHub repository.

💬 Note

Some of the code in this article is simplified (e.g. no error handling) for brevity. Filling in the gaps isn't particularly hard, if you're trying to recreate the project yourself.

Regular expression patterns

My initial approach hinged on creating a pattern class that could be used to build regular expressions. This approach assumed that any pattern should be possible to test on its own (thus making TDD more feasible). Therefore, I started with a Segment class (maybe the name wasn't perfect) that would represent a single pattern segment.

export class Segment {
  constructor(expression) {
    if (expression instanceof Segment) this.value = expression.value;
    if (expression instanceof RegExp) this.value = expression.source;
    if (typeof expression === 'string') this.value = expression;
  }
}

This class had a simple job: processing any valid value (Segment, RegExp or string) into a Segment object with a value property. This would allow me to build a regular expression by combining multiple Segment objects.

Using native structures

After a little fiddling around, I came to realize that my Segment class should, in fact, subclass RegExp itself. This meant that its value was roughly equivalent to RegExp.prototype.source. This change made the class simpler and more intuitive.

export class Segment extends RegExp {
  constructor(expression) {
    super(expression instanceof RegExp ? expression.source : expression);
  }
}

This would now allow me to tackle the problem of testing individual patterns, as every Segment could take the place of a regular expression. Therefore, the tests I had in place at the time would pass with flying colors:

import { describe, it, expect } from 'vitest';
import './matchers.js';

import { Segment } from '../src/segment.js';

describe('Segment', () => {
  it('should work with a regular expression', () => {
    expect(new Segment(/abc/)).toMatch('abc');
  });

  it('should work with a string', () => {
    expect(new Segment('abc')).toMatch('abc');
  });

  it('should work with another segment', () => {
    expect(new Segment(new Segment('abc'))).toMatch('abc');
  });
});

Sanitization & conversion

One core thing I had to deal with early on was string sanitization. Luckily, it's a simple matter of finding and replacing special characters. However, I'd also need to create segments without some escapes (e.g. + or * when building quantifiers later on).

import { describe, it, expect } from 'vitest';
import './matchers.js';

import { Segment, toSegment } from '../src/segment.js';

describe('Segment', () => {
  // ...
  it('should not escape special characters', () => {
    expect(new Segment('[a-z]+')).toMatch('d');
  });
});

describe('toSegment', () => {
  it('should work with a regular expression', () => {
    expect(toSegment(/[a-z]+/)).toMatch('d');
  });

  it('should work with a string', () => {
    expect(toSegment('abc')).toMatch('abc');
  });

  it('should work with another segment', () => {
    expect(toSegment(new Segment('abc'))).toMatch('abc');
  });

  it('should escape special characters in strings', () => {
    expect(toSegment('[a-z]+')).toMatch('[a-z]+');
  });
});

After settling on the use cases and writing my tests, I decided to separate the sanitization and conversion into Segments into their own functions.

export const sanitize = val => {
  if (typeof val !== 'string' && typeof val !== 'number')
    throw new TypeError('Value must be a string or a number');
  return `${val}`.replace(/[|\\{}()[\]^$+*?.-]/g, '\\$&');
};

💡 Tip

I've covered escaping special characters in regulars expressions in the past.

import { sanitize } from './sanitize.js';

//...

export const toSegment = expression => {
  if (expression instanceof Segment) return expression;
  if (expression instanceof RegExp) return new Segment(expression);
  return new Segment(sanitize(expression));
};

Reusable utilities

After building a toSegment function, I realized I needed a way to join segments, too. I also seemed to need a way to map an array of patterns to segments.

import { describe, it, expect } from 'vitest';
import './matchers.js';

import { toSegments, joinSegments } from '../src/segment.js';

describe('joinSegments', () => {
  it.each([
    [
      'a string, a RegExp, and a number',
      ['a', /b/, 123],
      ['a', 'b', '123']
    ],
    [
      'regular expressions with special patterns',
      [/[a-z]/, /[0-9]/],
      ['[a-z]', '[0-9]']
    ],
    [
      'strings with special patterns',
      ['[a-z]', '[0-9]'],
      [String.raw`[a-z]`, String.raw`[0-9]`]
    ],
  ])('when called with %s', (_, expressions, expected) => {
    it.each(['', '-', '|'])('and a "%s" separator', separator => {
      expect(
        joinSegments(toSegments(...expressions), separator).source
      ).toEqual(expected.join(separator));
    });
  });
});

💡 Tip

Notice the use of String.raw() in the last test case. This is a handy way to avoid escaping special characters in strings.

Having written some test cases that I could use to verify my implementation, I built the toSegments and joinSegments functions.

export const toSegments = (...expressions) => expressions.map(toSegment);

export const joinSegments = (segments, separator = '') =>
  new Segment(segments.join(separator));

I spent a good amount of time trying to figure out why my tests wouldn't pass no matter what I did. Then, it hit me: RegExp.prototype.toString() didn't behave the way I wanted! Naturally, I had to override it.

export class Segment extends RegExp {
  //...
  toString() {
    return this.source;
  }
}

This simple, albeit a little inelegant solution, fixed my tests and allowed me to move on to the next part of the implementation. With a joinSegments function in place, I could start building reusable building blocks for patterns.

Reusable abstractions

As mentioned in the previous article, I want to provide the basic building blocks for people to create their own patterns. This means I need to provide a way to create groups, lookaheads, quantifiers, and more. Let's give each one of these a try.

Anchors, wildcards & character classes

The simplest building blocks are anchors, wildcards, and character classes. They're all just a matter of creating the right Segment objects.

import { Segment } from './segment.js';

export const startOfLine = new Segment('^');
export const endOfLine = new Segment('$');
export const wordBoundary = new Segment(String.raw`\b`);
export const nonWordBoundary = new Segment(String.raw`\B`);

import { Segment } from './segment.js';

export const anyCharacter = new Segment('.');
export const anything = new Segment('.*');
export const something = new Segment('.+');

import { Segment } from './segment.js';

export const digit = new Segment(String.raw`\d`);
export const nonDigit = new Segment(String.raw`\D`);
export const wordCharacter = new Segment(String.raw`\w`);
export const nonWordCharacter = new Segment(String.raw`\W`);
export const whitespaceCharacter = new Segment(String.raw`\s`);
export const nonWhitespaceCharacter = new Segment(String.raw`\S`);

💬 Note

The tests for these constants are pretty much trivial, so they're omitted here.

Backreferences

Let's also get backreferences out of the way, as they're another trivial piece of code to write, even if using them feels like a more advanced technique.

import { describe, it, expect } from 'vitest';

// Temporary test setup, until `readEx` is implemented
const toRegExp = expressions => joinSegments(toSegments(...expressions));

import { backReference } from '../src/backReference.js';

describe('backReference', () => {
  it('should work with an index reference', () => {
    const regexp = toRegExp([/^/, /(a)/, backReference(1), /$/]);
    const [fullMatch, match] = regexp.exec('aa');
    expect(fullMatch).toBe('aa');
    expect(match).toBe('a');
  });

  it('should work with a named reference', () => {
    const regexp = toRegExp([
      /^/, /(?<myName>a)/, backReference('myName'), /$/,
    ]);
    const [fullMatch, match] = regexp.exec('aa');
    expect(fullMatch).toBe('aa');
    expect(match).toBe('a');
  });
});

import { Segment } from './segment.js';

export const backReference = reference =>
  new Segment(
    typeof reference === 'number' ?
    `\\${reference}` :
    `\\k<${reference}>`
  );

Groups

Groups are the next logical step, as they'll allow us to write more complex patterns (hint: we need the non-capturing group in several places).

import { describe, it, expect } from 'vitest';
import './matchers.js';

// Temporary test setup, until `readEx` is implemented
const toRegExp = expressions => joinSegments(toSegments(...expressions));

import {
  captureGroup, nonCaptureGroup, namedGroup, concat, or,
} from '../src/group.js';

describe('captureGroup', () => {
  it('should create a capturing group', () => {
    const regexp = toRegExp([/^/, 'a', captureGroup('bc'), /$/]);
    const [fullMatch, match] = regexp.exec('abc');
    expect(fullMatch).toBe('abc');
    expect(match).toBe('bc');
  });
});

describe('nonCaptureGroup', () => {
  it('should create a non-capturing group', () => {
    const regexp = toRegExp([/^/, nonCaptureGroup('a', /b/, 'c'), /$/]);
    const [fullMatch, match] = regexp.exec('abc');
    expect(fullMatch).toBe('abc');
    expect(match).not.toBe('abc');
  });
});

describe('namedGroup', () => {
  it('should create a named capturing group', () => {
    const regexp = toRegExp([
      /^/, 'a', namedGroup({ name: 'myName' }, 'bc'), /$/,
    ]);
    const matches = regexp.exec('abc');
    expect(matches[0]).toBe('abc');
    expect(matches.groups.myName).toBe('bc');
  });
});

describe('concat', () => {
  it('should combine expressions' => {
    const expressions = ['a', /b/, 'c'];
    const regexp = toRegExp([/^/, concat(...expressions), /$/]);
    expect(regexp).toMatchString('abc');
    expect(regexp).not.toMatchString('a');
    expect(regexp).not.toMatchString('ab');
  });
});

describe('or', () => {
  it('should match either of the patterns', () => {
    const regexp = toRegExp([/^/, or('a', 'b'), /$/]);
    expect(regexp).toMatchString('a');
    expect(regexp).toMatchString('b');
  });
});

Apart from passing all tests, I wanted to make sure there's a shared group creation utility, which I'll call toGroup. This made sure that I didn't repeat myself too much.

import { Segment, toSegments, joinSegments } from './segment.js';

const toGroup = (expressions, { capture, name } = {}) => {
  const expression = joinSegments(toSegments(...expressions)).source;

  if (name) return new Segment(`(?<${name}>${expression})`);
  if (capture) return new Segment(`(${expression})`);
  return new Segment(`(?:${expression})`);
};

export const captureGroup = (...expressions) =>
  toGroup(expressions, { capture: true });

export const nonCaptureGroup = (...expressions) =>
  toGroup(expressions, { capture: false });

export const namedGroup = ({ name }, ...expressions) =>
  toGroup(expressions, { name });

export const concat = (...expressions) => nonCaptureGroup(...expressions);

export const or = (...expressions) =>
  nonCaptureGroup(joinSegments(toSegments(...expressions), '|'));

Lookarounds

Lookaheads and lookbehinds followed a very similar pattern. I wrote some tests, implemented a reusable creation utility, and created the necessary building blocks. The only point of note here is the use of the concat utility from the group implementation.

import { Segment, toSegments } from './segment.js';
import { concat } from './group.js';

const toLookAround = (expressions, { lookbehind, negative } = {}) => {
  const expression = concat(...toSegments(...expressions)).source;
  const prefix = `?${lookbehind ? '<' : ''}${negative ? '!' : '='}`;
  return new Segment(`(${prefix}${expression})`);
};

export const lookahead = (...expressions) =>
  toLookAround(expressions, { lookbehind: false, negative: false });

export const negativeLookahead = (...expressions) =>
  toLookAround(expressions, { lookbehind: false, negative: true });

export const lookbehind = (...expressions) =>
  toLookAround(expressions, { lookbehind: true, negative: false });

export const negativeLookbehind = (...expressions) =>
  toLookAround(expressions, { lookbehind: true, negative: true });

💬 Note

Tests are omitted here, for brevity, as they're not too different from the ones for groups.

Quantifiers

The meat of the implementation seemed to be quantifiers. After all, they're the most common building blocks. Tests were a blessing here, especially the ones for repeat, with its trickier API. I'll focus on just those, as the others are pretty straightforward.

import { describe, it, expect } from 'vitest';
import './matchers.js';

// Temporary test setup, until `readEx` is implemented
const toRegExp = expressions => joinSegments(toSegments(...expressions));

import { repeat } from '../src/quantifiers.js';

describe('repeat', () => {
  describe('should match the pattern the specified number of times', () => {
    it('with an exact number of times', () => {
      const regexp = toRegExp([/^/, repeat({ times: 3 }, 'a'), /$/]);
      expect(regexp).toMatchString('aaa');
    });

    it('when only a min is given', () => {
      const regexp = toRegExp([/^/, repeat({ times: [2] }, 'a'), /$/]);
      expect(regexp).toMatchString('aa');
      expect(regexp).toMatchString('aaa');
    });

    it('when only a max is given', () => {
      const regexp = toRegExp([/^/, repeat({ times: [, 2] }, 'a'), /$/]);
      expect(regexp).toMatchString('a');
      expect(regexp).toMatchString('aa');
    });

    it('when given both a min and a max', () => {
      const regexp = toRegExp([/^/, repeat({ times: [2, 3] }, 'a'), /$/]);
      expect(regexp).toMatchString('aa');
      expect(regexp).toMatchString('aaa');
    });
  });
});

Notice also that I decided to make lazy variants of each quantifier their own functions, as to minimize friction for the end user. Instead of having to specify options for, say zeroOrOne, one can just use zeroOrOneLazy.

import { Segment, toSegments } from './segment.js';
import { nonCaptureGroup } from './group.js';

const toQuantifier = (expressions, { suffix, lazy } = {}) => {
  const expression = nonCaptureGroup(...toSegments(...expressions)).source;
  return new Segment(`${expression}${suffix}${lazy ? '?' : ''}`);
};

const toRepeat = (expressions, options) => {
  const { times, lazy } = options;
  const [min = null, max = null] = Array.isArray(times)
    ? times
    : [times, times];

  const [start, end] = [min ?? 0, max ?? ''];
  const suffix = start === end ? `{${start}}` : `{${start},${end}}`;

  return toQuantifier(expressions, { suffix, lazy });
};

export const zeroOrOne = (...expressions) =>
  toQuantifier(expressions, { suffix: '?' });

export const zeroOrOneLazy = (...expressions) =>
  toQuantifier(expressions, { suffix: '?', lazy: true });

export const oneOrMore = (...expressions) =>
  toQuantifier(expressions, { suffix: '+' });

export const oneOrMoreLazy = (...expressions) =>
  toQuantifier(expressions, { suffix: '+', lazy: true });

export const zeroOrMore = (...expressions) =>
  toQuantifier(expressions, { suffix: '*' });

export const zeroOrMoreLazy = (...expressions) =>
  toQuantifier(expressions, { suffix: '*', lazy: true });

export const repeat = (options, ...expressions) =>
  toRepeat(expressions, options);

export const repeatLazy = (options, ...expressions) =>
  toRepeat(expressions, { ...options, lazy: true });

Character sets

Ranges and character sets are the last building blocks I wanted to implement. As these functions can accept range patterns, we need to make sure, these work as expected, hence a few more test cases to be sure.

import { describe, it, expect } from 'vitest';
import './matchers.js';

// Temporary test setup, until `readEx` is implemented
const toRegExp = expressions => joinSegments(toSegments(...expressions));

import { anythingBut, anythingFrom } from '../src/characterSet.js';

describe('anythingFrom', () => {
  describe.each([
    ['abc', ['a', 'b', 'c'], ['d']],
    ['a-z', ['a', '-', 'z'], ['b']],
    ['[ab]', ['a', '[', 'b', ']'], ['d']],
    ['[^a]', ['^', '[', 'a', ']'], ['d']],
  ])('matches exactly', (expression, matches, nonMatches) => {
    matches.forEach(str => {
      it(`${expression} matches ${str}`, () => {
        expect(toRegExp([/^/, anythingFrom(expression), /$/]))
          .toMatchString(str);
      });
    });

    nonMatches.forEach(str => {
      it(`${expression} does not match ${str}`, () => {
        expect(toRegExp([/^/, anythingFrom(expression), /$/]))
          .not.toMatchString(str);
      });
    });
  });
});

describe('anythingBut', () => {
  describe.each([
    ['abc', ['d'], ['a', 'b', 'c']],
    ['a-z', ['d'], ['a', '-', 'z']],
    ['[ab]', ['d'], ['a', '[', 'b', ']']],
    ['[^a]', ['d'], ['^', '[', 'a', ']']],
  ])('matches exactly', (expression, matches, nonMatches) => {
    matches.forEach(str => {
      it(`${expression} matches ${str}`, () => {
        expect(toRegExp([/^/, anythingBut(expression), /$/]))
          .toMatchString(str);
      });
    });

    nonMatches.forEach(str => {
      it(`${expression} does not match ${str}`, () => {
        expect(toRegExp([/^/, anythingBut(expression), /$/]))
          .not.toMatchString(str);
      });
    });
  });
});

To make this happen, I settled on a toCharacterSet utility function, which would handle the conversion of range patterns to Segment objects. From that point onward, it was just a matter of building the anythingFrom and anythingBut functions.

import { Segment, toSegment, toSegments, joinSegments } from './segment.js';

export const toCharacterSet = expression => {
  if (Array.isArray(expression))
    return joinSegments(toSegments(...expression), '-');
  return toSegment(expression);
};

const toAnything = (prefix, ...expressions) =>
  new Segment(
    `[${prefix}${joinSegments(expressions.map(toCharacterSet), '|').source}]`
  );

export const anythingFrom = (...expressions) =>
  toAnything('', ...expressions);
export const anythingBut = (...expressions) =>
  toAnything('^', ...expressions);

The main function

Finally, we can build the main function, readEx, that will accept an array of patterns and a flags object, and return a RegExp. We've already written tests for a lot of the code, so our test scaffolding can pretty much be converted to the final implementation.

import { joinSegments, toSegments } from './utils.js';

const DEFAULT_FLAGS = {
  dotAll: false, global: true, ignoreCase: false,
  multiline: true, sticky: false, unicode: false,
};
const FLAG_MAP = {
  dotAll: 's', global: 'g', ignoreCase: 'i',
  multiline: 'm', sticky: 'y', unicode: 'u',
};

const asFlags = flags =>
  Object.entries(Object.entries(flags).reduce(
    (acc, [flag, value]) => {
      if (value !== undefined) acc[flag] = value;
      return acc;
    }, { ...DEFAULT_FLAGS }
  )).reduce((acc, [flag, value]) =>
    (value ? acc + FLAG_MAP[flag] : acc), ''
  );

const readEx = (expressions, flags = {}) =>
  new RegExp(joinSegments(toSegments(...expressions)).source, asFlags(flags));

After implementing the main function, I went ahead and updated all the test cases to replace the toRegExp function with readEx. This way, I could make sure that the final implementation worked as expected. I also wrote a bunch of integration tests, which are too complex to cover here, but feel free to check them out on the GitHub repository.

Exports

Finally, I exported all the building blocks and the main function from the index.js file. This way, I could bundle everything up in a nice neat package.

export * from './readEx.js';
export * from './backReference.js';
export * from './group.js';
export * from './lookAround.js';
export * from './quantifiers.js';
export * from './sequences.js';
export * from './characterSet.js';

Having used named exports throughout the project and separating utilities, I could simply export everything from the individual files in the index.js file. This makes it even easier to add new functions to the library in the future, but has the potential drawback of internal code being exported, if I'm not careful.

Packaging the code

We've already set up Vitest for testing, but we'll need a bundler to package our code. I've chosen Vite for its speed and simplicity.

Vite configuration

Vite's configuration is minimal, and it's very easy to set up. This is in stark contrast with tools like Webpack, which have always haunted me with their complexity (although, admittedly, this could be due to my own incompetence).

import { dirname, resolve } from 'node:path';
import { fileURLToPath } from 'node:url';
import { defineConfig } from 'vite';

const __dirname = dirname(fileURLToPath(import.meta.url));

export default defineConfig({
  build: {
    lib: {
      entry: resolve(__dirname, 'src/index.js'),
      name: 'readex',
      fileName: 'readex',
      formats: ['es'],
    },
    minify: false,
  },
});

I've based this configuration off of Vite's library mode documentation. The lib object specifies the entry point, the library name, the output file name, and the format. We're using ES modules for compatibility with modern browsers and Node.js.

Notice that I've also set minify to false for the time being. I'll revisit this in the next article, but for now, I want to be able to read the final output easily.

Building the code

We'll also add a build script to our package.json before we can bundle our code:

{
  "scripts": {
    "build": "vite build"
  }
}

Then, we can finally run the build script:

npm run build

This results in the following output:

> readex@1.0.0 build
> vite build

vite v6.2.0 building for production...
✓ 9 modules transformed.
dist/readex.js  6.04 kB │ gzip: 1.63 kB
✓ built in 51ms

In it, we can see the output path (dist/readex.js), the size of the bundle (6.04 kB), and the gzipped size (1.63 kB). This is a very small bundle size, which is great for a utility library like ours. Yet, I feel like we can do better, which I'm going to explore in the next article, where I refactor much of the codebase.

Publishing to npm

Finally, we can publish our package to npm. We'll need to specify a few things in our package.json first to make sure the right files are included and the package is importable as an ES module:

{
  "name": "readex",
  "type": "module",
  "version": "1.0.0",
  "files": ["dist"],
  "module": "./dist/readex.js",
  "exports": {
    ".": {
      "import": "./dist/readex.js"
    }
  }
}

After that, if you haven't already, you'll need to create an account on the npm website. Then, you can log in from the command line:

npm login

After logging in, you can publish your package:

npm publish

We just made our package available to the world! You can check it out on the npm website or install it in another project to test it out.

Conclusion

Phew! That was quite a long journey, but I think the library is now ready for use. Using TDD and a carefully designed API before starting out, the library took shape easily and quickly. The resulting bundle is small and the code is easy to read and understand. Next time, we'll take a stab at refactoring the code to make it even more efficient and readable. Stay tuned!

Implementing a library for readable regular expressions in JavaScript

API design recap

API implementation

Regular expression patterns

Using native structures

Sanitization & conversion

Reusable utilities

Reusable abstractions

Anchors, wildcards & character classes

Backreferences

Groups

Lookarounds

JavaScript Regular Expression tips

Quantifiers

Character sets

The main function

Exports

Packaging the code

Vite configuration

Building the code

Publishing to npm

Conclusion

More like this

Designing a user-centric API for a JavaScript library

Kickstart a JavaScript project with Test-Driven Development

Using Test-Driven Development to refactor a JavaScript project

ESLint & Prettier configuration