Regular expressions

A regular expression (or regex) is a formal pattern that describes a set of text strings. It allows you to search, validate, extract and transform text without writing manual character-by-character comparison loops.

Faced with problems like "verify this text is a valid IP address", "extract all email addresses from a 50 MB log file" or "replace all variable names starting with _old_ across 3,000 code files", a regex solves it in a single line. They are present in every modern programming language, text editor, terminal and database engine.

Anatomy

A regular expression is composed of elements with well-defined roles:

/pattern/flags — The standard notation. Slashes / act as delimiters; flags control global behaviour: g (all matches), i (case-insensitive), m (^/$ per line), s (. matches newlines).

Metacharacters — Symbols with special meaning: . (any character except \n), \d (a digit), \w (letter, digit or _), \s (whitespace). Their negations: \D, \W, \S.

Quantifiers — Repetitions: * (0 or more), + (1 or more), ? (0 or 1), {n} (exactly n), {n,m} (between n and m). By default they are greedy; adding ? makes them lazy: +?, *?.

Anchors — Positions without consuming characters: ^ (start), $ (end), \b (word boundary).

Character classes[abc] matches a, b or c. [a-z] is a range. [^abc] is the negation.

Groups(...) groups and captures. (?:...) groups without capturing. (?<name>...) creates named captures. (?=...) and (?!...) are lookaheads.

Alternationcat|dog matches "cat" or "dog".

History & evolution

The mathematical theory underlying regular expressions was formalised by mathematician Stephen Kleene in 1956. The step from theory to practice was taken by Ken Thompson in 1968, when he implemented regex in the ed text editor of UNIX. In 1973 he created grep — one of the most iconic UNIX tools — whose name comes from regex syntax: g/re/p (global regular expression print).

In the 1980s, POSIX standardised Basic (BRE) and Extended (ERE) Regular Expressions, still used in grep, sed and awk. The turning point came in 1987, when Larry Wall designed Perl with regex as a core language feature, creating PCRE (Perl Compatible Regular Expressions), the de facto standard adopted by PHP, Python, Ruby, Java and JavaScript.

In 1997, Philip Hazel released the PCRE library as an independent implementation. Today PCRE2 is present in Apache, Nginx, PostgreSQL and virtually every tool in the digital world.

Best practices

Always test with real data and edge cases. A regex that works with 3 test examples may fail with unexpected real-world data.

Anchor when possible. Using ^ and $ reduces search time and prevents unintended partial matches.

Avoid catastrophic backtracking. Patterns like (a+)+ can take exponential time on non-matching strings and crash a server (ReDoS). Avoid nested quantifiers over the same characters.

Use non-capturing groups (?:...) by default. If you do not need the captured content, (?:...) is more efficient and makes the group clearly structural.

Comment complex regex. In Python and PHP the (?x) flag (verbose mode) allows spaces and comments inside the pattern.

Common errors

Forgetting to escape the dot. . means "any character", not a literal dot. Write \. for a literal dot. The pattern /helpi.top/ matches "helpiXtop" as well as "helpi.top".

Greedy vs lazy misapplied. <.*> on <b>text</b> captures the entire fragment. The lazy version <.*?> captures each tag individually.

Forgetting the g flag. Without g, string.match(/pattern/) in JavaScript returns only the first match.

Catastrophic backtracking. The pattern (a+)+b can take exponential time on non-matching strings. Exploitable to crash services (ReDoS).

Assuming . matches newlines. By default it does not. Use the s flag or [\s\S].

Using regex for HTML or JSON. A classic anti-pattern: brittle in the face of nesting. Use specialised parsers instead.

Use cases

Form validation. Emails, phone numbers, postal codes, tax IDs: regex validate format on both client and server simultaneously.

Log analysis. With regex you can extract IPs, HTTP codes, timestamps or specific errors from millions of log lines in seconds.

Advanced find & replace. VS Code, Vim and Sublime Text support regex in search/replace. Rename 500 variables or reformat dates in a 100,000-row CSV in a single operation.

Framework routing. Rails, Laravel, Django and Express define their routes with patterns that capture URL parameters directly.

Data processing and ETL. Regex extract fields from unstructured formats to normalise and load them into databases.

Curiosities

  • The name grep comes from the <code>:g/re/p</code> command of the UNIX <code>ed</code> editor (1973): "global regular expression print". Ken Thompson's creation brought regex within reach of every UNIX user.
  • The ReDoS phenomenon (Regular Expression Denial of Service) has caused real outages: in 2016, a regex with catastrophic backtracking took Stack Overflow down for 34 minutes in production.
  • RFC 5322, the formal email address standard, can be fully validated with a single regex — but that regex is 6,318 characters long. For practical use, a simplified 50–100 character pattern handles 99.9% of real-world cases.
  • XKCD published comic #208 "Regular Expressions" in 2007. To this day it is the central cultural reference in the programming community for regex.