Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions .github/workflows/mysql-parser-tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
name: MySQL Parser Tests

on:
push:
branches:
- trunk
paths:
- '.github/workflows/mysql-parser-tests.yml'
- 'packages/mysql-parser/**'
pull_request:
paths:
- '.github/workflows/mysql-parser-tests.yml'
- 'packages/mysql-parser/**'
workflow_dispatch:

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

# Disable permissions for all available scopes by default.
# Any needed permissions should be configured at the job level.
permissions: {}

jobs:
test:
# The runtime supports PHP 7.2+; test the oldest and the latest.
name: PHP ${{ matrix.php }}
runs-on: ubuntu-latest
timeout-minutes: 10
permissions:
contents: read # Required to clone the repo.
strategy:
fail-fast: false
matrix:
php: [ '7.2', '8.5' ]

steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Set up PHP
uses: shivammathur/setup-php@v2
with:
php-version: ${{ matrix.php }}
coverage: none

- name: Install dependencies
working-directory: packages/mysql-parser
run: composer install --no-interaction --no-progress

- name: Run tests
working-directory: packages/mysql-parser
run: composer run test
1 change: 1 addition & 0 deletions packages/mysql-parser/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/build/
159 changes: 159 additions & 0 deletions packages/mysql-parser/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# MySQL Parser

A fast and complete parser for MySQL with zero dependencies, generated
directly from the **official MySQL grammar** (`sql/sql_yacc.yy`) and keyword
table (`sql/lex.h`) shipped in the
[mysql-server](https://github.com/mysql/mysql-server) sources.

The package treats MySQL's own Bison grammar as the source of truth and
compiles it, unmodified, into a compact LALR(1) parse table that a small
pure-PHP runtime executes. The lexer emits the grammar's own token numbers and
its keyword table is generated from `lex.h`, so the accepted language tracks a
real MySQL release exactly, with no hand-maintained grammar to drift.

## How it works

The grammar is compiled ahead of time by the tooling in `tools/`. The runtime
ships only the compiled artifacts and a thin driver:

```
mysql-server sources (pinned tag)
sql/sql_yacc.yy ─┐
├─▶ Bison 3.8.2 ──▶ automaton.xml ──▶ generate-parse-table.php ──▶ src/grammar/parse-table.php
sql/lex.h ───────┘ │
└──────────▶ generate-tokens.php ──────▶ src/grammar/tokens.php

MySQL query ──▶ WP_MySQL_Lexer ──▶ grammar tokens ──▶ WP_MySQL_Parser ──▶ WP_Parser_Node AST
(tokens.php) (parse-table.php)
```

- **`tools/`** compiles the grammar. It fetches the pinned MySQL sources
(checksum-verified), runs the exact Bison version MySQL uses (3.8.2, in
Docker, version-asserted), and compacts Bison's `--xml` automaton into plain
PHP ACTION/GOTO tables: per-state default reductions with sparse exception
rows, identical rows shared between states, near-identical keyword rows
patch-encoded against a base row, modal per-terminal shift targets, and
per-nonterminal GOTO defaults — about 7% of the cells of the dense table.
A second generator
derives the token-level data — the keyword table, the paren-gated function
keywords, and the token constants the scanner refers to — from `lex.h` and
the automaton, resolving every terminal by name.
- **`src/`** is the runtime: the MySQL lexer, the parser, and the two generated
grammar artifacts.

The MySQL grammar is unambiguous for an LALR(1) parser (Bison resolves all of
its shift/reduce conflicts by precedence and reports zero reduce/reduce
conflicts), so the runtime is a plain deterministic shift-reduce loop — no GLR,
backtracking, or conflict tables.

### The AST

`parse()` returns a `WP_Parser_Node` tree in which each node carries the
grammar rule name it was reduced by. By default every rule materialises a node,
including MySQL's deep single-child wrapper chains
(`expr → bool_pri → predicate → bit_expr → ...`).

Passing `true` as the parser's second constructor argument enables
**unit-production inlining**: unit productions whose only child is itself a
node are collapsed (the child replaces the would-be wrapper). Those chains
account for over half of all reductions, so inlining roughly halves node
allocations for a further 20-30% of throughput — at the cost of wrapper
rule names being absent from the tree, so consumers must match only the rule
names of meaningful (multi-child or token-bearing) nodes.

## Package layout

```
mysql-parser/
├── bin/build-grammar Orchestrates the full grammar build.
├── bin/build-corpus Orchestrates the full query corpus build.
├── tools/
│ ├── fetch-mysql-grammar.sh Fetch sql_yacc.yy + lex.h at the pinned tag.
│ ├── run-bison.sh Run Bison 3.8.2 (Docker) -> automaton.xml.
│ ├── generate-parse-table.php automaton.xml -> src/grammar/parse-table.php.
│ ├── generate-tokens.php lex.h -> src/grammar/tokens.php.
│ ├── fetch-mysql-tests.sh Fetch the MySQL server test suite at the pinned tag.
│ └── generate-query-corpus.php MySQL server tests -> data/ query corpus.
├── data/
│ └── mysql-server-query-corpus/ SQL queries extracted from the MySQL server tests.
├── src/
│ ├── parser/ Generic parse-tree primitives (token, node).
│ ├── class-wp-mysql-token.php MySQL token leaf.
│ ├── class-wp-mysql-lexer.php MySQL lexer.
│ ├── class-wp-mysql-parser.php The LALR(1) runtime.
│ └── grammar/ Generated parse table + token data.
└── tests/ PHPUnit suite + corpus benchmark.
```

The runtime requires **PHP 7.2+** and no PHP extensions.

## Using the parser

```php
require_once __DIR__ . '/vendor/autoload.php';

// Pass true as the second argument to enable unit-production inlining (see "The AST").
$parser = new WP_MySQL_Parser( require __DIR__ . '/src/grammar/parse-table.php' );
$tokens = ( new WP_MySQL_Lexer( 'SELECT 1 + 2' ) )->remaining_tokens();
$ast = $parser->parse( $tokens ); // WP_Parser_Node, or null on a syntax error.
```

## Building the grammar

The compiled artifacts under `src/grammar/` are committed, so the parser works
out of the box. To regenerate them from the MySQL sources, run from this
package's directory:

```bash
composer run build-grammar
```

This requires `bash`, `curl`, `docker`, and `php`. It fetches the grammar,
runs Bison in Docker, and rewrites `src/grammar/parse-table.php` and
`src/grammar/tokens.php`; re-running it reproduces the committed artifacts
byte for byte. Both artifacts are plain PHP arrays. The fetched sources and
the (large) automaton dump land in `build/`, which is gitignored.

## Building the query corpus

`data/mysql-server-query-corpus/` holds a corpus of SQL queries extracted from
the [MySQL server test suite](https://github.com/mysql/mysql-server/tree/trunk/mysql-test),
used to validate and benchmark the parser against real-world MySQL syntax. The
committed corpus is generated from a mysql-server tag pinned in
`tools/fetch-mysql-tests.sh` (overridable with `MYSQL_TAG`). To regenerate it:

```bash
composer run build-corpus
```

This requires `bash`, `git`, and `php`. It shallow-clones the `mysql-test`
directory of the mysql-server repository into the gitignored `build/`
workspace, extracts the SQL queries from the test files, and rewrites
`data/mysql-server-query-corpus/mysql-latest.csv`.

## Tests and benchmark

```bash
composer run test # PHPUnit suite (includes a corpus regression test)
composer run benchmark # corpus throughput, without and with the tracing JIT
```

The corpus tests run the package's ~69.5k-query corpus of MySQL server test
queries. The parser accepts **99.88%** of it; the 0.12% it rejects is syntax
removed in recent MySQL versions (e.g. `RESET MASTER`), multi-statement input,
statements needing non-default session SQL modes, and a few lexer edge cases.

## Pinned MySQL version

The grammar sources are fetched at a mysql-server tag pinned in
`tools/fetch-mysql-grammar.sh` (the version the committed artifacts were
generated from), with the source checksums pinned alongside it. The tag is
overridable for the build:

```bash
MYSQL_TAG=mysql-X.Y.Z composer run build-grammar
```

Bumping the tag regenerates `src/grammar/*`; the script prints the new source
checksums to re-pin. Follow up by re-running the test suite and benchmark to
confirm the corpus parse rate.
22 changes: 22 additions & 0 deletions packages/mysql-parser/bin/build-corpus
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/usr/bin/env bash
#
# Build the MySQL server query corpus from the MySQL server test suite.
#
# Runs the full pipeline end to end:
# 1. Fetch the mysql-test directory from the pinned mysql-server tag.
# 2. Extract the SQL queries into data/mysql-server-query-corpus/.
#
# Requirements: bash, git, php. Override the MySQL version with MYSQL_TAG
# (default: mysql-8.0.38).
#
set -euo pipefail

package_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
tools_dir="$package_dir/tools"

bash "$tools_dir/fetch-mysql-tests.sh"

echo "Generating the query corpus ..."
php "$tools_dir/generate-query-corpus.php"

echo "Done."
32 changes: 32 additions & 0 deletions packages/mysql-parser/bin/build-grammar
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
#!/usr/bin/env bash
#
# Build the MySQL parse table and token map from the official MySQL grammar.
#
# Runs the full pipeline end to end:
# 1. Fetch sql_yacc.yy + lex.h from the pinned mysql-server tag.
# 2. Run Bison 3.8.2 to produce the LALR automaton (automaton.xml).
# 3. Generate src/grammar/parse-table.php from the automaton.
# 4. Generate src/grammar/tokens.php from lex.h and the automaton.
#
# Requirements: bash, curl, docker, php. Override the MySQL version with
# MYSQL_TAG (default: mysql-8.4.3).
#
set -euo pipefail

package_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
tools_dir="$package_dir/tools"
build_dir="$package_dir/build"
grammar_dir="$package_dir/src/grammar"

mkdir -p "$grammar_dir"

bash "$tools_dir/fetch-mysql-grammar.sh"
bash "$tools_dir/run-bison.sh"

echo "Generating parse table ..."
php "$tools_dir/generate-parse-table.php" "$build_dir/automaton.xml" "$grammar_dir/parse-table.php"

echo "Generating grammar tokens ..."
php "$tools_dir/generate-tokens.php" "$build_dir/automaton.xml" "$build_dir/lex.h" "$grammar_dir/tokens.php"

echo "Done."
30 changes: 30 additions & 0 deletions packages/mysql-parser/composer.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"name": "wordpress/mysql-parser",
"type": "library",
"description": "A fast and complete MySQL parser with zero dependencies.",
"license": "GPL-2.0-or-later",
"require": {
"php": ">=7.2"
},
"autoload": {
"classmap": [
"src/"
]
},
"require-dev": {
"phpunit/phpunit": "^8.5"
},
"scripts": {
"test": "phpunit",
"build-grammar": [
"./bin/build-grammar"
],
"build-corpus": [
"./bin/build-corpus"
],
"benchmark": [
"@php tests/benchmark.php",
"@php -d opcache.enable_cli=1 -d opcache.jit_buffer_size=64M -d opcache.jit=tracing tests/benchmark.php"
]
}
}
Loading
Loading