Use JS-compatible Unicode casing for intrinsic string mappings#3495
Use JS-compatible Unicode casing for intrinsic string mappings#3495Andarist wants to merge 11 commits into
Conversation
jakebailey
left a comment
There was a problem hiding this comment.
I am very uneasy about this. Why can't we use Go's unicode ranges to do this? How can we make sure this is up to date long term?
| func ToLowerJS(str string) string { | ||
| runes := []rune(str) |
There was a problem hiding this comment.
This perf of this is going to be abysmal; surely we can bail out for ascii?
There was a problem hiding this comment.
I implemented the bailout for ASCII
There was a problem hiding this comment.
Pull request overview
This PR fixes intrinsic string mapping types (Lowercase<>, Uppercase<>, Capitalize<>, Uncapitalize<>) to use ECMAScript-compatible Unicode casing semantics (not Go’s strings.ToLower/ToUpper), addressing special-casing cases like dotted İ, ß expansions, ligatures, and Greek final sigma.
Changes:
- Implemented JS-compatible
ToLowerJS/ToUpperJSininternal/stringutil, backed by generated Unicode SpecialCasing + DerivedCoreProperties data (for Final_Sigma context handling). - Switched the checker’s intrinsic string mapping implementation to use the new JS-compatible casing functions.
- Added compiler baseline coverage for special casing behavior + a Go unit test for the casing helpers; refactored a shared rune-range lookup helper into
stringutil.
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| testdata/tests/cases/compiler/stringMappingSpecialCasing.ts | New compiler test covering special-casing string mappings (İ, Σ final sigma, ß, fi). |
| testdata/baselines/reference/compiler/stringMappingSpecialCasing.types | Reference baseline for types output of the new test. |
| testdata/baselines/reference/compiler/stringMappingSpecialCasing.symbols | Reference baseline for symbols output of the new test. |
| internal/stringutil/ranges.go | Introduces reusable rune-range binary search helper. |
| internal/stringutil/js_case.go | Adds JS-compatible case conversion with SpecialCasing + Final_Sigma context handling. |
| internal/stringutil/js_case_test.go | Unit tests for JS casing behavior (dotted i, final sigma, ß/ligatures). |
| internal/stringutil/js_case_generated.go | Generated Unicode special-casing mappings and property ranges used by JS casing. |
| internal/stringutil/generate.go | Adds go:generate directives to regenerate and format the Unicode casing tables. |
| internal/stringutil/_scripts/generate-special-casing.mts | Script to fetch/parse Unicode data and generate Go tables. |
| internal/scanner/scanner.go | Reuses stringutil.IsInRuneRanges instead of a scanner-local range helper. |
| internal/checker/checker.go | Uses JS-compatible casing for intrinsic string mapping types. |
Files not reviewed (1)
- internal/stringutil/js_case_generated.go: Language not supported
| const SPECIAL_CASING_URL = "https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt"; | ||
| const DERIVED_CORE_PROPERTIES_URL = "https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt"; |
From what I understand, that's not possible - I added the code comment explaining this in my last commit
For that reason, I implemented a commitable code generator here and not hand-rolled everything. This could be used to verify that the output file is up to date |
|
See also #3930 which uses x/text for this, though I'm wary of that too... |
8f3cf1c to
9f9ed49
Compare
|
I won't lie, I used some heavy help from the agents now... I decided to write an audit script (I have not committed it yet): internal/stringutil/_scripts/audit-js-casing.mts#!/usr/bin/env -S node --experimental-strip-types --no-warnings
import {
execFileSync,
spawnSync,
} from "node:child_process";
import * as fs from "node:fs";
import * as os from "node:os";
import * as path from "node:path";
const repoRoot = path.resolve(import.meta.dirname, "../../..");
const helperSource = path.join(import.meta.dirname, "compare-js-casing.go");
const helperDir = fs.mkdtempSync(path.join(os.tmpdir(), "tsgo-js-casing-"));
const helperBin = path.join(helperDir, "compare-js-casing");
process.on("exit", () => {
fs.rmSync(helperDir, { recursive: true, force: true });
});
execFileSync("go", ["build", "-o", helperBin, helperSource], {
cwd: repoRoot,
stdio: "inherit",
});
function firstCodePointSlice(s: string): [string, string] {
if (!s) return ["", ""];
const cp = s.codePointAt(0)!;
const first = String.fromCodePoint(cp);
const size = cp > 0xffff ? 2 : 1;
return [first, s.slice(size)];
}
function jsCases(s: string) {
const [first, rest] = firstCodePointSlice(s);
return {
upper: s.toUpperCase(),
lower: s.toLowerCase(),
cap: first ? first.toUpperCase() + rest : s,
uncap: first ? first.toLowerCase() + rest : s,
};
}
type GoResult = {
upper: string;
lower: string;
cap: string;
uncap: string;
};
function runBatch(inputs: string[]): GoResult[] {
const proc = spawnSync(helperBin, {
input: JSON.stringify(inputs.map(value => ({ value }))),
encoding: "utf8",
maxBuffer: 64 * 1024 * 1024,
});
if (proc.status !== 0) {
throw new Error(proc.stderr || `helper exited with ${proc.status}`);
}
return JSON.parse(proc.stdout);
}
function escapeVisible(s: string): string {
return JSON.stringify(s).slice(1, -1);
}
type Bucket = {
count: number;
samples: { input: string; expected: string; actual: string; meta?: string; }[];
};
function makeBucket(): Bucket {
return { count: 0, samples: [] };
}
function noteMismatch(bucket: Bucket, input: string, expected: string, actual: string, meta?: string) {
bucket.count++;
if (bucket.samples.length < 20) {
bucket.samples.push({
input: escapeVisible(input),
expected: escapeVisible(expected),
actual: escapeVisible(actual),
...(meta ? { meta } : {}),
});
}
}
function makeResultStore() {
return {
upper: makeBucket(),
lower: makeBucket(),
cap: makeBucket(),
uncap: makeBucket(),
};
}
function compare(inputs: string[], metas: string[], store: ReturnType<typeof makeResultStore>) {
const got = runBatch(inputs);
for (let i = 0; i < inputs.length; i++) {
const input = inputs[i];
const meta = metas[i];
const expected = jsCases(input);
const actual = got[i];
if (expected.upper !== actual.upper) noteMismatch(store.upper, input, expected.upper, actual.upper, meta);
if (expected.lower !== actual.lower) noteMismatch(store.lower, input, expected.lower, actual.lower, meta);
if (expected.cap !== actual.cap) noteMismatch(store.cap, input, expected.cap, actual.cap, meta);
if (expected.uncap !== actual.uncap) noteMismatch(store.uncap, input, expected.uncap, actual.uncap, meta);
}
}
function runSingleScalarSweep() {
const store = makeResultStore();
const batchSize = 50000;
let batch: string[] = [];
let metas: string[] = [];
let testedScalars = 0;
for (let cp = 0; cp <= 0x10ffff; cp++) {
if (cp >= 0xd800 && cp <= 0xdfff) continue;
batch.push(String.fromCodePoint(cp));
metas.push(`U+${cp.toString(16).toUpperCase().padStart(4, "0")}`);
if (batch.length === batchSize) {
compare(batch, metas, store);
testedScalars += batch.length;
batch = [];
metas = [];
}
}
if (batch.length) {
compare(batch, metas, store);
testedScalars += batch.length;
}
return { testedScalars, mismatches: store };
}
function runSigmaSweep(position: "before" | "after") {
const lower = makeBucket();
const batchSize = 50000;
let batch: string[] = [];
let metas: string[] = [];
for (let cp = 0; cp <= 0x10ffff; cp++) {
if (cp >= 0xd800 && cp <= 0xdfff) continue;
const ch = String.fromCodePoint(cp);
batch.push(position === "before" ? `${ch}Σ` : `Σ${ch}`);
metas.push(`U+${cp.toString(16).toUpperCase().padStart(4, "0")}`);
if (batch.length === batchSize) {
compareSigmaBatch(batch, metas, lower);
batch = [];
metas = [];
}
}
if (batch.length) {
compareSigmaBatch(batch, metas, lower);
}
return lower;
}
function compareSigmaBatch(inputs: string[], metas: string[], lower: Bucket) {
const got = runBatch(inputs);
for (let i = 0; i < inputs.length; i++) {
const expected = inputs[i].toLowerCase();
const actual = got[i].lower;
if (expected !== actual) {
noteMismatch(lower, inputs[i], expected, actual, metas[i]);
}
}
}
const result = {
singleScalarSweep: runSingleScalarSweep(),
sigmaSweep: {
before: runSigmaSweep("before"),
after: runSigmaSweep("after"),
},
};
console.log(JSON.stringify(result, null, 2));internal/stringutil/_scripts/compare-js-casing.gopackage main
import (
"encoding/json"
"io"
"log"
"os"
"unicode/utf8"
"github.com/microsoft/typescript-go/internal/stringutil"
)
type Input struct {
Value string `json:"value"`
}
type Result struct {
Upper string `json:"upper"`
Lower string `json:"lower"`
Cap string `json:"cap"`
Uncap string `json:"uncap"`
}
func main() {
var inputs []Input
data, err := io.ReadAll(os.Stdin)
if err != nil {
log.Fatal(err)
}
if err := json.Unmarshal(data, &inputs); err != nil {
log.Fatal(err)
}
results := make([]Result, len(inputs))
for i, in := range inputs {
results[i] = Result{
Upper: stringutil.ToUpperJS(in.Value),
Lower: stringutil.ToLowerJS(in.Value),
Cap: transformFirstRune(in.Value, stringutil.ToUpperJS),
Uncap: transformFirstRune(in.Value, stringutil.ToLowerJS),
}
}
if err := json.NewEncoder(os.Stdout).Encode(results); err != nil {
log.Fatal(err)
}
}
func transformFirstRune(s string, transform func(string) string) string {
if s == "" {
return s
}
_, size := utf8.DecodeRuneInString(s)
return transform(s[:size]) + s[size:]
}This helped me to find divergences between JS and I also compared the v8 and SpiderMonkey implementations and decided to just match the v8 (that was the easiest with the oracle scripts above ;p and it probably matches the most common expectations anyway). |
|
What were the results of the analysis? What is the difference? |
|
I couldn't find any divergences between the current implementation and v8. When it comes to comparing against 'ʰΣ'.toLowerCase() // actual: "ʰς", expected: "ʰσ"
"ͅΣ".toLowerCase() // actual: "ͅς", expected: "ͅσ"The only divergence class I found in testing was |
fixes #3489