Use JS-compatible Unicode casing for intrinsic string mappings by Andarist · Pull Request #3495 · microsoft/typescript-go

Andarist · 2026-04-22T09:48:07Z

jakebailey

I am very uneasy about this. Why can't we use Go's unicode ranges to do this? How can we make sure this is up to date long term?

jakebailey · 2026-04-22T15:53:06Z

+func ToLowerJS(str string) string {
+	runes := []rune(str)


This perf of this is going to be abysmal; surely we can bail out for ascii?

I implemented the bailout for ASCII

RyanCavanaugh

Per Jake's comments

…se-uppercase

Copilot

Pull request overview

This PR fixes intrinsic string mapping types (Lowercase<>, Uppercase<>, Capitalize<>, Uncapitalize<>) to use ECMAScript-compatible Unicode casing semantics (not Go’s strings.ToLower/ToUpper), addressing special-casing cases like dotted İ, ß expansions, ligatures, and Greek final sigma.

Changes:

Implemented JS-compatible ToLowerJS/ToUpperJS in internal/stringutil, backed by generated Unicode SpecialCasing + DerivedCoreProperties data (for Final_Sigma context handling).
Switched the checker’s intrinsic string mapping implementation to use the new JS-compatible casing functions.
Added compiler baseline coverage for special casing behavior + a Go unit test for the casing helpers; refactored a shared rune-range lookup helper into stringutil.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
testdata/tests/cases/compiler/stringMappingSpecialCasing.ts	New compiler test covering special-casing string mappings (İ, Σ final sigma, ß, ﬁ).
testdata/baselines/reference/compiler/stringMappingSpecialCasing.types	Reference baseline for types output of the new test.
testdata/baselines/reference/compiler/stringMappingSpecialCasing.symbols	Reference baseline for symbols output of the new test.
internal/stringutil/ranges.go	Introduces reusable rune-range binary search helper.
internal/stringutil/js_case.go	Adds JS-compatible case conversion with SpecialCasing + Final_Sigma context handling.
internal/stringutil/js_case_test.go	Unit tests for JS casing behavior (dotted i, final sigma, ß/ligatures).
internal/stringutil/js_case_generated.go	Generated Unicode special-casing mappings and property ranges used by JS casing.
internal/stringutil/generate.go	Adds `go:generate` directives to regenerate and format the Unicode casing tables.
internal/stringutil/_scripts/generate-special-casing.mts	Script to fetch/parse Unicode data and generate Go tables.
internal/scanner/scanner.go	Reuses `stringutil.IsInRuneRanges` instead of a scanner-local range helper.
internal/checker/checker.go	Uses JS-compatible casing for intrinsic string mapping types.

Files not reviewed (1)

internal/stringutil/js_case_generated.go: Language not supported

+const SPECIAL_CASING_URL = "https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt";
+const DERIVED_CORE_PROPERTIES_URL = "https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt";


Andarist · 2026-05-18T09:56:07Z

Why can't we use Go's unicode ranges to do this?

From what I understand, that's not possible - I added the code comment explaining this in my last commit

How can we make sure this is up to date long term?

For that reason, I implemented a commitable code generator here and not hand-rolled everything. This could be used to verify that the output file is up to date

jakebailey · 2026-05-18T14:37:14Z

See also #3930 which uses x/text for this, though I'm wary of that too...

…se-uppercase

Andarist · 2026-05-19T10:22:07Z

I won't lie, I used some heavy help from the agents now... I decided to write an audit script (I have not committed it yet):

internal/stringutil/_scripts/audit-js-casing.mts

#!/usr/bin/env -S node --experimental-strip-types --no-warnings

import {
    execFileSync,
    spawnSync,
} from "node:child_process";
import * as fs from "node:fs";
import * as os from "node:os";
import * as path from "node:path";

const repoRoot = path.resolve(import.meta.dirname, "../../..");
const helperSource = path.join(import.meta.dirname, "compare-js-casing.go");
const helperDir = fs.mkdtempSync(path.join(os.tmpdir(), "tsgo-js-casing-"));
const helperBin = path.join(helperDir, "compare-js-casing");

process.on("exit", () => {
    fs.rmSync(helperDir, { recursive: true, force: true });
});

execFileSync("go", ["build", "-o", helperBin, helperSource], {
    cwd: repoRoot,
    stdio: "inherit",
});

function firstCodePointSlice(s: string): [string, string] {
    if (!s) return ["", ""];
    const cp = s.codePointAt(0)!;
    const first = String.fromCodePoint(cp);
    const size = cp > 0xffff ? 2 : 1;
    return [first, s.slice(size)];
}

function jsCases(s: string) {
    const [first, rest] = firstCodePointSlice(s);
    return {
        upper: s.toUpperCase(),
        lower: s.toLowerCase(),
        cap: first ? first.toUpperCase() + rest : s,
        uncap: first ? first.toLowerCase() + rest : s,
    };
}

type GoResult = {
    upper: string;
    lower: string;
    cap: string;
    uncap: string;
};

function runBatch(inputs: string[]): GoResult[] {
    const proc = spawnSync(helperBin, {
        input: JSON.stringify(inputs.map(value => ({ value }))),
        encoding: "utf8",
        maxBuffer: 64 * 1024 * 1024,
    });
    if (proc.status !== 0) {
        throw new Error(proc.stderr || `helper exited with ${proc.status}`);
    }
    return JSON.parse(proc.stdout);
}

function escapeVisible(s: string): string {
    return JSON.stringify(s).slice(1, -1);
}

type Bucket = {
    count: number;
    samples: { input: string; expected: string; actual: string; meta?: string; }[];
};

function makeBucket(): Bucket {
    return { count: 0, samples: [] };
}

function noteMismatch(bucket: Bucket, input: string, expected: string, actual: string, meta?: string) {
    bucket.count++;
    if (bucket.samples.length < 20) {
        bucket.samples.push({
            input: escapeVisible(input),
            expected: escapeVisible(expected),
            actual: escapeVisible(actual),
            ...(meta ? { meta } : {}),
        });
    }
}

function makeResultStore() {
    return {
        upper: makeBucket(),
        lower: makeBucket(),
        cap: makeBucket(),
        uncap: makeBucket(),
    };
}

function compare(inputs: string[], metas: string[], store: ReturnType<typeof makeResultStore>) {
    const got = runBatch(inputs);
    for (let i = 0; i < inputs.length; i++) {
        const input = inputs[i];
        const meta = metas[i];
        const expected = jsCases(input);
        const actual = got[i];
        if (expected.upper !== actual.upper) noteMismatch(store.upper, input, expected.upper, actual.upper, meta);
        if (expected.lower !== actual.lower) noteMismatch(store.lower, input, expected.lower, actual.lower, meta);
        if (expected.cap !== actual.cap) noteMismatch(store.cap, input, expected.cap, actual.cap, meta);
        if (expected.uncap !== actual.uncap) noteMismatch(store.uncap, input, expected.uncap, actual.uncap, meta);
    }
}

function runSingleScalarSweep() {
    const store = makeResultStore();
    const batchSize = 50000;
    let batch: string[] = [];
    let metas: string[] = [];
    let testedScalars = 0;
    for (let cp = 0; cp <= 0x10ffff; cp++) {
        if (cp >= 0xd800 && cp <= 0xdfff) continue;
        batch.push(String.fromCodePoint(cp));
        metas.push(`U+${cp.toString(16).toUpperCase().padStart(4, "0")}`);
        if (batch.length === batchSize) {
            compare(batch, metas, store);
            testedScalars += batch.length;
            batch = [];
            metas = [];
        }
    }
    if (batch.length) {
        compare(batch, metas, store);
        testedScalars += batch.length;
    }
    return { testedScalars, mismatches: store };
}

function runSigmaSweep(position: "before" | "after") {
    const lower = makeBucket();
    const batchSize = 50000;
    let batch: string[] = [];
    let metas: string[] = [];
    for (let cp = 0; cp <= 0x10ffff; cp++) {
        if (cp >= 0xd800 && cp <= 0xdfff) continue;
        const ch = String.fromCodePoint(cp);
        batch.push(position === "before" ? `${ch}Σ` : `Σ${ch}`);
        metas.push(`U+${cp.toString(16).toUpperCase().padStart(4, "0")}`);
        if (batch.length === batchSize) {
            compareSigmaBatch(batch, metas, lower);
            batch = [];
            metas = [];
        }
    }
    if (batch.length) {
        compareSigmaBatch(batch, metas, lower);
    }
    return lower;
}

function compareSigmaBatch(inputs: string[], metas: string[], lower: Bucket) {
    const got = runBatch(inputs);
    for (let i = 0; i < inputs.length; i++) {
        const expected = inputs[i].toLowerCase();
        const actual = got[i].lower;
        if (expected !== actual) {
            noteMismatch(lower, inputs[i], expected, actual, metas[i]);
        }
    }
}

const result = {
    singleScalarSweep: runSingleScalarSweep(),
    sigmaSweep: {
        before: runSigmaSweep("before"),
        after: runSigmaSweep("after"),
    },
};

console.log(JSON.stringify(result, null, 2));

internal/stringutil/_scripts/compare-js-casing.go

package main

import (
	"encoding/json"
	"io"
	"log"
	"os"
	"unicode/utf8"

	"github.com/microsoft/typescript-go/internal/stringutil"
)

type Input struct {
	Value string `json:"value"`
}

type Result struct {
	Upper string `json:"upper"`
	Lower string `json:"lower"`
	Cap   string `json:"cap"`
	Uncap string `json:"uncap"`
}

func main() {
	var inputs []Input
	data, err := io.ReadAll(os.Stdin)
	if err != nil {
		log.Fatal(err)
	}
	if err := json.Unmarshal(data, &inputs); err != nil {
		log.Fatal(err)
	}

	results := make([]Result, len(inputs))
	for i, in := range inputs {
		results[i] = Result{
			Upper: stringutil.ToUpperJS(in.Value),
			Lower: stringutil.ToLowerJS(in.Value),
			Cap:   transformFirstRune(in.Value, stringutil.ToUpperJS),
			Uncap: transformFirstRune(in.Value, stringutil.ToLowerJS),
		}
	}

	if err := json.NewEncoder(os.Stdout).Encode(results); err != nil {
		log.Fatal(err)
	}
}

func transformFirstRune(s string, transform func(string) string) string {
	if s == "" {
		return s
	}
	_, size := utf8.DecodeRuneInString(s)
	return transform(s[:size]) + s[size:]
}

This helped me to find divergences between JS and x/text. So, based on this investigation, x/text isn't a suitable replacement for all of this is the goal is to match the JS semantics.

I also compared the v8 and SpiderMonkey implementations and decided to just match the v8 (that was the easiest with the oracle scripts above ;p and it probably matches the most common expectations anyway).

jakebailey · 2026-05-19T14:42:23Z

What were the results of the analysis? What is the difference?

Andarist · 2026-05-19T15:33:05Z

I couldn't find any divergences between the current implementation and v8.

When it comes to comparing against x/text (where actual is x/text and expected is what v8 returns here):

'ʰΣ'.toLowerCase() // actual: "ʰς", expected: "ʰσ"
"ͅΣ".toLowerCase() // actual: "ͅς", expected: "ͅσ"

The only divergence class I found in testing was toLowerCase final-sigma context for strings of the form rΣ (the oracle found 267 such divergences but they were all of this shape).

Andarist added 3 commits April 22, 2026 11:34

Use JS-compatible Unicode casing for intrinsic string mappings

695a3b4

share helper

5deda00

fmt

9b2663c

jakebailey reviewed Apr 22, 2026

View reviewed changes

RyanCavanaugh added this to the TypeScript 7.0 RC milestone May 7, 2026

RyanCavanaugh requested changes May 7, 2026

View reviewed changes

Andarist added 2 commits May 18, 2026 11:40

Merge remote-tracking branch 'origin/main' into fix/intrinsic-lowerca…

ef27351

…se-uppercase

address PR feedback

c17e23c

Copilot AI review requested due to automatic review settings May 18, 2026 09:46

Copilot started reviewing on behalf of Andarist May 18, 2026 09:48 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

Andarist requested a review from jakebailey May 18, 2026 09:56

Andarist added 6 commits May 18, 2026 18:56

fix things

fddb27a

Merge remote-tracking branch 'origin/main' into fix/intrinsic-lowerca…

03f57a1

…se-uppercase

fix things

4018330

fix more

62f4f9e

fmt

1145e2d

add comment

9f9ed49

Andarist force-pushed the fix/intrinsic-lowercase-uppercase branch from 8f3cf1c to 9f9ed49 Compare May 19, 2026 10:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use JS-compatible Unicode casing for intrinsic string mappings#3495

Use JS-compatible Unicode casing for intrinsic string mappings#3495
Andarist wants to merge 11 commits into
microsoft:mainfrom
Andarist:fix/intrinsic-lowercase-uppercase

Andarist commented Apr 22, 2026

Uh oh!

jakebailey left a comment

Uh oh!

Uh oh!

jakebailey Apr 22, 2026

Uh oh!

Andarist May 18, 2026

Uh oh!

RyanCavanaugh left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Andarist commented May 18, 2026

Uh oh!

jakebailey commented May 18, 2026

Uh oh!

Andarist commented May 19, 2026

Uh oh!

jakebailey commented May 19, 2026

Uh oh!

Andarist commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		const SPECIAL_CASING_URL = "https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt";
		const DERIVED_CORE_PROPERTIES_URL = "https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt";

Conversation

Andarist commented Apr 22, 2026

Uh oh!

jakebailey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jakebailey Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Andarist May 18, 2026

Choose a reason for hiding this comment

Uh oh!

RyanCavanaugh left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Andarist commented May 18, 2026

Uh oh!

jakebailey commented May 18, 2026

Uh oh!

Andarist commented May 19, 2026

Uh oh!

jakebailey commented May 19, 2026

Uh oh!

Andarist commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants