Skip to content

Use JS-compatible Unicode casing for intrinsic string mappings#3495

Open
Andarist wants to merge 11 commits into
microsoft:mainfrom
Andarist:fix/intrinsic-lowercase-uppercase
Open

Use JS-compatible Unicode casing for intrinsic string mappings#3495
Andarist wants to merge 11 commits into
microsoft:mainfrom
Andarist:fix/intrinsic-lowercase-uppercase

Conversation

@Andarist
Copy link
Copy Markdown
Contributor

fixes #3489

Copy link
Copy Markdown
Member

@jakebailey jakebailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very uneasy about this. Why can't we use Go's unicode ranges to do this? How can we make sure this is up to date long term?

Comment thread internal/stringutil/generate.go Outdated
Comment on lines +8 to +9
func ToLowerJS(str string) string {
runes := []rune(str)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This perf of this is going to be abysmal; surely we can bail out for ascii?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented the bailout for ASCII

@RyanCavanaugh RyanCavanaugh added this to the TypeScript 7.0 RC milestone May 7, 2026
Copy link
Copy Markdown
Member

@RyanCavanaugh RyanCavanaugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per Jake's comments

Copilot AI review requested due to automatic review settings May 18, 2026 09:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes intrinsic string mapping types (Lowercase<>, Uppercase<>, Capitalize<>, Uncapitalize<>) to use ECMAScript-compatible Unicode casing semantics (not Go’s strings.ToLower/ToUpper), addressing special-casing cases like dotted İ, ß expansions, ligatures, and Greek final sigma.

Changes:

  • Implemented JS-compatible ToLowerJS/ToUpperJS in internal/stringutil, backed by generated Unicode SpecialCasing + DerivedCoreProperties data (for Final_Sigma context handling).
  • Switched the checker’s intrinsic string mapping implementation to use the new JS-compatible casing functions.
  • Added compiler baseline coverage for special casing behavior + a Go unit test for the casing helpers; refactored a shared rune-range lookup helper into stringutil.

Reviewed changes

Copilot reviewed 10 out of 11 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
testdata/tests/cases/compiler/stringMappingSpecialCasing.ts New compiler test covering special-casing string mappings (İ, Σ final sigma, ß, fi).
testdata/baselines/reference/compiler/stringMappingSpecialCasing.types Reference baseline for types output of the new test.
testdata/baselines/reference/compiler/stringMappingSpecialCasing.symbols Reference baseline for symbols output of the new test.
internal/stringutil/ranges.go Introduces reusable rune-range binary search helper.
internal/stringutil/js_case.go Adds JS-compatible case conversion with SpecialCasing + Final_Sigma context handling.
internal/stringutil/js_case_test.go Unit tests for JS casing behavior (dotted i, final sigma, ß/ligatures).
internal/stringutil/js_case_generated.go Generated Unicode special-casing mappings and property ranges used by JS casing.
internal/stringutil/generate.go Adds go:generate directives to regenerate and format the Unicode casing tables.
internal/stringutil/_scripts/generate-special-casing.mts Script to fetch/parse Unicode data and generate Go tables.
internal/scanner/scanner.go Reuses stringutil.IsInRuneRanges instead of a scanner-local range helper.
internal/checker/checker.go Uses JS-compatible casing for intrinsic string mapping types.
Files not reviewed (1)
  • internal/stringutil/js_case_generated.go: Language not supported

Comment on lines +7 to +8
const SPECIAL_CASING_URL = "https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt";
const DERIVED_CORE_PROPERTIES_URL = "https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt";
@Andarist
Copy link
Copy Markdown
Contributor Author

Why can't we use Go's unicode ranges to do this?

From what I understand, that's not possible - I added the code comment explaining this in my last commit

How can we make sure this is up to date long term?

For that reason, I implemented a commitable code generator here and not hand-rolled everything. This could be used to verify that the output file is up to date

@Andarist Andarist requested a review from jakebailey May 18, 2026 09:56
@jakebailey
Copy link
Copy Markdown
Member

See also #3930 which uses x/text for this, though I'm wary of that too...

@Andarist Andarist force-pushed the fix/intrinsic-lowercase-uppercase branch from 8f3cf1c to 9f9ed49 Compare May 19, 2026 10:17
@Andarist
Copy link
Copy Markdown
Contributor Author

I won't lie, I used some heavy help from the agents now... I decided to write an audit script (I have not committed it yet):

internal/stringutil/_scripts/audit-js-casing.mts
#!/usr/bin/env -S node --experimental-strip-types --no-warnings

import {
    execFileSync,
    spawnSync,
} from "node:child_process";
import * as fs from "node:fs";
import * as os from "node:os";
import * as path from "node:path";

const repoRoot = path.resolve(import.meta.dirname, "../../..");
const helperSource = path.join(import.meta.dirname, "compare-js-casing.go");
const helperDir = fs.mkdtempSync(path.join(os.tmpdir(), "tsgo-js-casing-"));
const helperBin = path.join(helperDir, "compare-js-casing");

process.on("exit", () => {
    fs.rmSync(helperDir, { recursive: true, force: true });
});

execFileSync("go", ["build", "-o", helperBin, helperSource], {
    cwd: repoRoot,
    stdio: "inherit",
});

function firstCodePointSlice(s: string): [string, string] {
    if (!s) return ["", ""];
    const cp = s.codePointAt(0)!;
    const first = String.fromCodePoint(cp);
    const size = cp > 0xffff ? 2 : 1;
    return [first, s.slice(size)];
}

function jsCases(s: string) {
    const [first, rest] = firstCodePointSlice(s);
    return {
        upper: s.toUpperCase(),
        lower: s.toLowerCase(),
        cap: first ? first.toUpperCase() + rest : s,
        uncap: first ? first.toLowerCase() + rest : s,
    };
}

type GoResult = {
    upper: string;
    lower: string;
    cap: string;
    uncap: string;
};

function runBatch(inputs: string[]): GoResult[] {
    const proc = spawnSync(helperBin, {
        input: JSON.stringify(inputs.map(value => ({ value }))),
        encoding: "utf8",
        maxBuffer: 64 * 1024 * 1024,
    });
    if (proc.status !== 0) {
        throw new Error(proc.stderr || `helper exited with ${proc.status}`);
    }
    return JSON.parse(proc.stdout);
}

function escapeVisible(s: string): string {
    return JSON.stringify(s).slice(1, -1);
}

type Bucket = {
    count: number;
    samples: { input: string; expected: string; actual: string; meta?: string; }[];
};

function makeBucket(): Bucket {
    return { count: 0, samples: [] };
}

function noteMismatch(bucket: Bucket, input: string, expected: string, actual: string, meta?: string) {
    bucket.count++;
    if (bucket.samples.length < 20) {
        bucket.samples.push({
            input: escapeVisible(input),
            expected: escapeVisible(expected),
            actual: escapeVisible(actual),
            ...(meta ? { meta } : {}),
        });
    }
}

function makeResultStore() {
    return {
        upper: makeBucket(),
        lower: makeBucket(),
        cap: makeBucket(),
        uncap: makeBucket(),
    };
}

function compare(inputs: string[], metas: string[], store: ReturnType<typeof makeResultStore>) {
    const got = runBatch(inputs);
    for (let i = 0; i < inputs.length; i++) {
        const input = inputs[i];
        const meta = metas[i];
        const expected = jsCases(input);
        const actual = got[i];
        if (expected.upper !== actual.upper) noteMismatch(store.upper, input, expected.upper, actual.upper, meta);
        if (expected.lower !== actual.lower) noteMismatch(store.lower, input, expected.lower, actual.lower, meta);
        if (expected.cap !== actual.cap) noteMismatch(store.cap, input, expected.cap, actual.cap, meta);
        if (expected.uncap !== actual.uncap) noteMismatch(store.uncap, input, expected.uncap, actual.uncap, meta);
    }
}

function runSingleScalarSweep() {
    const store = makeResultStore();
    const batchSize = 50000;
    let batch: string[] = [];
    let metas: string[] = [];
    let testedScalars = 0;
    for (let cp = 0; cp <= 0x10ffff; cp++) {
        if (cp >= 0xd800 && cp <= 0xdfff) continue;
        batch.push(String.fromCodePoint(cp));
        metas.push(`U+${cp.toString(16).toUpperCase().padStart(4, "0")}`);
        if (batch.length === batchSize) {
            compare(batch, metas, store);
            testedScalars += batch.length;
            batch = [];
            metas = [];
        }
    }
    if (batch.length) {
        compare(batch, metas, store);
        testedScalars += batch.length;
    }
    return { testedScalars, mismatches: store };
}

function runSigmaSweep(position: "before" | "after") {
    const lower = makeBucket();
    const batchSize = 50000;
    let batch: string[] = [];
    let metas: string[] = [];
    for (let cp = 0; cp <= 0x10ffff; cp++) {
        if (cp >= 0xd800 && cp <= 0xdfff) continue;
        const ch = String.fromCodePoint(cp);
        batch.push(position === "before" ? `${ch}Σ` : ${ch}`);
        metas.push(`U+${cp.toString(16).toUpperCase().padStart(4, "0")}`);
        if (batch.length === batchSize) {
            compareSigmaBatch(batch, metas, lower);
            batch = [];
            metas = [];
        }
    }
    if (batch.length) {
        compareSigmaBatch(batch, metas, lower);
    }
    return lower;
}

function compareSigmaBatch(inputs: string[], metas: string[], lower: Bucket) {
    const got = runBatch(inputs);
    for (let i = 0; i < inputs.length; i++) {
        const expected = inputs[i].toLowerCase();
        const actual = got[i].lower;
        if (expected !== actual) {
            noteMismatch(lower, inputs[i], expected, actual, metas[i]);
        }
    }
}

const result = {
    singleScalarSweep: runSingleScalarSweep(),
    sigmaSweep: {
        before: runSigmaSweep("before"),
        after: runSigmaSweep("after"),
    },
};

console.log(JSON.stringify(result, null, 2));
internal/stringutil/_scripts/compare-js-casing.go
package main

import (
	"encoding/json"
	"io"
	"log"
	"os"
	"unicode/utf8"

	"github.com/microsoft/typescript-go/internal/stringutil"
)

type Input struct {
	Value string `json:"value"`
}

type Result struct {
	Upper string `json:"upper"`
	Lower string `json:"lower"`
	Cap   string `json:"cap"`
	Uncap string `json:"uncap"`
}

func main() {
	var inputs []Input
	data, err := io.ReadAll(os.Stdin)
	if err != nil {
		log.Fatal(err)
	}
	if err := json.Unmarshal(data, &inputs); err != nil {
		log.Fatal(err)
	}

	results := make([]Result, len(inputs))
	for i, in := range inputs {
		results[i] = Result{
			Upper: stringutil.ToUpperJS(in.Value),
			Lower: stringutil.ToLowerJS(in.Value),
			Cap:   transformFirstRune(in.Value, stringutil.ToUpperJS),
			Uncap: transformFirstRune(in.Value, stringutil.ToLowerJS),
		}
	}

	if err := json.NewEncoder(os.Stdout).Encode(results); err != nil {
		log.Fatal(err)
	}
}

func transformFirstRune(s string, transform func(string) string) string {
	if s == "" {
		return s
	}
	_, size := utf8.DecodeRuneInString(s)
	return transform(s[:size]) + s[size:]
}

This helped me to find divergences between JS and x/text. So, based on this investigation, x/text isn't a suitable replacement for all of this is the goal is to match the JS semantics.

I also compared the v8 and SpiderMonkey implementations and decided to just match the v8 (that was the easiest with the oracle scripts above ;p and it probably matches the most common expectations anyway).

@jakebailey
Copy link
Copy Markdown
Member

What were the results of the analysis? What is the difference?

@Andarist
Copy link
Copy Markdown
Contributor Author

I couldn't find any divergences between the current implementation and v8.

When it comes to comparing against x/text (where actual is x/text and expected is what v8 returns here):

'ʰΣ'.toLowerCase() // actual: "ʰς", expected: "ʰσ"
"ͅΣ".toLowerCase() // actual: "ͅς", expected: "ͅσ"

The only divergence class I found in testing was toLowerCase final-sigma context for strings of the form (the oracle found 267 such divergences but they were all of this shape).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lowercase<"İ">` differs between tsgo and tsc (Go strings.ToLower vs JS String.prototype.toLowerCase)

4 participants