Skip to content

rustpython-unicode #7560

@youknowone

Description

@youknowone

rustpython-unicode should be a shared crate that provides CPython-compatible Unicode
semantics and Unicode data for Rust-based Python implementations and tools.

Its purpose is not to implement Python string objects or high-level string methods. Instead,
it should provide the Unicode foundation that Python runtimes need: character classification,
case mapping, normalization, identifier rules, regex character-class predicates, and
unicodedata-style access to Unicode database information.

This makes it a natural shared dependency between RustPython and Pyre, while keeping actual
string operations in crate::str or the host runtime.

Design Goals

  • Match CPython behavior at the Unicode data and semantics level, not just in a few visible
    edge cases.
  • Use a single version-pinned Unicode data source per targeted CPython release.
  • Expose low-level APIs that can be reused by str, re, parser, compiler, and unicodedata.
  • Support non-scalar code points where Python behavior requires it, so the core API should be
    u32-based rather than char-based.
  • Be usable outside RustPython, especially by other Python-related Rust projects.

Scope

rustpython-unicode should include:

  • Unicode character classification used by Python:
    • isalpha
    • isalnum
    • isdecimal
    • isdigit
    • isnumeric
    • isspace
    • isprintable
    • casing predicates if needed
  • Identifier-related predicates:
    • is_xid_start
    • is_xid_continue
    • Python identifier helpers
  • Regex-oriented Unicode predicates:
    • Unicode \w
    • Unicode \d
    • Unicode \s
    • any other CPython regex character classes that depend on Unicode tables
  • Case conversion and casing-related data:
    • lowercase
    • uppercase
    • titlecase
    • casefold
    • full mappings where Python requires them
  • Unicode normalization support needed by Python:
    • NFC
    • NFD
    • NFKC
    • NFKD
    • is_normalized
  • unicodedata-style database access:
    • general category
    • bidirectional class
    • combining class
    • east asian width
    • mirrored
    • decomposition
    • decimal/digit/numeric values
    • name lookup
    • character lookup by name
    • Unicode age/version checks if needed for CPython compatibility layers
  • Versioned Unicode tables aligned with CPython.

Non-Goals

rustpython-unicode should not include:

  • Python str object behavior
  • slicing, searching, splitting, joining, formatting, padding, or other string algorithms
  • Python object model concerns
  • interpreter-specific wrappers

Those belong in crate::str or the embedding runtime.

Architecture

A good split would be:

  • rustpython-unicode
    • owns Unicode tables and Unicode semantics
    • exposes u32-based predicates and mappings
    • exposes unicodedata-style query APIs
  • crate::str
    • owns Python string methods and higher-level string algorithms
    • calls into rustpython-unicode for all Unicode-sensitive behavior
  • regex engine
    • calls into rustpython-unicode for Unicode character classes
  • unicodedata module
    • becomes a thin wrapper over rustpython-unicode

This keeps one authoritative Unicode path across the runtime.

Suggested Public API Direction

pub mod classify {
pub fn is_alpha(cp: u32) -> bool;
pub fn is_alnum(cp: u32) -> bool;
pub fn is_decimal(cp: u32) -> bool;
pub fn is_digit(cp: u32) -> bool;
pub fn is_numeric(cp: u32) -> bool;
pub fn is_space(cp: u32) -> bool;
pub fn is_printable(cp: u32) -> bool;
}

pub mod identifier {
pub fn is_xid_start(cp: u32) -> bool;
pub fn is_xid_continue(cp: u32) -> bool;
pub fn is_python_identifier_start(cp: u32) -> bool;
pub fn is_python_identifier_continue(cp: u32) -> bool;
}

pub mod regex {
pub fn is_word(cp: u32) -> bool;
pub fn is_digit(cp: u32) -> bool;
pub fn is_space(cp: u32) -> bool;
}

pub mod case {
pub fn to_lowercase(cp: u32) -> CaseMapping;
pub fn to_uppercase(cp: u32) -> CaseMapping;
pub fn to_titlecase(cp: u32) -> CaseMapping;
pub fn casefold(cp: u32) -> CaseMapping;
}

pub mod normalize {
pub fn nfc<I: IntoIterator<Item = u32>>(input: I) -> Normalized;
pub fn nfd<I: IntoIterator<Item = u32>>(input: I) -> Normalized;
pub fn nfkc<I: IntoIterator<Item = u32>>(input: I) -> Normalized;
pub fn nfkd<I: IntoIterator<Item = u32>>(input: I) -> Normalized;
pub fn is_normalized_nfc<I: IntoIterator<Item = u32>>(input: I) -> bool;
}

pub mod data {
pub fn category(cp: u32) -> GeneralCategory;
pub fn bidirectional(cp: u32) -> BidiClass;
pub fn combining(cp: u32) -> u8;
pub fn east_asian_width(cp: u32) -> EastAsianWidth;
pub fn mirrored(cp: u32) -> bool;
pub fn decomposition(cp: u32) -> Option;
pub fn decimal(cp: u32) -> Option;
pub fn digit(cp: u32) -> Option;
pub fn numeric(cp: u32) -> Option;
pub fn name(cp: u32) -> Option<&'static str>;
pub fn lookup(name: &str) -> Option;
}

The exact types can change, but the important part is the boundary: low-level Unicode
semantics here, string algorithms elsewhere.

Compatibility Model

The crate should define compatibility against a specific CPython line, for example:

  • CPython 3.14 Unicode semantics
  • version-pinned generated tables
  • explicit regeneration workflow when upgrading CPython

That matters because “Unicode-correct” is not enough here. The target is “CPython-
compatible.”

Why This Is Better Than Ad Hoc Fixes

  • str, re, parser, and unicodedata stop drifting apart.
  • There is one authoritative source for Unicode behavior.
  • Compatibility work becomes table-driven instead of patch-driven.
  • Future CPython upgrades become more mechanical and auditable.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions