Skip to content

Add unicode & bytes c-api support#7904

Merged
youknowone merged 2 commits into
RustPython:mainfrom
bschoenmaeckers:c-api-strings
May 19, 2026
Merged

Add unicode & bytes c-api support#7904
youknowone merged 2 commits into
RustPython:mainfrom
bschoenmaeckers:c-api-strings

Conversation

@bschoenmaeckers
Copy link
Copy Markdown
Contributor

@bschoenmaeckers bschoenmaeckers commented May 17, 2026

Summary by CodeRabbit

  • New Features
    • Added bytes C-API support: create bytes, get size, and access raw byte data from extensions.
    • Added Unicode C-API support: create/inspect UTF‑8 strings, encode, compare, and intern strings.
    • Expanded public C-API surface to expose the new bytes and unicode capabilities.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 17, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 9b69a26b-4ae8-47b4-b059-6b9ad16a5fad

📥 Commits

Reviewing files that changed from the base of the PR and between 4447635 and 78ba5e0.

📒 Files selected for processing (2)
  • crates/capi/src/bytesobject.rs
  • crates/capi/src/unicodeobject.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • crates/capi/src/bytesobject.rs
  • crates/capi/src/unicodeobject.rs

📝 Walkthrough

Walkthrough

Adds C-API bindings for Python bytes and unicode objects: type-check helpers, bytes constructors/accessors, unicode constructors/accessors/encoding/interning/comparison, public module exports, and crate-visible macro re-export.

Changes

C-API Bytes and Unicode Object Bindings

Layer / File(s) Summary
C-API Type Check Macro Re-export and Module Declaration
crates/capi/src/object.rs, crates/capi/src/lib.rs
define_py_check is re-exported with crate visibility and bytesobject and unicodeobject are declared public.
Bytes Object C-API Functions
crates/capi/src/bytesobject.rs
Adds PyBytes_Check/PyBytes_CheckExact, PyBytes_FromStringAndSize (handles null pointer/uninitialized buffer and negative lengths), PyBytes_Size, PyBytes_AsString, and disabled pyo3 tests.
Unicode Object C-API Functions
crates/capi/src/unicodeobject.rs
Adds PyUnicode_Check/PyUnicode_CheckExact, PyUnicode_FromStringAndSize, PyUnicode_AsUTF8AndSize, PyUnicode_AsEncodedString, PyUnicode_InternInPlace, PyUnicode_EqualToUTF8AndSize, and disabled pyo3 tests.

Sequence Diagram(s)

sequenceDiagram
  participant CCaller as C caller
  participant PyBytes_FromStringAndSize
  participant VM as RustPython VM
  participant Ctx as VM Context
  CCaller->>PyBytes_FromStringAndSize: (bytes: *mut c_char, len: isize)
  PyBytes_FromStringAndSize->>VM: with_vm_context
  alt bytes is NULL
    VM->>Ctx: allocate uninitialized Vec<u8>
  else bytes not NULL
    VM->>Ctx: copy from pointer slice into Vec<u8>
  end
  Ctx-->>VM: new PyBytes PyObject*
  VM-->>PyBytes_FromStringAndSize: PyObject*
  PyBytes_FromStringAndSize-->>CCaller: PyObject*
Loading
sequenceDiagram
  participant CCaller as C caller
  participant PyUnicode_InternInPlace
  participant VM as RustPython VM
  participant Ctx as VM Context
  CCaller->>PyUnicode_InternInPlace: string: *mut *mut PyObject
  PyUnicode_InternInPlace->>VM: downcast *string to PyStr
  VM->>Ctx: intern string
  Ctx-->>VM: interned PyObject*
  VM-->>PyUnicode_InternInPlace: interned PyObject*
  PyUnicode_InternInPlace-->>CCaller: write interned pointer back to *string
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

  • RustPython/RustPython#7871: Introduces define_py_check macro-generated C-API type-check functions; related to this PR's usage and re-export of that macro.

Suggested reviewers

  • youknowone
  • ShaharNaveh

Poem

🐰 I hopped through bytes and strings today,
From raw C pointers to UTF-8 play.
I copied, interned, and checked with care,
Rust bridges C so Python can share.
Hooray for bindings — nibble, hop, hooray! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main objective of the changeset: adding C-API support for unicode and bytes types through new FFI functions in bytesobject.rs and unicodeobject.rs.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/capi/src/bytesobject.rs`:
- Around line 10-26: Validate that the incoming len is non-negative at the start
of PyBytes_FromStringAndSize and bail out immediately if it's negative: check if
len < 0, set an appropriate Python exception (e.g., raise ValueError or call the
existing C-API error setter) on the VM, and return NULL instead of converting
len to usize; only after this check convert len to usize and proceed with the
current branches that allocate or slice using that usize value (refer to
function PyBytes_FromStringAndSize and the branches that call
Vec::with_capacity/set_len and slice::from_raw_parts).

In `@crates/capi/src/unicodeobject.rs`:
- Around line 108-126: The function PyUnicode_EqualToUTF8AndSize uses
slice::from_raw_parts with size cast unsafely, which overflows when size is
negative; add a guard at the start of PyUnicode_EqualToUTF8AndSize that checks
if size < 0 and immediately returns false (0) via the with_vm/Ok(false) path (or
direct c_int 0) to avoid creating an oversized slice, then proceed with the
existing logic (locate the unicode downcast to PyStr and the slice/from_utf8
steps) only when size is non-negative.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 9999d9a7-88c7-4f09-a7bf-21b46eaf41bf

📥 Commits

Reviewing files that changed from the base of the PR and between a1a87dc and 4447635.

📒 Files selected for processing (4)
  • crates/capi/src/bytesobject.rs
  • crates/capi/src/lib.rs
  • crates/capi/src/object.rs
  • crates/capi/src/unicodeobject.rs

Comment thread crates/capi/src/bytesobject.rs
Comment thread crates/capi/src/unicodeobject.rs
@youknowone
Copy link
Copy Markdown
Member

@bschoenmaeckers may be worth to check coderabbit comments

@bschoenmaeckers
Copy link
Copy Markdown
Contributor Author

@bschoenmaeckers may be worth to check coderabbit comments

Will do 👍

@bschoenmaeckers
Copy link
Copy Markdown
Contributor Author

Addressed review comments

@youknowone youknowone merged commit 20cb884 into RustPython:main May 19, 2026
26 checks passed
@bschoenmaeckers bschoenmaeckers deleted the c-api-strings branch May 19, 2026 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants