Skip to content

Unified: Add schema checking and corpus-style tests#21848

Open
asgerf wants to merge 21 commits into
github:mainfrom
asgerf:asgerf/swift-yeast
Open

Unified: Add schema checking and corpus-style tests#21848
asgerf wants to merge 21 commits into
github:mainfrom
asgerf:asgerf/swift-yeast

Conversation

@asgerf
Copy link
Copy Markdown
Contributor

@asgerf asgerf commented May 13, 2026

Corpus tests with schema checking

This PR adds tests modelled after tree-sitter's "corpus tests". A test file contains a set of of triples (source code, raw tree, final tree), where the trees are indentation-printed ASTs.

The "final tree" is also rendered with schema violations printed inline. Schema violations would cause a QL extractor crash, but integrating with corpus tests results in a tighter feedback loop (runs faster and has more informative errors)

For example, one test looks like:

===
Additive expression is desugared
===

1 + 2

---

source_file
  additive_expression
    lhs: integer_literal "1"
    op: +
    rhs: integer_literal "2"

---

top_level
  body:
    binary_expr
      operator: operator "+"
      left: int_literal "1"
      right: int_literal "2"

However, if I were to introduce a type error in the mapping:

        rule!(
            (integer_literal) @lit
            =>
            (block_stmt) // Deliberate error
        ),

the last section of the test output would look like:

top_level
  body:
    binary_expr
      operator: operator "+"
      left: block_stmt "1" <-- ERROR: The field binary_expr.left should contain expr, but got block_stmt
      right: block_stmt "2" <-- ERROR: The field binary_expr.right should contain expr, but got block_stmt

Since YEAST is not strongly typed we currently have to rely on testing to catch these errors.

Why is this PR so big?

This PR ends up doing a lot of stuff, which is unfortunately hard to disentangle.

  • The main objective is to corpus tests with schema checking in there ASAP.
  • Unfortunately this can only be exercised by changing the output schema, which causes the QL extractor to panic until the mapping successfully replaces the tree with an output that fits the target schema.
  • In the process of building up such a mapping, a bunch of problems with YEAST were discovered and fixed along the way.
  • These fixes are not that easy to understand without corresponding corpus tests to visualise their effect. Cherry-picking and merging in isolation was technically possible but not necessarily easier to review.

asgerf and others added 20 commits May 13, 2026 10:35
This adds tests consisting of source code and a printout of its rewritten AST.
One-shot desugaring rules now skip unnamed nodes (punctuation, keywords,
etc.) since rules are intended to target named nodes only.

Also prevent infinite recursion when a capture refers to the root node of
the matched tree (e.g. an @_ capture on the pattern root).

Additionally fix the swift.rs add_phase call to match the updated 3-arg
signature introduced by the one-shot phase kind commit.

Co-authored-by: Copilot <[email protected]>
…s framework

Add ast_types.yml defining the unified output AST schema with supertypes
(expr, stmt, condition, pattern) and named nodes (top_level, binary_expr,
name_expr, etc.).

Rewrite swift translation rules to map from tree-sitter Swift grammar to
the unified AST, using one-shot phase rules.

Update the generator to use the output AST schema for dbscheme/QL
generation, and normalize the extraction table prefix to 'unified'.

Improve the corpus test framework to include raw tree-sitter parse output,
type-error checking against the output schema, and better failure
reporting.

Regenerate Ast.qll, unified.dbscheme, and update BasicTest accordingly.

Co-authored-by: Copilot <[email protected]>
Add corpus test cases for Swift covering closures, collections, control
flow, functions, literals, loops, operators, optionals/errors, types,
and variables. Update existing desugar.txt with raw parse sections.

Note: operator nodes currently render their node ID instead of the actual
operator text (e.g. operator "3" instead of operator "+"). This will be
fixed in the next commit.

Co-authored-by: Copilot <[email protected]>
Introduce NodeRef as a typed wrapper around node arena IDs. Captures in
desugaring rules are now bound as NodeRef instead of raw usize, which
prevents accidental misuse and enables source-text-aware rendering.

Add the YeastDisplay trait as an alternative to Display: its
yeast_to_string method receives the Ast, allowing NodeRef to resolve to
the captured node's source text instead of printing a numeric ID.

Store the original source bytes in the Ast so that NodeContent::Range
values (from synthesized literal nodes) can be resolved back to text.

Update yeast-macros to emit NodeRef-typed capture bindings and use
Into::<usize>::into where raw IDs are needed. The #{expr} template
syntax now uses YeastDisplay instead of Display.

The effect is visible in the corpus tests: operator nodes now correctly
render as e.g. operator "+" instead of operator "3".

Co-authored-by: Copilot <[email protected]>
…n in field patterns

Two changes to parse_query_fields:

- Allow `field: (kind)* @cap` (repetition + optional capture) in field
  position, mirroring how it works for bare children.
- When the same field name is declared multiple times in a query (e.g.
  `condition: (foo) condition: (bar)`), merge them into a single
  ordered list of children rather than emitting duplicate field
  entries (which at runtime restart the iterator for the field and
  cause the second declaration to re-match from the first child).
…d mapping

ast_types.yml additions:
- tuple_pattern { element*: pattern } in the pattern supertype.
- sequence_condition { stmt*: stmt, condition: condition } in the
  condition supertype.

swift.rs:
- Map Swift tuple destructuring (e.g. `let (a, b) = pair`) to the new
  tuple_pattern instead of synthesizing an apply_pattern.
- if-let / guard-let: explicitly match the value_binding_pattern
  (the `let` keyword) and bind the source expression as the next
  condition child, so `let` no longer leaks into the output.
The branch was rebased on the grammar changes, but rewriting the history was too difficult, so I'm just updating the test output here.
The output is not so interesting as the mapping removes most nodes from the current test file.

I added a name_expr.swift test so at least one NameExpr makes it through.
@asgerf asgerf added the no-change-note-required This PR does not need a change note label May 13, 2026
@asgerf
Copy link
Copy Markdown
Contributor Author

asgerf commented May 13, 2026

Rerun has been triggered: 2 restarted 🚀

@asgerf asgerf marked this pull request as ready for review May 19, 2026 04:05
@asgerf asgerf requested a review from a team as a code owner May 19, 2026 04:05
Copilot AI review requested due to automatic review settings May 19, 2026 04:05
@asgerf asgerf requested a review from a team as a code owner May 19, 2026 04:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces tree-sitter “corpus-style” tests for the Unified extractor, augmented with inline schema/type checking of the mapped (output) AST to catch YEAST mapping/schema mismatches early and with clearer diagnostics.

Changes:

  • Added corpus test harness (unified/extractor/tests/corpus_tests.rs) that parses tests/corpus/**.txt cases, dumps both raw tree-sitter output and translated output, and (optionally) regenerates expected outputs via UNIFIED_UPDATE_CORPUS.
  • Introduced a shared Unified output schema (unified/extractor/ast_types.yml) and updated generator/extractor wiring to use the unified schema namespace.
  • Extended YEAST (schema + dump + macros + phase kinds) to support OneShot translation phases and inline schema error annotations in dumps.
Show a summary per file
File Description
unified/scripts/update-corpus.sh Helper script to regenerate corpus expected outputs via UNIFIED_UPDATE_CORPUS=1 cargo test.
unified/ql/test/library-tests/BasicTest/test.ql Adjusted basic Unified AST query predicates/imports to the new Unified AST surface.
unified/ql/test/library-tests/BasicTest/test.expected Updated expected query results to match the new Unified AST model and unsupported-node fallback behavior.
unified/ql/test/library-tests/BasicTest/name_expr.swift Added a minimal Swift input to exercise name_expr extraction.
unified/ql/lib/unified.dbscheme Regenerated dbscheme to reflect the new Unified output AST schema and relation namespace.
unified/extractor/tests/corpus/swift/variables.txt Added Swift corpus cases for variable declarations/bindings and assignments.
unified/extractor/tests/corpus/swift/types.txt Added Swift corpus cases for type/decl constructs (classes/structs/enums/etc.).
unified/extractor/tests/corpus/swift/optionals-and-errors.txt Added Swift corpus cases for optionals, try/throws, and related constructs.
unified/extractor/tests/corpus/swift/operators.txt Added Swift corpus cases for operator parsing and precedence.
unified/extractor/tests/corpus/swift/loops.txt Added Swift corpus cases for loops and related control constructs.
unified/extractor/tests/corpus/swift/literals.txt Added Swift corpus cases for literal forms (int/string/etc.).
unified/extractor/tests/corpus/swift/functions.txt Added Swift corpus cases for function decls/calls and argument forms.
unified/extractor/tests/corpus/swift/desugar.txt Added Swift corpus cases validating key desugaring/translation outcomes.
unified/extractor/tests/corpus/swift/control-flow.txt Added Swift corpus cases for if/else/guard/switch control flow.
unified/extractor/tests/corpus/swift/collections.txt Added Swift corpus cases for arrays/dicts/tuples/subscript parsing shapes.
unified/extractor/tests/corpus/swift/closures.txt Added Swift corpus cases for closures/lambdas including trailing closures/captures.
unified/extractor/tests/corpus_tests.rs New Rust test harness that runs corpus cases, compares dumps, and can update expected output.
unified/extractor/src/main.rs Registered languages module for shared language-spec plumbing.
unified/extractor/src/languages/swift/swift.rs Implemented Swift→Unified translation rules using a OneShot YEAST phase and output schema.
unified/extractor/src/languages/mod.rs Centralized language specs and shared OUTPUT_AST_SCHEMA include.
unified/extractor/src/generator.rs Updated generator to build dbscheme/QL library from the unified output schema via output_node_types_yaml.
unified/extractor/src/extractor.rs Normalized per-language specs to emit unified_* TRAP relations matching the unified dbscheme.
unified/extractor/ast_types.yml Added the Unified output AST schema (supertypes + named/unnamed node definitions).
unified/AGENTS.md Updated contributor instructions for extractor/corpus testing and regeneration workflow.
shared/yeast/tests/test.rs Added OneShot + typed dump coverage; updated phase construction to include PhaseKind.
shared/yeast/src/visitor.rs Ensured Ast contains a source buffer field (initialized when building).
shared/yeast/src/schema.rs Extended schema with supertype membership + per-field allowed-type info for type checking.
shared/yeast/src/node_types_yaml.rs Populated schema supertype/field-type metadata from YAML; refactored schema building helpers.
shared/yeast/src/lib.rs Added NodeRef + YeastDisplay, source-text resolution, reachable-node traversal, and OneShot phase execution.
shared/yeast/src/dump.rs Added dump mode that annotates inline schema/type errors for faster debugging.
shared/yeast/src/captures.rs Added try_map_all_captures to support OneShot recursive capture rewriting.
shared/yeast/doc/yeast.md Documented PhaseKind semantics and updated examples for new phase API.
shared/yeast-macros/src/parse.rs Enhanced rule/query parsing (repeated fields, capture typing as NodeRef, #{} formatting via YeastDisplay).
shared/tree-sitter-extractor/src/generator/mod.rs Avoided generating ReservedWord class when the schema has no unnamed reserved-word token type.
shared/tree-sitter-extractor/src/extractor/mod.rs Plumbed source bytes into Runner::run_from_tree for correct source-text rendering.

Copilot's findings

Comments suppressed due to low confidence (2)

unified/extractor/src/languages/swift/swift.rs:229

  • The guard_statement rule pattern expects bound_identifier and value_binding_pattern directly under guard_statement, but the corpus raw parse shows these are nested under condition: if_condition -> if_let_binding (see tests/corpus/swift/control-flow.txt). As written, this rule won't match and guard let remains unsupported_node. Update the pattern to match the actual tree-sitter shape (or remove the rule until supported).
        // ---- Guard statement ----
        // `guard let x = e else { ... }` — currently only handles the
        // let-binding form. The Swift parser models the `let` keyword as a
        // `value_binding_pattern` child of `condition`, followed by an
        // unnamed `=` and the source expression.
        rule!(
            (guard_statement
                bound_identifier: (simple_identifier) @id
                condition: (value_binding_pattern)
                condition: (_) @value
                (else)
                (statements) @else_branch)
            =>

unified/extractor/src/languages/swift/swift.rs:249

  • The if-let translation rules assume bound_identifier is a direct field of if_statement, but the corpus raw parse shows it is nested under condition: if_condition -> if_let_binding (see tests/corpus/swift/control-flow.txt). These rules therefore never fire, and the code falls back to treating the whole let value = optional as an expr_condition unsupported_node. Adjust the match to the real parse shape so let_pattern_condition is actually produced.
        // ---- If statement ----
        // if-let binding (with optional else branch). The Swift parser puts
        // the bound name in `bound_identifier`, the `let` keyword as a
        // `value_binding_pattern` child of `condition`, and the source
        // expression as a separate child of `condition`.
        rule!(
            (if_statement
                bound_identifier: (simple_identifier) @id
                condition: (value_binding_pattern)
                condition: (_) @value
                (statements) @then
                (else)
                (_) @else_branch)
            =>
  • Files reviewed: 35/36 changed files
  • Comments generated: 2

Comment thread unified/AGENTS.md
- To run extractor tests, run `cargo test` in the `extractor` directory.

- To run all tests, run `codeql test run --search-path extractor-pack ql/test`
- Do not edit the printed ASTs in `extractor/test/corpus` directly. To regenerate the ASTs, run `scripts/update-corpus.sh`.
Comment on lines +162 to +173
// Map a `lambda_literal` whose body is a single statement to
// `lambda_expr`. Multi-statement bodies fall through to
// `unsupported_node` because `lambda_expr.body` is single-valued
// in the current `ast_types.yml`. Parameters from explicit-typed
// closures (`{ (x: Int) -> Int in ... }`) are not yet captured.
rule!(
(lambda_literal
(statements (_) @body))
=>
(lambda_expr
body: {body})
),
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation no-change-note-required This PR does not need a change note

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants