Unified: Add schema checking and corpus-style tests#21848
Open
asgerf wants to merge 21 commits into
Open
Conversation
This adds tests consisting of source code and a printout of its rewritten AST.
One-shot desugaring rules now skip unnamed nodes (punctuation, keywords, etc.) since rules are intended to target named nodes only. Also prevent infinite recursion when a capture refers to the root node of the matched tree (e.g. an @_ capture on the pattern root). Additionally fix the swift.rs add_phase call to match the updated 3-arg signature introduced by the one-shot phase kind commit. Co-authored-by: Copilot <[email protected]>
…s framework Add ast_types.yml defining the unified output AST schema with supertypes (expr, stmt, condition, pattern) and named nodes (top_level, binary_expr, name_expr, etc.). Rewrite swift translation rules to map from tree-sitter Swift grammar to the unified AST, using one-shot phase rules. Update the generator to use the output AST schema for dbscheme/QL generation, and normalize the extraction table prefix to 'unified'. Improve the corpus test framework to include raw tree-sitter parse output, type-error checking against the output schema, and better failure reporting. Regenerate Ast.qll, unified.dbscheme, and update BasicTest accordingly. Co-authored-by: Copilot <[email protected]>
Add corpus test cases for Swift covering closures, collections, control flow, functions, literals, loops, operators, optionals/errors, types, and variables. Update existing desugar.txt with raw parse sections. Note: operator nodes currently render their node ID instead of the actual operator text (e.g. operator "3" instead of operator "+"). This will be fixed in the next commit. Co-authored-by: Copilot <[email protected]>
Introduce NodeRef as a typed wrapper around node arena IDs. Captures in
desugaring rules are now bound as NodeRef instead of raw usize, which
prevents accidental misuse and enables source-text-aware rendering.
Add the YeastDisplay trait as an alternative to Display: its
yeast_to_string method receives the Ast, allowing NodeRef to resolve to
the captured node's source text instead of printing a numeric ID.
Store the original source bytes in the Ast so that NodeContent::Range
values (from synthesized literal nodes) can be resolved back to text.
Update yeast-macros to emit NodeRef-typed capture bindings and use
Into::<usize>::into where raw IDs are needed. The #{expr} template
syntax now uses YeastDisplay instead of Display.
The effect is visible in the corpus tests: operator nodes now correctly
render as e.g. operator "+" instead of operator "3".
Co-authored-by: Copilot <[email protected]>
…n in field patterns Two changes to parse_query_fields: - Allow `field: (kind)* @cap` (repetition + optional capture) in field position, mirroring how it works for bare children. - When the same field name is declared multiple times in a query (e.g. `condition: (foo) condition: (bar)`), merge them into a single ordered list of children rather than emitting duplicate field entries (which at runtime restart the iterator for the field and cause the second declaration to re-match from the first child).
…d mapping
ast_types.yml additions:
- tuple_pattern { element*: pattern } in the pattern supertype.
- sequence_condition { stmt*: stmt, condition: condition } in the
condition supertype.
swift.rs:
- Map Swift tuple destructuring (e.g. `let (a, b) = pair`) to the new
tuple_pattern instead of synthesizing an apply_pattern.
- if-let / guard-let: explicitly match the value_binding_pattern
(the `let` keyword) and bind the source expression as the next
condition child, so `let` no longer leaks into the output.
The branch was rebased on the grammar changes, but rewriting the history was too difficult, so I'm just updating the test output here.
The output is not so interesting as the mapping removes most nodes from the current test file. I added a name_expr.swift test so at least one NameExpr makes it through.
Contributor
Author
Rerun has been triggered: 2 restarted 🚀 |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces tree-sitter “corpus-style” tests for the Unified extractor, augmented with inline schema/type checking of the mapped (output) AST to catch YEAST mapping/schema mismatches early and with clearer diagnostics.
Changes:
- Added corpus test harness (
unified/extractor/tests/corpus_tests.rs) that parsestests/corpus/**.txtcases, dumps both raw tree-sitter output and translated output, and (optionally) regenerates expected outputs viaUNIFIED_UPDATE_CORPUS. - Introduced a shared Unified output schema (
unified/extractor/ast_types.yml) and updated generator/extractor wiring to use the unified schema namespace. - Extended YEAST (schema + dump + macros + phase kinds) to support OneShot translation phases and inline schema error annotations in dumps.
Show a summary per file
| File | Description |
|---|---|
| unified/scripts/update-corpus.sh | Helper script to regenerate corpus expected outputs via UNIFIED_UPDATE_CORPUS=1 cargo test. |
| unified/ql/test/library-tests/BasicTest/test.ql | Adjusted basic Unified AST query predicates/imports to the new Unified AST surface. |
| unified/ql/test/library-tests/BasicTest/test.expected | Updated expected query results to match the new Unified AST model and unsupported-node fallback behavior. |
| unified/ql/test/library-tests/BasicTest/name_expr.swift | Added a minimal Swift input to exercise name_expr extraction. |
| unified/ql/lib/unified.dbscheme | Regenerated dbscheme to reflect the new Unified output AST schema and relation namespace. |
| unified/extractor/tests/corpus/swift/variables.txt | Added Swift corpus cases for variable declarations/bindings and assignments. |
| unified/extractor/tests/corpus/swift/types.txt | Added Swift corpus cases for type/decl constructs (classes/structs/enums/etc.). |
| unified/extractor/tests/corpus/swift/optionals-and-errors.txt | Added Swift corpus cases for optionals, try/throws, and related constructs. |
| unified/extractor/tests/corpus/swift/operators.txt | Added Swift corpus cases for operator parsing and precedence. |
| unified/extractor/tests/corpus/swift/loops.txt | Added Swift corpus cases for loops and related control constructs. |
| unified/extractor/tests/corpus/swift/literals.txt | Added Swift corpus cases for literal forms (int/string/etc.). |
| unified/extractor/tests/corpus/swift/functions.txt | Added Swift corpus cases for function decls/calls and argument forms. |
| unified/extractor/tests/corpus/swift/desugar.txt | Added Swift corpus cases validating key desugaring/translation outcomes. |
| unified/extractor/tests/corpus/swift/control-flow.txt | Added Swift corpus cases for if/else/guard/switch control flow. |
| unified/extractor/tests/corpus/swift/collections.txt | Added Swift corpus cases for arrays/dicts/tuples/subscript parsing shapes. |
| unified/extractor/tests/corpus/swift/closures.txt | Added Swift corpus cases for closures/lambdas including trailing closures/captures. |
| unified/extractor/tests/corpus_tests.rs | New Rust test harness that runs corpus cases, compares dumps, and can update expected output. |
| unified/extractor/src/main.rs | Registered languages module for shared language-spec plumbing. |
| unified/extractor/src/languages/swift/swift.rs | Implemented Swift→Unified translation rules using a OneShot YEAST phase and output schema. |
| unified/extractor/src/languages/mod.rs | Centralized language specs and shared OUTPUT_AST_SCHEMA include. |
| unified/extractor/src/generator.rs | Updated generator to build dbscheme/QL library from the unified output schema via output_node_types_yaml. |
| unified/extractor/src/extractor.rs | Normalized per-language specs to emit unified_* TRAP relations matching the unified dbscheme. |
| unified/extractor/ast_types.yml | Added the Unified output AST schema (supertypes + named/unnamed node definitions). |
| unified/AGENTS.md | Updated contributor instructions for extractor/corpus testing and regeneration workflow. |
| shared/yeast/tests/test.rs | Added OneShot + typed dump coverage; updated phase construction to include PhaseKind. |
| shared/yeast/src/visitor.rs | Ensured Ast contains a source buffer field (initialized when building). |
| shared/yeast/src/schema.rs | Extended schema with supertype membership + per-field allowed-type info for type checking. |
| shared/yeast/src/node_types_yaml.rs | Populated schema supertype/field-type metadata from YAML; refactored schema building helpers. |
| shared/yeast/src/lib.rs | Added NodeRef + YeastDisplay, source-text resolution, reachable-node traversal, and OneShot phase execution. |
| shared/yeast/src/dump.rs | Added dump mode that annotates inline schema/type errors for faster debugging. |
| shared/yeast/src/captures.rs | Added try_map_all_captures to support OneShot recursive capture rewriting. |
| shared/yeast/doc/yeast.md | Documented PhaseKind semantics and updated examples for new phase API. |
| shared/yeast-macros/src/parse.rs | Enhanced rule/query parsing (repeated fields, capture typing as NodeRef, #{} formatting via YeastDisplay). |
| shared/tree-sitter-extractor/src/generator/mod.rs | Avoided generating ReservedWord class when the schema has no unnamed reserved-word token type. |
| shared/tree-sitter-extractor/src/extractor/mod.rs | Plumbed source bytes into Runner::run_from_tree for correct source-text rendering. |
Copilot's findings
Comments suppressed due to low confidence (2)
unified/extractor/src/languages/swift/swift.rs:229
- The
guard_statementrule pattern expectsbound_identifierandvalue_binding_patterndirectly underguard_statement, but the corpus raw parse shows these are nested undercondition: if_condition -> if_let_binding(seetests/corpus/swift/control-flow.txt). As written, this rule won't match andguard letremainsunsupported_node. Update the pattern to match the actual tree-sitter shape (or remove the rule until supported).
// ---- Guard statement ----
// `guard let x = e else { ... }` — currently only handles the
// let-binding form. The Swift parser models the `let` keyword as a
// `value_binding_pattern` child of `condition`, followed by an
// unnamed `=` and the source expression.
rule!(
(guard_statement
bound_identifier: (simple_identifier) @id
condition: (value_binding_pattern)
condition: (_) @value
(else)
(statements) @else_branch)
=>
unified/extractor/src/languages/swift/swift.rs:249
- The if-let translation rules assume
bound_identifieris a direct field ofif_statement, but the corpus raw parse shows it is nested undercondition: if_condition -> if_let_binding(seetests/corpus/swift/control-flow.txt). These rules therefore never fire, and the code falls back to treating the wholelet value = optionalas anexpr_conditionunsupported_node. Adjust the match to the real parse shape solet_pattern_conditionis actually produced.
// ---- If statement ----
// if-let binding (with optional else branch). The Swift parser puts
// the bound name in `bound_identifier`, the `let` keyword as a
// `value_binding_pattern` child of `condition`, and the source
// expression as a separate child of `condition`.
rule!(
(if_statement
bound_identifier: (simple_identifier) @id
condition: (value_binding_pattern)
condition: (_) @value
(statements) @then
(else)
(_) @else_branch)
=>
- Files reviewed: 35/36 changed files
- Comments generated: 2
| - To run extractor tests, run `cargo test` in the `extractor` directory. | ||
|
|
||
| - To run all tests, run `codeql test run --search-path extractor-pack ql/test` | ||
| - Do not edit the printed ASTs in `extractor/test/corpus` directly. To regenerate the ASTs, run `scripts/update-corpus.sh`. |
Comment on lines
+162
to
+173
| // Map a `lambda_literal` whose body is a single statement to | ||
| // `lambda_expr`. Multi-statement bodies fall through to | ||
| // `unsupported_node` because `lambda_expr.body` is single-valued | ||
| // in the current `ast_types.yml`. Parameters from explicit-typed | ||
| // closures (`{ (x: Int) -> Int in ... }`) are not yet captured. | ||
| rule!( | ||
| (lambda_literal | ||
| (statements (_) @body)) | ||
| => | ||
| (lambda_expr | ||
| body: {body}) | ||
| ), |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Corpus tests with schema checking
This PR adds tests modelled after tree-sitter's "corpus tests". A test file contains a set of of triples (source code, raw tree, final tree), where the trees are indentation-printed ASTs.
The "final tree" is also rendered with schema violations printed inline. Schema violations would cause a QL extractor crash, but integrating with corpus tests results in a tighter feedback loop (runs faster and has more informative errors)
For example, one test looks like:
However, if I were to introduce a type error in the mapping:
the last section of the test output would look like:
Since YEAST is not strongly typed we currently have to rely on testing to catch these errors.
Why is this PR so big?
This PR ends up doing a lot of stuff, which is unfortunately hard to disentangle.