Skip to content

feat: add registers extraction task for PAC generation#1

Open
ozongzi wants to merge 1 commit intoakiselev:masterfrom
ozongzi:feat/registers-extraction
Open

feat: add registers extraction task for PAC generation#1
ozongzi wants to merge 1 commit intoakiselev:masterfrom
ozongzi:feat/registers-extraction

Conversation

@ozongzi
Copy link
Copy Markdown

@ozongzi ozongzi commented Mar 27, 2026

Summary

Adds a new registers extraction task that uses Gemini to extract a complete peripheral register map from a microcontroller datasheet, outputting structured JSON suitable for generating Rust PAC (Peripheral Access Crate) code via svd2rust or chiptool.

datasheet extract registers STM32F407.pdf -f

Output Format

The output JSON models peripherals, registers, fields, and enumerated values — matching the data model used by svd2rust and embassy's chiptool:

{
  "part_details": { "part_number": "STM32F407VG", ... },
  "peripherals": [
    {
      "name": "SPI1",
      "base_address": "0x40013000",
      "registers": [
        {
          "name": "CR1",
          "offset": "0x00",
          "access": "read-write",
          "fields": [
            { "name": "SPE", "bit_offset": 6, "bit_width": 1, ... }
          ]
        }
      ]
    }
  ]
}

Motivation

Existing PAC generation relies on SVD files provided by chip vendors, which are often inaccurate, incomplete, or nonexistent (especially for domestic Chinese MCUs like GD32, WCH, etc.). This task enables generating PAC data directly from datasheets, with the output validated against existing stm32-rs / stm32-metapac data.

Build Note

mupdf-sys currently fails to compile on macOS 26.x (Xcode 17, clang 17) due to a fdopen macro conflict in bundled zlib. This is an existing issue unrelated to this PR. Consider pdfium-render as a cross-platform alternative.

Copilot AI review requested due to automatic review settings March 27, 2026 08:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new registers extraction task intended to produce a structured peripheral/register/field JSON output that can be used for Rust PAC generation workflows (e.g., svd2rust / chiptool).

Changes:

  • Added a new ExtractTask::Registers CLI task wired to a new prompt spec.
  • Introduced a new registers extraction prompt (prompts/extract-registers.md) with detailed anti-hallucination and output requirements.
  • Added a JSON schema for the registers output shape in src/prompts.rs.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
src/prompts.rs Adds PROMPT_REGISTERS and a new registers() prompt spec + JSON schema.
src/extract.rs Adds the Registers task variant and routes it to prompts::registers().
prompts/extract-registers.md New LLM prompt describing how to extract register maps and the expected JSON output.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +33 to +41
| Field | Requirement |
|-------|-------------|
| `name` | EXACT register name as shown (e.g., `CR1`, `SR`, `DR`) |
| `description` | Brief description from datasheet |
| `offset` | Byte offset from peripheral base address (hex string, e.g., `"0x00"`) |
| `size` | Register width in bits (typically 32) |
| `reset_value` | Reset/default value as hex string (e.g., `"0x00000000"`), or null if not specified |
| `access` | `"read-write"`, `"read-only"`, `"write-only"`, or `"read-writeOnce"` |

Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The markdown tables in this prompt use || at the start of the header/separator rows (e.g. || Field | Requirement |), which renders as an extra empty column and can reduce clarity for the model. Update these to standard | Field | Requirement | / | --- | --- | formatting.

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +64
| Field | Requirement |
|-------|-------------|
| `name` | EXACT field name (e.g., `SPE`, `RXNE`, `BR`) |
| `description` | EXACT description from datasheet |
| `bit_offset` | LSB position (0-indexed, e.g., `6` for bit 6) |
| `bit_width` | Number of bits (e.g., `1` for single bit, `3` for 3-bit field) |
| `access` | `"read-write"`, `"read-only"`, `"write-only"` — inherit from register if not specified |
| `enumerated_values` | Array of named values if datasheet defines them, else `[]` |

### Step 4: Handle Special Cases

**Reserved bits:**
- DO NOT include reserved bits as fields
- They will be inferred from gaps in bit coverage

**Write-clear / Read-clear flags:**
- Set `access` to `"read-writeOnce"` for write-1-to-clear flags
- Set `access` to `"read-only"` for hardware-set status flags

Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the field extraction section, the allowed access values list doesn't include read-writeOnce, but later the prompt instructs using read-writeOnce for write-1-to-clear flags. This is internally inconsistent; include read-writeOnce in the field access options (or clarify that it is only allowed at the register level).

Copilot uses AI. Check for mistakes.
Comment on lines +339 to +343
spec.schema = serde_json::from_str(r#"{
"type": "object",
"required": ["part_details", "peripherals"],
"properties": {
"part_details": {
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompt specifies an explicit error JSON shape when no register map is found, but the responseJsonSchema here only allows {part_details, peripherals}. With schema enforcement, the model can't return the documented error object and may be forced to fabricate data to satisfy the schema. Consider updating the schema to oneOf the success shape vs an {error, part_number, pages_searched} shape (or remove the error-response instruction from the prompt).

Copilot uses AI. Check for mistakes.
Comment on lines +344 to +401
"type": "object",
"required": ["part_number"],
"properties": {
"part_number": {"type": "string"},
"datasheet_revision": {"type": ["string", "null"]},
"description": {"type": "string"}
}
},
"peripherals": {
"type": "array",
"items": {
"type": "object",
"required": ["name", "base_address", "registers"],
"properties": {
"name": {"type": "string"},
"description": {"type": "string"},
"base_address": {"type": "string"},
"source_page": {"type": "integer"},
"incomplete": {"type": "boolean"},
"registers": {
"type": "array",
"items": {
"type": "object",
"required": ["name", "offset", "fields"],
"properties": {
"name": {"type": "string"},
"description": {"type": "string"},
"offset": {"type": "string"},
"size": {"type": "integer"},
"reset_value": {"type": ["string", "null"]},
"access": {"type": "string", "enum": ["read-write", "read-only", "write-only", "read-writeOnce"]},
"fields": {
"type": "array",
"items": {
"type": "object",
"required": ["name", "bit_offset", "bit_width"],
"properties": {
"name": {"type": "string"},
"description": {"type": "string"},
"bit_offset": {"type": "integer"},
"bit_width": {"type": "integer"},
"access": {"type": "string"},
"enumerated_values": {
"type": "array",
"items": {
"type": "object",
"required": ["name", "value"],
"properties": {
"name": {"type": "string"},
"value": {"type": "integer"},
"description": {"type": "string"}
}
}
}
}
}
}
}
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema currently does not require several fields that the prompt marks as mandatory (e.g., part_details.datasheet_revision/description, register access/size/reset_value, field enumerated_values). If the JSON schema is meant to enforce output completeness, these should be added to the relevant required lists (and enumerated_values should be required with type: array, even if empty).

Copilot uses AI. Check for mistakes.
}
}
}
}"#).expect("registers schema is valid JSON");
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike the other prompt specs, this schema is parsed from a raw JSON string and uses expect(...), which would panic at runtime if edited incorrectly. For consistency and safer refactors, consider expressing this with json!({...}) (or at least .context(...) + ?) to avoid panics and keep the style consistent across prompt specs.

Suggested change
}"#).expect("registers schema is valid JSON");
}"#).unwrap_or_else(|e| panic!("registers schema JSON is invalid: {e}"));

Copilot uses AI. Check for mistakes.
@akiselev
Copy link
Copy Markdown
Owner

Have you tested this with microcontroller reference manuals? Any idea how it performs?

I'm hesitant to add this as an explicit extraction because LLMs have a hard time with exhaustiveness in these cases and this would be even less reliable than extractions can be.

Let me test this out with a project im working on

@ozongzi
Copy link
Copy Markdown
Author

ozongzi commented Mar 28, 2026

Can I modify the upstream crate to pdfium-render? Otherwise my Mac won't compile.

I plan to integrate svd2rust / chiptool to implement a complete datasheet -> PAC toolchain, but I'm not sure if I should implement it in this PR (maybe this change is too big, I should create a new crate instead).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants