[Repo Assist] Add schema.org microdata support to HtmlProvider#1676
[Repo Assist] Add schema.org microdata support to HtmlProvider#1676github-actions[bot] wants to merge 8 commits intomainfrom
Conversation
- Parse itemscope/itemtype/itemprop HTML microdata attributes at design time - Generate a typed 'Schemas' container on HtmlProvider documents - Each schema type (e.g. http://schema.org/Person) becomes a property returning an array of typed items with one property per itemprop name - Items are erased to HtmlSchemaItem at runtime - Property values follow the HTML microdata spec: content attr, href, src, datetime, or inner text depending on element type - Nested itemscope elements are not traversed (correct per spec) - 6 unit tests (HtmlRuntime.getSchemas) + 3 integration tests (HtmlProvider) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/repo-assist Please invstigate CI failures and fix. |
|
Commit pushed:
|
The HtmlProvider now generates schema types for HTML elements with itemscope/itemtype/itemprop attributes. Update the expected signature snapshots for the 4 HTML test files that contain microdata (zoopla.html, zoopla2.html, ebay_cars.htm, imdb_chart.htm) so the DesignTime tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/repo-assist please try this out on some real websites - develop some scripts that use Wikipedia, IMDBm maybe ebay I've added these to the network firewall Post the samples back here and put a couple in the docs. Update the docs too you haven't done that yet, and review all the HtmlProvider docs and improve them |
…ema.org microdata section - Update introduction to clearly explain table naming, column type inference, and when to use the provider - Add Wikipedia S&P 500 companies example demonstrating groupBy analysis - Add schema.org microdata section showing ProductCatalog and mixed-page samples - Update NuGet stats example with improved regex and formatting - Rename Doctor Who groupBy variable for clarity - Add note about JSON-LD on JS-rendered sites (IMDB, eBay) - Remove outdated 'Introducing the provider' framing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Commit pushed:
|
This comment has been minimized.
This comment has been minimized.
|
/repo-assist OK looks like we should be supporting JSON-LD as well, could you add that to this PR and redo more samples. Find out what Wikipedia actually supports for structured data schemas too - not just tables and develop samples around that similarly and get back to me |
Adds native JSON-LD support to HtmlProvider: when an HTML document contains <script type="application/ld+json"> blocks, the provider now generates a typed .JsonLd container (e.g. doc.JsonLd.Article) with one strongly-typed string property per top-level scalar field discovered in the sample. This mirrors the .Schemas container added for HTML microdata and is especially useful for Wikipedia pages (and many modern websites) which embed schema.org Article/WebPage/Person JSON-LD in the <head> element for SEO purposes. Changes: - Add HtmlJsonLdItem and HtmlJsonLdGroup types to HtmlRuntime.fs - Add JsonLdGroup case to HtmlObjectDescription - Add getJsonLd parser (finds script[type=application/ld+json] elements, parses JSON, groups by @type, flattens scalar properties into Map<string,string>) - Add GetJsonLd(id) method to HtmlDocument - Update HtmlGenerator to generate a typed JsonLd container - Add FSharp.Data.Json.Core project reference to Html.Core - Add 5 runtime tests + 6 integration tests - Rewrite HtmlProvider.fsx with Wikipedia JSON-LD examples and summary table of all three formats (tables/microdata/JSON-LD) - Update RELEASE_NOTES.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Commit pushed:
|
|
🤖 This is an automated response from Repo Assist. Done! I've added JSON-LD support to this PR and revised the documentation with Wikipedia-focused examples. Here's a summary of what was added: JSON-LD support (
|
| Format | Access | What it contains |
|---|---|---|
| HTML tables | .Tables.* |
Statistics, event lists, comparisons |
| HTML microdata | .Schemas.* |
Inline entity markup (products, events) |
| JSON-LD | .JsonLd.* |
Article metadata: name, headline, description, url, datePublished, dateModified, license, inLanguage |
Wikipedia embeds a schema.org/Article JSON-LD block in (head) on every article page. Person and place pages additionally carry a mainEntity link to the Wikidata item, plus sameAs cross-references.
Changes in this push
HtmlRuntime.fs—HtmlJsonLdItem,HtmlJsonLdGrouptypes;getJsonLdparser;GetJsonLd(id)onHtmlDocumentHtmlGenerator.fs—createJsonLdItemType;JsonLdGroupcase generating aJsonLdcontainerFSharp.Data.Html.Core.fsproj— addedFSharp.Data.Json.Coreproject reference (forJsonValue.Parse)HtmlProvider.fsx— full rewrite of the structured-data sections with four Wikipedia-based examples and a format summary table- Tests — 5 new runtime unit tests + 6 new
HtmlProviderintegration tests (all green)
Test status: DesignTime 489 ✅ · Integration 285 ✅ · Core 2867 ✅ · Fantomas format check ✅
Generated by Repo Assist for issue #1676
To install this agentic workflow, run
gh aw add githubnext/agentics/workflows/repo-assist.md@afb00b92a9514fee9a14c583f059a03d05738f70
🤖 This PR was created by Repo Assist, an automated AI assistant.
Implements the feature requested in #611 — adds schema.org microdata inference to
HtmlProvider, generating a typedSchemascontainer from HTML elements withitemscope/itemtype/itempropattributes.Closes #611
What's changed
New types (in
FSharp.Data.Runtime)HtmlSchemaItem— a single microdata item:{ Properties: Map(string, string); Html: HtmlNode }HtmlSchemaGroup— all items of a given schema type:{ Name; TypeUrl; Items: HtmlSchemaItem[]; Properties: string[] }HtmlObjectDescriptiongains a newSchemaGroupcaseRuntime parsing (
HtmlRuntime.fs)HtmlRuntime.getSchemas : HtmlDocument -> HtmlSchemaGroup list— discovers allitemscope/itemtypeelements, groups by type URL, extractsitempropvalues per the HTML microdata spec:(meta)→contentattribute(a),(link)→hrefattribute (falling back to inner text)(img),(audio),(video),(source)→srcattribute(time)→datetimeattribute (falling back to inner text)contentattribute (falling back to inner text)itemscopeelements are not traversedHtmlDocument.GetSchema(id)added for runtime schema lookupType generation (
HtmlGenerator.fs)createSchemaItemType— generates one provided type per schema group (erased toHtmlSchemaItem), with onestringproperty per discovereditempropname (Pascal-cased)generateTypeshandlesSchemaGroupcase — adds aSchemascontainer with one property per schema type, returningSchemaItemType[]Usage example
Test Status
FSharp.Data.Core.TestsHTML/schema unit tests: 2862 passed (6 new schema tests added)FSharp.Data.TestsHtmlProvider integration tests: 38 passed (3 new schema integration tests added)