feat(tiptap): convert Story Block content to Markdown (#35727) by wezell · Pull Request #35728 · dotCMS/core

wezell · 2026-05-15T21:06:33Z

Summary

Adds com.dotcms.tiptap.TiptapMarkdown — bidirectional converter between Tiptap JSON (Story Block / ProseMirror) and Markdown.
Exposes it to Velocity templates:
- $contentlet.storyBlock.toMarkdown() (StoryBlockMap)
- $markdownTool.blockToMarkdown(json) (MarkdownTool)
Adds org.commonmark:commonmark + -ext-gfm-tables + -ext-gfm-strikethrough (0.22.0). Zero transitive runtime deps (~250KB total).

What it handles

Nodes: paragraph, heading 1-6, blockquote, bulletList, orderedList, listItem, codeBlock (with language), horizontalRule, hardBreak, image, table/tableRow/tableHeader/tableCell, plus dotCMS-specific dotImage and youtube.

Marks: bold, italic, strike, code, link.

Graceful degradation: marks with no markdown equivalent (underline, highlight, subscript, superscript, textStyle, color) are dropped silently. Any other unknown node/mark logs once at INFO via Logger.info and is skipped — Tiptap is extensible, so the converter never throws on user-extended schemas.

Notable correctness details

Whitespace lifting. Markdown emphasis cannot close after a space (*x * is invalid). The serializer extracts trailing whitespace out of mark spans before emitting closers, and leading whitespace before openers, so output is always well-formed and parses back to the same structure.
Code-context escaping. Text inside inline code marks or codeBlock nodes is emitted literally — special chars are NOT backslash-escaped.
Dynamic fence width. A codeBlock whose body contains triple backticks gets a longer fence (4+ ticks) so the fence can't collide.
Pipe escaping in table cells; mark precedence (link > bold > italic > strike > code, outer→inner) deterministic.

Test plan

TiptapMarkdownTest — 49 synthetic unit tests covering every supported node, every mark, escaping, fence-width, JSON-string overload, round-trip stability per node type.
TiptapMarkdownBlogContentTest — 7 tests against blog-test.json (trimmed to 2 real Story Block bodies, 122KB), verifying:
- every node/mark in real content is supported
- non-empty markdown output for every blog
- re-parses to a non-empty Tiptap doc
- reaches a stable fixed point after one normalization pass
- distinctive text survives the round-trip
- inline-code content emitted literally
Full module compile (./mvnw compile -pl :dotcms-core) clean.
Reviewer: manually hit a Story Block field via $contentlet.storyBlock.toMarkdown() in a Velocity template to sanity-check end-to-end wiring.

Out of scope (documented)

HTML blocks inside markdown are preserved as a paragraph of raw HTML text (no DOM parsing — Tiptap.js relies on the browser DOM here; the Java port does not).
youtube renders as a plain link to the video src (markdown has no native embed). Reviewer call: switch to an <iframe> HTML block if richer rendering is wanted.
Marks like underline etc. are intentionally lossy on the JSON→MD direction since markdown lacks the syntax.

🤖 Generated with Claude Code

…35727) Adds TiptapMarkdown (com.dotcms.tiptap) which converts Story Block / ProseMirror documents to markdown and back. Wires it into the existing renderable surface so Velocity can call: $contentlet.storyBlock.toMd $markdownTool.blockToMarkdown(json) Supports paragraph, heading 1-6, blockquote, bullet/ordered lists, codeBlock with language, horizontalRule, hardBreak, image, GFM tables, plus the dotCMS-specific dotImage and youtube extensions. Marks: bold, italic, strike, code, link. Marks with no markdown equivalent (underline, highlight, sub/superscript, textStyle, color) are dropped silently; truly unknown nodes/marks log once at INFO and are skipped so the converter never throws on user-extended Tiptap schemas. Markdown -> Tiptap uses commonmark-java with GFM tables and strikethrough extensions (zero transitive runtime deps). Tests: 56 passing -- 49 synthetic + 7 against a trimmed real-blog fixture (blog-test.json) covering every node and mark plus fixed-point round-trip stability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude · 2026-05-15T21:07:29Z

Claude finished @wezell's task in 3m 51s —— View job

PR Review: feat(tiptap): convert Story Block content to Markdown

The overall design is sound and the test coverage is solid. A few real bugs and several things worth fixing before merge.

Bugs

1. Dead code: listStack and ListCtx are never used

TiptapMarkdown.java lines 461–462 and 1066–1069:

private final Deque<ListCtx> listStack = new ArrayDeque<>();  // never used
...
private static final class ListCtx { ... }  // never instantiated

The actual list rendering uses listDepth (a plain int counter) and direct recursion. The listStack/ListCtx are leftover from an earlier iteration. Remove them.

Fix this →

2. Test SUPPORTED_MARK_TYPES is incomplete — creates fragile fixture check

TiptapMarkdownBlogContentTest.java line 62 adds only "underline" to the set of silently-dropped marks, but TiptapMarkdown.isSilentlyDroppedMark() also silently drops highlight, subscript, superscript, textStyle, and color. If blog-test.json is ever updated with content that carries those marks, every_node_type_in_fixture_is_supported fails as a false negative even though the converter handles them correctly.

// Missing from SUPPORTED_MARK_TYPES:
SUPPORTED_MARK_TYPES.add("highlight");
SUPPORTED_MARK_TYPES.add("subscript");
SUPPORTED_MARK_TYPES.add("superscript");
SUPPORTED_MARK_TYPES.add("textStyle");
SUPPORTED_MARK_TYPES.add("color");

Fix this →

3. Image src and link href values are embedded verbatim — spaces produce invalid markdown

renderImage (line 717) and closeMark for "link" (line 1006) write the URL directly into the output:

sb.append("![](").append(src);  // src unescaped
...
final StringBuilder sb = new StringBuilder("](").append(href);         // href unescaped

A src or href containing a space (e.g. https://example.com/my image.png) produces invalid CommonMark syntax. The URL should be percent-encoded or wrapped in angle brackets: <https://example.com/my%20image.png>. CommonMark's own handling is to parse up to the first unbalanced space/paren, so a space in a URL will silently truncate it.

Additionally, title.replace("\"", "\\\"") in renderImage (line 718) and closeMark (line 1007) does not escape backslashes first, so a title of foo \"bar becomes foo \\"bar which is still malformed.

Fix this →

4. escapeText escapes - everywhere, producing ugly output

escapeText (line 1029) includes '-' in the list of characters to backslash-escape. Every hyphen in prose text gets escaped: "state-of-the-art" becomes "state\-of\-the\-art". While this technically round-trips through CommonMark, markdown is commonly consumed by tools other than CommonMark (Pandoc, GitHub, Jekyll, etc.), and visual inspection of the output is unpleasant. A hyphen is only significant as a list marker when it appears as - or --- at the start of a line — it never needs escaping mid-sentence. The same applies to . (see next point).

Fix this →

5. Paragraphs starting with \d+\. are not escaped — round-trip breaks

escapeText does not escape . after digits. A paragraph whose first text node is "1. Something" would serialize to:

1\. Something       ← wait, '.' is NOT in the escape list

Actually — . is not escaped at all. So "1. Something" becomes 1. Something as the first line of a paragraph block, which CommonMark parses back as an ordered list item, not a paragraph. The round-trip test for this case would fail.

The fix is to detect \d+\. at position 0 of a block's first text node and emit a backslash before the .. This is the standard CommonMark approach to prevent accidental list parsing.

Fix this →

Design concerns (non-blocking but worth noting)

6. captureChild is O(n²) in document size

captureChild (line 655) snapshots and restores the entire StringBuilder output for each block child inside a list item:

final StringBuilder saved = new StringBuilder(out);  // copies full output so far
out.setLength(0);
// ... render child ...
return out.toString();
// finally: out.append(saved)

For a document with 200 list items and an accumulated output of 50 KB, this copies ~50 KB per item. A simple fix is to use an index-based approach: record out.length() before, render, extract out.substring(mark), then reset out.setLength(mark).

7. Double-parse in toMarkdown(JSONObject)

TiptapMarkdown.toMarkdown(JSONObject) calls tiptap.toString() and then MAPPER.readTree(string). The JSONObject was itself parsed from JSON originally, so this is a JSON→string→JSON roundtrip. Minor but wasteful.

Clean

The whitespace-lifting logic (trailing/leading whitespace extracted before/after mark open/close delimiters), the dynamic fence width for triple-backtick bodies, and the mark precedence/reopening logic in renderInline are all correct and the test coverage for those cases is thorough.

Mirrors the naming of the existing toHtml() Renderable surface so Velocity callers use `\$contentlet.storyBlock.toMarkdown` alongside `\$contentlet.storyBlock.toHtml`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hassandotcms · 2026-06-11T14:05:30Z

QA FIndings

Tested the usable methods end-to-end:

$dotContentMap.body.toMarkdown() on a page, works, correct markdown
Both $markdown.blockToMarkdown() overloads in the Playground; identical output to toMarkdown()
REST _search after conversions, block field unchanged, still the normal JSON tree
toHtml() and $markdown.parse() , no regression, same output as before
Called conversion repeatedly , read-only, deterministic, storage byte-identical

JSON → Markdown

Contentlet with every block type , headings/lists/quotes/code/tables/images/youtube all convert correctly
Edge cases (empty docs, missing attrs, heading level 99, unicode, non-JSON input), Handled.
Unknown custom blocks and grids, wrapper/layout dropped, inner text kept
dotContent → nothing. Embedded contentlet titled "embed-target": toHtml() gives <h2>embed-target</h2>, toMarkdown() gives empty (need to fix: add case "dotContent" in MarkdownWriter.renderNode() switch, emit attrs.data.title)
dotVideo and legacy aiContent → nothing (aiContent keeps its text in attrs.content; that text is gone) (need to fix: same switch — dotVideo as a link like the youtube case; aiContent emit attrs.content text)
List inside a table cell vanishes silently ,| cell with list | comes out | |, no log (need to fix: MarkdownWriter.renderCellInline() ,handle non-paragraph blocks or at least log like the unknown-node path does)
WYSIWYG-converted field → raw HTML returned verbatim, not converted

Markdown → JSON (toTiptap)

Generic image node (what toTiptap() emits) fed through the system: toHtml() renders an empty <h2>, the editor shows the body blank, and saving wiped the document to one empty paragraph. Same flow with known node types, everything survives (need to fix: TiptapBuilder.visit(Image) ; emit dotImage instead of image; the serializer already accepts both. Pre-merge)
HTML input: <h1>Hello</h1> becomes a paragraph with the literal text <h1>Hello</h1>
Inline HTML: before <br> after loses the <br> silently ;including the <br> the converter itself writes for multiparagraph cells, so its own tables don't survive re-parse (need to fix: add visit(HtmlInline) to TiptapBuilder; map <br> to hardBreak, keep the rest as text like HtmlBlock does)
Already-valid Tiptap JSON in: re-parsed as markdown, stored as one literal JSON string ;nothing detects "already JSON" (need to fix: add isTiptapDoc(String) static to TiptapMarkdown ;parse + check type=="doc" and content is array)
GFM: - [ ] task one → bullet with literal text [ ] task one; |:---:| alignment dropped; bare URLs stay plain text. Reference links and setext headings convert fine (need to fix: alignment only)

Tested in Editor Mode

Attribute check: link target="_blank", textAlign: center, image width/height, youtube start, all in JSON, all gone after JSON→MD→JSON. Text 1. not a list comes back as an ordered list (need to fix: Javadoc only, document the attr losses and the 1. edge; markdown can't express these)
- Same markdown via the editor's "Paste from Markdown" vs toTiptap(): editor produces dotImage, fills cell align attrs, autolinks bare URLs, maps →underline and<br>→hardBreak. Same input, two different stored results depending on the path (covered by the dotImage, alignment, and HtmlInline fixes above)
Opened/re-saved the full test contentlet in the editor ; all known nodes survive, only attrs normalized

github-project-automation Bot added this to dotCMS - Product Planning May 15, 2026

github-actions Bot added the Area : Backend PR changes Java/Maven backend code label May 15, 2026

claude Bot added the AI: Safe To Rollback label May 15, 2026

github-actions Bot removed the AI: Safe To Rollback label May 15, 2026

claude Bot added the AI: Safe To Rollback label May 15, 2026

wezell added the Team : Scout label May 15, 2026

This was referenced May 19, 2026

fix(ci): harden PR→issue linking gate against shell injection and missed refs #35761

Merged

harden PR→issue linking gate against shell injection and missed refs #35794

Closed

Merge branch 'main' into issue-35727-tiptap-markdown-converter

d2918d9

github-actions Bot removed the AI: Safe To Rollback label Jun 5, 2026

claude Bot added the AI: Safe To Rollback label Jun 5, 2026

dotCMS deleted a comment from github-actions Bot Jun 5, 2026

Merge branch 'main' into issue-35727-tiptap-markdown-converter

129d295

github-actions Bot mentioned this pull request Jun 5, 2026

Convert Tiptap Story Block content to Markdown #35727

Open

github-actions Bot removed the AI: Safe To Rollback label Jun 5, 2026

fmontes mentioned this pull request Jun 5, 2026

Workflow fire: convert Block Editor HTML/Markdown to ProseMirror JSON on save (server-side) #36002

Open

6 tasks

claude Bot added the AI: Safe To Rollback label Jun 5, 2026

fmontes mentioned this pull request Jun 5, 2026

dotAI: embed Block Editor (Story Block) fields as Markdown instead of HTML-stripped text #36003

Open

5 tasks

wezell enabled auto-merge June 5, 2026 16:37

wezell requested a review from fmontes June 9, 2026 17:16

fmontes added Team : Modernization and removed Team : Scout labels Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tiptap): convert Story Block content to Markdown (#35727)#35728

feat(tiptap): convert Story Block content to Markdown (#35727)#35728
wezell wants to merge 4 commits into
mainfrom
issue-35727-tiptap-markdown-converter

wezell commented May 15, 2026 •

edited

Loading

Uh oh!

claude Bot commented May 15, 2026 •

edited

Loading

Uh oh!

hassandotcms commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wezell commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What it handles

Notable correctness details

Test plan

Out of scope (documented)

Uh oh!

claude Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: feat(tiptap): convert Story Block content to Markdown

Bugs

Design concerns (non-blocking but worth noting)

Clean

Uh oh!

hassandotcms commented Jun 11, 2026

QA FIndings

Tested the usable methods end-to-end:

JSON → Markdown

Markdown → JSON (toTiptap)

Tested in Editor Mode

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wezell commented May 15, 2026 •

edited

Loading

claude Bot commented May 15, 2026 •

edited

Loading