Skip to content

feat(tiptap): convert Story Block content to Markdown (#35727)#35728

Open
wezell wants to merge 4 commits into
mainfrom
issue-35727-tiptap-markdown-converter
Open

feat(tiptap): convert Story Block content to Markdown (#35727)#35728
wezell wants to merge 4 commits into
mainfrom
issue-35727-tiptap-markdown-converter

Conversation

@wezell

@wezell wezell commented May 15, 2026

Copy link
Copy Markdown
Member

Closes #35727

Summary

  • Adds com.dotcms.tiptap.TiptapMarkdown — bidirectional converter between Tiptap JSON (Story Block / ProseMirror) and Markdown.
  • Exposes it to Velocity templates:
    • $contentlet.storyBlock.toMarkdown() (StoryBlockMap)
    • $markdownTool.blockToMarkdown(json) (MarkdownTool)
  • Adds org.commonmark:commonmark + -ext-gfm-tables + -ext-gfm-strikethrough (0.22.0). Zero transitive runtime deps (~250KB total).

What it handles

Nodes: paragraph, heading 1-6, blockquote, bulletList, orderedList, listItem, codeBlock (with language), horizontalRule, hardBreak, image, table/tableRow/tableHeader/tableCell, plus dotCMS-specific dotImage and youtube.

Marks: bold, italic, strike, code, link.

Graceful degradation: marks with no markdown equivalent (underline, highlight, subscript, superscript, textStyle, color) are dropped silently. Any other unknown node/mark logs once at INFO via Logger.info and is skipped — Tiptap is extensible, so the converter never throws on user-extended schemas.

Notable correctness details

  • Whitespace lifting. Markdown emphasis cannot close after a space (*x * is invalid). The serializer extracts trailing whitespace out of mark spans before emitting closers, and leading whitespace before openers, so output is always well-formed and parses back to the same structure.
  • Code-context escaping. Text inside inline code marks or codeBlock nodes is emitted literally — special chars are NOT backslash-escaped.
  • Dynamic fence width. A codeBlock whose body contains triple backticks gets a longer fence (4+ ticks) so the fence can't collide.
  • Pipe escaping in table cells; mark precedence (link > bold > italic > strike > code, outer→inner) deterministic.

Test plan

  • TiptapMarkdownTest — 49 synthetic unit tests covering every supported node, every mark, escaping, fence-width, JSON-string overload, round-trip stability per node type.
  • TiptapMarkdownBlogContentTest — 7 tests against blog-test.json (trimmed to 2 real Story Block bodies, 122KB), verifying:
    • every node/mark in real content is supported
    • non-empty markdown output for every blog
    • re-parses to a non-empty Tiptap doc
    • reaches a stable fixed point after one normalization pass
    • distinctive text survives the round-trip
    • inline-code content emitted literally
  • Full module compile (./mvnw compile -pl :dotcms-core) clean.
  • Reviewer: manually hit a Story Block field via $contentlet.storyBlock.toMarkdown() in a Velocity template to sanity-check end-to-end wiring.

Out of scope (documented)

  • HTML blocks inside markdown are preserved as a paragraph of raw HTML text (no DOM parsing — Tiptap.js relies on the browser DOM here; the Java port does not).
  • youtube renders as a plain link to the video src (markdown has no native embed). Reviewer call: switch to an <iframe> HTML block if richer rendering is wanted.
  • Marks like underline etc. are intentionally lossy on the JSON→MD direction since markdown lacks the syntax.

🤖 Generated with Claude Code

…35727)

Adds TiptapMarkdown (com.dotcms.tiptap) which converts Story Block /
ProseMirror documents to markdown and back. Wires it into the existing
renderable surface so Velocity can call:

  $contentlet.storyBlock.toMd
  $markdownTool.blockToMarkdown(json)

Supports paragraph, heading 1-6, blockquote, bullet/ordered lists,
codeBlock with language, horizontalRule, hardBreak, image, GFM tables,
plus the dotCMS-specific dotImage and youtube extensions. Marks:
bold, italic, strike, code, link. Marks with no markdown equivalent
(underline, highlight, sub/superscript, textStyle, color) are dropped
silently; truly unknown nodes/marks log once at INFO and are skipped
so the converter never throws on user-extended Tiptap schemas.

Markdown -> Tiptap uses commonmark-java with GFM tables and
strikethrough extensions (zero transitive runtime deps).

Tests: 56 passing -- 49 synthetic + 7 against a trimmed real-blog
fixture (blog-test.json) covering every node and mark plus
fixed-point round-trip stability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the Area : Backend PR changes Java/Maven backend code label May 15, 2026
@claude

claude Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

Claude finished @wezell's task in 3m 51s —— View job


PR Review: feat(tiptap): convert Story Block content to Markdown

  • Read changed files
  • Review TiptapMarkdown.java core logic
  • Review tests
  • Review API surface / integrations
  • Post findings

The overall design is sound and the test coverage is solid. A few real bugs and several things worth fixing before merge.


Bugs

1. Dead code: listStack and ListCtx are never used

TiptapMarkdown.java lines 461–462 and 1066–1069:

private final Deque<ListCtx> listStack = new ArrayDeque<>();  // never used
...
private static final class ListCtx { ... }  // never instantiated

The actual list rendering uses listDepth (a plain int counter) and direct recursion. The listStack/ListCtx are leftover from an earlier iteration. Remove them.

Fix this →


2. Test SUPPORTED_MARK_TYPES is incomplete — creates fragile fixture check

TiptapMarkdownBlogContentTest.java line 62 adds only "underline" to the set of silently-dropped marks, but TiptapMarkdown.isSilentlyDroppedMark() also silently drops highlight, subscript, superscript, textStyle, and color. If blog-test.json is ever updated with content that carries those marks, every_node_type_in_fixture_is_supported fails as a false negative even though the converter handles them correctly.

// Missing from SUPPORTED_MARK_TYPES:
SUPPORTED_MARK_TYPES.add("highlight");
SUPPORTED_MARK_TYPES.add("subscript");
SUPPORTED_MARK_TYPES.add("superscript");
SUPPORTED_MARK_TYPES.add("textStyle");
SUPPORTED_MARK_TYPES.add("color");

Fix this →


3. Image src and link href values are embedded verbatim — spaces produce invalid markdown

renderImage (line 717) and closeMark for "link" (line 1006) write the URL directly into the output:

sb.append("![](").append(src);  // src unescaped
...
final StringBuilder sb = new StringBuilder("](").append(href);         // href unescaped

A src or href containing a space (e.g. https://example.com/my image.png) produces invalid CommonMark syntax. The URL should be percent-encoded or wrapped in angle brackets: <https://example.com/my%20image.png>. CommonMark's own handling is to parse up to the first unbalanced space/paren, so a space in a URL will silently truncate it.

Additionally, title.replace("\"", "\\\"") in renderImage (line 718) and closeMark (line 1007) does not escape backslashes first, so a title of foo \"bar becomes foo \\"bar which is still malformed.

Fix this →


4. escapeText escapes - everywhere, producing ugly output

escapeText (line 1029) includes '-' in the list of characters to backslash-escape. Every hyphen in prose text gets escaped: "state-of-the-art" becomes "state\-of\-the\-art". While this technically round-trips through CommonMark, markdown is commonly consumed by tools other than CommonMark (Pandoc, GitHub, Jekyll, etc.), and visual inspection of the output is unpleasant. A hyphen is only significant as a list marker when it appears as - or --- at the start of a line — it never needs escaping mid-sentence. The same applies to . (see next point).

Fix this →


5. Paragraphs starting with \d+\. are not escaped — round-trip breaks

escapeText does not escape . after digits. A paragraph whose first text node is "1. Something" would serialize to:

1\. Something       ← wait, '.' is NOT in the escape list

Actually — . is not escaped at all. So "1. Something" becomes 1. Something as the first line of a paragraph block, which CommonMark parses back as an ordered list item, not a paragraph. The round-trip test for this case would fail.

The fix is to detect \d+\. at position 0 of a block's first text node and emit a backslash before the .. This is the standard CommonMark approach to prevent accidental list parsing.

Fix this →


Design concerns (non-blocking but worth noting)

6. captureChild is O(n²) in document size

captureChild (line 655) snapshots and restores the entire StringBuilder output for each block child inside a list item:

final StringBuilder saved = new StringBuilder(out);  // copies full output so far
out.setLength(0);
// ... render child ...
return out.toString();
// finally: out.append(saved)

For a document with 200 list items and an accumulated output of 50 KB, this copies ~50 KB per item. A simple fix is to use an index-based approach: record out.length() before, render, extract out.substring(mark), then reset out.setLength(mark).

7. Double-parse in toMarkdown(JSONObject)

TiptapMarkdown.toMarkdown(JSONObject) calls tiptap.toString() and then MAPPER.readTree(string). The JSONObject was itself parsed from JSON originally, so this is a JSON→string→JSON roundtrip. Minor but wasteful.


Clean

The whitespace-lifting logic (trailing/leading whitespace extracted before/after mark open/close delimiters), the dynamic fence width for triple-backtick bodies, and the mark precedence/reopening logic in renderInline are all correct and the test coverage for those cases is thorough.

Mirrors the naming of the existing toHtml() Renderable surface so
Velocity callers use `\$contentlet.storyBlock.toMarkdown` alongside
`\$contentlet.storyBlock.toHtml`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hassandotcms

Copy link
Copy Markdown
Member

QA FIndings

Tested the usable methods end-to-end:

  • $dotContentMap.body.toMarkdown() on a page, works, correct markdown
  • Both $markdown.blockToMarkdown() overloads in the Playground; identical output to toMarkdown()
  • REST _search after conversions, block field unchanged, still the normal JSON tree
  • toHtml() and $markdown.parse() , no regression, same output as before
  • Called conversion repeatedly , read-only, deterministic, storage byte-identical

JSON → Markdown

  • Contentlet with every block type , headings/lists/quotes/code/tables/images/youtube all convert correctly
  • Edge cases (empty docs, missing attrs, heading level 99, unicode, non-JSON input), Handled.
  • Unknown custom blocks and grids, wrapper/layout dropped, inner text kept
  • dotContent → nothing. Embedded contentlet titled "embed-target": toHtml() gives <h2>embed-target</h2>, toMarkdown() gives empty (need to fix: add case "dotContent" in MarkdownWriter.renderNode() switch, emit attrs.data.title)
  • dotVideo and legacy aiContent → nothing (aiContent keeps its text in attrs.content; that text is gone) (need to fix: same switch — dotVideo as a link like the youtube case; aiContent emit attrs.content text)
  • List inside a table cell vanishes silently ,| cell with list | comes out | |, no log (need to fix: MarkdownWriter.renderCellInline() ,handle non-paragraph blocks or at least log like the unknown-node path does)
  • WYSIWYG-converted field → raw HTML returned verbatim, not converted

Markdown → JSON (toTiptap)

  • Generic image node (what toTiptap() emits) fed through the system: toHtml() renders an empty <h2>, the editor shows the body blank, and saving wiped the document to one empty paragraph. Same flow with known node types, everything survives (need to fix: TiptapBuilder.visit(Image) ; emit dotImage instead of image; the serializer already accepts both. Pre-merge)
  • HTML input: <h1>Hello</h1> becomes a paragraph with the literal text <h1>Hello</h1>
  • Inline HTML: before <br> after loses the <br> silently ;including the <br> the converter itself writes for multiparagraph cells, so its own tables don't survive re-parse (need to fix: add visit(HtmlInline) to TiptapBuilder; map <br> to hardBreak, keep the rest as text like HtmlBlock does)
  • Already-valid Tiptap JSON in: re-parsed as markdown, stored as one literal JSON string ;nothing detects "already JSON" (need to fix: add isTiptapDoc(String) static to TiptapMarkdown ;parse + check type=="doc" and content is array)
  • GFM: - [ ] task one → bullet with literal text [ ] task one; |:---:| alignment dropped; bare URLs stay plain text. Reference links and setext headings convert fine (need to fix: alignment only)

Tested in Editor Mode

  • Attribute check: link target="_blank", textAlign: center, image width/height, youtube start, all in JSON, all gone after JSON→MD→JSON. Text 1. not a list comes back as an ordered list (need to fix: Javadoc only, document the attr losses and the 1. edge; markdown can't express these)
    • Same markdown via the editor's "Paste from Markdown" vs toTiptap(): editor produces dotImage, fills cell align attrs, autolinks bare URLs, maps →underline and<br>→hardBreak. Same input, two different stored results depending on the path (covered by the dotImage, alignment, and HtmlInline fixes above)
  • Opened/re-saved the full test contentlet in the editor ; all known nodes survive, only attrs normalized

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Convert Tiptap Story Block content to Markdown

3 participants