Skip to content

Ensure proper cleanup for <pre> tags containing complex HTML#22

Closed
0x1b2c wants to merge 1 commit intokepano:mainfrom
0x1b2c:handle-complex-pre
Closed

Ensure proper cleanup for <pre> tags containing complex HTML#22
0x1b2c wants to merge 1 commit intokepano:mainfrom
0x1b2c:handle-complex-pre

Conversation

@0x1b2c
Copy link
Copy Markdown
Contributor

@0x1b2c 0x1b2c commented Apr 1, 2025

This PR maximize the compatibility with the situation where <pre> tags containing block-level elements (e.g., <p>, <ul>, <br>) were not being cleaned correctly by Defuddle.

Problem: The previous logic treated all <pre>-like elements as potential code blocks. Complex HTML inside them was inappropriately formatted because standard cleanup ignores <pre> internals.

Solution: The rule handling preformatted elements (codeBlockRules in code.ts) now checks the children of <pre> (and similar containers):

  1. If block-level elements or <br> are found, the <pre> is converted to a <div>. This ensures its content is treated as standard HTML by later cleanup steps.
  2. Otherwise, it's standardized into a <pre><code> block (maintaining the original behavior for simple preformatted text and code).

Potential Refactoring:

The transform function now does a couple of different things based on the content. I'm not sure if it's better to keep it this way for simplicity, or if we should maybe split the logic for "convert complex <pre> to <div>" and "standardize to <pre><code>" into separate helper functions within code.ts, also the filename and rule name may be renamed to something like preformatedXxx.

Happy to discuss or explore this in a follow-up if you think it makes sense!

This PR was co-authored with Gemini 2.5 Pro.

@kepano
Copy link
Copy Markdown
Owner

kepano commented Apr 2, 2025

Thanks. Can you provide an example of a page that didn't work before?

@0x1b2c
Copy link
Copy Markdown
Contributor Author

0x1b2c commented Apr 3, 2025

I'm sorry, I didn't directly provide URL since it's a site for porngraphy novel. https://hlib.cc/n/15263018

Also I understand complex HTML tags should not exist in <pre> tags, this PR is not "fixing" something but try to maximize the compatibility with sites which didn't respect web standards.

@kepano kepano added the stale label Mar 5, 2026
@kepano kepano closed this Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants