The module is simple enough to use. This tutorial will get you started.
These are the methods you can get the module installed:-
For those who have pip, we got your back.
$ pip install html2text
Clone the repository from https://github.com/Alir3z4/html2text
$ git clone --depth 1 https://github.com/Alir3z4/html2text.git
$ python setup.py build
$ python setup.py install
Once installed the module can be used as follows.
import html2text
html = function_to_get_some_html()
text = html2text.html2text(html)
print(text)
This converts the provided html to text( Markdown text) with all the options set to default.
To customize the options provided by the module the usage is as follows:
import html2text
text_maker = html2text.HTML2Text()
text_maker.ignore_links = True
text_maker.bypass_tables = False
html = function_to_get_some_html()
text = text_maker.handle(html)
print(text)
All options exist in the config.py file. A list is provided here with simple indications of their function.
- UNICODE_SNOB for using unicode
- ESCAPE_SNOB for escaping every special character
- LINKS_EACH_PARAGRAPH for putting links after every paragraph
- BODY_WIDTH for wrapping long lines
- SKIP_INTERNAL_LINKS to skip #local-anchor things
- INLINE_LINKS for formatting images and links
- PROTECT_LINKS protect from line breaks
- GOOGLE_LIST_INDENT no of pixels to indent nested lists
- IGNORE_ANCHORS
- IGNORE_IMAGES
- IMAGES_TO_ALT
- IMAGES_WITH_SIZE
- IGNORE_EMPHASIS
- BYPASS_TABLES format tables in HTML rather than Markdown
- IGNORE_TABLES ignore table-related tags (table, th, td, tr) while keeping rows
- SINGLE_LINE_BREAK to use a single line break rather than two
- UNIFIABLE is a dictionary which maps unicode abbreviations to ASCII
values
- RE_SPACE for finding space-only lines
- RE_UNESCAPE for finding html entities like
- RE_ORDERED_LIST_MATCHER for matching ordered lists in MD
- RE_UNORDERED_LIST_MATCHER for matching unordered list matcher in MD
- RE_MD_CHARS_MATCHER for matching Md \,[,],( and )
- RE_MD_CHARS_MATCHER_ALL for matching `,*,_,{,},[,],(,),#,!
- RE_MD_DOT_MATCHER for matching lines starting with <space>1.<space>
- RE_MD_PLUS_MATCHER for matching lines starting with <space>+<space>
- RE_MD_DASH_MATCHER for matching lines starting with <space>(-)<space>
- RE_SLASH_CHARS a string of slash escapeable characters
- RE_MD_BACKSLASH_MATCHER to match \char
- USE_AUTOMATIC_LINKS to convert <a href='http://xyz'>http://xyz</a> to <http://xyz>
- MARK_CODE to wrap 'pre' blocks with [code]...[/code] tags
- WRAP_LINKS to decide if links have to be wrapped during text wrapping (implies INLINE_LINKS = False)
- DECODE_ERRORS to handle decoding errors. 'strict', 'ignore', 'replace' are the acceptable values.
To alter any option the procedure is to create a parser with
parser = html2text.HTML2Text() and to set the option on the parser.
example: parser.unicode_snob = True to set the UNICODE_SNOB option.
| Option | Description |
|---|---|
--version |
Show program version number and exit |
-h, --help |
Show this help message and exit |
--ignore-links |
Do not include any formatting for links |
--protect-links |
Protect links from line breaks surrounding them "+" with angle brackets |
--ignore-images |
Do not include any formatting for images |
--images-to-alt |
Discard image data, only keep alt text |
--images-with-size |
Write image tags with height and width attrs as raw html to retain dimensions |
-g, --google-doc |
Convert an html-exported Google Document |
-d, --dash-unordered-list |
Use a dash rather than a star for unordered list items |
-b BODY_WIDTH, --body-width=BODY_WIDTH |
Number of characters per output line, 0 for no wrap |
-i LIST_INDENT, --google-list-indent=LIST_INDENT |
Number of pixels Google indents nested lists |
-s, --hide-strikethrough |
Hide strike-through text. only relevant when -g is specified as well |
--escape-all |
Escape all special characters. Output is less readable, but avoids corner case formatting issues. |
--bypass-tables |
Format tables in HTML rather than Markdown syntax. |
--ignore-tables |
Ignore table-related tags (table, th, td, tr) while keeping rows. |
--single-line-break |
Use a single line break after a block element rather than two. |
--reference-links |
Use reference links instead of inline links to create markdown |
--ignore-emphasis |
Ignore all emphasis formatting in the html. |
-e, --asterisk-emphasis |
Use asterisk rather than underscore to emphasize text |
--unicode-snob |
Use unicode throughout instead of ASCII |
--no-automatic-links |
Do not use automatic links like http://google.com |
--no-skip-internal-links |
Turn off skipping of internal links |
--links-after-para |
Put the links after the paragraph and not at end of document |
--mark-code |
Mark code with [code]...[/code] blocks |
--no-wrap-links |
Do not wrap links during text wrapping. Implies --reference-links |
--decode-errors=HANDLER |
What to do in case an error is encountered. ignore, strict, replace etc. |
--pad-tables |
Use padding to make tables look good. |