This is a fork of html2text with some minor edits to better handle html documents filed with the SEC through SEC Edgar.
Blockquote handling has been simplified, as it makes text wrapping difficult. HTML character and entity references have been made more robust so as to handle oddities where invalid or non-printing characters get referenced. This handling also crushes a number of badly formatted characters. It also contains handlers which assume Windows-1252 for any references to the upper 128 bytes, which was extremely common for most filings until recently.
Several defaults are changed to better suit SEC filings:
IGNORE_ANCHORS = True
IGNORE_EMPHASIS = True
IGNORE_IMAGES = True
SINGLE_LINE_BREAK = True
IGNORE_TABLES = True
The option ESCAPE_NONE is created and set to True to avoid any escaping of markdown characters after parsing.