Skip to content

arthrod/html2text_edgar

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

791 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

html2text_edgar

This is a fork of html2text with some minor edits to better handle html documents filed with the SEC through SEC Edgar.

Blockquote handling has been simplified, as it makes text wrapping difficult. HTML character and entity references have been made more robust so as to handle oddities where invalid or non-printing characters get referenced. This handling also crushes a number of badly formatted characters. It also contains handlers which assume Windows-1252 for any references to the upper 128 bytes, which was extremely common for most filings until recently.

Several defaults are changed to better suit SEC filings:

IGNORE_ANCHORS = True
IGNORE_EMPHASIS = True
IGNORE_IMAGES = True
SINGLE_LINE_BREAK = True
IGNORE_TABLES = True

The option ESCAPE_NONE is created and set to True to avoid any escaping of markdown characters after parsing.

About

Fork of html2text for cleaning SEC Edgar filings

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 64.6%
  • HTML 35.4%