Tokenizer: use Python objects to represent tokens #521

jayaddison · 2020-12-30T17:42:46Z

This change refactors the tokenizer module to use Python object instances where previously plain dictionaries were used to hold token state.

This builds upon #519, #520 and attempts to resolve #24.

…me state

…object-tokens

jayaddison · 2020-12-30T17:45:21Z

NB: This isn't suitable for merge currently; it seems to introduce a noticeable performance penalty:

Before (2c19b98)

.........................................
html_parse_etree: Mean +- std dev: 202 ms +- 10 ms

After (8772408)

.........................................
html_parse_etree: Mean +- std dev: 226 ms +- 11 ms

gsnedders · 2021-01-04T16:37:44Z

This builds upon #519, #520 and attempts to resolve #24.

Leaving reviewing this till after those two, FYI.

jayaddison · 2022-12-24T01:03:21Z

Cleaning up some old / stale pull requests; please let me know if this changeset is considered worthwhile and I'll reopen if so.

jayaddison added 16 commits December 29, 2020 14:44

Consistency: consume a single character at a time during attribute na…

183d8a0

…me state

Refactor: pretranslate lowercase element and attribute names

2e86373

Restore self.currentToken safety check

8f96b17

Alternate approach: do not pretranslate temporary buffered data

a912842

Consistency: character consumption within double-escaped state

f9f370e

Refactor: use Python objects for tokens within tokenizer

bcee8bd

Introduce type hierarchy for tag-related tokens

67262f8

Simplify tag token construction

900bdaf

Refactor token attribution name/value accumulation

1f6cae9

Cleanup: remove leavingThisState / emitToken logic

695ac1c

Remove EmptyTag tokenizer token class

b1a444b

Refactor: pre-translate strings that are only used in lowercase context

bb7fabc

Cleanup: remove getattr anti-pattern

5f4ace9

Consistency: use camel-casing to correspond with existing codebase style

d744c86

Consistency: consume a single character at a time during attribute na…

1d62e69

…me state

Merge branch 'tokenizer/pretranslate-lowercase-names' into tokenizer/…

8772408

…object-tokens

Linting cleanup

192cce0

Clarify method name: clearAttribute -> flushAttribute

e76e0dd

gsnedders mentioned this pull request Jan 5, 2021

Compile html5lib with Cython #524

Draft

Merge branch 'master' into tokenizer/object-tokens

da37332

jayaddison closed this Dec 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenizer: use Python objects to represent tokens #521

Tokenizer: use Python objects to represent tokens #521

Uh oh!

jayaddison commented Dec 30, 2020

Uh oh!

jayaddison commented Dec 30, 2020

Uh oh!

gsnedders commented Jan 4, 2021

Uh oh!

jayaddison commented Dec 24, 2022

Uh oh!

Uh oh!

Tokenizer: use Python objects to represent tokens #521

Tokenizer: use Python objects to represent tokens #521

Uh oh!

Conversation

jayaddison commented Dec 30, 2020

Uh oh!

jayaddison commented Dec 30, 2020

Uh oh!

gsnedders commented Jan 4, 2021

Uh oh!

jayaddison commented Dec 24, 2022

Uh oh!

Uh oh!