Added very (blazingly) fast implementation (🦀)#3
Open
AntoniosBarotsis wants to merge 1 commit intoArraying:mainfrom
Open
Added very (blazingly) fast implementation (🦀)#3AntoniosBarotsis wants to merge 1 commit intoArraying:mainfrom
AntoniosBarotsis wants to merge 1 commit intoArraying:mainfrom
Conversation
Author
|
Also to the surprise of no one it was quite a bit faster, |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Following the extensive discussion outlined in #2, I decided to implement the issue.
Benchmarks
Bit more seriously
There weren't many libraries for parsing mbox files so I did that myself. I tested it in my gmail archive which was 2GB and the only place it falls short compared to the python version is some cases which look like this
where I was too lazy to make the parsing work so I just skipped it, in the entire file, there were 2 instances of this (both from Microsoft) so this doesn't seem like a big deal.
On the other hand, somewhat surprisingly, the rust version detected around 20 more emails than the python version somehow. Some make sense and I'm not sure why they aren't in the python version but some straight up just look weird
this for example should not be split over multiple lines but here we are, the python version completely ignores it (as it is technically broken) and the rust one prints
d2lsupport@tudelft.brightspace.co. I did not read through the RFC long enough to figure out if newlines are permitted in theaddr-specelement 😎But anyway, the inconsistencies were in 20 out of the 1200 total entries so for the most part it's fine, you might get some extra junk here and there and, except for whatever that long thing Microsoft used, I at least didn't get anything less in my data file. I've also added a test that essentially diffs the python and rust outputs in case you're interested 👍