Skip to content

Added very (blazingly) fast implementation (🦀)#3

Open
AntoniosBarotsis wants to merge 1 commit intoArraying:mainfrom
AntoniosBarotsis:drastically-improve-entire-project
Open

Added very (blazingly) fast implementation (🦀)#3
AntoniosBarotsis wants to merge 1 commit intoArraying:mainfrom
AntoniosBarotsis:drastically-improve-entire-project

Conversation

@AntoniosBarotsis
Copy link

Following the extensive discussion outlined in #2, I decided to implement the issue.

Benchmarks

  • Python
    • probably very slow 👎
    • p*thon 🤮
    • p*thon for loops 🤮🤮
  • Rust
    • rust
    • 🦀
    • (probably) blazingly fast 🚀
    • written in rust (by the way) (btw)

Bit more seriously

There weren't many libraries for parsing mbox files so I did that myself. I tested it in my gmail archive which was 2GB and the only place it falls short compared to the python version is some cases which look like this

From: =?utf-8?Q?=CE=9D=CE=99=CE=9A=CE=9F=CE=9B=CE=91=CE=9F=CE=A3_=CE=9C?=
 =?utf-8?Q?=CE=A0=CE=91=CE=A1=CE=9F=CE=A4=CE=A3=CE=97=CE=A3_=CE=BC=CE?=
 =?utf-8?Q?=AD=CF=83=CF=89_=CF=84=CE=B7=CF=82_=CF=85=CF=80=CE=B7=CF=81?=
 =?utf-8?Q?=CE=B5=CF=83=CE=AF=CE=B1=CF=82_=CF=86=CF=89=CE=BD=CE=B7=CF?=
 =?utf-8?Q?=84=CE=B9=CE=BA=CE=BF=CF=8D_=CF=84=CE=B1=CF=87=CF=85=CE=B4?=
 =?utf-8?Q?=CF=81=CE=BF=CE=BC=CE=B5=CE=AF=CE=BF=CF=85?=
 <sbvmsvc9@microsoft.com>

where I was too lazy to make the parsing work so I just skipped it, in the entire file, there were 2 instances of this (both from Microsoft) so this doesn't seem like a big deal.

On the other hand, somewhat surprisingly, the rust version detected around 20 more emails than the python version somehow. Some make sense and I'm not sure why they aren't in the python version but some straight up just look weird

From: d2lsupport@tudelft.brightspace.com <d2lsupport@tudelft.brightspace.co=
m>

this for example should not be split over multiple lines but here we are, the python version completely ignores it (as it is technically broken) and the rust one prints d2lsupport@tudelft.brightspace.co. I did not read through the RFC long enough to figure out if newlines are permitted in the addr-spec element 😎

But anyway, the inconsistencies were in 20 out of the 1200 total entries so for the most part it's fine, you might get some extra junk here and there and, except for whatever that long thing Microsoft used, I at least didn't get anything less in my data file. I've also added a test that essentially diffs the python and rust outputs in case you're interested 👍

@AntoniosBarotsis
Copy link
Author

Also to the surprise of no one it was quite a bit faster, 3sec 768ms 800µs 700ns (rs) vs 3min 28sec 250ms 390µs 500ns (py) though I doubt anyone's student email has gigabytes of data. I used my gmail because you can't export the mbox archive from outlook web.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant