Blog Post at https://ben.enterprises/hackathon-judging
The experiments that led us to our conclusions (well mostly). Git history has been erased. Many of the earlier experiments contain flaws or errors. The production directory code was done by another organizer. The two judgments without a seventh team have "NA" as the last 'team'. Experiments 00-05 were performed prior to MadHacks 2025. There were additional runs with tweaked numbers that are lost to time.
Notable Experiments:
- ./experiments/06-Final compares our final algorithm against Gavel for 1000 simulations assuming a StdDev of 1.5
- ./experiments/07-StdDev using a less informative hyperprior of a half-normal distibution with StdDev of 4, we find a 94% credible for the StdDev of the MadHacks project q-factors of [0.626, 1.547] and a mean estimated StdDev of 1.078
- ./expriments/08-Rerun expected probabilities with our pre-event estimates
A decent chunk of the code here was written by AI (as is obvious from the comments). Notably not done by AI was the adaptation of Gavel for testing. AI was also not used to come up with any ideas, or to write any text for my blog post (I did use AI to identify spelling/grammar mistakes). I remain confident in our methods.
Files named judge.py or crowd_bt.py are derivative of Gavel and are licensed under AGPL-3.0-only as provided in ./LICENSE-GAVEL.txt