Skip to content

tycronk20/Optimal-TestTime

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Inference Time Scaling: How to make it more efficient, and how to use it to make models hallucinate less.

Inference time scaling has been shown to be extremely effective at improving the performance of LLMS in domains where there exists a generator-verifier gap: generating candidate solutions for a problem is much more difficult than verifying correctness. Several popular methodologies for scaling inference time compute have been explored, with many of the most widely used involve using Reinforcement Learning to elicit long Chains-Of-Thought that allow a model to self correct and arrive at a solution to a problem after several thousands of tokens of reasoning, as well as the idea of generating many candidate solutions and choosing the one evaluated to be the most correct, commonly known as best-of-n. Combining the strengths of both of these methodologies to scale Inference time compute has proven to be highly effective, boosting key benchmark results in competitive coding (IOI for o3) as well as mathematics (Frontier Math, AIME). We explore a much more inference-efficient approach to scaling best-of-n for reasoning models (parallel reasoning), by pruning reasoning chains early that don’t contribute to candidate solution diversity. We’ve found promising results in applying this technique to the AIME competition math benchmark, where our method demonstrates equal to pass @ 50 performance by pruning 40 reasoning chains after only 300 tokens, only decoding 10 reasoning chains to completion.

Check it out here!

We propose a method for increasing LLM refusal rates on questions they’d typically get incorrect by utilizing inference-time compute to scale reasoning, then verifying the diversity of reasoning chains that arrive towards a coherent answer. Given a query, we run inference on Deepseek R1, then interrupt it at N tokens, injecting an interruption token: “No, but”. We then run P parallel inferences from that Nth token, allowing us to sample different reasoning traces given the same interruption token. Once R1 has resolved to an answer, we utilize an SLM to judge the coherence and diversity of the P reasoning traces with a score. If this score exceeds a tuned threshold, we choose to reject the original LLM answer and choose not to attempt to answer the respective question. We find this method maintains accuracy while increasing refusal rates on incorrect answers, with further work necessary to derive the optimum Nth token at which to inject. We believe this method is highly applicable to deployments and spaces where false negatives are highly consequential— such as high-trust environments like the medical or legal field, or in spaces where LLM outputs are too large to be human-verified, where this method is a form of scaleable oversight.

Check it out here!

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 54.9%
  • TypeScript 31.7%
  • CSS 13.2%
  • JavaScript 0.2%