[FEA] Take fast path on segmented reductions

Currently if a user writes something like `(a = sum(b, {1}))` to do a sum across rows on a 2D tensor this would implicitly create an operator right now that will create an input iterator from `b`. This is not necessary on a strict unity-stride row summation, and instead we can dispatch directly to segmented functions. This PR is to address that speed issue.