Currently if a user writes something like (a = sum(b, {1})) to do a sum across rows on a 2D tensor this would implicitly create an operator right now that will create an input iterator from b. This is not necessary on a strict unity-stride row summation, and instead we can dispatch directly to segmented functions. This PR is to address that speed issue.