Optimizations with enumerations

Are there optimized versions of these lists: [1..10]?
For example, this code is extremely slow

main :: IO ()
main =
    print a
    where
        k = [2300000, 2299999.. 1]
        b = map (+100) k
        c = take 535 $ sort b
        a = sum c

Another question, is writing so many instructions in the where good?

1 Like

That code is not that slow for me. It takes about 200 ms on my computer (even in GHCi it takes only about 500 ms). You can make it slightly faster by compiling with optimisations ghc -O ... and using Int instead of the default Integer:

        k = [2300000, 2299999.. 1 :: Int]

(You only need a change there, the rest will be inferred automatically)

But those two optimisations only speed it up from 220 ms to 170 ms.

That’s fine, but the variable names are not very descriptive. If you want to avoid names I’d just write it all in one expression:

main = print $ sum $ take 535 $ sort $ map (+ 100) [2300000, 2299999 .. 1]

I thought using unboxed vectors might be faster, so I tried this code:

import qualified Data.Vector as Vector
import qualified Data.Vector.Algorithms.Intro as Vector
main :: IO ()
main =
    print a
    where
        k = Vector.enumFromStepN (2300000 :: Int) (-1) 2300000
        b = Vector.map (+100) k
        c = Vector.take 535 $ Vector.modify Vector.sort b
        a = Vector.sum c

But that takes 2.5 seconds. So a big win for laziness I’d say.

(But note that these are still immutable vectors, so this code probably creates a few copies of the whole array. You could optimise that even further, but that takes some skill and effort.)

Edit: changing from Intro sort to Tim sort makes it a lot faster, only taking 410 ms.

2 Likes

I have read that the laziness of the language allows the compiler to rearrange the order of operations to get fast code.
However, I cannot get performance anywhere near that of julia. This code is 5 orders of magnitude faster.

# I am only using lazy functions and data structures, 
# so you can compare it with the haskell version
function f()
    k = 2300000:-1:1
    b = sort(k.+100)            # '.+' is vectorized '+'
    c = Iterators.take(b, 535)  # lazy version of take
    sum(c)
end

@time f() # 196880 in 0.000002 seconds

The range 2300000:-1:1 is converted to 2300100:-1:101.
In haskell is it possible to take [2300000, 2299999.. 1] and turn it into [2300100, 2300099..101]?

Why do you generate a descending sequence only to reverse it (via sort) in the next line?

Ah, I just realized I forgot to use unboxed vectors which are quite a bit faster for Int. Just changing the import to:

import qualified Data.Vector.Unboxed as Vector

makes Tim sort only take 170 ms. And I notice Intro sort has a special partial sorting version which is even faster:

        c = Vector.take 535 $ Vector.modify (\v -> Vector.partialSort v 535) b

That takes only 50 ms.

You can improve it a bit more by using unsafeThaw and unsafeFreeze instead of modify:

        c = Vector.take 535 $ runST $ do
              v <- Vector.unsafeThaw b
              Vector.sort v
              Vector.unsafeFreeze v

Then I get to about 20 ms. Compiling with -O2 takes it down to about 10 ms. Interestingly, at this point Tim sort is faster again at 10 ms vs 20 ms for a partial Intro sort.

That’s what fusion does and I believe it should be done in both the list version and the vector version as long as you compile with optimisations.

1 Like

it’s an invented example, because I noticed that working with this type of list is not very efficient.

It looks like julia realises that the result isn’t used and it doesn’t execute the code at all.

If I print the result, then julia has comparable performance to the Haskell version on my machine.

3 Likes

can you post the code please? both haskell and julia

Indeed it isn’t; which is sad because List is the first recursive data structure you’re taught on Intro courses. OTOH the results @jaror is getting are pretty nippy.

Most intros don’t at first explain the default Integer is probably not what you want for performant code; also don’t tell running direct at the GHCi prompt is interpreting, not compiling.

So how were you running this when you first posted/you said “extremely slow”?

2 Likes

Ok, thank you all for the answers.
I like @jaror solution.

Julia really does seem fast. It only takes a relatively long time (6 ms) for the first run where it is spending more than 99% of its time doing compilation. After that it is fast, even if I add some extra data dependencies in the code to rule out that it is caching the arrays. But the allocation statistics show that it is not allocating the full array any way.

I strongly suspect Julia is recognizing that the enumeration iterator is already sorted in descending order, so it simply has to reverse the enumeration.

With a random array it takes a more sensible 40-60 ms and 70MiB allocations. Part of that is of course the overhead of generating the random array, but that part should be faster than the sorting.

Another way to defeat it is to call map explicitly:

function f()
  k = 2300000:-1:1
  b = sort(map(x -> x + 100,k))
  c = Iterators.take(b, 535)
  print(sum(c))
end

That takes around 7 ms and 35 MiB, so still very fast.

1 Like

How do you benchmark haskell code? I am getting inconsistent results.
If I compile and measure the whole program with “time ./Main” I get a result, if I use Criterion instead (as indicated in the official tutorial) I get a much better result.
Can someone show me how to benchmark this example?

Edit:
Example, if I have this file Main.hs:

module Main where

import Criterion.Main
import Data.List (sort)

funToTest2 :: Int -> Int
funToTest2 toTake = 
    sum $ take toTake (sort $ map (+100) [2300000, 2299999 .. 1])

main :: IO ()
main = defaultMain [
  bench "funToTest"  $ whnf funToTest2 535
  ]

Then I compile it with: ghc -O2 Main.hs
And run with: ./Main --output bench.html
I get:

time                 2.349 ÎĽs   (2.286 ÎĽs .. 2.445 ÎĽs)
                     0.993 R²   (0.982 R² .. 0.999 R²)
mean                 2.303 ÎĽs   (2.277 ÎĽs .. 2.391 ÎĽs)
std dev              143.0 ns   (61.01 ns .. 279.8 ns)
variance introduced by outliers: 74% (severely inflated)

Here’s how I would benchmark your original code:

import Test.Tasty.Bench
import Data.List (sort)

foo :: Int -> Int
foo n = a where
  k = [n, n - 1 .. 1]
  b = map (+100) k
  c = take 535 $ sort b
  a = sum c

main :: IO ()
main = defaultMain
  [ bench "foo" $ whnf foo 2300000 ]

(This uses @Bodigrim’s tasty-bench as a modern and lightweight drop-in replacement for criterion. I just noticed the latest version has cool live updating measurements.)

On my machine, the result is:

All
  foo: OK
    222  ms ±  11 ms

The issue here is that the compiler will optimize it to something like this:

myBigList = sort $ map (+100) [2300000, 2299999 .. 1]

funToTest2 toTake = 
    sum $ take toTake myBigList

Which means the big list will only be computed (and sorted) in the very first benchmark run.

To solve this you can make that 2300000 a parameter of the function.

2 Likes

Ok, I tried to write a version that does operations directly on ranges, so I created my own version of ranges.
My problem is that it even seems to perform better than julia, and I don’t understand if it is a benchmark-related problem.
Here is the code:

{-# LANGUAGE InstanceSigs #-}
module Main where

import Test.Tasty.Bench
import Data.List (sort)
import Data.Maybe

data RangeFromThenTo a = RangeFromThenTo !a !a !a 

-- show $ RangeFromThenTo 1 2 10 --> {1, 2 .. 10}
instance Show a => Show (RangeFromThenTo a) where
    show :: Show a => RangeFromThenTo a -> String
    show (RangeFromThenTo start firstStep end) = 
        "{" ++ show start ++ ", " ++ show firstStep ++ " .. " ++ show end ++ "}"

-- allow only ranges where 
-- start < firstStep < end == True or 
-- start > firstStep > end == True
rangeFromThenTo :: (Num a, Ord a) => a -> a -> a -> Maybe (RangeFromThenTo a)
rangeFromThenTo start firstStep end = 
    if (start < firstStep && firstStep < end) || (start > firstStep && firstStep > end)
    then Just $ RangeFromThenTo start firstStep end
    else Nothing

-- get the step of a range {1, 2 .. 10} --> 2 - 1 --> 1
rangeStep :: Num a => RangeFromThenTo a -> a
rangeStep (RangeFromThenTo start firstStep end) = firstStep - start

rangeToList :: Enum a => RangeFromThenTo a -> [a]
rangeToList (RangeFromThenTo start firstStep end) = [start, firstStep .. end]

isRangeSorted :: (Ord a) => RangeFromThenTo a -> Bool
isRangeSorted (RangeFromThenTo start firstStep end) = 
    start < firstStep && firstStep < end 

reverseRange :: Num a => RangeFromThenTo a -> RangeFromThenTo a
reverseRange r@(RangeFromThenTo start firstStep end) =
    RangeFromThenTo end (end - rangeStep r) start
            
sortRange :: (Ord a, Num a) => RangeFromThenTo a -> RangeFromThenTo a
sortRange r 
    | isRangeSorted r = r
    | otherwise = reverseRange r

-- {1, 2 .. 10} + 1 --> {2, 3 .. 11}
addRangeNum :: Num a => RangeFromThenTo a -> a -> RangeFromThenTo a
addRangeNum (RangeFromThenTo start firstStep end) n =
    RangeFromThenTo (start + n) (firstStep + n) (end + n)

funToTest :: Int -> Int -> Int -> Int -> Int -> Int
funToTest toTake st end toAdd begin = 
    a
    where
        k = fromJust $ rangeFromThenTo begin st end -- generate and unwrap a range
        b = sortRange $ addRangeNum k toAdd -- add and sort the range
        c = take toTake $ rangeToList b -- convert the range to list and apply take
        a = sum c

main :: IO ()
main = defaultMain [
  bench "range version"  $ whnf (funToTest 535  2299999 1 100) 2300000
  ]

julia code on my pc: min 401ns, median 427ns, mean 498.489 ns ± 644.923 ns
my version:

All
  range version: OK
    336  ns ±  28 ns

Both criterion and and tasty agree on the result, but I’m not sure if there’s a problem in the benchmark and maybe it’s only calculating the function once.

One thing that I notice is that you’ve only made that 230000 an “parameter” of the benchmark. It would perhaps be slightly better to make all arguments of funToTest parameters of the benchmark like this:

whnf (\(a, b, c, d, e) -> funToTest a b c d e) (535, 2299999, 1, 100, 2300000)

But I don’t think that will make a big difference in this case.

I’m not that surprised GHC produces faster code in this case, because GHC spends quite a lot of time optimising whereas the Julia compiles a lot faster as far as I know.

1 Like