Haskell Foundation DevOps Weekly Log, 2022-06-24

Hi again!

This week, I finished extending CI to record GHC memory usage and pulled a corpus of CI data that I can use to characterize failures and set a baseline for improvements.

Finishing the memory profiling infrastructure

Recording memory usage was my first official ticket. It is good to finally(?) be done with it.

Like any good software project, this one took just as long to deploy as it took to code up in the first place. By last Friday I thought it was half-done, so the story went something like this:

So now I think I’m really done. I even have pretty graphs (with links to see it live):

  1. A “heartbeat” for the importer process itself


    Not much data yet since it only runs once a day.

  2. Fluff: the graph of memory usage while compiling Cabal


    I’m importing all the heap memory usage data, which means I can just graph the data points directly with Grafana.

There’s probably a lot more that can be done with all that data, but I’m going to leave that up to others and move on with the next project!

Categorizing spurious CI failures

Knowing if a failure is spurious or not requires looking at the job log. There are thousands of failures per month. To set baselines, I need to trawl through all those logs. I need to get that data on my local system if I want to have any chance of staying sane.

So, I took some time to fetch a month’s worth of data. (See bonus section below for some gory details.) I now have 9MB of metadata for a month’s worth of failed jobs (~3500 of them). Furthermore, I have raw logs for all those jobs, which is 5GB of uncompressed text. I look forward to digging into that data next week. Solving spurious failures is my one and only task now!

Bonus: Pulling job data

I used Haskell to fetch all that data from GitLab! (Can you believe it?)

But, see below for an example of “hacking at web apis in Haskell”, and ask yourself if it seems like a win for Haskell. :slight_smile: It took me 3 hours to just figure out the easiest whack-it-together web client library, learn lens(?), and summon the correct string gods to write that code. Also, imports take over 25% of non-blank lines. :upside_down_face:

Protips:

  • Don’t use the GitLab API to fetch raw logs—it’s extremely slow. Use the “raw” link instead, e.g. https://gitlab.haskell.org/ghc/ghc/-/jobs/1088588/raw.
  • Compiling with -O is mandatory for anything that does lots of string conversions. (I’ve learned that before… oops)
    • Compile the code, don’t run it in ghci
Hacky Haskell
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TypeApplications #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE NumericUnderscores #-}

import Network.Wreq

import Control.Lens

import Data.Aeson as A

import Data.Aeson.Lens as AL

import Control.Concurrent
import Data.Char
import Data.Maybe
import Data.Text (Text, unpack, pack)
import Data.Text.Encoding (decodeUtf8, encodeUtf8)
import Data.ByteString.Builder
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.Encoding as TL
import qualified Data.Text.Lazy.IO as TL
import Data.Time
import Data.Time.Format.ISO8601
import GHC.Generics
import List.Transformer (runListT, liftIO, ListT)
import Network.HTTP.Client hiding (responseBody)
import Network.HTTP.Client.TLS hiding (responseBody)
import Network.Wreq.Session
import System.Environment
import System.IO
import qualified Data.ByteString.Char8 as BS
import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Lazy.Char8 as BSLC
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Data.Vector as Vec
import qualified List.Transformer as ListT
import qualified Network.Wreq.Session as Session

select :: [a] -> ListT IO a
select = ListT.select

requ = "https://gitlab.haskell.org/api/v4/projects/1/jobs?scope%5B%5D=failed&per_page=1&pagination=keyset&per_page=100"

main = do
    sess <- newSessionControl
        Nothing
        tlsManagerSettings
            { managerResponseTimeout
            -- this didn't seem to work?
            -- I tried to set it in the 'getWith' call, and it didn't work
            -- there, either.
            = responseTimeoutMicro (60_000_000)
            }

    -- res <- getGit sess (traceRequ 1088438)
    -- BSLC.putStr (res ^. responseBody)
    runListT (streamLogs sess)
    -- getFailures requ 0
    -- putStrLn "hi"
    -- putStrLn ""

getGit sess r = do
    key <- BS.pack <$> getEnv "GIT_PRIVATE_TOKEN"
    let opts =
            defaults
            & header "PRIVATE-TOKEN" .~ [key]
    Network.Wreq.getWith opts r

getFailures :: Session -> String -> Int -> IO ()
getFailures sess r i = do
    if i > 10000000000000000
        then pure ()
        else do
            resp <- getGit sess r
            let nextLink = resp ^. responseLink "rel" "next" . linkURL
            let b :: Maybe Text
                b = resp ^? responseBody . (_JSON @BSL.ByteString @Value) . _Array . _last . AL.key "created_at" . _String
                t :: Maybe UTCTime
                t = iso8601ParseM . unpack =<< b
                Just t' = t
            now <- getCurrentTime
            let lastMonth = addUTCTime (-30 * nominalDay) now
            T.putStrLn $ decodeUtf8 $ BSL.toStrict $ fromJust $ resp ^? responseBody
            -- Loop to get next page of results
            if t' > lastMonth
                then do
                    threadDelay 1000000
                    getFailures sess (BS.unpack nextLink) (i+1)
                else pure ()



traceRequ i =
    "https://gitlab.haskell.org/ghc/ghc/-/jobs/"
    <> show i
    <> "/raw"

streamLogs sess = do
    (i, cnt) <-  ListT.zip
        (select . fmap jid =<< liftIO failureIds)
        (select $ [0..])
    liftIO $ hPrint stderr (cnt, i)
    resp <- liftIO $ getGit sess $ traceRequ i
    let trace =
            Trace i
                ( TL.decodeUtf8
                $ resp ^. responseBody
                )
    liftIO $ TL.putStrLn (TL.decodeUtf8 $ encode trace)
    liftIO $ threadDelay 2_000_000


failureIds :: IO [ExtractId]
failureIds = concat . mapMaybe decode <$> failureArrays

-- 9M, easy peasy
failureArrays =
    -- Just converting to Text so I can use T.lines. :D
    fmap (BSL.fromStrict . encodeUtf8) . T.lines <$> T.readFile "month-of-fails.json"

data ExtractId = ExtractId { jid :: Int }
    deriving (Eq, Show)

instance FromJSON ExtractId where
    parseJSON = withObject "GitlabJob" $ \v ->
        ExtractId <$> v .: "id"


data Trace = Trace
    { tid :: Int
    , trace :: TL.Text
    } deriving (Eq, Show, Generic)

instance ToJSON Trace where
    toJSON (Trace i t) = object [ "id" A..= i, "trace" A..= t ]

Non-work

Finally, Friday is/was Midsummer’s Eve, a holiday in Finland, and I spent it volunteering on the visiting tallship Götheborg, doing a different kind of maintenance on system infrastructure.

Gratuitous ship photos

Ship at harbor in Helsinki today

Spaghetti code? (Photo from 2013)

7 Likes

“Götheborg” - is there a connection to the city of Göteborg, as in:

Chalmers University of Technology and University of Göteborg, S-412 96 Göteborg, Sweden

…?

Both in absolute and personal terms, yes!

Göteborg is the modern Swedish name, and Götheborg the name from the 1780s when the original ship was built. (The modern ship is a replica.) The ship was named after the city, which is also its home port.

Furthermore, I visited the ship for the first time (in Göteborg/Gothenburg) in February 2013. I was there because I was applying to a Haskell-focused PhD program at Chalmers. I didn’t get the position, but I did get a new hobby. :slight_smile:

I never thought about it, but Haskell literally led me to my association with this ship.

3 Likes

If it hasn’t already been mentioned elsewhere: is there enough memory for all the programs?

In 2019, I had a minimal (console-only) build installation on an older machine (8 Gbyte, 8 hyperthreads) - a full multi-thread rebuild of GHC would take the better part of an hour (that’s without running the testsuite!), and it also had spurious failures.

As I dimly recall, running a process monitor in another shell helped to reveal the cause - if the available memory on the computer got too low, the OS would respond by terminating at least one of the build processes. The remaining processes, upon encountering the resulting absence of built modules, would then start failing until the the whole build stopped in a mass of “not yet compiled” errors.

I devised a few kludges to work around the problem until I was able to double the memory on the system (back in 2019, before all the current shortages). What was interesting was the behaviour I was then able to observe - for the most part, the extra memory wasn’t being used heavily: slightly more than 8Gbyte, with spurious jumps to 12Gbyte (for some larger modules, methinks).

Apart from throwing terabytes of memory at GHC, maybe you can modify the build system so that it refrains from starting extra processes if the available memory is almost exhausted.

1 Like

Indeed, I do wonder if some of the servers are just getting overloaded and OOM’d.

I think it’s a good idea to keep the requirements for building GHC reasonable (for some definition of “reasonable”—something that might be good to raise with the HF technical track / working group?). For now, we’ll probably cop out by throwing more compute at the problem. :slight_smile:

(Heh) If only that was an affordable option for the rest of us…

I agree that working to reduce the amount of resources needed to compile GHC (or Haskell programs in general) is a worthwhile social project.

In fact, I’ll put it on my (long) list…

3 Likes