Hi again!
This week, I finished extending CI to record GHC memory usage and pulled a corpus of CI data that I can use to characterize failures and set a baseline for improvements.
Finishing the memory profiling infrastructure
Recording memory usage was my first official ticket. It is good to finally(?) be done with it.
Like any good software project, this one took just as long to deploy as it took to code up in the first place. By last Friday I thought it was half-done, so the story went something like this:
- I thought it shouldn’t be working on Monday, and it wasn’t. Success?
- No! It should be working! No success.
- Problem: missing database permissions
- Amazingly, the logs helped me figure this out right away.
- (I also got lucky and discovered a typo in the database column types while fixing the perms)
- Manual production hacks
- Finally got it all done and even made a Grafana dashboard! Mission accomplished! Or?
- Matthew (domain expert) notices the data look weird. Memory usage is way too low.
- One fix later (Really only run eventlogging on --make mode (!237) · Merge requests · Glasgow Haskell Compiler / head.hackage · GitLab) I seem to be getting it done for real now.
So now I think I’m really done. I even have pretty graphs (with links to see it live):
-
A “heartbeat” for the importer process itself
Not much data yet since it only runs once a day. -
Fluff: the graph of memory usage while compiling Cabal
I’m importing all the heap memory usage data, which means I can just graph the data points directly with Grafana.
There’s probably a lot more that can be done with all that data, but I’m going to leave that up to others and move on with the next project!
Categorizing spurious CI failures
Knowing if a failure is spurious or not requires looking at the job log. There are thousands of failures per month. To set baselines, I need to trawl through all those logs. I need to get that data on my local system if I want to have any chance of staying sane.
So, I took some time to fetch a month’s worth of data. (See bonus section below for some gory details.) I now have 9MB of metadata for a month’s worth of failed jobs (~3500 of them). Furthermore, I have raw logs for all those jobs, which is 5GB of uncompressed text. I look forward to digging into that data next week. Solving spurious failures is my one and only task now!
Bonus: Pulling job data
I used Haskell to fetch all that data from GitLab! (Can you believe it?)
But, see below for an example of “hacking at web apis in Haskell”, and ask yourself if it seems like a win for Haskell. It took me 3 hours to just figure out the easiest whack-it-together web client library, learn lens(?), and summon the correct string gods to write that code. Also, imports take over 25% of non-blank lines.
Protips:
- Don’t use the GitLab API to fetch raw logs—it’s extremely slow. Use the “raw” link instead, e.g. https://gitlab.haskell.org/ghc/ghc/-/jobs/1088588/raw.
- Compiling with
-O
is mandatory for anything that does lots of string conversions. (I’ve learned that before… oops)- Compile the code, don’t run it in ghci
Hacky Haskell
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TypeApplications #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE NumericUnderscores #-}
import Network.Wreq
import Control.Lens
import Data.Aeson as A
import Data.Aeson.Lens as AL
import Control.Concurrent
import Data.Char
import Data.Maybe
import Data.Text (Text, unpack, pack)
import Data.Text.Encoding (decodeUtf8, encodeUtf8)
import Data.ByteString.Builder
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.Encoding as TL
import qualified Data.Text.Lazy.IO as TL
import Data.Time
import Data.Time.Format.ISO8601
import GHC.Generics
import List.Transformer (runListT, liftIO, ListT)
import Network.HTTP.Client hiding (responseBody)
import Network.HTTP.Client.TLS hiding (responseBody)
import Network.Wreq.Session
import System.Environment
import System.IO
import qualified Data.ByteString.Char8 as BS
import qualified Data.ByteString.Lazy as BSL
import qualified Data.ByteString.Lazy.Char8 as BSLC
import qualified Data.Text as T
import qualified Data.Text.IO as T
import qualified Data.Vector as Vec
import qualified List.Transformer as ListT
import qualified Network.Wreq.Session as Session
select :: [a] -> ListT IO a
select = ListT.select
requ = "https://gitlab.haskell.org/api/v4/projects/1/jobs?scope%5B%5D=failed&per_page=1&pagination=keyset&per_page=100"
main = do
sess <- newSessionControl
Nothing
tlsManagerSettings
{ managerResponseTimeout
-- this didn't seem to work?
-- I tried to set it in the 'getWith' call, and it didn't work
-- there, either.
= responseTimeoutMicro (60_000_000)
}
-- res <- getGit sess (traceRequ 1088438)
-- BSLC.putStr (res ^. responseBody)
runListT (streamLogs sess)
-- getFailures requ 0
-- putStrLn "hi"
-- putStrLn ""
getGit sess r = do
key <- BS.pack <$> getEnv "GIT_PRIVATE_TOKEN"
let opts =
defaults
& header "PRIVATE-TOKEN" .~ [key]
Network.Wreq.getWith opts r
getFailures :: Session -> String -> Int -> IO ()
getFailures sess r i = do
if i > 10000000000000000
then pure ()
else do
resp <- getGit sess r
let nextLink = resp ^. responseLink "rel" "next" . linkURL
let b :: Maybe Text
b = resp ^? responseBody . (_JSON @BSL.ByteString @Value) . _Array . _last . AL.key "created_at" . _String
t :: Maybe UTCTime
t = iso8601ParseM . unpack =<< b
Just t' = t
now <- getCurrentTime
let lastMonth = addUTCTime (-30 * nominalDay) now
T.putStrLn $ decodeUtf8 $ BSL.toStrict $ fromJust $ resp ^? responseBody
-- Loop to get next page of results
if t' > lastMonth
then do
threadDelay 1000000
getFailures sess (BS.unpack nextLink) (i+1)
else pure ()
traceRequ i =
"https://gitlab.haskell.org/ghc/ghc/-/jobs/"
<> show i
<> "/raw"
streamLogs sess = do
(i, cnt) <- ListT.zip
(select . fmap jid =<< liftIO failureIds)
(select $ [0..])
liftIO $ hPrint stderr (cnt, i)
resp <- liftIO $ getGit sess $ traceRequ i
let trace =
Trace i
( TL.decodeUtf8
$ resp ^. responseBody
)
liftIO $ TL.putStrLn (TL.decodeUtf8 $ encode trace)
liftIO $ threadDelay 2_000_000
failureIds :: IO [ExtractId]
failureIds = concat . mapMaybe decode <$> failureArrays
-- 9M, easy peasy
failureArrays =
-- Just converting to Text so I can use T.lines. :D
fmap (BSL.fromStrict . encodeUtf8) . T.lines <$> T.readFile "month-of-fails.json"
data ExtractId = ExtractId { jid :: Int }
deriving (Eq, Show)
instance FromJSON ExtractId where
parseJSON = withObject "GitlabJob" $ \v ->
ExtractId <$> v .: "id"
data Trace = Trace
{ tid :: Int
, trace :: TL.Text
} deriving (Eq, Show, Generic)
instance ToJSON Trace where
toJSON (Trace i t) = object [ "id" A..= i, "trace" A..= t ]
Non-work
Finally, Friday is/was Midsummer’s Eve, a holiday in Finland, and I spent it volunteering on the visiting tallship Götheborg, doing a different kind of maintenance on system infrastructure.