How do you trim stack compilation artifacts

Hi,

I’m creating a docker image as a builder in my CD/CI setup. In the Dockerfile of that image I run stack --package foo --package bar etc. to get a list of precompiled packages so the compilation of my code doesn’t require this step of downloading and compilation.

All good so far except that the docker image is 11Gb big. I’m sure that all that size is the code and the intermediate objects during the compilation. How can I purge all the intermediate stuff that stack creates and just leave what will be required during the compilation of my code base?

Thanks

2 Likes

I am not super familiar with stack so consider this a guide more than an answer

First, you can run stack build --only-snapshot or similar flags to get all the online packages instead of listing them all.

After that, I think, they should be register and precompiled in some folder like .stack/snapshots/.... I think, that is the folder which contains all necessary artifacts to build you project so you can delete the intermidate .stack-work which has been created in the build directory.

I don’t know if the above is true, or if it will actually reduce the size of the image.

The .stack-work is the local build and my app isn’t that big. The main problem is dependencies. See the size of the ~/.stack folder:

root@702963a22c01:~/.stack# du -sh * | sort -hr
2.3G    programs
1.8G    pantry
677M    snapshots
46M     setup-exe-cache
236K    stack.sqlite3
120K    global-project
52K     setup-exe-src
4.0K    config.yaml
0       stack.sqlite3.pantry-write-lock

I don’t know if I could safely delete folders programs and pantry or if by doing so stack will start redownloading stuff when my docker image is actually compiling my code. I don’t know either if snapshots can be trimmed a bit more either.

You could try with the new stack “dev containers” that bring GHC built against musl libc. Statically link your binaries, then copy them into a deployment Alpine docker image with a multi-stage docker build and voila.

1 Like

@ocramz I don’t know if your approach will solve the big image issue anyway. To be more specific:

  1. I’m working with Google Cloud Build and I want to create a custom builder (similar to the official ones GitHub - GoogleCloudPlatform/cloud-builders: Builder images and examples commonly used for Google Cloud Build)
  2. Instead of having a vanilla haskell or stack image, I want a docker image that has stack AND all dependencies of my project precompiled because Cloud Build will instantiate it on every push to the repo and I don’t want to download the whole internet on every commit to a PR.
  3. I can accomplish the above but my image is Gbs big and Cloud build is slow in retrieving it. The question is how should I remove all the non needed crap in ~/.stack but it keeps compiling my code “without” access to hackage.

This is my current approach:

FROM haskell:9.4.7

ARG LTS_VERSION=lts-21.14

COPY known_hosts.github /root/.ssh/known_hosts

RUN apt-get update -qqy \
  && apt-get install -qqy curl bc git openssh-client \
  && mkdir -p /usr/share \
  && apt-get remove -qqy --purge curl \
  && rm /var/lib/apt/lists/*_*

# Needed by Shakefile. Run with a dummy eval command
RUN stack --resolver "$LTS_VERSION" eval 'foo=1' \
       --package shake \
       --package hashable \
       --package binary \
       --package bytestring \
       --package text \
       --package deepseq

RUN cd /usr/share \
  && git clone --depth 1 git@github.com:tonicebrian/wargames-arena.git \
  # Add all folders containing Haskell code
  && cd wargames-arena/downfallofempires/backend \
  && stack test --only-dependencies \
  && cd - \
  # End stackage initialization
  && rm -rf wargames-arena \

ENTRYPOINT ["/usr/local/bin/stack"]

but you can see the size of the stack folder in How do you trim stack compilation artifacts - #3 by tonicebrian

1 Like

I’m not sure if this fits your requirements but have you looked at using Docker multi-stage builds? Here’s an example for Stack.

@tonicebrian ah, now I got what you mean.

I think you could start @Lsmor 's approach above How do you trim stack compilation artifacts - #2 by Lsmor , then stash the resulting image in a docker registry and have CI pull that when your repository changes.

@eahlberg No, it doesn’t fit requirements because I want to reduce the size of the first image that I would use in a multistage requirement.

The problem could be reframed as, "If I’m constrained in disk space how can I reduce the local disk usage by stack’s cache after stack build --only-dependencies so that next time I do vanilla stack build on my repo, it keeps working`.

@ocramz I’ll try that by doing rm -rf ~/.stack/programs ~/.stack/pantry and see if that works but he isn’t sure either and I thought that would be something already solved.

@tonicebrian removing .stack/programs will delete GHC and stack will download it at next stack build so not sure it’s a great idea. pantry has a big sqlite DB inside as well as a large 00-index.tar which I think contains the whole of Hackage sources.

pantry looks about the right size at 1.8 GB. The database will be rebuilt by Stack if it is deleted and Stack then needs to build.

programs also looks about the right size for a single version of GHC at 2.3 GB. If you are not using the Stack-supplied GHC (--system-ghc, --no-install-ghc) then you don’t need it.

Your snapshots are about 677 MB - that seems a little on the high side. To make sure that only contains the snapshots that you actually need, you can delete it and then build only what you want to cause Stack to recreate it for your ‘immutable’ dependencies.

As you say, .stack-work for a project contains Cabal (the library) build artefacts for only your local/mutable packages in your project. If you delete it, Stack will recreate it when it re-builds your local packages.

2 Likes

It is weird that your programs folder has 2.3Gb. Since you are using a docker image with ghc already installed you should configure stack to not-download ghc.

# makes stack to no install ghc on its own. 
# Notice that system's ghc must!! match lts compiler's version 
# Also, this configuration is be overwriten by project's `stack.yaml`
# So be sure that you `stack.yaml` file doesn't have system-ghc: false
RUN stack config set install-ghc --global false \ 
 && stack config set system-ghc --global true 

Now you should’ve saved 2.3Gb of memory. For the rest of space, try running a disk usage command

# This folder might be different in you docker.
# use `stack path` to find the paths
du -sh ~/.stack/* 

Paste the result so we can check. Notice you may leak some information by doing this. I don’t know how private you project is

1 Like

@lsmor good catch, yeah stack was bootstrapping GHC, so by reusing the one from the source image I have this:

root@55fd0c396c4e:~/.stack# du -sh ~/.stack/* | sort -hr
1.8G    /root/.stack/pantry
676M    /root/.stack/snapshots
47M     /root/.stack/setup-exe-cache
232K    /root/.stack/stack.sqlite3
120K    /root/.stack/global-project
52K     /root/.stack/setup-exe-src
4.0K    /root/.stack/config.yaml
0       /root/.stack/stack.sqlite3.pantry-write-lock

So pantry is the big driver and it is a pitty that cannot be erased. This is the list of dependecies that total those 676M in snapshots:

  - aeson
  - ansi-terminal
  - base
  - bytestring
  - containers
  - either
  - generic-deriving
  - githash
  - katip
  - lens
  - monad-loops
  - mtl
  - optparse-simple
  - pretty-simple
  - pqueue
  - random
  - safe
  - servant
  - servant-server
  - servant-websockets
  - stm
  - stm-actor
  - transformers
  - ulid
  - text
  - wai
  - wai-extra
  - wai-cors
  - warp
      - QuickCheck
      - genvalidity-hspec
      - sydtest
      - sydtest-discover

So it is “big” but pretty normal for a web sever with websockets, some logging, and testing. It doesn’t even have the DB libraries that I plan to add.

My statically compiled executable is just a mere 40Mb so I think there should be room for more trimming but I don’t know where.

So TL;DR would be:

  • The current docker image weights 6.46Gb of which:
    • 3.11 Gb come from the haskell:9.4.7 image
    • 1.8 Gb come from the pantry
    • 676 Mb are compiled dependencies

So I am going to try the haskell:9.4.7-slim and see how it goes, but I have no idea about how to reduce pantry or the compiled dependencies, so any ideas would be helpful.

1 Like

I wonder what happens if you remove the folders under pantry? at least, the sqllite file which in my computer seems to be the largest. Up to @mpilgrem, stack recreates the database if is missing, but I don’t know if it will re-download the dependencies or it will recreate the index from the already built deps.

By way of context, Stack is built on top of Pantry (pantry), which is built on top of the Hackage security library (HSL) (hackage-security). Stack asks itself (via HSL): do I have the current Hackage package index? That gives rise to the contents of the pantry\hackage directory in the Stack root. The local Pantry database used directly by Stack is populated, in part, with information from that package index. That is the pantry.sqlite3 file in the pantry directory. EDIT: I think Stack only asks itself if has the current Hackage package index if it cannot find something in the Pantry database.

EDIT2: So, if you are confident you will not be using something not in the Pantry database once it has been initialised, you can delete pantry\hackage. The risk is, if something is not found in the Pantry database, pantry\hackage will be downloaded again, which is slow.

1 Like

@mpilgrem I’ve removed the .stack/pantry/hackage and I was able to compile without problems so that was a nice 800Mb gone. Do you see any other improvements shrinking the sqlite3 DB? Is there a way to just keep info and metadata in that sqlite3 only for packages already downloaded? Any field that would allow us so filter by that criteria?

My use case is that the builder knows everything about the dependencies of the project it is intended to build by tracking project’s package.yaml at all times. With this in mind, could we perform some SQL wizardry to just keep rows for the required packages?

PD. I think this thread is going to end up in blog post :smiley:. For the record I’ve started with a 6.46Gb of the builder image and 2 minutes 10 seconds of pulling in Google Cloud Build. I’m at 3.58Gb and 50 seconds of pulling. Too slow for CI/CD pipeline but going in the right direction.

1 Like

My last take was to try to find unimportant stuff for a CI/CD task in ~/.stack/pantry/pantry.sqlite3 . At first I thought that the textual representation in blob.contents was intended for humans but after doing update blob set contents = ""; vacuum the DB is shrinked to an impressive 31Mb but I get this error when compiling:

Error: [S-775]                                                                                                                              
       Exception while reading snapshot from https://raw.githubusercontent.com/commercialhaskell/stackage-snapshots/master/lts/21/14.yaml   
       (60e54c1ba3c1e7163acf6dafa9d56b2d3b23f88a31ad53a1c9d888f32561f8da,639819):                                                           
                                                                                                                                            
       Error: [S-645]                                                                                                                       
       Couldn't parse snapshot from https://raw.githubusercontent.com/commercialhaskell/stackage-snapshots/master/lts/21/14.yaml (60e54c1ba3
c1e7163acf6dafa9d56b2d3b23f88a31ad53a1c9d888f32561f8da,639819): Error in $: parsing Snapshot failed, expected Object, but encountered Null

So I guess trimming the sqlite3 DB is a dead end :frowning_face:

My analysis is this: for your particular need, you are not interested in most of the Hackage package index. That is, you are interested in a package index that is a known subset of the Hackage package index.

Stack can be configured to use package indices other than the Hackage package index. See the documentation for the package-index configuration option: Configuration (project and global) - The Haskell Tool Stack.

So, you could create your bespoke package index and configure Stack to use it. That would result in Stack creating the Pantry database you want.

2 Likes

I’ve set up the following for the CI/CD at our company:

Also note the image we use to keep the resulting image as small as can be.
The image in our registry is 1.83GB, and that’s with 200+ dependencies built

FROM fpco/stack-build-small:{RESOLVER}
...
RUN DEBIAN_FRONTEND=noninteractive \
    ln -fs "<our timezone>" /etc/localtime && \
    apt update -y && \
    apt install -y curl ssh apt-utils "<any other packages you may need, like libpq-dev>" && \
    \ # THE FOLLOWING IS IF YOU NEED DOCKER IN THE IMAGE
    curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add - && \
    add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" && \
    apt update -y && \
    apt install -y docker-ce-cli && \
    \ # WE NEED TO DO THIS TO GET TO OUR ATLASSIAN REPOS
    touch ~/.ssh/known_hosts && \
    chmod 600 ~/.ssh/known_hosts && \
    ssh-keyscan -H bitbucket.org >> ~/.ssh/known_hosts && \
    ssh-keygen -R bitbucket.org -f ~/.ssh/known_hosts && \
    echo "~/.ssh contents" && ls -la ~/.ssh && \
    echo "known_hosts contents after ssh-keygen" && cat ~/.ssh/known_hosts && \
    curl -L https://bitbucket.org/site/ssh >> ~/.ssh/known_hosts && \
    echo "~/.ssh contents" && ls -la ~/.ssh && \
    echo "known_hosts contents after curl" && cat ~/.ssh/known_hosts && \
    git clone "<our repository to build the image for>" -b $BRANCH && \
    cd "<the repo directory>" && \
    git checkout $COMMIT && \
    stack upgrade && \
stack build --stack-root "$STACKDIR" --dependencies-only --ghc-options="-j" && \
    rm -fr $STACKDIR/build-plan* \
           $STACKDIR/loaded-snapshot-cache \
           $STACKDIR/precompiled \
           $STACKDIR/setup-exe-src \
           $STACKDIR/indices/Hackage/00-index.tar* \
           /opt/ghc/*/share/doc

These last directories are the ones I’ve found you don’t need to have the CI still build everything without having to build all the dependencies or GHC again.

1 Like

@Vlix I think you’ve left aside some intermediate steps because this super simple Dockerfile following your template goes to the 4.8Gb.

FROM fpco/stack-build-small:lts-21.14
ARG LTS_VERSION=lts-21.14
ARG STACK_DIR=/root/.stack
COPY id_rsa /root/.ssh/id_rsa
COPY known_hosts.github /root/.ssh/known_hosts
RUN chmod 400 /root/.ssh/id_rsa \
  && apt-get update -qqy \
  && apt-get install -qqy curl bc git openssh-client \
  && mkdir -p /usr/share \
  && apt-get remove -qqy --purge curl \
  && rm /var/lib/apt/lists/*_*
RUN stack --resolver "$LTS_VERSION" eval 'foo=1' \
       --package deepseq \
  &&  rm -fr $STACKDIR/build-plan* \
           $STACKDIR/loaded-snapshot-cache \
           $STACKDIR/precompiled \
           $STACKDIR/setup-exe-src \
           $STACKDIR/indices/Hackage/00-index.tar* \
           /opt/ghc/*/share/doc
ENTRYPOINT ["/usr/local/bin/stack"]