RFC: Using hie files to list external declarations for cabal-audit

Hello,

I am investigating if and how hie files could be used for a cabal-audit command to check what are the declarations being used from the build dependencies. The use-case would be to alert the user when a vulnerable function is being used. This would be particularly important to avoid false alarm when a given vulnerability only appears in a rarely used declaration of a popular package.

So far, I’ve implemented the following function (code) which seems to work well for simple terms:

-- | Returns all the edges between a top level declaration and its dependencies.
getDependencies :: HieFile -> [(TopLevelDeclaration, Declaration)]

For example, given the following modules:

module CabalAudit.Test.Class where

class TestClass a where
  tasty :: a -> Bool
module CabalAudit.Test.Instance where

import CabalAudit.Test.Class

data Tea = Tea

instance TestClass Tea where
  tasty = not . alwaysTrue

instance Show Tea where
  show Tea = "thè" ++ "!"

alwaysTrue :: Tea -> Bool
alwaysTrue = const True
module CabalAudit.Test.User where

import CabalAudit.Test.Class
import CabalAudit.Test.Instance

useAlwaysTrue :: Bool
useAlwaysTrue = tasty Tea

Here is the current output of the tool:

# CabalAudit.Test.Instance
GHC.Show.show: GHC.Base.++, CabalAudit.Test.Instance.Tea
CabalAudit.Test.Class.tasty: GHC.Base.., GHC.Classes.not, CabalAudit.Test.Instance.alwaysTrue, GHC.Base.const, GHC.Types.True
CabalAudit.Test.Instance.alwaysTrue: GHC.Base.const, GHC.Types.True

# CabalAudit.Test.User
CabalAudit.Test.User.useAlwaysTrue: CabalAudit.Test.Class.tasty, GHC.Base.., GHC.Classes.not, CabalAudit.Test.Instance.alwaysTrue, GHC.Base.const, GHC.Types.True, CabalAudit.Test.Instance.Tea

This looks promising, we can see that alwaysTrue is reachable from the useAlwaysTrue declaration. However, I don’t know why the class instances are not attached to the local module (e.g. CabalAudit.Test.Class.Show of Tea instead of GHC.Show.show).

Before continuing down that path, I would appreciate some feedbacks on this strategy, in particular:

  • Is this a good usage of hie files? It seems like we would need to rebuild the dependencies with -fwrite-ide-info, so perhaps we could also use a GHC plugin, would that be better?
  • How to handle typeclass correctly, do we need to use generateReferencesMap?
  • Beside typeclass instances, what other pitfalls would need to be addressed, e.g. are TypeFamilies going to be an issue?

Thank you for your time,
-Tristan

6 Likes

I don’t have direct answers to your questions, but I note that what you want to do seems extremely similar to what weeder does, and it uses HIE files.

Indeed, maybe you could even use weeder. AIUI, you want to assert “vulnerable function X is dead (i.e. unused)”, which is exactly what weeder can tell you!

3 Likes

Thanks, that’s a great suggestion, I’ll give it a try. My main concern with weeder is that it loads all the hie files at once, and for cabal-audit we would need to provide the modules (including transitive ones) from all the build dependencies. So my plan was to do the search incrementally, avoiding external modules that are not used. Perhaps that can be done efficiently with Weeder.analyseHieFiles, but I’m not sure what to do with regards to the RefMap. @ocharles Do you think that would work ?

While it may be possible to use .hie files for this purpose, there are a few reasons why it might not be advisable to rely on them for enforcing security guarantees:

  1. .hie files are constructed with an ad-hoc traversal of the GHC frontend AST, before RULES and optimisations have a chance to fire, potentially invalidating any results
  2. The ad-hoc traversal nature means there might be certain parts of the AST which are omitted - .hie files were initially designed as an aid for IDE/user-visible tooling, and so they only capture parts of the AST that come from the users code. As such, code generated from TH splices, deriving statements and such are currently not captured by the AST. There might be additional constructs like RebindableSyntax where the .hie AST is not accurate.
  3. .hie files are not generated by default and require additional options to be passed at compile time for all dependencies. Generating .hie files might add additional cost to compile times and disk usage.
  4. Indeed we would have to address #16901: GHC needs to distribute .hie files for base libraries · Issues · Glasgow Haskell Compiler / GHC · GitLab so that .hie files for boot libraries are available.

The broad strokes of this proposal remind me of the SafeHaskell mechanism, perhaps that would be a better target for building off off?

3 Likes

This is important feedback - thanks.

The goal here is not to enforce a security guarantee, however - it’s to inform developers about known issues with libraries that they depend on. For example, a contrib library for xmonad had a remote execution vulnerability in 2013 that allowed web page titles to contain code that was executed. An xmonad user who hasn’t updated since then should be able to get a warning at build time that the version of the contrib library that they depend on has a known vulnerability.

The warning feature described above doesn’t need anything more than access to a freeze file that lists the complete transitive dependency set. However, not all xmonad-contrib users will be using the hook that contained the vulnerability. Some libraries (base being one example) are large and difficult to update, and the vulnerability may just be in one small part of the library. Having an overwhelming number of false positives makes the warnings less actionable and less useful.

@tristanC is looking for ways for users to avoid false positives when checking for vulnerabilities. Using the HIE files is intended to check “Does the code contain a reference to the particular declaration or datatype to which the advisory applies” - do you know of a better way to implement this? The drawbacks you state are very real, and an alternative approach would certainly be nice.

1 Like

I don’t think .hie files are currently suitable (without a lot of work auditing, testing and improving) for implementing a sound version of the tool you describe. By soundness I mean the property that if the tool says your code doesn’t call function foo, then your code really never calls function foo at runtime. Of course this would be difficult for any tool in the presence of things like dynamically loading shared objects at runtime, but .hie files can fail in much more mundane scenarios.

In practice, I would say that .hie files would get you 90% there, in terms of the accuracy of the results for “real world” inputs.

To achieve this with a higher degree of correctness, I would think of GHC features that can be implemented to achieve the goal, like extensions to SafeHaskell, some kind of contagious deprecation/warning pragmas or some combination of the two. Note that SafeHaskell already has the kind of “contagious” behaviour desired, but perhaps it is not granular enough, operating at the module level, not at the level of declarations.

Another approach may be to use the new -fwrite-if-simplified-core option to embed the simplified Core for all symbols in interface files, and then run an analysis on that. In my opinion this would be much more feasible than trying to patch up the holes in .hie files since GHC Core is a relatively simple language compared to surface Haskell, and since you get the actual post-optimisation Core which was then compiled into your program you avoid a lot of problems dealing with Haskell constructs such as RULES, type classes, deriving, TemplateHaskell etc.

2 Likes

Thank you for the careful feedback!

I still don’t see the analogy to Safe Haskell.

I suspect that analyzing interface files will be problematic - my understanding is that their format is not stable across GHC versions, so a tool would need to link to GHC to do this, which makes it much more difficult to maintain. We’ll think a bit more about ways to proceed - thanks again!

1 Like

.hie files are also not stable across GHC versions, and you need to link against ghc to read them!

I suspect Core (the bits that go into interface files) is about as stable as .hie files.

.hie files have an advantage in that they can be fully read and analysed in isolation, independent of any compile flags, package databases etc. whereas you need to set up a proper GHC session to fully make sense of interface files, and you don’t get very far just analysing an interface file without also reading in the interface files for its dependencies.

I still don’t see the analogy to Safe Haskell.

If all the “bad” functions you care about were defined in modules declared Unsafe, then checking if you can compile your program with -XSafe would result in a sound implementation of the tool you describe?

Thank you @wz1000 for the feedback, I am also coming to the same conclusion after reading the weeder and calligraphy usage of .hie file, e.g. this approach presently misses: RULES, deriving and RebindableSyntax. For TH splice, calligraphy’s readme does mention that .hie files are generated after TH expansion though. For optimization, would it be possible that an unreachable declaration gets used because of an optimization setting?

That looks promising, does the simplified Core contains the original names, e.g. the GenModule name/version and OccName we could accurately match from the advisory database? What about inlined declarations, are they still visible?

That also looks promising, but it sounds like we would need to build a modified version of the affected dependencies. With .hie files, we can get them by adding a global ghc-options (assuming that the ghc libraries already come with their .hie files thanks to MR !16901), which would be a similar user experience to re-building a project in profiling mode. In other words, we are looking for a solution that is also not too complicated to deploy for end-users.

Thanks again for getting back to me!

Ah, I’d misunderstood that aspect of .hie files - thanks.

I’m starting to think that the best way to achieve soundness here is to instead build a GHC feature where it can emit a list of package-and-module-qualified references to imported names as it elaborates a file, and again when it emits the final version of the Core. If this were in a fairly standard format (e.g. one name per line), then it would be fairly straightforward to build this tool and perhaps many others.

Then the security advisory auditing tool could get these files together with the build plan, and see if any affected names are used.

It seems important here to not link to GHC. Tools that link to GHC require lots of maintenance, and this tool will be much more useful if it doesn’t need to be updated regularly and a single build of it can be compatible with lots of GHC builds.

I suppose, modulo important concerns about error messages - but this seems like massive overkill! All we need to do is resolve fully-qualified free variables in the packages in a build plan, and then compare them to a known list.

On further reflection, simplified core seems like a poor fit. That’s because we really do want to warn on all references to definitions with advisories attached that occur in the source program, even if they’re eliminated by optimization passes. Getting warnings for advisories should be a property of a Haskell program, not a property of a set of GHC build flags applied to a Haskell program, I’d think.

Getting warnings for advisories should be a property of a Haskell program, not a property of a set of GHC build flags applied to a Haskell program, I’d think.

What are your thoughts about functions introduced into the final program through RULES pragmas?

1 Like

I suppose concerned user would like to know if a vulnerable code is included in their application build, so perhaps we do need to take into account GHC build flags in order to get the full coverage induced by RULES and other optimizations.

@wz1000 would you have references to some doc or code using the simplified core from interface files?

edit: found this in ghc-9.6.1 documentation:

I’m not familiar with GHC Core, aren’t inlined declaration name omitted from the IfaceExpr?

Would it be sufficient to consider free occurrences of a name in a RULES pragma as a use of said name, just as with expressions that don’t occur in such a pragma?

We do have the infrastructure and ability to create arbitrary metadata during GHC runs (via plugins, or at some point hard coded into GHC), and carry those along with packages (though support in cabal for plugin meta-data). This was implemented by @csabahruska for use with the external interpreter and is highly flexible. I would recommend any form of metadata experiments to first go through this mechanism (using a plugin), and after some validation time to be potentially incorporated into GHC (if a plugin ends up not being sufficient; e.g. we do not yet have plugins for cross compilers solved properly).

But please let’s not try to tag orthogonal stuff onto existing items, unless it’s absolutely necessary. HIE files should serve Haskell IDE Engine first and foremost.

3 Likes

It’s worth noting that it is rather non-trivial to run a plugin on all your dependencies, compared to building them with an extra GHC flag. See RFC: design for better GHC plugin support in cabal · Issue #7901 · haskell/cabal · GitHub

1 Like

Alright, so for good measure I am investigating using the Core representation to analyze dependencies using a GHC plugins (code). That seems to work equally well, here is the new output for my test scenario (e.g. does useAlwaysTrue reach alwaysTrue through a typeclass instance):

# CabalAudit.Test.Instance
CabalAudit.Test.Instance.alwaysTrue: GHC.Base.const, GHC.Types.True
CabalAudit.Test.Instance.$ctasty: GHC.Base.., GHC.Classes.not, CabalAudit.Test.Instance.alwaysTrue
CabalAudit.Test.Instance.$fTestClassTea: CabalAudit.Test.Instance.$ctasty

# CabalAudit.Test.User
CabalAudit.Test.User.useAlwaysTrue: CabalAudit.Test.Instance.$fTestClassTea, CabalAudit.Test.Instance.Tea

This approach seems simpler and more reliable. I’ll look into dumping this metadata to perform the whole analysis.

2 Likes

Here is a quick update. I’ve re-organized the cabal-audit repository to investigate the different leads (GHC plugins, simplified core in .hi files and .hie files). My current challenge is to collect the declarations metadatas from the build dependencies:

  • For the plugin artifacts and the .hie files, it seems like we would need to lookup the file location per module and I don’t know how to get that reliably. For now I patched the packages with nix to includes these files next to the .hi files, and perhaps using for pkg in $(ghc-pkg list --simple-output); do ghc-pkg field $pkg library-dirs | awk '{ print $2 }'; done will be enough to find all the relevant files?

  • For the .hi files, it seems like we need the HscEnv to call tcTopIfaceBindings. So I managed to make a PoC in the CabalAuditIface module with runGhc as a library, and that is quite nice because we can use the lookupModule facility of GHC to find the interface file. Though I don’t know yet how to register the package db of cabal.

I don’t fully understand what I am doing here, that seems to work, but maybe that is very wrong, so please let me know if I am missing something :slight_smile:

1 Like

Continuing with the suggestions to use a GHC plugin, I’ve now implemented a minimal proof of concept. By following the new usage instructions, cabal-audit generates the following call graph for this module:

module CabalAudit.Test.External where

import Text.JSON qualified

decodeFloat :: String -> Maybe Float
decodeFloat str = case Text.JSON.decode str of
  Text.JSON.Ok v -> Just v
  Text.JSON.Error _ -> Nothing

Visualized with gephi (the root is at the top left):

graph

It seems like this working as expected, in this example I wanted to check if readFloat was used from base, but we can see it is using $fReadInteger and readDec instead. Though it does not seem like I’m correctly handling the names from core, for example multiple class instance share the same OccName, so I am the name’s Unique, but that doesn’t seems stable across module.

Anyway, thank you for the feedbacks, I think this RFC is now solved. Next I’ll propose a Architectural Decision Records (ADR) for the cabal-audit implementation.

4 Likes