Question on laziness of I/O actions in Haskell

Hey everyone! Currently I’m studying some useful functions in I/O action. In the book Learn You a Haskell For Great Good, the author mentioned that getContents is I/O lazy and I try to understand what does he mean.

I use the example he offered in the book :

main = do 
       contents <- getContents
       putStr (map toUpper contents)

I compile this program and run it in the terminal. I want to know what Haskell has done during the following processes :

  1. I type Hey in the terminal;
  2. I press the Enter key.

The author says: “It won’t try to read the whole content at once and store it into memory before printing out the capslocked version. Rather, it will print out the capslocked version as it reads it, because it will only read a line from the input when it really needs to.”

Here’s my thoughts on what does Haskell do in the corresponding process:

  1. Haskell do nothing;
  2. Haskell read Hey, i.e. Haskell read 'H' : 'e' : 'y' : [] from left to right and everytime it reads a character it “print” the corresponding uppercase.

But what the author said makes me feel like Haskell actually does sth in 1.

Any advice would be of great help!

1 Like

Disclaimer: I’m far from being an expert here, this is the mental image that I’ve built, which might be completely wrong.

What is really in place isn’t some magic of getContents. It’s general magic of Haskell’s laziness. contents isn’t a string. It’s an unevaluated expression which (when evaluated) will result in a character and another unevaluated expression and so on. Finally, one of those expressions will evaluate to an empty list (remember, strings are just lists of characters). This has nothing to do with IO (as in operation, not the type IO), the IO is hidden in what it takes to evaluate that expression.

Now, you go to the second line. putStr needs a whole list, but it’ll also take it one character after another. So, it’ll ‘ask’ it’s argument (a synonym which I’ll use for ‘evaluate’) which happens to be map toUpper contents. At this point map asks contents ‘give me the next element of the list’ (so that it can put it through toUpper and return to putStr). At this points contents (which is an unevaluated return value of getContents) figures out it doesn’t have any character to return, so it reads some of it from the operating system. How many? Some. As much as OS is willing to give at that specific moment, maybe one, maybe 1000 (if you’ve redirected stdin to read a book), and as much as fits in it’s internal buffers which are specified, but hidden somewhere deep. This is all it takes to print the first letter of response. Assume it’s 10 for further discussion.

Now putStr asks for the second letter. The procedure starts the same, but contents happily serves the next letter from it’s internal buffers. Some more iterations pass and this time the buffer is (again) empty. contents will reach out to OS to get the next few characters. At some point it’ll get from OS information that the file has ended, and it’ll return an empty list instead of next character.

It doesn’t matter here that much, as the string is sufficently small. But, if you pipe a several GB worth of text through it, it’ll still continue this ‘reading in chunks’ workflow instead of gulping all of that into RAM (possibly exhausting it) and processing everything in-memory before spitting out to stdout.

And if you wonder, putStr also does it like this - it won’t wait for the whole string to give it to OS, it’ll happily feed OS one fragment at a time, whose fragments being either single charaters or some small internal buffers it fills one-character-at-a-time.

2 Likes

This is all correct, except that this isn’t how IO usually works, and it only does so here because getContents uses unsafeInterleaveIO. Which these days is seen as almost always a bad idea. If not for backwards compatibility issues, getContents would have been removed from base a long time ago.

7 Likes

On the contrary, it is some special magic of getContents. getContents is ultimately implemented in terms of lazyRead, which itself is implemented using unsafeInterleaveIO. That’s the magic. I’d say it’s black magic, and to be completely avoided.

The rest of your description is correct @Torinthiel. The thing that’s special in the case of getContents (and other forms of lazy I/O) is the black magic that permits the compiler to do I/O during “asking” (as you call “evaluating”) a thunk (which should be a pure operation). That’s not technically wrong, but it is very, very confusing. It comes from an era where “lazy all the things” was the hot topic in Haskell. It was worth exploring but ultimately failed.

I recommend not trying to understand lazy I/O, ignore everything that uses it, completely forget it exists, and use proper streaming abstractions instead (pipes/conduit/streaming/streamly/Bluefin).

10 Likes

then, should getContents be deprecated and supplied with a warning? I used it without considering it harmful…

Does this leave us with any non-deprecated (in a less-formal meaning) way of duing IO in pure base, without any external packages?

There is getContents', a strict variant of getContents added recently-ish to avoid lazy IO.

3 Likes

I’d be in favour of that, but it’s not the biggest fish to fry.

Sure, most of the I/O in base is not lazy.

2 Likes

Note that fixIO uses unsafeInterleaveIO (or something like it) and an MVar. It’s pretty clever because it’s thread-safe that way.

I’d prefer it if base got a proper Stream datatype first so that I don’t have to use a “proper streaming abstraction” for something as basic as feeding a file into an incremental parser.

3 Likes

You don’t even really need one. (a -> IO b) -> IO r is an extraordinarily high power-to-weight ratio equivalent of pipe’s Pipe a b IO r.

1 Like

then, should getContents be deprecated and supplied with a warning? I used it without considering it harmful…

At least it’s in this list already:

5 Likes

But for another reason, right?

Oh yes, good point! PR welcome I guess!

It means that the content of the resource acquired won’t be in memory until evaluated.

This is good, you use it if you need it. No need to load 5 GB of data if the content isn’t needed later.

It means that the resource acquired won’t be released until all the content is evaluated.

This is horrible.

main :: IO ()
main = do
  content <- readFile "foo.txt"
  writeFile "foo.txt" "bar"
  print content

This crashes. Now imagine this but spread across a codebase.

Unfortunately lazy IO can’t work if the world outside isn’t lazy.

Avoid lazy IO.

6 Likes