Skip invalid characters on reading

Hey!

I am starting my little project with simple copy stdin to stdout code like this:

module Main where

import System.IO
import Lib

main :: IO ()
main = do
    iseof <- hIsEOF stdin
    if iseof
        then return ()
        else do
            line <- getLine
            putStrLn line
            main

Reading input with invalid (probably UTF) character I am getting (unsurprisingly) this:

<stdin>: hGetLine: invalid argument (invalid byte sequence)

I do want to support malformed chars on input. I want to skip such chars and continue processing.

Can somebody please point me in the right direction how to do it? I am comfortable with FP but not very knowledgeable about Haskell libraries and apis…

thanks!

1 Like

EDIT: For people looking for the best answer in this thread, scroll down to Skip invalid characters on reading

hsyl20

This works, but I think it is ugly:

{-# LANGUAGE TypeApplications #-}
module Main where

import System.IO
import Lib
import Control.Exception (handle)
import Control.Monad (void)
import Foreign.Marshal.Alloc (malloc)
import Data.Word (Word8)

main :: IO ()
main = do
    hSetEncoding stdin utf16 -- I had to add this to make the error occur
    buf <- malloc @Word8
    iseof <- hIsEOF stdin
    if iseof
        then return ()
        else do
            handle @IOError (\_ -> void (hGetBuf stdin buf 1)) $ do
                line <- getLine
                putStrLn line
            main

Basically, the hGetBuf never returns an error and just removes one byte from stdin.

But if the encoding is not single-byte, then this might produce garbage.

Hello. Isn’t this the code you’re looking for?

module Main where

main :: IO ()
main = interact id

@jaror - I will give that a try

@cloudyhug
This is just my starting point. Final behaviour will modify input in a way which is not simple mapping of input lines. That’s why I have not used interact for this. Anyhow, I will try whether this handles those faulty utf characters. If it does it can help towards my need.

Thanks so far!

I just tried my code, it drops the start of each sentence containing an invalid character. So, it is propably not exactly what you want.

If you only care about 8 bit characters then you could use the following:

module Main where

import System.IO
import Lib

main :: IO ()
main = do
    hSetEncoding stdin char8
    iseof <- hIsEOF stdin
    if iseof
        then return ()
        else do
            line <- getLine
            putStrLn line
            main

See the documentation of char8:

An encoding in which Unicode code points are translated to bytes by taking the code point modulo 256. When decoding, bytes are translated directly into the equivalent code point.

This encoding never fails in either direction. However, encoding discards information, so encode followed by decode is not the identity.

You probably want to use an encoding such as UTF-8//IGNORE.

let enc = mkTextEncoding "UTF-8//IGNORE"
hSetEncoding stdin enc

see https://www.stackage.org/haddock/lts-15.14/base-4.13.0.0/GHC-IO-Encoding.html#v:mkTextEncoding

3 Likes

@hsyl20 - This is exactly what I was looking for. It works. Thanks a lot for doc link as well!

I would mark your post as a solution if I could find such a button. In fact I do not see a way how to close this thread…