I am starting my little project with simple copy stdin to stdout code like this:
module Main where
import System.IO
import Lib
main :: IO ()
main = do
iseof <- hIsEOF stdin
if iseof
then return ()
else do
line <- getLine
putStrLn line
main
Reading input with invalid (probably UTF) character I am getting (unsurprisingly) this:
I do want to support malformed chars on input. I want to skip such chars and continue processing.
Can somebody please point me in the right direction how to do it? I am comfortable with FP but not very knowledgeable about Haskell libraries and apis…
{-# LANGUAGE TypeApplications #-}
module Main where
import System.IO
import Lib
import Control.Exception (handle)
import Control.Monad (void)
import Foreign.Marshal.Alloc (malloc)
import Data.Word (Word8)
main :: IO ()
main = do
hSetEncoding stdin utf16 -- I had to add this to make the error occur
buf <- malloc @Word8
iseof <- hIsEOF stdin
if iseof
then return ()
else do
handle @IOError (\_ -> void (hGetBuf stdin buf 1)) $ do
line <- getLine
putStrLn line
main
Basically, the hGetBuf never returns an error and just removes one byte from stdin.
But if the encoding is not single-byte, then this might produce garbage.
@cloudyhug
This is just my starting point. Final behaviour will modify input in a way which is not simple mapping of input lines. That’s why I have not used interact for this. Anyhow, I will try whether this handles those faulty utf characters. If it does it can help towards my need.
I just tried my code, it drops the start of each sentence containing an invalid character. So, it is propably not exactly what you want.
If you only care about 8 bit characters then you could use the following:
module Main where
import System.IO
import Lib
main :: IO ()
main = do
hSetEncoding stdin char8
iseof <- hIsEOF stdin
if iseof
then return ()
else do
line <- getLine
putStrLn line
main
An encoding in which Unicode code points are translated to bytes by taking the code point modulo 256. When decoding, bytes are translated directly into the equivalent code point.
This encoding never fails in either direction. However, encoding discards information, so encode followed by decode is not the identity.