Skip invalid characters on reading

james64 · May 26, 2020, 9:54am

Hey!

I am starting my little project with simple copy stdin to stdout code like this:

module Main where

import System.IO
import Lib

main :: IO ()
main = do
    iseof <- hIsEOF stdin
    if iseof
        then return ()
        else do
            line <- getLine
            putStrLn line
            main

Reading input with invalid (probably UTF) character I am getting (unsurprisingly) this:

<stdin>: hGetLine: invalid argument (invalid byte sequence)

I do want to support malformed chars on input. I want to skip such chars and continue processing.

Can somebody please point me in the right direction how to do it? I am comfortable with FP but not very knowledgeable about Haskell libraries and apis…

thanks!

jaror · May 26, 2020, 12:27pm

EDIT: For people looking for the best answer in this thread, scroll down to Skip invalid characters on reading

hsyl20

This works, but I think it is ugly:

{-# LANGUAGE TypeApplications #-}
module Main where

import System.IO
import Lib
import Control.Exception (handle)
import Control.Monad (void)
import Foreign.Marshal.Alloc (malloc)
import Data.Word (Word8)

main :: IO ()
main = do
    hSetEncoding stdin utf16 -- I had to add this to make the error occur
    buf <- malloc @Word8
    iseof <- hIsEOF stdin
    if iseof
        then return ()
        else do
            handle @IOError (\_ -> void (hGetBuf stdin buf 1)) $ do
                line <- getLine
                putStrLn line
            main

Basically, the hGetBuf never returns an error and just removes one byte from stdin.

But if the encoding is not single-byte, then this might produce garbage.

cloudyhug · May 27, 2020, 10:13am

Hello. Isn’t this the code you’re looking for?

module Main where

main :: IO ()
main = interact id

james64 · May 27, 2020, 11:39am

@jaror - I will give that a try

@cloudyhug
This is just my starting point. Final behaviour will modify input in a way which is not simple mapping of input lines. That’s why I have not used interact for this. Anyhow, I will try whether this handles those faulty utf characters. If it does it can help towards my need.

Thanks so far!

jaror · May 27, 2020, 12:10pm

I just tried my code, it drops the start of each sentence containing an invalid character. So, it is propably not exactly what you want.

If you only care about 8 bit characters then you could use the following:

module Main where

import System.IO
import Lib

main :: IO ()
main = do
    hSetEncoding stdin char8
    iseof <- hIsEOF stdin
    if iseof
        then return ()
        else do
            line <- getLine
            putStrLn line
            main

See the documentation of char8:

An encoding in which Unicode code points are translated to bytes by taking the code point modulo 256. When decoding, bytes are translated directly into the equivalent code point.

This encoding never fails in either direction. However, encoding discards information, so encode followed by decode is not the identity.

hsyl20 · May 27, 2020, 1:34pm

You probably want to use an encoding such as UTF-8//IGNORE.

let enc = mkTextEncoding "UTF-8//IGNORE"
hSetEncoding stdin enc

see https://www.stackage.org/haddock/lts-15.14/base-4.13.0.0/GHC-IO-Encoding.html#v:mkTextEncoding

james64 · May 28, 2020, 1:46pm

@hsyl20 - This is exactly what I was looking for. It works. Thanks a lot for doc link as well!

I would mark your post as a solution if I could find such a button. In fact I do not see a way how to close this thread…

Topic		Replies	Views
[Solved]What the range of characters that `putStrLn` support? Learn	5	930	December 10, 2023
Get N Chars from Handle Learn	7	388	August 28, 2024
Bad file descriptor for stdin Learn	1	1262	November 13, 2021
Text-2.0 with UTF8 is finally released! Announcements	20	2821	January 27, 2022
HF Tech Proposal #1: UTF-8 Encoded Text Haskell Foundation	40	6338	June 2, 2021

Skip invalid characters on reading

Related topics