Read a file character by character/UTF8: Difference between revisions

m
→‎{{header|Wren}}: Changed to Wren S/H
(added REXX version 2 (show my understanding of this task))
m (→‎{{header|Wren}}: Changed to Wren S/H)
 
(96 intermediate revisions by 47 users not shown)
Line 1:
{{draft task|File handling}}
 
;Task:
Read a file one character at a time, as opposed to [[Read entire file|reading the entire file at once]].
 
Line 6 ⟶ 8:
The procedure should support the reading of files containing UTF8 encoded wide characters, returning whole characters for each consecutive read.
 
;Related task:
;See also
*   [[Read a file line by line]]
<br><br>
 
=={{header|Run BASICAction!}}==
<syntaxhighlight lang="action!">byte X
<lang runbasic>open file.txt" for binary as #f
numChars = 1 ' specify number of characters to read
Proc Main()
a$ = input$(#f,numChars) ' read number of characters specified
b$ = input$(#f,1) ' read one character
Open (1,"D:FILENAME.TXT",4,0)
close #f</lang>
Do
X=GetD(1)
Put(X)
Until EOF(1)
Od
Close(1)
Return</syntaxhighlight>
 
=={{header|Perl 6AutoHotkey}}==
{{works with|AutoHotkey 1.1}}
Perl 6 has a built in method .getc to get a single character from an open file handle. File handles default to UTF-8, so they will handle multi-byte characters correctly.
<syntaxhighlight lang="autohotkey">File := FileOpen("input.txt", "r")
while !File.AtEOF
MsgBox, % File.Read(1)</syntaxhighlight>
 
To read a single character at a time from the Standard Input terminal; $*IN in Perl 6:
<lang perl6>.say while defined $_ = $*IN.getc;</lang>
 
=={{header|BASIC256}}==
Or, from a file:
<syntaxhighlight lang="basic256">f = freefile
<lang perl6>my $filename = 'whatever';
filename$ = "file.txt"
 
open f, filename$
my $in = open( $filename, :r ) or die "$!\n";
 
while not eof(f)
print $_ while defined $_ = $in.getc;</lang>
print chr(readbyte(f));
end while
close f
end</syntaxhighlight>
 
=={{header|C}}==
<syntaxhighlight lang="c">#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <locale.h>
 
int main(void)
{
/* If your native locale doesn't use UTF-8 encoding
* you need to replace the empty string with a
* locale like "en_US.utf8"
*/
char *locale = setlocale(LC_ALL, "");
FILE *in = fopen("input.txt", "r");
 
wint_t c;
while ((c = fgetwc(in)) != WEOF)
putwchar(c);
fclose(in);
 
return EXIT_SUCCESS;
}</syntaxhighlight>
 
=={{header|C++}}==
<syntaxhighlight lang="cpp">
#include <fstream>
#include <iostream>
#include <locale>
 
using namespace std;
int main(void)
{
/* If your native locale doesn't use UTF-8 encoding
* you need to replace the empty string with a
* locale like "en_US.utf8"
*/
std::locale::global(std::locale("")); // for C++
std::cout.imbue(std::locale());
ifstream in("input.txt");
wchar_t c;
while ((c = in.get()) != in.eof())
wcout<<c;
in.close();
return EXIT_SUCCESS;
}
</syntaxhighlight>
 
=={{header|C sharp|C#}}==
<syntaxhighlight lang="csharp">using System;
using System.IO;
using System.Text;
 
namespace RosettaFileByChar
{
class Program
{
static char GetNextCharacter(StreamReader streamReader) => (char)streamReader.Read();
 
static void Main(string[] args)
{
Console.OutputEncoding = Encoding.UTF8;
char c;
using (FileStream fs = File.OpenRead("input.txt"))
{
using (StreamReader streamReader = new StreamReader(fs, Encoding.UTF8))
{
while (!streamReader.EndOfStream)
{
c = GetNextCharacter(streamReader);
Console.Write(c);
}
}
}
}
}
}</syntaxhighlight>
 
=={{header|Common Lisp}}==
{{works with|CLISP}}{{works with|Clozure CL}}{{works with|CMUCL}}{{works with|ECL (Lisp)}}{{works with|SBCL}}{{works with|ABCL}}
 
<syntaxhighlight lang="lisp">;; CLISP puts the external formats into a separate package
#+clisp (import 'charset:utf-8 'keyword)
 
(with-open-file (s "input.txt" :external-format :utf-8)
(loop for c = (read-char s nil)
while c
do (format t "~a" c)))</syntaxhighlight>
 
=={{header|Crystal}}==
{{trans|Ruby}}
 
The encoding is UTF-8 by default, but it can be explicitly specified.
 
<syntaxhighlight lang="ruby">File.open("input.txt") do |file|
file.each_char { |c| p c }
end</syntaxhighlight>
 
or
 
<syntaxhighlight lang="ruby">File.open("input.txt") do |file|
while c = file.read_char
p c
end
end</syntaxhighlight>
=={{header|Delphi}}==
{{libheader| System.SysUtils}}
{{libheader| System.Classes}}
{{Trans|C#}}
<syntaxhighlight lang="delphi">
program Read_a_file_character_by_character_UTF8;
 
{$APPTYPE CONSOLE}
 
uses
System.SysUtils,
System.Classes;
 
function GetNextCharacter(StreamReader: TStreamReader): char;
begin
Result := chr(StreamReader.Read);
end;
 
const
FileName: TFileName = 'input.txt';
 
begin
if not FileExists(FileName) then
raise Exception.Create('Error: File not exist.');
 
var F := TStreamReader.Create(FileName, TEncoding.UTF8);
 
while not F.EndOfStream do
begin
var c := GetNextCharacter(F);
write(c);
end;
readln;
end.</syntaxhighlight>
 
=={{header|Déjà Vu}}==
 
<syntaxhighlight lang="dejavu">#helper function that deals with non-ASCII code points
local (read-utf8-char) file tmp:
!read-byte file
if = :eof dup:
drop
raise :unicode-error
resize-blob tmp ++ dup len tmp
set-to tmp
try:
return !decode!utf-8 tmp
catch unicode-error:
if < 3 len tmp:
raise :unicode-error
(read-utf8-char) file tmp
 
#reader function
read-utf8-char file:
!read-byte file
if = :eof dup:
return
local :tmp make-blob 1
set-to tmp 0
try:
return !decode!utf-8 tmp
catch unicode-error:
(read-utf8-char) file tmp
 
#if the module is used as a script, read from the file "input.txt",
#showing each code point separately
if = (name) :(main):
local :file !open :read "input.txt"
 
while true:
read-utf8-char file
if = :eof dup:
drop
!close file
return
!.</syntaxhighlight>
 
=={{header|Factor}}==
<syntaxhighlight lang="text">USING: kernel io io.encodings.utf8 io.files strings ;
IN: rosetta-code.read-one
 
"input.txt" utf8 [
[ read1 dup ] [ 1string write ] while drop
] with-file-reader</syntaxhighlight>
 
 
=={{header|FreeBASIC}}==
<syntaxhighlight lang="freebasic">Dim As Long f
f = Freefile
 
Dim As String filename = "file.txt"
Dim As String*1 txt
 
Open filename For Binary As #f
While Not Eof(f)
txt = String(Lof(f), 0)
Get #f, , txt
Print txt;
Wend
Close #f
Sleep</syntaxhighlight>
 
=={{header|FunL}}==
<syntaxhighlight lang="funl">import io.{InputStreamReader, FileInputStream}
 
r = InputStreamReader( FileInputStream('input.txt'), 'UTF-8' )
 
while (ch = r.read()) != -1
print( chr(ch) )
r.close()</syntaxhighlight>
 
=={{header|Go}}==
<syntaxhighlight lang="go">package main
 
import (
"bufio"
"fmt"
"io"
"os"
)
 
func Runer(r io.RuneReader) func() (rune, error) {
return func() (r rune, err error) {
r, _, err = r.ReadRune()
return
}
}
 
func main() {
runes := Runer(bufio.NewReader(os.Stdin))
for r, err := runes(); err != nil; r,err = runes() {
fmt.Printf("%c", r)
}
}</syntaxhighlight>
 
=={{header|Haskell}}==
 
{{Works with|GHC|7.8.3}}
 
<syntaxhighlight lang="haskell">#!/usr/bin/env runhaskell
 
{- The procedure to read a UTF-8 character is just:
 
hGetChar :: Handle -> IO Char
 
assuming that the encoding for the handle has been set to utf8.
-}
 
import System.Environment (getArgs)
import System.IO (
Handle, IOMode (..),
hGetChar, hIsEOF, hSetEncoding, stdin, utf8, withFile
)
import Control.Monad (forM_, unless)
import Text.Printf (printf)
import Data.Char (ord)
 
processCharacters :: Handle -> IO ()
processCharacters h = do
done <- hIsEOF h
unless done $ do
c <- hGetChar h
putStrLn $ printf "U+%04X" (ord c)
processCharacters h
 
processOneFile :: Handle -> IO ()
processOneFile h = do
hSetEncoding h utf8
processCharacters h
 
{- You can specify one or more files on the command line, or if no
files are specified, it reads from standard input.
-}
main :: IO ()
main = do
args <- getArgs
case args of
[] -> processOneFile stdin
xs -> forM_ xs $ \name -> do
putStrLn name
withFile name ReadMode processOneFile</syntaxhighlight>
{{out}}
<pre>
bash$ echo €50 | ./read-char-utf8.hs
U+20AC
U+0035
U+0030
U+000A
</pre>
 
=={{header|J}}==
 
Reading a file a character at a time is antithetical not only to the architecture of J, but to the architecture and design of most computers and most file systems. Nevertheless, this can be a useful concept if you're building your own hardware. So let's model it...
 
First, we know that the first 8-bit value in a utf-8 sequence tells us the length of the sequence needed to represent that character. Specifically: we can convert that value to binary, and count the number of leading 1s to find the length of the character (except the length is always at least 1 character long).
 
<syntaxhighlight lang="j">u8len=: 1 >. 0 i.~ (8#2)#:a.&i.</syntaxhighlight>
 
So now, we can use indexed file read to read a utf-8 character starting at a specific file index. What we do is read the first octet and then read as many additional characters as we need based on whatever we started with. If that's not possible, we will return EOF:
 
<syntaxhighlight lang="j">indexedread1u8=:4 :0
try.
octet0=. 1!:11 y;x,1
octet0,1!:11 y;(x+1),<:u8len octet0
catch.
'EOF'
end.
)</syntaxhighlight>
 
The length of the result tells us what to add to the file index to find the next available file index for reading.
 
Of course, this is massively inefficient. So if someone ever asks you to do this, make sure you ask them "Why?" Because the answer to that question is going to be important (and might suggest a completely different implementation).
 
Note also that it would make more sense to return an empty string, instead of the string 'EOF', when we reach the end of the file. But that is out of scope for this task.
 
=={{header|Java}}==
The ''FileReader'' class offers a ''read'' method which will return the integer value of each character, upon each call.<br />
When the end of the stream is reached, -1 is returned.<br />
You can implement this task by enclosing a ''FileReader'' within a class, and generating a new character via a method return.
<syntaxhighlight lang="java">
import java.io.FileReader;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
 
public class Program {
private final FileReader reader;
 
public Program(String path) throws IOException {
reader = new FileReader(path, StandardCharsets.UTF_16);
}
 
/** @return integer value from 0 to 0xffff, or -1 for EOS */
public int nextCharacter() throws IOException {
return reader.read();
}
 
public void close() throws IOException {
reader.close();
}
}
</syntaxhighlight>
 
===Using Java 11===
<syntaxhighlight lang="java">
import java.io.BufferedReader;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
 
public final class ReadFileByCharacter {
public static void main(String[] aArgs) {
Path path = Path.of("input.txt");
try ( BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8) ) {
int value;
while ( ( value = reader.read() ) != END_OF_STREAM ) {
System.out.println((char) value);
}
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
private static final int END_OF_STREAM = -1;
 
}
</syntaxhighlight>
{{ out }}
<pre>
R
o
s
e
t
t
a
</pre>
 
=={{header|jq}}==
jq being stream-oriented, it makes sense to define `readc` so that it emits a stream of the UTF-8 characters in the input:
<syntaxhighlight lang="jq">def readc:
inputs + "\n" | explode[] | [.] | implode;</syntaxhighlight>
 
Example:
<syntaxhighlight lang="sh">
echo '过活' | jq -Rn 'include "readc"; readc'
"过"
"活"
"\n"</syntaxhighlight>
 
=={{header|Julia}}==
 
The built-in <code>read(stream, Char)</code> function reads a single UTF8-encoded character from a given stream.
 
<syntaxhighlight lang="julia">open("myfilename") do f
while !eof(f)
c = read(f, Char)
println(c)
end
end</syntaxhighlight>
 
=={{header|Kotlin}}==
<syntaxhighlight lang="scala">// version 1.1.2
 
import java.io.File
 
const val EOF = -1
 
fun main(args: Array<String>) {
val reader = File("input.txt").reader() // uses UTF-8 by default
reader.use {
while (true) {
val c = reader.read()
if (c == EOF) break
print(c.toChar()) // echo to console
}
}
}</syntaxhighlight>
 
=={{header|Lua}}==
{{works with|Lua|5.3}}
<syntaxhighlight lang="lua">
-- Return whether the given string is a single ASCII character.
function is_ascii (str)
return string.match(str, "[\0-\x7F]")
end
 
-- Return whether the given string is an initial byte in a multibyte sequence.
function is_init (str)
return string.match(str, "[\xC2-\xF4]")
end
 
-- Return whether the given string is a continuation byte in a multibyte sequence.
function is_cont (str)
return string.match(str, "[\x80-\xBF]")
end
 
-- Accept a filestream.
-- Return the next UTF8 character in the file.
function read_char (file)
local multibyte -- build a valid multibyte Unicode character
 
for c in file:lines(1) do
if is_ascii(c) then
if multibyte then
-- We've finished reading a Unicode character; unread the next byte,
-- and return the Unicode character.
file:seek("cur", -1)
return multibyte
else
return c
end
elseif is_init(c) then
if multibyte then
file:seek("cur", -1)
return multibyte
else
multibyte = c
end
elseif is_cont(c) then
multibyte = multibyte .. c
else
assert(false)
end
end
end
 
-- Test.
function read_all ()
testfile = io.open("tmp.txt", "w")
testfile:write("𝄞AöЖ€𝄞Ελληνικάy䮀成长汉\n")
testfile:close()
testfile = io.open("tmp.txt", "r")
 
while true do
local c = read_char(testfile)
if not c then return else io.write(" ", c) end
end
end
</syntaxhighlight>
{{out}}
𝄞 A ö Ж € 𝄞 Ε λ λ η ν ι κ ά y ä ® € 成 长 汉
 
=={{header|M2000 Interpreter}}==
from revision 27, version 9.3, of M2000 Environment, Chinese 长 letter displayed in console (as displayed in editor)
 
<syntaxhighlight lang="m2000 interpreter">
Module checkit {
\\ prepare a file
\\ Save.Doc and Append.Doc to file, Load.Doc and Merge.Doc from file
document a$
a$={First Line
Second line
Third Line
Ελληνικά Greek Letters
y䮀
成长汉
}
Save.Doc a$, "checkthis.txt", 2 ' 2 for UTF-8
b$="*"
final$=""
buffer Clear bytes as byte*16
Buffer One as byte
Buffer Two as byte*2
Buffer Three as byte*3
Locale 1033
open "checkthis.txt" for input as #f
seek#f, 4 ' skip BOM
While b$<>"" {
GetOneUtf8Char(&b$)
final$+=b$
}
close #f
Report final$
Sub GetOneUtf8Char(&ch$)
ch$=""
if Eof(#f) then Exit Sub
Get #f, One
Return Bytes, 0:=Eval(one, 0)
local mrk=Eval(one, 0)
Try ok {
If Binary.And(mrk, 0xE0)=0xC0 then {
Get #f,one
Return Bytes, 1:=Eval$(one, 0,1)
ch$=Eval$(Bytes, 0, 2)
} Else.if Binary.And(mrk, 0xF0)=0xE0 then {
Get #f,two
Return Bytes, 1:=Eval$(two,0,2)
ch$=Eval$(Bytes, 0, 3)
} Else.if Binary.And(mrk, 0xF8)=0xF0 then {
Get #f,three
Return Bytes, 1:=Eval$(three, 0, 3)
ch$=Eval$(Bytes, 0, 4)
} Else ch$=Eval$(Bytes, 0, 1)
}
if Error or not ok then ch$="" : exit sub
ch$=left$(string$(ch$ as Utf8dec),1)
End Sub
}
checkit
</syntaxhighlight>
 
using document as final$
 
<syntaxhighlight lang="m2000 interpreter">
Module checkit {
\\ prepare a file
\\ Save.Doc and Append.Doc to file, Load.Doc and Merge.Doc from file
document a$
a$={First Line
Second line
Third Line
Ελληνικά Greek Letters
y䮀
成长汉
}
Save.Doc a$, "checkthis.txt", 2 ' 2 for UTF-8
b$="*"
document final$
buffer Clear bytes as byte*16
Buffer One as byte
Buffer Two as byte*2
Buffer Three as byte*3
Locale 1033
open "checkthis.txt" for input as #f
seek#f, 4 ' skip BOM
oldb$=""
While b$<>"" {
GetOneUtf8Char(&b$)
\\ if final$ is document then 10 and 13 if comes alone are new line
\\ so we need to throw 10 after the 13, so we have to use oldb$
if b$=chr$(10) then if oldb$=chr$(13) then oldb$="": continue
oldb$=b$
final$=b$ ' we use = for append to document
}
close #f
Report final$
Sub GetOneUtf8Char(&ch$)
ch$=""
if Eof(#f) then Exit Sub
Get #f, One
Return Bytes, 0:=Eval(one, 0)
local mrk=Eval(one, 0)
Try ok {
If Binary.And(mrk, 0xE0)=0xC0 then {
Get #f,one
Return Bytes, 1:=Eval$(one, 0,1)
ch$=Eval$(Bytes, 0, 2)
} Else.if Binary.And(mrk, 0xF0)=0xE0 then {
Get #f,two
Return Bytes, 1:=Eval$(two,0,2)
ch$=Eval$(Bytes, 0, 3)
} Else.if Binary.And(mrk, 0xF8)=0xF0 then {
Get #f,three
Return Bytes, 1:=Eval$(three, 0, 3)
ch$=Eval$(Bytes, 0, 4)
} Else ch$=Eval$(Bytes, 0, 1)
}
if Error or not ok then ch$="" : exit sub
ch$=left$(string$(ch$ as Utf8dec),1)
End Sub
}
checkit
 
</syntaxhighlight>
 
=={{header|Mathematica}}/{{header|Wolfram Language}}==
<syntaxhighlight lang="mathematica">str = OpenRead["file.txt"];
ToString[Read[str, "Character"], CharacterEncoding -> "UTF-8"]</syntaxhighlight>
 
=={{header|NetRexx}}==
{{incorrect|Java|Maybe overengineered?}}
{{works with|Java|1.7}}
[[Java]] and by extension [[NetRexx]] provides I/O functions that read UTF-8 encoded character data directly from an attached input stream.
The <tt>Reader.read()</tt> method reads a single character as an integer value in the range 0 &ndash; 65535 [0x00 &ndash; 0xffff], reading from a file encoded in UTF-8 will read each codepoint into an <tt>int</tt>.
In the sample below the <tt>readCharacters</tt> method reads the file character by character into a <tt>String</tt> and returns the result to the caller. The rest of this sample examines the result and formats the details.
 
:The file <tt>data/utf8-001.txt</tt> it a UTF-8 encoded text file containing the following:&nbsp;&#x79;&#xE4;&#xAE;&#x20AC;&#x1D11E;&#x1D122;&#x31;&#x32;.
<syntaxhighlight lang="netrexx">/* NetRexx */
options replace format comments java crossref symbols nobinary
numeric digits 20
 
runSample(arg)
return
 
-- ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
method readCharacters(fName) public static binary returns String
slurped = String('')
slrp = StringBuilder()
fr = Reader null
fFile = File(fName)
EOF = int -1 -- End Of File indicator
do
fr = BufferedReader(FileReader(fFile))
ic = int
cc = char
-- read the contents of the file one character at a time
loop label rdr forever
-- Reader.read reads a single character as an integer value in the range 0 - 65535 [0x00 - 0xffff]
-- or -1 on end of stream i.e. End Of File
ic = fr.read()
if ic == EOF then leave rdr
cc = Rexx(ic).d2c
slrp.append(cc)
end rdr
-- load the results of the read into a variable
slurped = slrp.toString()
catch fex = FileNotFoundException
fex.printStackTrace()
catch iex = IOException
iex.printStackTrace()
finally
if fr \= null then do
fr.close()
catch iex = IOException
iex.printStackTrace()
end
end
return slurped
 
-- ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
method encodingDetails(str = String) public static
stlen = str.length()
cplen = Character.codePointCount(str, 0, stlen)
say 'Unicode: length="'stlen'" code_point_count="'cplen'" string="'str'"'
loop ix = 0 to stlen - 1
cp = Rexx(Character.codePointAt(str, ix))
cc = Rexx(Character.charCount(cp))
say ' 'formatCodePoint(ix, cc, cp)
if cc > 1 then do
surrogates = [Rexx(Character.highSurrogate(cp)).c2d(), Rexx(Character.lowSurrogate(cp)).c2d()]
loop sx = 0 to cc - 1
ix = ix + sx
cp = surrogates[sx]
say ' 'formatCodePoint(ix, 1, cp)
end sx
end
end ix
say
return
 
-- ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
-- @see http://docs.oracle.com/javase/6/docs/technotes/guides/intl/encoding.doc.html
-- @since Java 1.7
method formatCodePoint(ix, cc, cp) private static
scp = Rexx(Character.toChars(cp))
icp = cp.d2x(8).x2d(9) -- signed to unsigned conversion
ocp = Rexx(Integer.toOctalString(icp))
x_utf16 = ''
x_utf8 = ''
do
b_utf16 = String(scp).getBytes('UTF-16BE')
b_utf8 = String(scp).getBytes('UTF-8')
loop bv = 0 to b_utf16.length - 1 by 2
x_utf16 = x_utf16 Rexx(b_utf16[bv]).d2x(2) || Rexx(b_utf16[bv + 1]).d2x(2)
end bv
loop bv = 0 to b_utf8.length - 1
x_utf8 = x_utf8 Rexx(b_utf8[bv]).d2x(2)
end bv
x_utf16 = x_utf16.space(1, ',')
x_utf8 = x_utf8.space(1, ',')
catch ex = UnsupportedEncodingException
ex.printStackTrace()
end
cpName = Character.getName(cp)
fmt = -
'CodePoint:' -
'index="'ix.right(3, 0)'"' -
'character_count="'cc'"' -
'id="U+'cp.d2x(5)'"' -
'hex="0x'cp.d2x(6)'"' -
'dec="'icp.right(7, 0)'"' -
'oct="'ocp.right(7, 0)'"' -
'char="'scp'"' -
'utf-16="'x_utf16'"' -
'utf-8="'x_utf8'"' -
'name="'cpName'"'
return fmt
-- ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
method runSample(arg) public static
parse arg fileNames
if fileNames = '' then fileNames = 'data/utf8-001.txt'
loop while fileNames \= ''
parse fileNames fileName fileNames
slurped = readCharacters(fileName)
say "Input:" slurped
encodingDetails(slurped)
end
say
return
</syntaxhighlight>
{{out}}
<pre>
Input: y䮀𝄞𝄢12
Unicode: length="10" code_point_count="8" string="y䮀𝄞𝄢12"
CodePoint: index="000" character_count="1" id="U+00079" hex="0x000079" dec="0000121" oct="0000171" char="y" utf-16="0079" utf-8="79" name="LATIN SMALL LETTER Y"
CodePoint: index="001" character_count="1" id="U+000E4" hex="0x0000E4" dec="0000228" oct="0000344" char="ä" utf-16="00E4" utf-8="C3,A4" name="LATIN SMALL LETTER A WITH DIAERESIS"
CodePoint: index="002" character_count="1" id="U+000AE" hex="0x0000AE" dec="0000174" oct="0000256" char="®" utf-16="00AE" utf-8="C2,AE" name="REGISTERED SIGN"
CodePoint: index="003" character_count="1" id="U+020AC" hex="0x0020AC" dec="0008364" oct="0020254" char="€" utf-16="20AC" utf-8="E2,82,AC" name="EURO SIGN"
CodePoint: index="004" character_count="2" id="U+1D11E" hex="0x01D11E" dec="0119070" oct="0350436" char="𝄞" utf-16="D834,DD1E" utf-8="F0,9D,84,9E" name="MUSICAL SYMBOL G CLEF"
CodePoint: index="004" character_count="1" id="U+0D834" hex="0x00D834" dec="0055348" oct="0154064" char="?" utf-16="FFFD" utf-8="3F" name="HIGH SURROGATES D834"
CodePoint: index="005" character_count="1" id="U+0DD1E" hex="0x00DD1E" dec="0056606" oct="0156436" char="?" utf-16="FFFD" utf-8="3F" name="LOW SURROGATES DD1E"
CodePoint: index="006" character_count="2" id="U+1D122" hex="0x01D122" dec="0119074" oct="0350442" char="𝄢" utf-16="D834,DD22" utf-8="F0,9D,84,A2" name="MUSICAL SYMBOL F CLEF"
CodePoint: index="006" character_count="1" id="U+0D834" hex="0x00D834" dec="0055348" oct="0154064" char="?" utf-16="FFFD" utf-8="3F" name="HIGH SURROGATES D834"
CodePoint: index="007" character_count="1" id="U+0DD22" hex="0x00DD22" dec="0056610" oct="0156442" char="?" utf-16="FFFD" utf-8="3F" name="LOW SURROGATES DD22"
CodePoint: index="008" character_count="1" id="U+00031" hex="0x000031" dec="0000049" oct="0000061" char="1" utf-16="0031" utf-8="31" name="DIGIT ONE"
CodePoint: index="009" character_count="1" id="U+00032" hex="0x000032" dec="0000050" oct="0000062" char="2" utf-16="0032" utf-8="32" name="DIGIT TWO"
</pre>
 
=={{header|Nim}}==
As most system languages, Nim reads bytes and provides functions to decode bytes into Unicode runes. The normal way to read a stream of UTF-8 characters would be to read the file line by line and decode each line using the “utf-8” iterator which yields UTF-8 characters as strings (one by one) or using the “runes” iterator which yields the UTF-8 characters as Runes (one by one).
 
As in fact the file would be read line by line, even if the characters are actually yielded one by one, it may be considered as cheating. So, we provide a function and an iterator which read bytes one by one.
 
<syntaxhighlight lang="nim">import unicode
 
proc readUtf8(f: File): string =
## Return next UTF-8 character as a string.
while true:
result.add f.readChar()
if result.validateUtf8() == -1: break
 
iterator readUtf8(f: File): string =
## Yield successive UTF-8 characters from file "f".
var res: string
while not f.endOfFile:
res.setLen(0)
while true:
res.add f.readChar()
if res.validateUtf8() == -1: break
yield res</syntaxhighlight>
 
=={{header|Pascal}}==
<syntaxhighlight lang="pascal">(* Read a file char by char *)
program ReadFileByChar;
var
InputFile,OutputFile: file of char;
InputChar: char;
begin
Assign(InputFile, 'testin.txt');
Reset(InputFile);
Assign(OutputFile, 'testout.txt');
Rewrite(OutputFile);
while not Eof(InputFile) do
begin
Read(InputFile, InputChar);
Write(OutputFile, InputChar)
end;
Close(InputFile);
Close(OutputFile)
end.
</syntaxhighlight>
=={{header|Perl}}==
<syntaxhighlight lang="perl">binmode STDOUT, ':utf8'; # so we can print wide chars without warning
 
open my $fh, "<:encoding(UTF-8)", "input.txt" or die "$!\n";
 
while (read $fh, my $char, 1) {
printf "got character $char [U+%04x]\n", ord $char;
}
 
close $fh;</syntaxhighlight>
 
If the contents of the ''input.txt'' file are <code>aă€⼥</code> then the output would be:
<pre>
got character a [U+0061]
got character ă [U+0103]
got character € [U+20ac]
got character ⼥ [U+2f25]
</pre>
 
=={{header|Phix}}==
Generally I use utf8_to_utf32() on whole lines when I want to do character-counting.
 
You can find that routine in builtins/utfconv.e, and here is a modified copy that reads
precisely one unicode character from a file. If there is a genuine demand for it, I
could easily add this to that file permanently, and document/autoinclude it properly.
 
<!--<syntaxhighlight lang="phix">-->
<span style="color: #008080;">constant</span> <span style="color: #000000;">INVALID_UTF8</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">#FFFD</span>
<span style="color: #008080;">function</span> <span style="color: #000000;">get_one_utf8_char</span><span style="color: #0000FF;">(</span><span style="color: #004080;">integer</span> <span style="color: #000000;">fn</span><span style="color: #0000FF;">)</span>
<span style="color: #000080;font-style:italic;">-- returns INVALID_UTF8 on error, else a string of 1..4 bytes representing one character</span>
<span style="color: #004080;">object</span> <span style="color: #000000;">res</span>
<span style="color: #004080;">integer</span> <span style="color: #000000;">headb</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">bytes</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">c</span>
<span style="color: #000080;font-style:italic;">-- headb = first byte of utf-8 character:</span>
<span style="color: #000000;">headb</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">getc</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">headb</span><span style="color: #0000FF;">=-</span><span style="color: #000000;">1</span> <span style="color: #008080;">then</span> <span style="color: #008080;">return</span> <span style="color: #0000FF;">-</span><span style="color: #000000;">1</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #000000;">res</span> <span style="color: #0000FF;">=</span> <span style="color: #008000;">""</span><span style="color: #0000FF;">&</span><span style="color: #000000;">headb</span>
<span style="color: #000080;font-style:italic;">-- calculate length of utf-8 character in bytes (1..4):</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">headb</span><span style="color: #0000FF;"><</span><span style="color: #000000;">0</span> <span style="color: #008080;">then</span> <span style="color: #000000;">bytes</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">0</span> <span style="color: #000080;font-style:italic;">-- (utf-8 starts at #0)</span>
<span style="color: #008080;">elsif</span> <span style="color: #000000;">headb</span><span style="color: #0000FF;"><=</span><span style="color: #000000;">0b01111111</span> <span style="color: #008080;">then</span> <span style="color: #000000;">bytes</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">1</span> <span style="color: #000080;font-style:italic;">-- 0b_0xxx_xxxx</span>
<span style="color: #008080;">elsif</span> <span style="color: #000000;">headb</span><span style="color: #0000FF;"><=</span><span style="color: #000000;">0b10111111</span> <span style="color: #008080;">then</span> <span style="color: #000000;">bytes</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">0</span> <span style="color: #000080;font-style:italic;">-- (it's a tail byte)</span>
<span style="color: #008080;">elsif</span> <span style="color: #000000;">headb</span><span style="color: #0000FF;"><=</span><span style="color: #000000;">0b11011111</span> <span style="color: #008080;">then</span> <span style="color: #000000;">bytes</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">2</span> <span style="color: #000080;font-style:italic;">-- 0b_110x_xxxx</span>
<span style="color: #008080;">elsif</span> <span style="color: #000000;">headb</span><span style="color: #0000FF;"><=</span><span style="color: #000000;">0b11101111</span> <span style="color: #008080;">then</span> <span style="color: #000000;">bytes</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">3</span> <span style="color: #000080;font-style:italic;">-- 0b_1110_xxxx</span>
<span style="color: #008080;">elsif</span> <span style="color: #000000;">headb</span><span style="color: #0000FF;"><=</span><span style="color: #000000;">0b11110100</span> <span style="color: #008080;">then</span> <span style="color: #000000;">bytes</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">4</span> <span style="color: #000080;font-style:italic;">-- 0b_1111_0xzz</span>
<span style="color: #008080;">else</span> <span style="color: #000000;">bytes</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">0</span> <span style="color: #000080;font-style:italic;">-- (utf-8 ends at #10FFFF)</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #000080;font-style:italic;">-- 2..4 bytes encoding (tail range: 0b_1000_0000..0b_1011_1111);</span>
<span style="color: #008080;">for</span> <span style="color: #000000;">j</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #000000;">bytes</span><span style="color: #0000FF;">-</span><span style="color: #000000;">1</span> <span style="color: #008080;">do</span> <span style="color: #000080;font-style:italic;">-- tail bytes are valid?</span>
<span style="color: #000000;">c</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">getc</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">c</span><span style="color: #0000FF;"><</span><span style="color: #000000;">#80</span> <span style="color: #008080;">or</span> <span style="color: #000000;">c</span><span style="color: #0000FF;">></span><span style="color: #000000;">#BF</span> <span style="color: #008080;">then</span>
<span style="color: #000000;">bytes</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">0</span> <span style="color: #000080;font-style:italic;">-- invalid tail byte or eof</span>
<span style="color: #008080;">exit</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #000000;">res</span> <span style="color: #0000FF;">&=</span> <span style="color: #000000;">c</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
<span style="color: #000080;font-style:italic;">-- 1 byte encoding (head range: 0b_0000_0000..0b_0111_1111):</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">bytes</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">then</span>
<span style="color: #000000;">c</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">headb</span> <span style="color: #000080;font-style:italic;">-- UTF-8 = ASCII
-- 2 bytes encoding (head range: 0b_1100_0000..0b_1101_1111):</span>
<span style="color: #008080;">elsif</span> <span style="color: #000000;">bytes</span><span style="color: #0000FF;">=</span><span style="color: #000000;">2</span> <span style="color: #008080;">then</span>
<span style="color: #000000;">c</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">and_bits</span><span style="color: #0000FF;">(</span><span style="color: #000000;">headb</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">#1F</span><span style="color: #0000FF;">)*</span><span style="color: #000000;">#40</span> <span style="color: #0000FF;">+</span> <span style="color: #000080;font-style:italic;">-- 0b110[7..11] headb</span>
<span style="color: #7060A8;">and_bits</span><span style="color: #0000FF;">(</span><span style="color: #000000;">res</span><span style="color: #0000FF;">[</span><span style="color: #000000;">2</span><span style="color: #0000FF;">],</span> <span style="color: #000000;">#3F</span><span style="color: #0000FF;">)</span> <span style="color: #000080;font-style:italic;">-- 0b10[1..6] tail</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">c</span><span style="color: #0000FF;">></span><span style="color: #000000;">#7FF</span> <span style="color: #008080;">then</span> <span style="color: #0000FF;">?</span><span style="color: #000000;">9</span><span style="color: #0000FF;">/</span><span style="color: #000000;">0</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #000080;font-style:italic;">-- sanity check</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">c</span><span style="color: #0000FF;"><</span><span style="color: #000000;">#80</span> <span style="color: #008080;">then</span> <span style="color: #000080;font-style:italic;">-- long form?</span>
<span style="color: #000000;">res</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">INVALID_UTF8</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #000080;font-style:italic;">-- 3 bytes encoding (head range: 0b_1110_0000..0b_1110_1111):</span>
<span style="color: #008080;">elsif</span> <span style="color: #000000;">bytes</span><span style="color: #0000FF;">=</span><span style="color: #000000;">3</span> <span style="color: #008080;">then</span>
<span style="color: #000000;">c</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">and_bits</span><span style="color: #0000FF;">(</span><span style="color: #000000;">headb</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">#0F</span><span style="color: #0000FF;">)*</span><span style="color: #000000;">#1000</span> <span style="color: #0000FF;">+</span> <span style="color: #000080;font-style:italic;">-- 0b1110[13..16] head</span>
<span style="color: #7060A8;">and_bits</span><span style="color: #0000FF;">(</span><span style="color: #000000;">res</span><span style="color: #0000FF;">[</span><span style="color: #000000;">2</span><span style="color: #0000FF;">],</span> <span style="color: #000000;">#3F</span><span style="color: #0000FF;">)*</span><span style="color: #000000;">#40</span> <span style="color: #0000FF;">+</span> <span style="color: #000080;font-style:italic;">-- 0b10[7..12] tail</span>
<span style="color: #7060A8;">and_bits</span><span style="color: #0000FF;">(</span><span style="color: #000000;">res</span><span style="color: #0000FF;">[</span><span style="color: #000000;">3</span><span style="color: #0000FF;">],</span> <span style="color: #000000;">#3F</span><span style="color: #0000FF;">)</span> <span style="color: #000080;font-style:italic;">-- 0b10[1..6] tail</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">c</span><span style="color: #0000FF;">></span><span style="color: #000000;">#FFFF</span> <span style="color: #008080;">then</span> <span style="color: #0000FF;">?</span><span style="color: #000000;">9</span><span style="color: #0000FF;">/</span><span style="color: #000000;">0</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #000080;font-style:italic;">-- sanity check</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">c</span><span style="color: #0000FF;"><</span><span style="color: #000000;">#800</span> <span style="color: #000080;font-style:italic;">-- long form?</span>
<span style="color: #008080;">or</span> <span style="color: #0000FF;">(</span><span style="color: #000000;">c</span><span style="color: #0000FF;">>=</span><span style="color: #000000;">#D800</span> <span style="color: #008080;">and</span> <span style="color: #000000;">c</span><span style="color: #0000FF;"><=</span><span style="color: #000000;">#DFFF</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">then</span> <span style="color: #000080;font-style:italic;">-- utf-16 incompatible</span>
<span style="color: #000000;">res</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">INVALID_UTF8</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #000080;font-style:italic;">-- 4 bytes encoding (head range: 0b_1111_0000..0b_1111_0111):</span>
<span style="color: #008080;">elsif</span> <span style="color: #000000;">bytes</span><span style="color: #0000FF;">=</span><span style="color: #000000;">4</span> <span style="color: #008080;">then</span>
<span style="color: #000000;">c</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">and_bits</span><span style="color: #0000FF;">(</span><span style="color: #000000;">headb</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">#07</span><span style="color: #0000FF;">)*</span><span style="color: #000000;">#040000</span> <span style="color: #0000FF;">+</span> <span style="color: #000080;font-style:italic;">-- 0b11110[19..21] head</span>
<span style="color: #7060A8;">and_bits</span><span style="color: #0000FF;">(</span><span style="color: #000000;">res</span><span style="color: #0000FF;">[</span><span style="color: #000000;">2</span><span style="color: #0000FF;">],</span> <span style="color: #000000;">#3F</span><span style="color: #0000FF;">)*</span><span style="color: #000000;">#1000</span> <span style="color: #0000FF;">+</span> <span style="color: #000080;font-style:italic;">-- 0b10[13..18] tail</span>
<span style="color: #7060A8;">and_bits</span><span style="color: #0000FF;">(</span><span style="color: #000000;">res</span><span style="color: #0000FF;">[</span><span style="color: #000000;">3</span><span style="color: #0000FF;">],</span> <span style="color: #000000;">#3F</span><span style="color: #0000FF;">)*</span><span style="color: #000000;">#0040</span> <span style="color: #0000FF;">+</span> <span style="color: #000080;font-style:italic;">-- 0b10[7..12] tail</span>
<span style="color: #7060A8;">and_bits</span><span style="color: #0000FF;">(</span><span style="color: #000000;">res</span><span style="color: #0000FF;">[</span><span style="color: #000000;">4</span><span style="color: #0000FF;">],</span> <span style="color: #000000;">#3F</span><span style="color: #0000FF;">)</span> <span style="color: #000080;font-style:italic;">-- 0b10[1..6] tail</span>
<span style="color: #008080;">if</span> <span style="color: #000000;">c</span><span style="color: #0000FF;"><</span><span style="color: #000000;">#10000</span> <span style="color: #000080;font-style:italic;">-- long form?</span>
<span style="color: #008080;">or</span> <span style="color: #000000;">c</span><span style="color: #0000FF;">></span><span style="color: #000000;">#10FFFF</span> <span style="color: #008080;">then</span>
<span style="color: #000000;">res</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">INVALID_UTF8</span> <span style="color: #000080;font-style:italic;">-- utf-8 ends at #10FFFF</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #000080;font-style:italic;">-- bytes = 0; current byte is not encoded correctly:</span>
<span style="color: #008080;">else</span>
<span style="color: #000000;">res</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">INVALID_UTF8</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">return</span> <span style="color: #000000;">res</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">function</span>
<!--</syntaxhighlight>-->
 
Test code:
 
<!--<syntaxhighlight lang="phix">-->
<span style="color: #000080;font-style:italic;">--string utf8 = "aă€⼥" -- (same results as next)</span>
<span style="color: #004080;">string</span> <span style="color: #000000;">utf8</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">utf32_to_utf8</span><span style="color: #0000FF;">({</span><span style="color: #000000;">#0061</span><span style="color: #0000FF;">,</span><span style="color: #000000;">#0103</span><span style="color: #0000FF;">,</span><span style="color: #000000;">#20ac</span><span style="color: #0000FF;">,</span><span style="color: #000000;">#2f25</span><span style="color: #0000FF;">})</span>
<span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"length of utf8 is %d bytes\n"</span><span style="color: #0000FF;">,</span><span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">utf8</span><span style="color: #0000FF;">))</span>
<span style="color: #004080;">integer</span> <span style="color: #000000;">fn</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">open</span><span style="color: #0000FF;">(</span><span style="color: #008000;">"test.txt"</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"wb"</span><span style="color: #0000FF;">)</span>
<span style="color: #7060A8;">puts</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">,</span><span style="color: #000000;">utf8</span><span style="color: #0000FF;">)</span>
<span style="color: #7060A8;">close</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">)</span>
<span style="color: #000000;">fn</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">open</span><span style="color: #0000FF;">(</span><span style="color: #008000;">"test.txt"</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"r"</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #000000;">5</span> <span style="color: #008080;">do</span>
<span style="color: #004080;">object</span> <span style="color: #000000;">res</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">get_one_utf8_char</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">if</span> <span style="color: #004080;">string</span><span style="color: #0000FF;">(</span><span style="color: #000000;">res</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">then</span>
<span style="color: #008080;">if</span> <span style="color: #7060A8;">platform</span><span style="color: #0000FF;">()=</span><span style="color: #000000;">LINUX</span> <span style="color: #008080;">then</span>
<span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"char %d (%s) is %d bytes\n"</span><span style="color: #0000FF;">,{</span><span style="color: #000000;">i</span><span style="color: #0000FF;">,</span><span style="color: #000000;">res</span><span style="color: #0000FF;">,</span><span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">res</span><span style="color: #0000FF;">)})</span>
<span style="color: #008080;">else</span>
<span style="color: #000080;font-style:italic;">-- unicode and consoles tricky on windows, so I'm
-- just avoiding that issue altogther (t)here.</span>
<span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"char %d is %d bytes\n"</span><span style="color: #0000FF;">,{</span><span style="color: #000000;">i</span><span style="color: #0000FF;">,</span><span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">res</span><span style="color: #0000FF;">)})</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">elsif</span> <span style="color: #000000;">res</span><span style="color: #0000FF;">=-</span><span style="color: #000000;">1</span> <span style="color: #008080;">then</span>
<span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"char %d - EOF\n"</span><span style="color: #0000FF;">,</span><span style="color: #000000;">i</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">exit</span>
<span style="color: #008080;">else</span>
<span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"char %d - INVALID_UTF8\n"</span><span style="color: #0000FF;">,</span><span style="color: #000000;">i</span><span style="color: #0000FF;">)</span>
<span style="color: #008080;">exit</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">if</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
<span style="color: #7060A8;">close</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">)</span>
<!--</syntaxhighlight>-->
 
{{out}}
<pre>
length of utf8 is 9 bytes
char 1 is 1 bytes
char 2 is 2 bytes
char 3 is 3 bytes
char 4 is 3 bytes
char 5 - EOF
</pre>
 
=={{header|PicoLisp}}==
Pico Lisp uses UTF-8 until told otherwise.
<syntaxhighlight lang="picolisp">
(in "wordlist"
(while (char)
(process @))
</syntaxhighlight>
 
=={{header|Python}}==
{{works with|Python|2.7}}
<langsyntaxhighlight lang="python">
def get_next_character(f):
with open(filename,"rb") as f:
# note: assumes valid utf-8
while True:
c onebyte= f.read(1)
while c:
if not onebyte:
while breakTrue:
byte=onebyte[0] try:
yield c.decode('utf-8')
</lang>
except UnicodeDecodeError:
# we've encountered a multibyte character
# read another byte and try again
c += f.read(1)
else:
# c was a valid char, and was yielded, continue
c = f.read(1)
break
 
# Usage:
with open("input.txt","rb") as f:
for c in get_next_character(f):
print(c)
</syntaxhighlight>
 
{{works with|Python|3}}
Python 3 simplifies the handling of text files since you can specify an encoding.
<syntaxhighlight lang="python">def get_next_character(f):
"""Reads one character from the given textfile"""
c = f.read(1)
while c:
yield c
c = f.read(1)
 
# Usage:
with open("input.txt", encoding="utf-8") as f:
for c in get_next_character(f):
print(c, sep="", end="")</syntaxhighlight>
 
=={{header|QBasic}}==
<syntaxhighlight lang="qbasic">f = FREEFILE
filename$ = "file.txt"
 
OPEN filename$ FOR BINARY AS #f
WHILE NOT EOF(f)
char$ = STR$(LOF(f))
GET #f, , char$
PRINT char$;
WEND
CLOSE #f</syntaxhighlight>
 
=={{header|Racket}}==
Don't we all love self reference?
<langsyntaxhighlight lang="racket">
#lang racket
; This file contains utf-8 charachters: λ, α, γ ...
(for ([c (in-port read-char (open-input-file "read-file.rkt"))])
(display c))
</syntaxhighlight>
</lang>
Output:
<langsyntaxhighlight lang="racket">
#lang racket
; This file contains utf-8 charachters: λ, α, γ ...
(for ([c (in-port read-char (open-input-file "read-file.rkt"))])
(display c))
</syntaxhighlight>
</lang>
 
=={{header|Raku}}==
(formerly Perl 6)
 
Raku has a built in method .getc to get a single character from an open file handle. File handles default to UTF-8, so they will handle multi-byte characters correctly.
 
To read a single character at a time from the Standard Input terminal; $*IN in Raku:
<syntaxhighlight lang="raku" line>.say while defined $_ = $*IN.getc;</syntaxhighlight>
 
Or, from a file:
<syntaxhighlight lang="raku" line>my $filename = 'whatever';
 
my $in = open( $filename, :r ) orelse .die;
 
print $_ while defined $_ = $in.getc;</syntaxhighlight>
 
=={{header|REXX}}==
===version 1===
REXX doesn't support UTF8 encoded wide characters, just bytes.
<br><br>The task's requirement stated that '''EOF''' was to be returned upon reaching the end-of-file, so this programming example was written as a subroutine (procedure).
<br>Note that displaying of characters that may modify screen behavior such as tab usage, backspaces, line feeds, carriage returns, "bells" and others are suppressed, but their hexadecimal equivalents are displayed.
<lang rexx>/*REXX pgm reads/shows a file char by char, returning 'EOF' when done. */
<syntaxhighlight lang="rexx">/*REXX program reads and displays a file char by char, returning 'EOF' when done. */
parse arg f . /* F is the fileID to be read.*/
parse arg iFID . /*iFID: [↓] show is the file'sfileID contentsto be read. */
if f\=='' then do j=1 until x=='EOF' /*J count's[↓] show the file's characterscontents. */
if iFID\=='' then do j=1 until x=getchar(f);='EOF' y= /*getJ a charactercount's the orfile's characters. an 'EOF'. */
if x>>' ' x=getchar(iFID); then y=x /*displayget a character X or ifan presentable'EOF'. */
say right(j,12) if x>>'character, (hex,char)' then c2x(y=x) y /*display X if presentable. */
end /*say right(j*/, 12) 'character, (hex,char)' /* [↑] onlyc2x(x) show X if not low hex*/y
exit end /*j*/ /*stick a[↑] only display X fork inif it,not we'relow done.hex*/
exit /*stick a fork in it, we're all done. */
/*───────────────────────────────GETCHAR subroutine─────────────────────*/
/*──────────────────────────────────────────────────────────────────────────────────────*/
getchar: procedure; parse arg z; if chars(z)==0 then return 'EOF'
getchar: procedure; parse arg z; if chars(z)==0 then return 'EOF'; return charin(z)</langsyntaxhighlight>
'''input''' &nbsp; file: &nbsp; '''ABC'''
<br>and was created by the DOS command (under Windows/XP): &nbsp; &nbsp; '''echo 123 [¬ a prime]> ABC'''
<pre>
123 [¬ a prime]
</pre>
'''output''' &nbsp; (for the above [ABC] input file):
<pre>
1 character, (hex,char) 31 1
Line 95 ⟶ 1,124:
17 character, (hex,char) 0A
18 character, (hex,char) 454F46 EOF
End-Of-File.
</pre>
 
===version 2===
<langsyntaxhighlight lang="rexx">/* REXX ---------------------------------------------------------------
* 29.12.2013 Walter Pachl
* read one utf8 character at a time
* see http://de.wikipedia.org/wiki/UTF-8#Kodierung
* sorry this is in German but the encoding table should be obvious
*--------------------------------------------------------------------*/
oid='utf8.txt';'erase' oid /* first create file containing utf8 chars*/
Line 111 ⟶ 1,141:
Call lineout oid
fid='utf8.txt' /* then read it and show the contents */
Do Until c8='EOF'
Do While chars(fid)>0
c8=get_utf8char(fid)
Say left(c8,4) c2x(c8)
End
Say 'EOF'
Exit
 
get_utf8char: Procedure
Parse Arg f
If chars(f)=0 Then
Return 'EOF'
c=charin(f)
b=c2b(c)
Line 136 ⟶ 1,167:
Return c
 
c2b: Return x2b(c2x(arg(1)))</langsyntaxhighlight>
output:
<pre>y 79
Line 142 ⟶ 1,173:
® C2AE
€ E282AC
𝄞� F09D849E
EOF 454F46</pre>
 
=={{header|Ring}}==
<syntaxhighlight lang="ring">
fp = fopen("C:\Ring\ReadMe.txt","r")
r = fgetc(fp)
while isstring(r)
r = fgetc(fp)
see r
end
fclose(fp)
</syntaxhighlight>
Output:
<pre>
==================================================
The Ring Programming Language
http://ring-lang.net/
Version 1.0
Release Date : January 25, 2016
Update Date : March 27, 2016
===================================================
Binary release for Microsoft Windows
===================================================
 
Run Start.bat to open Ring Notepad then
start learning from the documentation
 
Join Ring Group for questions
https://groups.google.com/forum/#!forum/ring-lang
 
Greetings,
Mahmoud Fayed
msfclipper@yahoo.com
http://www.facebook.com/mahmoudfayed1986
</pre>
 
=={{header|Ruby}}==
{{works with|Ruby|1.9}}
Utf-8 is the default encoding since Ruby 2.0. In Ruby 1.9 use the magic comment "#encoding: utf-8" on the first line.
<lang ruby>DATA.each_char{|c| p c}
 
<syntaxhighlight lang="ruby">File.open('input.txt', 'r:utf-8') do |f|
__END__
f.each_char{|c| p c}
characters: λ, α, γ</lang>
end</syntaxhighlight>
 
or
 
<syntaxhighlight lang="ruby">File.open('input.txt', 'r:utf-8') do |f|
while c = f.getc
p c
end
end</syntaxhighlight>
 
=={{header|Run BASIC}}==
<syntaxhighlight lang="runbasic">open file.txt" for binary as #f
numChars = 1 ' specify number of characters to read
a$ = input$(#f,numChars) ' read number of characters specified
b$ = input$(#f,1) ' read one character
close #f</syntaxhighlight>
 
=={{header|Rust}}==
Rust standard library provides hardly any straight-forward way to read single UTF-8 characters
from a file. Following code implements an iterator that consumes a byte stream, taking only as
many bytes as necessary to decode the next UTF-8 character. It provides quite a complete error
report, so that the client code can leverage it to deal with corrupted input.
 
The decoding code is based on [https://docs.rs/crate/utf8-decode/1.0.0/source/ utf8-decode] crate
originally.
 
<syntaxhighlight lang="rust">use std::{
convert::TryFrom,
fmt::{Debug, Display, Formatter},
io::Read,
};
 
pub struct ReadUtf8<I: Iterator> {
source: std::iter::Peekable<I>,
}
 
impl<R: Read> From<R> for ReadUtf8<std::io::Bytes<R>> {
fn from(source: R) -> Self {
ReadUtf8 {
source: source.bytes().peekable(),
}
}
}
 
impl<I, E> Iterator for ReadUtf8<I>
where
I: Iterator<Item = Result<u8, E>>,
{
type Item = Result<char, Error<E>>;
 
fn next(&mut self) -> Option<Self::Item> {
self.source.next().map(|next| match next {
Ok(lead) => self.complete_char(lead),
Err(e) => Err(Error::SourceError(e)),
})
}
}
 
impl<I, E> ReadUtf8<I>
where
I: Iterator<Item = Result<u8, E>>,
{
fn continuation(&mut self) -> Result<u32, Error<E>> {
if let Some(Ok(byte)) = self.source.peek() {
let byte = *byte;
 
return if byte & 0b1100_0000 == 0b1000_0000 {
self.source.next();
Ok((byte & 0b0011_1111) as u32)
} else {
Err(Error::InvalidByte(byte))
};
}
 
match self.source.next() {
None => Err(Error::InputTruncated),
Some(Err(e)) => Err(Error::SourceError(e)),
Some(Ok(_)) => unreachable!(),
}
}
 
fn complete_char(&mut self, lead: u8) -> Result<char, Error<E>> {
let a = lead as u32; // Let's name the bytes in the sequence
 
let result = if a & 0b1000_0000 == 0 {
Ok(a)
} else if lead & 0b1110_0000 == 0b1100_0000 {
let b = self.continuation()?;
Ok((a & 0b0001_1111) << 6 | b)
} else if a & 0b1111_0000 == 0b1110_0000 {
let b = self.continuation()?;
let c = self.continuation()?;
Ok((a & 0b0000_1111) << 12 | b << 6 | c)
} else if a & 0b1111_1000 == 0b1111_0000 {
let b = self.continuation()?;
let c = self.continuation()?;
let d = self.continuation()?;
Ok((a & 0b0000_0111) << 18 | b << 12 | c << 6 | d)
} else {
Err(Error::InvalidByte(lead))
};
 
Ok(char::try_from(result?).unwrap())
}
}
 
#[derive(Debug, Clone)]
pub enum Error<E> {
InvalidByte(u8),
InputTruncated,
SourceError(E),
}
 
impl<E: Display> Display for Error<E> {
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
match self {
Self::InvalidByte(b) => write!(f, "invalid byte 0x{:x}", b),
Self::InputTruncated => write!(f, "character truncated"),
Self::SourceError(e) => e.fmt(f),
}
}
}
 
fn main() -> std::io::Result<()> {
for (index, value) in ReadUtf8::from(std::fs::File::open("test.txt")?).enumerate() {
match value {
Ok(c) => print!("{}", c),
 
Err(e) => {
print!("\u{fffd}");
eprintln!("offset {}: {}", index, e);
}
}
}
 
Ok(())
}</syntaxhighlight>
 
 
=={{header|Seed7}}==
The library [http://seed7.sourceforge.net/libraries/utf8.htm utf8.s7i]
provides the functions [http://seed7.sourceforge.net/libraries/utf8.htm#openUtf8%28in_string,in_string%29 openUtf8]
and [http://seed7.sourceforge.net/libraries/utf8.htm#getc%28in_utf8_file%29 getc].
When a file has been opened with <code>openUtf8</code> fhe function <code>getc</code> reads UTF-8 characters from the file.
To allow writing Unicode characters to standard output
the file [http://seed7.sourceforge.net/libraries/utf8.htm#STD_UTF8_OUT STD_UTF8_OUT] is used.
 
<syntaxhighlight lang="seed7">$ include "seed7_05.s7i";
include "utf8.s7i";
 
const proc: main is func
local
var file: inFile is STD_NULL;
var char: ch is ' ';
begin
OUT := STD_UTF8_OUT;
inFile := openUtf8("readAFileCharacterByCharacterUtf8.in", "r");
if inFile <> STD_NULL then
while hasNext(inFile) do
ch := getc(inFile);
writeln("got character " <& ch <& " [U+" <& ord(ch) radix 16 <& "]");
end while;
close(inFile);
end if;
end func;</syntaxhighlight>
 
{{out}}
When the input file <tt>readAFileCharacterByCharacterUtf8.in</tt> contains the characters <tt>aă€⼥</tt> the output is:
<pre>
got character a [U+61]
got character ă [U+103]
got character € [U+20ac]
got character ⼥ [U+2f25]
</pre>
 
=={{header|Sidef}}==
<syntaxhighlight lang="ruby">var file = File('input.txt') # the input file contains: "aă€⼥"
var fh = file.open_r # equivalent with: file.open('<:utf8')
fh.each_char { |char|
printf("got character #{char} [U+%04x]\n", char.ord)
}</syntaxhighlight>
{{out}}
<pre>
got character a [U+0061]
got character ă [U+0103]
got character € [U+20ac]
got character ⼥ [U+2f25]
</pre>
 
=={{header|Smalltalk}}==
{{works with|Smalltalk/X}}
<syntaxhighlight lang="smalltalk">|utfStream|
utfStream := 'input' asFilename readStream asUTF8EncodedStream.
[utfStream atEnd] whileFalse:[
Transcript showCR:'got char ',utfStream next.
].
utfStream close.</syntaxhighlight>
 
=={{header|Tcl}}==
To read a single character from a file, use:
<langsyntaxhighlight lang="tcl">set ch [read $channel 1]</langsyntaxhighlight>
This will read multiple bytes sufficient to obtain a Unicode character if a suitable encoding has been configured on the channel. For binary channels, this will always consume exactly one byte. However, the low-level channel buffering logic may consume more than one byte (which only really matters where the channel is being handed on to another process and the channel is over a file descriptor that doesn't support the <tt>lseek</tt> OS call); the extent of buffering can be controlled via:
<syntaxhighlight lang ="tcl">fconfigure $channel -buffersize $byteCount</langsyntaxhighlight>
When the channel is only being accessed from Tcl (or via Tcl's C API) it is not normally necessary to adjust this option.
 
=={{header|V (Vlang)}}==
<syntaxhighlight lang="v (vlang)">
import os
 
fn main() {
file := './file.txt'
mut content_arr := []u8{}
if os.is_file(file) == true {
content_arr << os.read_bytes(file) or {
println('Error: can not read')
exit(1)
}
}
else {
println('Error: can not find file')
exit(1)
}
 
println(content_arr.bytestr())
}
</syntaxhighlight>
 
=={{header|Wren}}==
<syntaxhighlight lang="wren">import "io" for File
 
File.open("input.txt") { |file|
var offset = 0
var char = "" // stores each byte read till we have a complete UTF encoded character
while(true) {
var b = file.readBytes(1, offset)
if (b == "") return // end of stream
char = char + b
if (char.codePoints[0] >= 0) { // a UTF encoded character is complete
System.write(char) // print it
char = "" // reset store
}
offset = offset + 1
}
}</syntaxhighlight>
 
=={{header|zkl}}==
zkl doesn't know much about UTF-8 or Unicode but is able to test whether a string or number is valid UTF-8 or not. This code uses that to build a state machine to decode a byte stream into UTF-8 characters.
<syntaxhighlight lang="zkl">fcn readUTF8c(chr,s=""){ // transform UTF-8 character stream
s+=chr;
try{ s.len(8); return(s) }
catch{ if(s.len()>6) throw(__exception) } // 6 bytes max for UTF-8
return(Void.Again,s); // call me again with s & another character
}</syntaxhighlight>
Used to modify a zkl iterator, it can consume any stream-able (files, strings, lists, etc) and provides support for foreach, map, look ahead, push back, etc.
<syntaxhighlight lang="zkl">fcn utf8Walker(obj){
obj.walker(3) // read characters
.tweak(readUTF8c)
}</syntaxhighlight>
<syntaxhighlight lang="zkl">s:="-->\u20AC123"; // --> e2,82,ac,31,32,33 == -->€123
utf8Walker(s).walk().println();
 
w:=utf8Walker(Data(Void,s,"\n")); // Data is a byte bucket
foreach c in (utf8Walker(Data(Void,s,"\n"))){ print(c) }
 
utf8Walker(Data(Void,0xe2,0x82,"123456")).walk().println(); // € is short 1 byte</syntaxhighlight>
{{out}}
<pre>
L("-","-",">","€","1","2","3")
-->€123
VM#1 caught this unhandled exception:
ValueError : Invalid UTF-8 string
</pre>
If you wish to push a UTF-8 stream through one or more functions, you can use the same state machine:
<syntaxhighlight lang="zkl">stream:=Data(Void,s,"\n").howza(3); // character stream
stream.pump(List,readUTF8c,"print")</syntaxhighlight>
{{out}}<pre>-->€123</pre>
and returns a list of the eight UTF-8 characters (with newline).
Or, if file "foo.txt" contains the characters:
<syntaxhighlight lang="zkl">File("foo.txt","rb").howza(3).pump(List,readUTF8c,"print");</syntaxhighlight>
produces the same result.
 
{{omit from|AWK}}
9,482

edits