Read a file character by character/UTF8: Difference between revisions
Read a file character by character/UTF8 (view source)
Revision as of 10:31, 2 February 2024
, 4 months ago→{{header|Wren}}: Changed to Wren S/H
(Read a file character by character/UTF8 en FreeBASIC) |
m (→{{header|Wren}}: Changed to Wren S/H) |
||
(17 intermediate revisions by 6 users not shown) | |||
Line 13:
=={{header|Action!}}==
<
Proc Main()
Line 25:
Close(1)
Return</
=={{header|AutoHotkey}}==
{{works with|AutoHotkey 1.1}}
<
while !File.AtEOF
MsgBox, % File.Read(1)</
=={{header|BASIC256}}==
<syntaxhighlight lang="basic256">f = freefile
filename$ = "file.txt"
open f, filename$
while not eof(f)
print chr(readbyte(f));
end while
close f
end</syntaxhighlight>
=={{header|C}}==
<
#include <wchar.h>
#include <stdlib.h>
Line 54 ⟶ 67:
return EXIT_SUCCESS;
}</
=={{header|C++}}==
<
#include <fstream>
#include <iostream>
Line 81 ⟶ 94:
return EXIT_SUCCESS;
}
</syntaxhighlight>
=={{header|C sharp|C#}}==
<
using System.IO;
using System.Text;
Line 111 ⟶ 124:
}
}
}</
=={{header|Common Lisp}}==
{{works with|CLISP}}{{works with|Clozure CL}}{{works with|CMUCL}}{{works with|ECL (Lisp)}}{{works with|SBCL}}{{works with|ABCL}}
<
#+clisp (import 'charset:utf-8 'keyword)
Line 122 ⟶ 135:
(loop for c = (read-char s nil)
while c
do (format t "~a" c)))</
=={{header|Crystal}}==
Line 129 ⟶ 142:
The encoding is UTF-8 by default, but it can be explicitly specified.
<
file.each_char { |c| p c }
end</
or
<
while c = file.read_char
p c
end
end</
=={{header|Delphi}}==
{{libheader| System.SysUtils}}
{{libheader| System.Classes}}
{{Trans|C#}}
<syntaxhighlight lang="delphi">
program Read_a_file_character_by_character_UTF8;
Line 173 ⟶ 186:
end;
readln;
end.</
=={{header|Déjà Vu}}==
<
local (read-utf8-char) file tmp:
!read-byte file
Line 215 ⟶ 228:
!close file
return
!.</
=={{header|Factor}}==
<syntaxhighlight lang="text">USING: kernel io io.encodings.utf8 io.files strings ;
IN: rosetta-code.read-one
"input.txt" utf8 [
[ read1 dup ] [ 1string write ] while drop
] with-file-reader</
=={{header|FreeBASIC}}==
<
f = Freefile
Line 240 ⟶ 253:
Wend
Close #f
Sleep</
=={{header|FunL}}==
<
r = InputStreamReader( FileInputStream('input.txt'), 'UTF-8' )
Line 250 ⟶ 263:
print( chr(ch) )
r.close()</
=={{header|Go}}==
<
import (
Line 274 ⟶ 287:
fmt.Printf("%c", r)
}
}</
=={{header|Haskell}}==
Line 280 ⟶ 293:
{{Works with|GHC|7.8.3}}
<
{- The procedure to read a UTF-8 character is just:
Line 321 ⟶ 334:
xs -> forM_ xs $ \name -> do
putStrLn name
withFile name ReadMode processOneFile</
{{out}}
<pre>
Line 337 ⟶ 350:
First, we know that the first 8-bit value in a utf-8 sequence tells us the length of the sequence needed to represent that character. Specifically: we can convert that value to binary, and count the number of leading 1s to find the length of the character (except the length is always at least 1 character long).
<
So now, we can use indexed file read to read a utf-8 character starting at a specific file index. What we do is read the first octet and then read as many additional characters as we need based on whatever we started with. If that's not possible, we will return EOF:
<
try.
octet0=. 1!:11 y;x,1
Line 348 ⟶ 361:
'EOF'
end.
)</
The length of the result tells us what to add to the file index to find the next available file index for reading.
Line 357 ⟶ 370:
=={{header|Java}}==
The ''FileReader'' class offers a ''read'' method which will return the integer value of each character, upon each call.<br />
When the end of the stream is reached, -1 is returned.<br />
You can implement this task by enclosing a ''FileReader'' within a class, and generating a new character via a method return.
<syntaxhighlight lang="java">
import java.io.FileReader;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
public class
private final FileReader reader;
public
}
/** @return integer value from 0 to 0xffff, or -1 for EOS */
public int nextCharacter() throws IOException {
return reader.read();
}
public void close() throws IOException {
reader.close();
}
}
</syntaxhighlight>
===Using Java 11===
<syntaxhighlight lang="java">
import java.io.BufferedReader;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
public final class ReadFileByCharacter {
public static void main(String[] aArgs) {
Path path = Path.of("input.txt");
try ( BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8) ) {
int value;
while ( ( value = reader.read() ) != END_OF_STREAM ) {
System.out.println((char) value);
}
} catch (IOException ioe) {
ioe.printStackTrace();
}
}
private static final int END_OF_STREAM = -1;
}
</syntaxhighlight>
{{ out }}
<pre>
R
o
s
e
t
t
a
</pre>
=={{header|jq}}==
jq being stream-oriented, it makes sense to define `readc` so that it emits a stream of the UTF-8 characters in the input:
<
inputs + "\n" | explode[] | [.] | implode;</
Example:
<syntaxhighlight lang="sh">
echo '过活' | jq -Rn 'include "readc"; readc'
"过"
"活"
"\n"</
=={{header|Julia}}==
Line 389 ⟶ 450:
The built-in <code>read(stream, Char)</code> function reads a single UTF8-encoded character from a given stream.
<
while !eof(f)
c = read(f, Char)
println(c)
end
end</
=={{header|Kotlin}}==
<
import java.io.File
Line 412 ⟶ 473:
}
}
}</
=={{header|Lua}}==
{{works with|Lua|5.3}}
<syntaxhighlight lang="lua">
-- Return whether the given string is a single ASCII character.
function is_ascii (str)
Line 474 ⟶ 535:
end
end
</syntaxhighlight>
{{out}}
𝄞 A ö Ж € 𝄞 Ε λ λ η ν ι κ ά y ä ® € 成 长 汉
Line 481 ⟶ 542:
from revision 27, version 9.3, of M2000 Environment, Chinese 长 letter displayed in console (as displayed in editor)
<syntaxhighlight lang="m2000 interpreter">
Module checkit {
\\ prepare a file
Line 535 ⟶ 596:
}
checkit
</syntaxhighlight>
using document as final$
<syntaxhighlight lang="m2000 interpreter">
Module checkit {
\\ prepare a file
Line 599 ⟶ 660:
checkit
</syntaxhighlight>
=={{header|Mathematica}}/{{header|Wolfram Language}}==
<
ToString[Read[str, "Character"], CharacterEncoding -> "UTF-8"]</
=={{header|NetRexx}}==
Line 613 ⟶ 674:
:The file <tt>data/utf8-001.txt</tt> it a UTF-8 encoded text file containing the following: y䮀𝄞𝄢12.
<
options replace format comments java crossref symbols nobinary
numeric digits 20
Line 726 ⟶ 787:
say
return
</syntaxhighlight>
{{out}}
<pre>
Line 750 ⟶ 811:
As in fact the file would be read line by line, even if the characters are actually yielded one by one, it may be considered as cheating. So, we provide a function and an iterator which read bytes one by one.
<
proc readUtf8(f: File): string =
Line 766 ⟶ 827:
res.add f.readChar()
if res.validateUtf8() == -1: break
yield res</
=={{header|Pascal}}==
<
program ReadFileByChar;
var
Line 787 ⟶ 848:
Close(OutputFile)
end.
</syntaxhighlight>
=={{header|Perl}}==
<
open my $fh, "<:encoding(UTF-8)", "input.txt" or die "$!\n";
Line 797 ⟶ 858:
}
close $fh;</
If the contents of the ''input.txt'' file are <code>aă€⼥</code> then the output would be:
Line 814 ⟶ 875:
could easily add this to that file permanently, and document/autoinclude it properly.
<!--<
<span style="color: #008080;">constant</span> <span style="color: #000000;">INVALID_UTF8</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">#FFFD</span>
Line 889 ⟶ 950:
<span style="color: #008080;">return</span> <span style="color: #000000;">res</span>
<span style="color: #008080;">end</span> <span style="color: #008080;">function</span>
<!--</
Test code:
<!--<
<span style="color: #000080;font-style:italic;">--string utf8 = "aă€⼥" -- (same results as next)</span>
<span style="color: #004080;">string</span> <span style="color: #000000;">utf8</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">utf32_to_utf8</span><span style="color: #0000FF;">({</span><span style="color: #000000;">#0061</span><span style="color: #0000FF;">,</span><span style="color: #000000;">#0103</span><span style="color: #0000FF;">,</span><span style="color: #000000;">#20ac</span><span style="color: #0000FF;">,</span><span style="color: #000000;">#2f25</span><span style="color: #0000FF;">})</span>
Line 920 ⟶ 981:
<span style="color: #008080;">end</span> <span style="color: #008080;">for</span>
<span style="color: #7060A8;">close</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">)</span>
<!--</
{{out}}
Line 934 ⟶ 995:
=={{header|PicoLisp}}==
Pico Lisp uses UTF-8 until told otherwise.
<syntaxhighlight lang="picolisp">
(in "wordlist"
(while (char)
(process @))
</syntaxhighlight>
=={{header|Python}}==
{{works with|Python|2.7}}
<
def get_next_character(f):
# note: assumes valid utf-8
Line 963 ⟶ 1,024:
for c in get_next_character(f):
print(c)
</syntaxhighlight>
{{works with|Python|3}}
Python 3 simplifies the handling of text files since you can specify an encoding.
<
"""Reads one character from the given textfile"""
c = f.read(1)
Line 977 ⟶ 1,038:
with open("input.txt", encoding="utf-8") as f:
for c in get_next_character(f):
print(c, sep="", end="")</
=={{header|QBasic}}==
<syntaxhighlight lang="qbasic">f = FREEFILE
filename$ = "file.txt"
OPEN filename$ FOR BINARY AS #f
WHILE NOT EOF(f)
char$ = STR$(LOF(f))
GET #f, , char$
PRINT char$;
WEND
CLOSE #f</syntaxhighlight>
=={{header|Racket}}==
Don't we all love self reference?
<
#lang racket
; This file contains utf-8 charachters: λ, α, γ ...
(for ([c (in-port read-char (open-input-file "read-file.rkt"))])
(display c))
</syntaxhighlight>
Output:
<
#lang racket
; This file contains utf-8 charachters: λ, α, γ ...
(for ([c (in-port read-char (open-input-file "read-file.rkt"))])
(display c))
</syntaxhighlight>
=={{header|Raku}}==
Line 1,001 ⟶ 1,074:
To read a single character at a time from the Standard Input terminal; $*IN in Raku:
<syntaxhighlight lang="raku"
Or, from a file:
<syntaxhighlight lang="raku"
my $in = open( $filename, :r ) orelse .die;
print $_ while defined $_ = $in.getc;</
=={{header|REXX}}==
Line 1,015 ⟶ 1,088:
<br>The task's requirement stated that '''EOF''' was to be returned upon reaching the end-of-file, so this programming example was written as a subroutine (procedure).
<br>Note that displaying of characters that may modify screen behavior such as tab usage, backspaces, line feeds, carriage returns, "bells" and others are suppressed, but their hexadecimal equivalents are displayed.
<
parse arg iFID . /*iFID: is the fileID to be read. */
/* [↓] show the file's contents. */
Line 1,025 ⟶ 1,098:
exit /*stick a fork in it, we're all done. */
/*──────────────────────────────────────────────────────────────────────────────────────*/
getchar: procedure; parse arg z; if chars(z)==0 then return 'EOF'; return charin(z)</
'''input''' file: '''ABC'''
<br>and was created by the DOS command (under Windows/XP): '''echo 123 [¬ a prime]> ABC'''
Line 1,055 ⟶ 1,128:
===version 2===
<
* 29.12.2013 Walter Pachl
* read one utf8 character at a time
Line 1,094 ⟶ 1,167:
Return c
c2b: Return x2b(c2x(arg(1)))</
output:
<pre>y 79
Line 1,104 ⟶ 1,177:
=={{header|Ring}}==
<
fp = fopen("C:\Ring\ReadMe.txt","r")
r = fgetc(fp)
Line 1,112 ⟶ 1,185:
end
fclose(fp)
</syntaxhighlight>
Output:
<pre>
Line 1,140 ⟶ 1,213:
{{works with|Ruby|1.9}}
<
f.each_char{|c| p c}
end</
or
<
while c = f.getc
p c
end
end</
=={{header|Run BASIC}}==
<
numChars = 1 ' specify number of characters to read
a$ = input$(#f,numChars) ' read number of characters specified
b$ = input$(#f,1) ' read one character
close #f</
=={{header|Rust}}==
Line 1,168 ⟶ 1,241:
originally.
<
convert::TryFrom,
fmt::{Debug, Display, Formatter},
Line 1,278 ⟶ 1,351:
Ok(())
}</
Line 1,289 ⟶ 1,362:
the file [http://seed7.sourceforge.net/libraries/utf8.htm#STD_UTF8_OUT STD_UTF8_OUT] is used.
<
include "utf8.s7i";
Line 1,306 ⟶ 1,379:
close(inFile);
end if;
end func;</
{{out}}
Line 1,318 ⟶ 1,391:
=={{header|Sidef}}==
<
var fh = file.open_r # equivalent with: file.open('<:utf8')
fh.each_char { |char|
printf("got character #{char} [U+%04x]\n", char.ord)
}</
{{out}}
<pre>
Line 1,333 ⟶ 1,406:
=={{header|Smalltalk}}==
{{works with|Smalltalk/X}}
<
utfStream := 'input' asFilename readStream asUTF8EncodedStream.
[utfStream atEnd] whileFalse:[
Transcript showCR:'got char ',utfStream next.
].
utfStream close.</
=={{header|Tcl}}==
To read a single character from a file, use:
<
This will read multiple bytes sufficient to obtain a Unicode character if a suitable encoding has been configured on the channel. For binary channels, this will always consume exactly one byte. However, the low-level channel buffering logic may consume more than one byte (which only really matters where the channel is being handed on to another process and the channel is over a file descriptor that doesn't support the <tt>lseek</tt> OS call); the extent of buffering can be controlled via:
<syntaxhighlight lang
When the channel is only being accessed from Tcl (or via Tcl's C API) it is not normally necessary to adjust this option.
=={{header|V (Vlang)}}==
<syntaxhighlight lang="v (vlang)">
import os
fn main() {
file := './file.txt'
mut content_arr := []u8{}
if os.is_file(file) == true {
content_arr << os.read_bytes(file) or {
println('Error: can not read')
exit(1)
}
}
else {
println('Error: can not find file')
exit(1)
}
println(content_arr.bytestr())
}
</syntaxhighlight>
=={{header|Wren}}==
<
File.open("input.txt") { |file|
Line 1,363 ⟶ 1,459:
offset = offset + 1
}
}</
=={{header|zkl}}==
zkl doesn't know much about UTF-8 or Unicode but is able to test whether a string or number is valid UTF-8 or not. This code uses that to build a state machine to decode a byte stream into UTF-8 characters.
<
s+=chr;
try{ s.len(8); return(s) }
catch{ if(s.len()>6) throw(__exception) } // 6 bytes max for UTF-8
return(Void.Again,s); // call me again with s & another character
}</
Used to modify a zkl iterator, it can consume any stream-able (files, strings, lists, etc) and provides support for foreach, map, look ahead, push back, etc.
<
obj.walker(3) // read characters
.tweak(readUTF8c)
}</
<
utf8Walker(s).walk().println();
Line 1,384 ⟶ 1,480:
foreach c in (utf8Walker(Data(Void,s,"\n"))){ print(c) }
utf8Walker(Data(Void,0xe2,0x82,"123456")).walk().println(); // € is short 1 byte</
{{out}}
<pre>
Line 1,393 ⟶ 1,489:
</pre>
If you wish to push a UTF-8 stream through one or more functions, you can use the same state machine:
<
stream.pump(List,readUTF8c,"print")</
{{out}}<pre>-->€123</pre>
and returns a list of the eight UTF-8 characters (with newline).
Or, if file "foo.txt" contains the characters:
<
produces the same result.
|