Read a file character by character/UTF8
Read a file one character at a time, as opposed to reading the entire file at once.
The solution may be implemented as a procedure, which returns the next character in the file on each consecutive call (returning EOF when the end of the file is reached).
The procedure should support the reading of files containing UTF8 encoded wide characters, returning whole characters for each consecutive read.
- See also
Run BASIC
<lang runbasic>open file.txt" for binary as #f numChars = 1 ' specify number of characters to read a$ = input$(#f,numChars) ' read number of characters specified b$ = input$(#f,1) ' read one character close #f</lang>
Perl 6
Perl 6 has a built in method .getc to get a single character from an open file handle. File handles default to UTF-8, so they will handle multi-byte characters correctly.
To read a single character at a time from the Standard Input terminal; $*IN in Perl 6: <lang perl6>.say while defined $_ = $*IN.getc;</lang>
Or, from a file: <lang perl6>my $filename = 'whatever';
my $in = open( $filename, :r ) or die "$!\n";
print $_ while defined $_ = $in.getc;</lang>
Python
<lang python> with open(filename,"rb") as f:
while True: onebyte=f.read(1) if not onebyte: break byte=onebyte[0]
</lang>
Racket
Don't we all love self reference? <lang racket>
- lang racket
- This file contains utf-8 charachters
- λ, α, γ ...
(for ([c (in-port read-char (open-input-file "read-file.rkt"))])
(display c))
</lang> Output: <lang racket>
- lang racket
- This file contains utf-8 charachters
- λ, α, γ ...
(for ([c (in-port read-char (open-input-file "read-file.rkt"))])
(display c))
</lang>
REXX
version 1
REXX doesn't support UTF8 encoded wide characters, just bytes.
The task's requirement stated that EOF was to be returned upon reaching the end-of-file, so this programming example was written as a subroutine (procedure).
<lang rexx>/*REXX pgm reads/shows a file char by char, returning 'EOF' when done. */
parse arg f . /* F is the fileID to be read.*/
/* [↓] show the file's contents.*/
if f\== then do j=1 until x=='EOF' /*J count's the file's characters*/
x=getchar(f); y= /*get a character or an 'EOF'. */ if x>>' ' then y=x /*display X if presentable. */ say right(j,12) 'character, (hex,char)' c2x(x) y end /*j*/ /* [↑] only show X if not low hex*/
exit /*stick a fork in it, we're done.*/ /*───────────────────────────────GETCHAR subroutine─────────────────────*/ getchar: procedure; parse arg z; if chars(z)==0 then return 'EOF'
return charin(z)</lang>
input file: ABC
123 [¬ a prime]
output (for the above input file):
1 character, (hex,char) 31 1 2 character, (hex,char) 32 2 3 character, (hex,char) 33 3 4 character, (hex,char) 20 5 character, (hex,char) 5B [ 6 character, (hex,char) AA ¬ 7 character, (hex,char) 20 8 character, (hex,char) 61 a 9 character, (hex,char) 20 10 character, (hex,char) 70 p 11 character, (hex,char) 72 r 12 character, (hex,char) 69 i 13 character, (hex,char) 6D m 14 character, (hex,char) 65 e 15 character, (hex,char) 5D ] 16 character, (hex,char) 0D 17 character, (hex,char) 0A 18 character, (hex,char) 454F46 EOF
version 2
<lang rexx>/* REXX ---------------------------------------------------------------
- 29.12.2013 Walter Pachl
- read one utf8 character at a time
- see http://de.wikipedia.org/wiki/UTF-8#Kodierung
- --------------------------------------------------------------------*/
oid='utf8.txt';'erase' oid /* first create file containing utf8 chars*/ Call charout oid,'79'x Call charout oid,'C3A4'x Call charout oid,'C2AE'x Call charout oid,'E282AC'x Call charout oid,'F09D849E'x Call lineout oid fid='utf8.txt' /* then read it and show the contents */ Do Until c8='EOF'
c8=get_utf8char(fid) Say left(c8,4) c2x(c8) End
Exit
get_utf8char: Procedure
Parse Arg f If chars(f)=0 Then Return 'EOF' c=charin(f) b=c2b(c) If left(b,1)=0 Then Nop Else Do p=pos('0',b) Do i=1 To p-2 If chars(f)=0 Then Do Say 'illegal contents in file' f Leave End c=c||charin(f) End End Return c
c2b: Return x2b(c2x(arg(1)))</lang> output:
y 79 ä C3A4 ® C2AE € E282AC ð„ž F09D849E EOF 454F46
Ruby
Utf-8 is the default encoding since Ruby 2.0. In Ruby 1.9 use the magic comment "#encoding: utf-8" on the first line. <lang ruby>DATA.each_char{|c| p c}
__END__ characters: λ, α, γ</lang>
Tcl
To read a single character from a file, use: <lang tcl>set ch [read $channel 1]</lang> This will read multiple bytes sufficient to obtain a Unicode character if a suitable encoding has been configured on the channel. For binary channels, this will always consume exactly one byte. However, the low-level channel buffering logic may consume more than one byte (which only really matters where the channel is being handed on to another process and the channel is over a file descriptor that doesn't support the lseek OS call); the extent of buffering can be controlled via: <lang tcl>fconfigure $channel -buffersize $byteCount</lang> When the channel is only being accessed from Tcl (or via Tcl's C API) it is not normally necessary to adjust this option.