Read entire file: Difference between revisions

Content deleted Content added

Inline

Revision as of 00:57, 16 August 2010

Load the entire contents of some text file as a single string variable.

If applicable, discuss: encoding selection, the possibility of memory-mapping.

Of course, one should avoid reading an entire file at once if the file is large and the task can be accomplished incrementally instead (in which case check File IO); this is for those cases where having the entire file is actually what is wanted.

ALGOL 68

In official ALGOL 68 a file is composed of pages, lines and characters, however for ALGOL 68 Genie and ELLA ALGOL 68RS this concept is not supported as they adopt the Unix concept of files being "flat", and hence contain only characters.

The book can contain new pages and new lines, are not of any particular character set, hence are system independent. The character set is set by a call to make conv, eg make conv(tape, ebcdic conv); - c.f. Character_codes for more details.

In official/standard ALGOL 68 only: <lang algol68>MODE BOOK = FLEX[0]FLEX[0]FLEX[0]CHAR; ¢ pages of lines of characters ¢ BOOK book;

FILE book file; INT errno = open(book file, "book.txt", stand in channel);

get(book file, book)</lang>

Once a "book" has been read into a book array it can still be associated with a virtual file and again be accessed with standard file routines (such as readf, printf, putf, getf, new line etc). This means data can be directly manipulated from a array cached in "core" using transput (stdio) routines.

In official/standard ALGOL 68 only: <lang algol68>FILE cached book file; associate(cached book file, book)</lang>

C

It is not possible to specify encodings: the file is read as binary data (on some system, the b flag is ignored and there's no difference between "r" and "rb"; on others, it changes the way the "new lines" are treated, but this should not affect fread) <lang c>#include <stdio.h>

include <stdlib.h>

int main() {

 char *buffer;
 FILE *fh = fopen("readentirefile.c", "rb");
 if ( fh != NULL )
 {
   fseek(fh, 0L, SEEK_END);
   long s = ftell(fh);
   rewind(fh);
   buffer = malloc(s);
   if ( buffer != NULL )
   {
     fread(buffer, s, 1, fh);
     // we can now close the file
     fclose(fh); fh = NULL;
     
     // do something, e.g.
     fwrite(buffer, s, 1, stdout);

     free(buffer);
   }
   if (fh != NULL) fclose(fh);
 }
 return EXIT_SUCCESS;

}</lang>

Works with: POSIX

We can memory-map the file.

<lang c>#include <stdio.h>

include <stdlib.h>
include <sys/mman.h>
include <sys/types.h>
include <sys/stat.h>
include <unistd.h>
include <fcntl.h>

int main() {

 const char *buffer;
 struct stat s;

 int fd = open("readentirefile_mm.c", O_RDONLY);
 if (fd < 0 ) return EXIT_FAILURE;
 fstat(fd, &s);
 buffer = mmap(NULL, s.st_size, PROT_READ, MAP_PRIVATE, fd, 0L);

 if ( buffer != NULL )
 {
   // do something
   fwrite(buffer, s.st_size, 1, stdout);

   // with const, this gives a warning; but the const
   // helps us to avoid writing into the map (if we do, a compile
   // time error is raised),
   // because of PROT_READ, otherwise we obtain a seg.fault.
   munmap(buffer, s.st_size);
 }

 close(fd);
 return EXIT_SUCCESS;

}</lang>

Clojure

The core function slurp does the trick; you can specify an encoding as an optional second argument: <lang clojure>(slurp "myfile.txt") (slurp "my-utf8-file.txt" "UTF-8")</lang>

D

D version 2. To read a whole file into a dynamic array of unsigned bytes: <lang d>import std.file: read;

void main() {

   auto data = cast(ubyte[])read("data.raw");

}</lang> To read a whole file into a validated UTF-8 string: <lang d>import std.file: readText;

void main() {

   string s = readText("text.txt");

}</lang>

E

<lang e><file:foo.txt>.getText()</lang>

The file is assumed to be in the default encoding.

Factor

<lang factor>USING: io.encodings.ascii io.encodings.binary io.files ;

! to read entire file as binary "foo.txt" binary file-contents

! to read entire file as lines of text "foo.txt" ascii file-lines</lang>

Forth

Works with: GNU Forth

<lang forth>s" foo.txt" slurp-file ( str len )</lang>

Haskell

In the IO monad:

<lang haskell>do text <- readFile filepath

  -- do stuff with text</lang>

This example is untested. Please check that it's correct, debug it as necessary, and remove this message.

Note that readFile is lazy. If you want to ensure the entire file is read in at once, before any other IO actions are run, try:

<lang haskell>eagerReadFile :: FilePath -> IO String eagerReadFile filepath = do

   text <- readFile filepath
   last text `seq` return text</lang>

Icon and Unicon

Icon

The first code snippet below reads from stdin directly into the string fs. <lang Icon>every (fs := "") ||:= |read()</lang> The second code snippet below performs the same operation using an intermediate list fL and applying a function (e.g. FUNC) to each line. Use this form when you need to perform additional string functions such as 'trim' or 'map' on each line. This avoids unnecessary garbage collections which will occur with larger files. The list can be discarded when done. <lang Icon>every put(fL := [],|FUNC(read())) every (fs := "") ||:= !fL fL := &null</lang>

Unicon

This Icon solution works in Unicon.

J

To memory map the file:

<lang j> require'jmf'

  JCHAR map_jmf_ 'var';'foo.txt'</lang>

Caution: updating the value of the memory mapped variable will update the file, and this character remains when the variable's value is passed, unmodified, to a verb which modifies its own local variables.

Java

There is no single method to do this in Java (probably because reading an entire file at once could fill up your memory quickly), so to do this you could simply append the contents as you read them line-by-line as in File IO. <lang java>import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException;

public class ReadFile {

   public static void main(String[] args) throws IOException{
       String fileContents = readEntireFile("./foo.txt");
   }

   private static String readEntireFile(String filename) throws IOException {
       FileReader in = new FileReader(filename);
       StringBuilder contents = new StringBuilder();
       char[] buffer = new char[4096];
       int read = 0;
       do {
           contents.append(buffer, 0, read);
           read = in.read(buffer);
       } while (read >= 0);
       return contents.toString();
   }

}</lang> You could remove the "deleteCharAt" line if you changed "in.readLine() + "\n"" to "in.read()", but this would read the file byte-by-byte and would likely be more inefficient for large files with long (or maybe even medium-length) lines.

Liberty BASIC

<lang lb>filedialog "Open a Text File","*.txt",file$ if file$<>"" then

   open file$ for input as #1
   entire$ = input$(#1, lof(#1))
   close #1
   print entire$

end if</lang>

OCaml

For most uses we can use this function:

<lang ocaml>let load_file f =

 let ic = open_in f in
 let n = in_channel_length ic in
 let s = String.create n in
 really_input ic s 0 n;
 close_in ic;
 (s)</lang>

There is no problem reading an entire file with the function really_input because this function is implemented appropriately with an internal loop, but it can only load files which size is equal or inferior to the maximum length of an ocaml string. This maximum size is available with the variable Sys.max_string_length. On 32 bit machines this size is about 16Mo.

To load bigger files several solutions exist, for example create a structure that contains several strings where the contents of the file can be split. Or another solution that is often used is to use a bigarray of chars instead of a string:

<lang ocaml>type big_string =

 (char, Bigarray.int8_unsigned_elt, Bigarray.c_layout) Bigarray.Array1.t</lang>

The function below returns the contents of a file with this type big_string, and it does so with "memory-mapping":

<lang ocaml>let load_big_file filename =

 let fd = Unix.openfile filename [Unix.O_RDONLY] 0o640 in
 let len = Unix.lseek fd 0 Unix.SEEK_END in
 let _ = Unix.lseek fd 0 Unix.SEEK_SET in
 let shared = false in  (* modifications are done in memory only *)
 let bstr = Bigarray.Array1.map_file fd
              Bigarray.char Bigarray.c_layout shared len in
 Unix.close fd;
 (bstr)</lang>

Then the length of the data can be get with Bigarray.Array1.dim instead of String.length, and we can access to a given char with the syntactic sugar bstr.{i} (instead of str.[i]) as shown in the small piece of code below (similar to the cat command):

<lang ocaml>let () =

 let bstr = load_big_file Sys.argv.(1) in
 let len = Bigarray.Array1.dim bstr in
 for i = 0 to pred len do
   let c = bstr.{i} in
   print_char c
 done</lang>

Oz

The interface for file operations is object-oriented. <lang oz>declare

 FileHandle = {New Open.file init(name:"test.txt")}
 FileContents = {FileHandle read(size:all list:$)}

in

 {FileHandle close}
 {System.printInfo FileContents}</lang>

FileContents is a list of bytes. The operation does not assume any particular encoding.

Perl

<lang perl>open my $fh, $filename; my $text = do { local( $/ ); <$fh> };</lang> or <lang perl>use File::Slurp; my $text = read_file($filename);</lang>

PicoLisp

Using 'till' is the shortest way: <lang PicoLisp>(in "file" (till NIL T))</lang> To read the file into a list of characters: <lang PicoLisp>(in "file" (till NIL))</lang> or, more explicit: <lang PicoLisp>(in "file" (make (while (char) (link @))))</lang> Encoding is always assumed to be UTF-8.

PL/I

<lang PL/I> get file (in) edit ((substr(s, i, 1) do i = 1 to 32767)) (a); </lang>

PowerShell

<lang powershell>Get-Content foo.txt</lang> This will only detect Unicode correctly with a BOM in place (even for UTF-8). With explicit selection of encoding: <lang powershell>Get-Content foo.txt -Encoding UTF8</lang> However, both return an array of strings which is fine for pipeline use but if a single string is desired the array needs to be joined: <lang powershell>(Get-Content foo.txt) -join "`n"</lang>

PHP

<lang php>file_get_contents($filename)</lang>

PureBasic

A file can be read with any of the built in commands <lang PureBasic>Number.b = ReadByte(#File) Length.i = ReadData(#File, *MemoryBuffer, LengthToRead) Number.c = ReadCharacter(#File) Number.d = ReadDouble(#File) Number.f = ReadFloat(#File) Number.i = ReadInteger(#File) Number.l = ReadLong(#File) Number.q = ReadQuad(#File) Text$ = ReadString(#File [, Flags]) Number.w = ReadWord(#File)</lang> If the file is s pure text file (no CR/LF etc.), this will work and will read each line untill EOL is found. <lang PureBasic>If ReadFile(0, "RC.txt")

 Variable$=ReadString(0)     
 CloseFile(0)

EndIf</lang> Since PureBasic terminates strings with a #NULL and also split the ReadString() is encountering new line chars, any file containing these must be treated as a data stream. <lang PureBasic>Title$="Select a file" Pattern$="Text (.txt)|*.txt|All files (*.*)|*.*" fileName$ = OpenFileRequester(Title$,"",Pattern$,0) If fileName$

 If ReadFile(0, fileName$)
   length = Lof(0)     
   *MemoryID = AllocateMemory(length)  
   If *MemoryID
     bytes = ReadData(0, *MemoryID, length)
     MessageRequester("Info",Str(bytes)+" was read")
   EndIf
   CloseFile(0)
 EndIf

EndIf</lang>

Python

<lang python>open(filename).read()</lang>

This returns a byte string and does not assume any particular encoding.

Ruby

<lang ruby>IO.read(filename)</lang>

Tcl

This reads the data in as text, applying the default encoding translations. <lang tcl>set f [open $filename] set data [read $f] close $f</lang> To read the data in as uninterpreted bytes, either use fconfigure to put the handle into binary mode before reading, or (from Tcl 8.5 onwards) do this: <lang tcl>set f [open $filename "rb"] set data [read $f] close $f</lang>

VBScript

Read text file with default encoding into variable and display <lang vb>dim s s = createobject("scripting.filesystemobject").opentextfile("slurp.vbs",1).readall wscript.echo s</lang>

Read text file with UTF-16 encoding into memory and display <lang vb>wscript.echo createobject("scripting.filesystemobject").opentextfile("utf16encoded.txt",1,-1).readall</lang>

Zsh

@@ Line 228: / Line 228: @@
 There is no problem reading an entire file with the function <code>really_input</code> because this function is implemented appropriately with an internal loop, but it can only load files which size is equal or inferior to the maximum length of an ocaml string. This maximum size is available with the variable <code>Sys.max_string_length</code>. On 32 bit machines this size is about 16Mo.
-To load bigger files several solutions exist, for example create a structure that contains several strings where the contents of the file can be split. Or an another solution that is often used is to use a bigarray of chars instead of a string:
+To load bigger files several solutions exist, for example create a structure that contains several strings where the contents of the file can be split. Or another solution that is often used is to use a bigarray of chars instead of a string:
 <lang ocaml>type big_string =