One of n lines in a file

From Rosetta Code
Revision as of 00:48, 19 February 2012 by 76.85.206.175 (talk)
Task
One of n lines in a file
You are encouraged to solve this task according to the task description, using any language you may know.

A method of choosing a line randomly from a file:

  • Without reading the file more than once
  • When substantial parts of the file cannot be held in memory
  • Without knowing how many lines are in the file

Is to:

  • keep the first line of the file as a possible choice, then
  • Read the second line of the file if possible and make it the possible choice if a uniform random value between zero and one is less than 1/2.
  • Read the third line of the file if possible and make it the possible choice if a uniform random value between zero and one is less than 1/3.
...
  • Read the Nth line of the file if possible and make it the possible choice if a uniform random value between zero and one is less than 1/N
  • Return the computed possible choice when no further lines exist in the file.
Task
  1. Create a function/method/routine called one_of_n that given n, the number of actual lines in a file, follows the algorithm above to return an integer - the line number of the line chosen from the file.
    The number returned can vary, randomly, in each run.
  2. Use one_of_n in a simulation to find what woud be the chosen line of a 10 line file simulated 1,000,000 times.
  3. Print and show how many times each of the 10 lines is chosen as a rough measure of how well the algorithm works.

Note: You may choose a smaller number of repetitions if necessary, but mention this up-front.

Ada

<lang Ada>with Ada.Text_IO, Ada.Numerics.Float_Random;

procedure One_Of_N is

  Num_Of_Lines: constant Positive := 10;
  package Rnd renames Ada.Numerics.Float_Random;
  Gen: Rnd.Generator; -- used globally
  function Choose_One_Of_N(Last_Line_Number: Positive) return Natural is
     Current_Choice: Natural := 0;
  begin
     for Line_Number in 1 .. Last_Line_Number loop
       if (Rnd.Random(Gen) * Float(Line_Number) <= 1.0) then
          Current_Choice := Line_Number;
       end if;
     end loop;
     return Current_Choice;
  end Choose_One_Of_N;
  Results: array(1 .. Num_Of_Lines) of Natural := (others => 0);
  Index: Integer range 1 .. Num_Of_Lines;

begin

  Rnd.Reset(Gen);
  for I in 1 .. 1_000_000 loop    -- compute results
     Index := Choose_One_Of_N(Num_Of_Lines);
     Results(Index) := Results(Index) + 1;
  end loop;
  for R in Results'Range loop    -- output results
     Ada.Text_IO.Put(Integer'Image(Results(R)));
  end loop;

end One_Of_N;</lang>

Example output:

 100104 100075 99761 99851 100457 100315 100101 99557 99678 100101

AutoHotkey

Translation of: Python

This simulation is for 100,000 repetitions. <lang AutoHotkey>one_of_n(n){

   ; One based line numbers
   choice = 1
   Loop % n-1
   {
       Random, rnd, 1, % A_Index+1
       If rnd = 1
           choice := A_Index+1
   }
   return choice

} one_of_n_test(n=10, trials=100000){

   bins := [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
   Loop % trials
       bins[one_of_n(n)] += 1
   return bins

}

b := one_of_n_test() Loop 10

  out .= A_Index ": " b[A_Index] "`n"

MsgBox % out</lang> Output:

---------------------------
One of n.ahk
---------------------------
1: 10046
2: 9958
3: 9953
4: 9973
5: 9915
6: 10212
7: 9941
8: 9993
9: 10063
10: 9946

---------------------------
OK   
---------------------------

C

<lang c>#include <stdio.h>

  1. include <stdlib.h>

inline int irand(int n) { int r, randmax = RAND_MAX/n * n; while ((r = rand()) >= randmax); return r / (randmax / n); }

inline int one_of_n(int n) { int i, r = 0; for (i = 1; i < n; i++) if (!irand(i + 1)) r = i; return r; }

int main(void) { int i, r[10] = {0};

for (i = 0; i < 1000000; i++, r[one_of_n(10)]++); for (i = 0; i < 10; i++) printf("%d%c", r[i], i == 9 ? '\n':' ');

return 0; }</lang>output

100561 99814 99816 99721 99244 99772 100790 100072 99997 100213

Euphoria

<lang Euphoria>-- One of n lines in a file include std/rand.e include std/math.e

function one_of_n(integer n) integer line_num = 1 for i = 2 to n do if rnd() < 1 / i then line_num = i end if end for return line_num end function

procedure main() integer num_reps = 1000000, num_lines_in_file = 10 sequence lines = repeat(0,num_lines_in_file) for i = 1 to num_reps do lines[one_of_n(num_lines_in_file)] += 1 end for for i = 1 to num_lines_in_file do printf(1,"Number of times line %d was selected: %g\n", {i,lines[i]}) end for printf(1,"Total number selected: %d\n", sum(lines) ) end procedure

main() </lang> Sample Output:

Number of times line 1 was selected: 100154
Number of times line 2 was selected: 99778
Number of times line 3 was selected: 99906
Number of times line 4 was selected: 99727
Number of times line 5 was selected: 100025
Number of times line 6 was selected: 100465
Number of times line 7 was selected: 99738
Number of times line 8 was selected: 100135
Number of times line 9 was selected: 99871
Number of times line 10 was selected: 100201
Total number selected: 1000000

Factor

random-line uses an input stream. "/etc/passwd" ascii [ random-line . ] with-file-reader would print a random line from /etc/passwd.

<lang factor>! rosettacode/random-line/random-line.factor USING: io kernel locals math random ; IN: rosettacode.random-line

random-line ( -- line )
   readln :> choice! 1 :> count!
   [ readln dup ]
   [ count 1 + dup count! random zero?
       [ choice! ] [ drop ] if
   ] while drop
   choice ;</lang>

one-of-n wants to use the same algorithm. Factor has duck typing, so one-of-n creates a mock object that quacks like an input stream. This mock object only responds to stream-readln, not the other methods of stream protocol. This works because random-line only needs stream-readln. The mock response is a line number instead of a real line.

<lang factor>! rosettacode/one-of-n/one-of-n.factor USING: accessors io kernel math rosettacode.random-line ; IN: rosettacode.one-of-n

<PRIVATE TUPLE: mock-stream count last ;

<mock-stream> ( n -- stream )
   mock-stream new 0 >>count swap >>last ;

M: mock-stream stream-readln ! stream -- line

   dup [ count>> ] [ last>> ] bi <
   [ [ 1 + ] change-count count>> ]
   [ drop f ] if ;

PRIVATE>

one-of-n ( n -- line )
   <mock-stream> [ random-line ] with-input-stream* ;

USING: assocs formatting locals sequences sorting ; <PRIVATE

f>0 ( object/f -- object/0 )
   dup [ drop 0 ] unless ;
test-one-of-n ( -- )
   H{ } clone :> chosen
   1000000 [
       10 one-of-n chosen [ f>0 1 + ] change-at
   ] times
   chosen keys natural-sort [
       dup chosen at "%d chosen %d times\n" printf
   ] each ;

PRIVATE> MAIN: test-one-of-n</lang>

$ ./factor -run=rosettacode.one-of-n 
Loading resource:work/rosettacode/one-of-n/one-of-n.factor
Loading resource:work/rosettacode/random-line/random-line.factor
Loading resource:basis/formatting/formatting.factor
Loading resource:basis/formatting/formatting-docs.factor
1 chosen 100497 times
2 chosen 100157 times
3 chosen 100207 times
4 chosen 99448 times
5 chosen 100533 times
6 chosen 99774 times
7 chosen 99535 times
8 chosen 99826 times
9 chosen 100058 times
10 chosen 99965 times

Go

<lang go>package main

import (

   "bufio"
   "fmt"
   "io"
   "math/rand"
   "time"

)

// choseLineRandomly implements the method described in the task. // input is a an io.Reader, which could be an os.File, for example. // Or, to implement a simulation, it could be anything else that implements // io.Reader. The method as described suggests saving and returning // lines, but the rest of the task requires line numbers. This function // thus returns both. func choseLineRandomly(r io.Reader) (s string, ln int, err error) {

   br := bufio.NewReader(r)
   s, err = br.ReadString('\n')
   if err != nil {
       return
   }
   ln = 1
   lnLast := 1.
   var sLast string
   for {
       // note bufio.ReadString used here.  This effectively defines a
       // line of the file as zero or more bytes followed by a newline.
       sLast, err = br.ReadString('\n')
       if err == io.EOF {
           return s, ln, nil // normal return
       }
       if err != nil {
           break
       }
       lnLast++
       if rand.Float64() < 1/lnLast {
           s = sLast
           ln = int(lnLast)
       }
   }
   return // error return

}

// oneOfN function required for task item 1. Specified to take a number // n, the number of lines in a file, but the method (above) specified to // to be used does not need n, but rather the file itself. This function // thus takes both, ignoring n and passing the file to choseLineRandomly. func oneOfN(n int, file io.Reader) int {

   _, ln, err := choseLineRandomly(file)
   if err != nil {
       panic(err)
   }
   return ln

}

// simulated file reader for task item 2 type simReader int

func (r *simReader) Read(b []byte) (int, error) {

   if *r <= 0 {
       return 0, io.EOF
   }
   b[0] = '\n'
   *r--
   return 1, nil

}

func main() {

   // task item 2 simulation consists of accumulating frequency statistic
   // on 1,000,000 calls of oneOfN on simulated file.
   n := 10
   freq := make([]int, n)
   rand.Seed(time.Now().UnixNano())
   for times := 0; times < 1e6; times++ {
       sr := simReader(n)
       freq[oneOfN(n, &sr)-1]++
   }
   // task item 3.  show frequencies.
   fmt.Println(freq)

}</lang> Output:

[99945 99770 99594 100532 99941 100223 99716 100217 99855 100207]

Icon and Unicon

Translation of: Python

<lang Icon>procedure main() # one of n

  one_of_n_test(10,1000000)

end

procedure one_of_n(n)

  every i := 1 to n do 
     choice := (?0  < 1. / i, i)
  return \choice | fail

end

procedure one_of_n_test(n,trials)

  bins := table(0)
  every i := 1 to trials do
        bins[one_of_n(n)] +:= 1
  every writes(bins[i := 1 to n]," ")
  return bins

end</lang>

Sample output:

99470 99806 99757 99921 100213 100001 99778 100385 100081 100588

J

This implementation also implements line buffering, since the built-in line handling does not work quite how I want it to work. That said, if a line is too large (many gigabytes, for example), the system will crawl to a halt when the line is read.

<lang j>randLineBig=:3 :0

 file=. boxopen y
 r=. 
 n=. 1
 size=. fsize file
 blocksize=. 1e7
 buffer=. 
 for_block. |: blocksize -~/\@(] <. [ * 0 1 +/i.@>.@%~) size do.
   buffer=. buffer, fread file,<block
   linends=. LF = buffer
   lines=. linends <;.2 buffer
   buffer=. buffer }.~ {: 1+I.linends
   pick=. (0 ?@$~ #lines) < % n+i.#lines
   if. 1 e. pick do.
     r=. ({:I.pick) {:: lines
   end.
   n=. n+#lines
 end.
 r

)</lang>

Usage: randLineBig 'filename'

Testing:

<lang j> (,LF,.~":,.i.10) fwrite <'seq.txt' 20

  (#;~.)/.~ /:~ <@randLineBig"0]1e6#<'seq.txt'

┌──────┬───┐ │99916 │0 │ ├──────┼───┤ │99944 │1 │ ├──────┼───┤ │100250│2 │ ├──────┼───┤ │100621│3 │ ├──────┼───┤ │99594 │4 │ ├──────┼───┤ │100106│5 │ ├──────┼───┤ │99957 │6 │ ├──────┼───┤ │99975 │7 │ ├──────┼───┤ │100054│8 │ ├──────┼───┤ │99583 │9 │ └──────┴───┘</lang>

Java

Translation of: Python

<lang Java>import java.util.Arrays; import java.util.Random;

public class OneOfNLines {

static Random rand;

public static int oneOfN(int n) { int choice = 0;

for(int i = 1; i < n; i++) { if(rand.nextInt(i+1) == 0) choice = i; }

return choice; }

public static void main(String[] args) { int n = 10; int trials = 1000000; int[] bins = new int[n]; rand = new Random();

for(int i = 0; i < trials; i++) bins[oneOfN(n)]++;


System.out.println(Arrays.toString(bins)); } } </lang>

Sample output:

[99832, 99958, 100281, 99601, 99568, 99689, 100118, 99753, 100659, 100541]

OCaml

<lang ocaml>let one_of_n n =

 let rec aux i r =
   if i >= n then r else
     if Random.int (i + 1) = 0
     then aux (succ i) i
     else aux (succ i) r
 in
 aux 1 0

let test ~n ~trials =

 let ar = Array.make n 0 in
 for i = 1 to trials do
   let d = one_of_n n in
   ar.(d) <- succ ar.(d)
 done;
 Array.iter (Printf.printf " %d") ar;
 print_newline ()

let () =

 Random.self_init ();
 test ~n:10 ~trials:1_000_000</lang>

Executing:

$ ocamlopt -o one.opt one.ml
$ ./one.opt 
 100620 99719 99928 99864 99760 100151 99553 100529 99800 100076

Perl 6

Translation of: Python

<lang perl6>sub one_of_n($n) {

   my $choice;
   $choice = $_ if .rand < 1 for 1 .. $n;
   $choice - 1;

}

sub one_of_n_test($n = 10, $trials = 1000000) {

   my @bins;
   @bins[one_of_n($n)]++ for ^$trials;
   @bins;

}

say one_of_n_test();</lang> Output:

100288 100047 99660 99773 100256 99633 100161 100483 99789 99910

PicoLisp

<lang PicoLisp>(de one-of-n (N)

  (let R 1
     (for I N
        (when (= 1 (rand 1 I))
           (setq R I) ) )
     R ) )

(let L (need 10 0)

  (do 1000000
     (inc (nth L (one-of-n 10))) )
  L )</lang>

Output:

-> (99893 100145 99532 100400 100263 100229 99732 100116 99709 99981)

PureBasic

<lang purebasic>Procedure.f randomFloat()

  ProcedureReturn Random(1000000) / 1000000

EndProcedure

Procedure one_of_n(n)

 Protected linesRead, lineChosen
 While linesRead < n 
   linesRead + 1
   If randomFloat() <= (1.0 / (linesRead))
      lineChosen = linesRead
   EndIf
 Wend
 ProcedureReturn lineChosen

EndProcedure

If OpenConsole()

 #testFileLineCount = 10
 #simulationCount = 1000000
 Define i
 Dim a(#testFileLineCount) ;index 0 is not used
 For i = 1 To #simulationCount
   x = one_of_n(#testFileLineCount)
   a(x) + 1
 Next
 
 For i = 1 To #testFileLineCount
   Print(Str(a(i)) + "  ")
 Next
 PrintN("")
 
 Print(#CRLF$ + #CRLF$ + "Press ENTER to exit"): Input()
 CloseConsole()

EndIf</lang> Sample output:

99959  100011  100682  100060  99834  99632  100083  99817  99824  100098

Python

<lang python>from random import randrange

def one_of_n(n):

   # Zero based line numbers
   choice = 0
   for i in range(1, n):
       if randrange(i+1) == 0:
           choice = i
   return choice
           

def one_of_n_test(n=10, trials=1000000):

   bins = [0] * n
   if n:
       for i in range(trials):
           bins[one_of_n(n)] += 1
   return bins

print(one_of_n_test())</lang>

Sample output
[99833, 100303, 99902, 100132, 99608, 100117, 99531, 100017, 99795, 100762]

Ruby

<lang ruby># Returns a random line from _io_, or nil if _io_ has no lines.

  1. # Get a random line from /etc/passwd
  2. line = open("/etc/passwd") {|f| random_line(f) }

def random_line(io)

 choice = io.gets; count = 1
 while line = io.gets
   rand(count += 1).zero? and choice = line
 end
 choice

end

def one_of_n(n)

 # Create a mock IO that provides line numbers instead of lines.
 # Assumes that #random_line calls #gets.
 (mock_io = Object.new).instance_eval do
   @count = 0
   @last = n
   def self.gets
     (@count < @last) ? (@count += 1) : nil
   end
 end
 random_line(mock_io)

end

chosen = Hash.new(0) 1_000_000.times { chosen[one_of_n(10)] += 1 } chosen.keys.sort.each do |key|

 puts "#{key} chosen #{chosen[key]} times"

end</lang>

$ ruby one-of-n.rb 
1 chosen 100470 times
2 chosen 100172 times
3 chosen 100473 times
4 chosen 99725 times
5 chosen 100600 times
6 chosen 99126 times
7 chosen 100297 times
8 chosen 99606 times
9 chosen 100039 times
10 chosen 99492 times

Seed7

<lang seed7>$ include "seed7_05.s7i";

const func integer: one_of_n (in integer: n) is func

 result
   var integer: r is 1;
 local
   var integer: i is 0;
 begin
   for i range 2 to n do
     if rand(1, i) = 1 then
       r := i;
     end if;
   end for;
 end func;

const proc: main is func

 local
   var array integer: r is 10 times 0;
   var integer: i is 0;
 begin
   for i range 1 to 1000000 do
     incr(r[one_of_n(10)]);
   end for;
   for i range 1 to 10 do
     write(r[i] <& " ");
   end for;
   writeln;
 end func;</lang>

Output:

100372 99661 100264 99644 100180 99748 99718 100205 99714 100494 

Tcl

<lang tcl>package require Tcl 8.5 proc 1ofN {n} {

   for {set line 1} {$line <= $n} {incr line} {

if {rand() < 1.0/[incr fraction]} { set result $line }

   }
   return $result

}

for {set i 0} {$i < 1000000} {incr i} {

   incr count([1ofN 10])

} parray count; # Alphabetic order, but convenient</lang> Sample output:

count(1)  = 99862
count(10) = 100517
count(2)  = 100545
count(3)  = 100339
count(4)  = 99636
count(5)  = 99920
count(6)  = 99263
count(7)  = 100283
count(8)  = 99871
count(9)  = 99764