External sort: Difference between revisions

← Older edit

External sort (view source)

Revision as of 12:17, 3 December 2023

63,166 bytes added , 5 months ago

m

→‎{{header|Wren}}: Minor tidy

PureFox

9,482

edits

Revision as of 17:19, 5 October 2016 (view source) Proton (talk \| contribs) (Created the external sort task)		Latest revision as of 12:17, 3 December 2023 (view source) PureFox (talk \| contribs) m (→‎{{header\|Wren}}: Minor tidy)
(27 intermediate revisions by 12 users not shown)
Line 1: {{Sorting Algorithm}} {{draft task}}Sort a huge file too large to fit into memory. The algorithm consists in reading a large file to be sorted in chunks of data small enough to fit in main memory, sort each of the chunks, write them out to a temporary file, and finally combined the smaller subfiles into a single larger file. For more info see: https://en.wikipedia.org/wiki/External_sorting [[Category:Sorting]] {{draft task\|Sorting Algorithms}} Sort a huge file too large to fit into memory. ~~The sorting algorithm can be any popular sort, like quicksort. For simplicity one can assume that the file consists of fixed lenght integers and that the sort function is less-than (<).~~ The algorithm consists in reading a large file to be sorted in chunks of data small enough to fit in main memory, sort each of the chunks, write them out to a temporary file, and finally combined the smaller subfiles into a single larger file. For more info see: https://en.wikipedia.org/wiki/External_sorting The sorting algorithm can be any popular sort, like quicksort. For simplicity one can assume that the file consists of fixed length integers and that the sort function is less-than (<). <br><br> =={{header\|AppleScript}}== This effort uses an in-place Quicksort, which tends to move things around less than a merge sort and so hopefully means fewer writes to the file and a less long wait for the task to finish. It suits the characteristics of AppleScript's File Read/Write commands and the fact that stable sorting isn't required with integers. A "half chunk" of integers at a time is read to each of two buffer lists covering different sections of the file range being partitioned. Only those integers needing to be swapped are written back to the file and each list is replaced as it's used up. When the converging sections eventually overlap, a single list is used instead which is updated in parallel with the file to ensure that the partitioning repeat stops in the right place. Partitions less than a "chunk" in length are sorted in memory with a heap sort. (The Foundation framework has a built-in NSMutableArray sort which is faster than a vanilla heap sort — even with the necessary derivation of NSMutableArrays from the lists and lists from the sorted arrays — but I don't know how well this fits the task's "low memory" conceit.) <syntaxhighlight lang="applescript">(* Quicksort algorithm: S.A.R. (Tony) Hoare, 1960. Optimisations by Robert Sedgewick and others at various times. Heap sort algorithm: J.W.J. Williams, 1964. ) use AppleScript version "2.3.1" -- MacOS 10.9 (Mavericks) or later — for these 'use' commands! use internalSorter : script "Heap sort" -- <https://www.rosettacode.org/wiki/Sorting_algorithms/Heapsort#AppleScript>. use scripting additions -- Configuration. property maxChunkSize : 256000 -- 256 KBytes (64000 AppleScript integers). The larger this figure can be, the less slow the sort. property integerSize : 4 property maxInternalSortSize : maxChunkSize on externalSort(theFile) -- Param: file, alias, or HFS path. ( Properties and handlers for the sort. ) script o -- Precalculated values. property twiceIntegerSize : integerSize + integerSize property maxHalfChunkSize : maxChunkSize div twiceIntegerSize integerSize -- Reference number of the system file handle for the open file. property fRef : missing value -- Start byte indices of integers in the file. property i : missing value property j : missing value -- Start byte indices of the most recent additions to pivot stores in the file. property pLeft : missing value property pRight : missing value -- Two buffer lists and assocatiated info. property leftList : missing value property leftEndByte : missing value -- End byte index of equivalent range in file. property a : missing value -- Left list index. property leftListLength : missing value property rightList : missing value property rightStartByte : missing value -- Start byte index of equivalent range in file. property b : missing value -- Right list index. -- Whether or not down to single-list working. property singleList : missing value (* Quicksort handler. Sorts a file of integers in place. ) on qsrt(l, r) -- Params: start-byte indices in the file of the first and last integers in a range to be partitioned. repeat -- Tail call elimination repeat. -- Prime the properties for this range. set {i, j, pLeft, pRight, leftEndByte, a, leftListLength, rightStartByte, b, singleList} to ¬ {l, r, l - integerSize, r + integerSize, l - 1, 0, 0, r + integerSize, 0, false} -- Get the first and last integers in the range, setting up the first two buffer lists in the process. set leftValue to nextLeftInteger(l, r) set rightValue to nextRightInteger(l, r) -- Read a third integer directly from the middle of the range in the file. set pivot to (read fRef from ((l + r - 2) div twiceIntegerSize integerSize + 1) for integerSize as integer) -- Choose one of the three as the pivot (median-of-3 pivot selection). set leftGreaterThanRight to (leftValue > rightValue) if (leftValue > pivot) then if (leftGreaterThanRight) then if (rightValue > pivot) then set pivot to rightValue else set pivot to leftValue end if else if (pivot > rightValue) then if (leftGreaterThanRight) then set pivot to leftValue else set pivot to rightValue end if end if -- Whichever's now the pivot, swap the outermost two if the left's greater than the right. -- If either of them is the pivot, advance the pivot store boundary on the relevant side. if (leftGreaterThanRight) then write leftValue to fRef as integer starting at r write rightValue to fRef as integer starting at l if (leftValue = pivot) then set pRight to r else if (rightValue = pivot) then set pLeft to l end if else if (leftValue = pivot) then set pLeft to l if (rightValue = pivot) then set pRight to r end if -- Continue partitioning the range. set i to l + integerSize set j to r - integerSize repeat until (i > j) -- Partitioning repeat. set leftValue to nextLeftInteger(l, r) repeat while (leftValue < pivot) set i to i + integerSize set leftValue to nextLeftInteger(l, r) end repeat set rightValue to nextRightInteger(l, r) repeat while (rightValue > pivot) set j to j - integerSize set rightValue to nextRightInteger(l, r) end repeat if (j > i) then -- Three-way partitioning: if either value to be swapped is a pivot instance, extend the pivot store -- on the destination side and, if the appropriated slot isn't already the pivot destination, swap its -- current content for (a copy of) the pivot and use the retrieved value instead in the main swap. if (leftValue = pivot) then set pRight to pRight - integerSize if (pRight > j) then set leftValue to (read fRef from pRight for integerSize as integer) write pivot as integer to fRef starting at pRight end if end if if (rightValue = pivot) then set pLeft to pLeft + integerSize if (pLeft < i) then set rightValue to (read fRef from pLeft for integerSize as integer) write pivot as integer to fRef starting at pLeft end if end if -- Write the values to be swapped to the appropriate places in the file. write rightValue to fRef as integer starting at i write leftValue to fRef as integer starting at j -- If down to a single buffer list, update this too so that the repeat will know when to stop. if (singleList) then set item a of my leftList to rightValue set item b of my leftList to leftValue end if else if (i > j) then exit repeat end if set i to i + integerSize set j to j - integerSize end repeat -- Partitioning. -- Swap any stored pivot instances into the slots next to the crossed indices -- and advance the indices to exclude the pivots from the rest of the sort. repeat with p from l to pLeft by integerSize if (j > pLeft) then write (read fRef from j for integerSize as integer) to fRef as integer starting at p write pivot to fRef as integer starting at j set j to j - integerSize else -- Don't bother swapping where store and target slots overlap. set j to p - integerSize exit repeat end if end repeat repeat with p from r to pRight by -integerSize if (i < pRight) then write (read fRef from i for integerSize as integer) to fRef as integer starting at p write pivot to fRef as integer starting at i set i to i + integerSize else set i to p + integerSize exit repeat end if end repeat -- Where the new partitions are short enough, sort them in memory with a non-recursive sort. -- Otherwise subpartition the shorter one recursively, then the longer iteratively. set leftDiff to j - l set rightDiff to r - i if (leftDiff < rightDiff) then set {shorterDiff, ls, rs, longerDiff, l} to {leftDiff, l, j, rightDiff, i} else set {shorterDiff, ls, rs, longerDiff, r} to {rightDiff, i, r, leftDiff, j} end if if (shorterDiff < maxInternalSortSize) then if (rs > ls) then sortInMemory(ls, rs) else qsrt(ls, rs) end if if (longerDiff < maxInternalSortSize) then if (r > l) then sortInMemory(l, r) exit repeat -- … and return from the handler. end if -- Otherwise go round again to handle the longer partition. end repeat -- Tail call elimination. end qsrt (* Return the next integer from the left buffer list, setting up or replacing the list as necessary. ) on nextLeftInteger(l, r) set a to a + 1 if (a > leftListLength) then -- The existing left list has been used up or doesn't yet exist. set leftEndByte to leftEndByte + maxHalfChunkSize -- Derive a new left list from the next half-chunk of data — unless any of this is already -- covered by the other list, in which case replace both lists with a single one. if (leftEndByte < rightStartByte) then set leftList to (read fRef from i for maxHalfChunkSize as integer) as list set a to 1 set leftListLength to (count leftList) else goToSingleList(l, r) set b to b + 1 end if end if return item a of my leftList end nextLeftInteger ( Return the next integer from the right buffer list, simile. ) on nextRightInteger(l, r) set b to b - 1 if (b < 1) then set rightStartByte to rightStartByte - maxHalfChunkSize if (rightStartByte > leftEndByte) then set rightList to (read fRef from rightStartByte for maxHalfChunkSize as integer) as list set b to (count rightList) else goToSingleList(l, r) end if end if return item b of my rightList end nextRightInteger ( Set up a single buffer list for use in the closing stage of a partitioning repeat. ) on goToSingleList(l, r) -- The range to read from the file is from bytes i to (j + integerSize - 1) -- PLUS an integer either side, if these are within the range being partitioned, -- to help ensure that the partitioning repeat stops in the right place. if (i > l) then set readStart to i - integerSize set a to 2 else set readStart to i set a to 1 end if if (j < r) then set readEnd to j + twiceIntegerSize - 1 else set readEnd to j + integerSize - 1 end if -- Ditch the existing right list. set rightList to missing value -- Read the integers from the calculated range and set both list properties to the same result instance. set leftList to (read fRef from readStart to readEnd as integer) as list set rightList to leftList -- Set the other relevant properties. set b to (count rightList) if (j < r) then set b to b - 1 set leftListLength to b set singleList to true end goToSingleList ( Read integers from a given range in the file, sort them in memory, and write them back to the same range. ) on sortInMemory(l, r) set rightList to missing value set leftList to (read fRef from l to (r + integerSize - 1) as integer) as list tell internalSorter to sort(leftList, 1, -1) read fRef from l for 0 -- Set the file handle's position pointer. repeat with x from 1 to (count leftList) write (item x of my leftList) as integer to fRef end repeat end sortInMemory end script ( Main handler code. Sets up and starts the sort. ) -- Check the input. try set theFile to theFile as alias on error display dialog "The specified file doesn't exist." buttons {"Stop"} default button 1 cancel button 1 with icon stop end try set fileSize to (get eof theFile) if (fileSize is 0) then display dialog "The file is empty." buttons {"Stop"} default button 1 cancel button 1 with icon stop else if (fileSize mod integerSize > 0) then display dialog ¬ "The file size isn't an exact number of integers." buttons {"Stop"} default button 1 cancel button 1 with icon stop end if -- Get the user to specify the destination file. Can be the original. set oldPath to theFile as text set astid to AppleScript's text item delimiters set AppleScript's text item delimiters to ":" tell oldPath to set {rootPath, oldName} to {text 1 thru text item -2, text item -1} set AppleScript's text item delimiters to "." tell oldName to set newName to text 1 thru text item -2 & " (sorted copy)." & text item -1 set AppleScript's text item delimiters to astid set newFile to ¬ (choose file name with prompt "Save the sorted result as…" default name newName default location (rootPath as alias)) -- If the original wasn't chosen, copy the data to the new location. -- There are simpler ways to copy a file, but this still practically instantaneous -- and definitely only involves maxChunkSize bytes at a time. if (newFile as text is not oldPath) then set readRef to (open for access theFile) set writeRef to (open for access newFile with write permission) try set eof writeRef to 0 repeat with i from 1 to fileSize by maxChunkSize set d to (read readRef for maxChunkSize as data) write d as data to writeRef end repeat close access writeRef close access readRef on error errMsg close access writeRef close access readRef display dialog errMsg buttons {"Stop"} default button 1 cancel button 1 with icon stop end try set theFile to newFile end if -- Open the target file with write permission and perform the sort. set o's fRef to (open for access theFile with write permission) try -- Handler parameters: first-byte indices of the first and last integers in the file. if (fileSize > maxChunkSize) then tell o to qsrt(1, fileSize + 1 - integerSize) else tell o to sortInMemory(1, fileSize + 1 - integerSize) end if close access o's fRef on error errMsg close access o's fRef display dialog errMsg buttons {"Stop"} default button 1 cancel button 1 with icon stop end try -- Return the specifier for the sorted file. return theFile end externalSort set theFile to (path to desktop as text) & "Test.dat" set sortedFile to externalSort(theFile)</syntaxhighlight> =={{header\|C++}}== Sort a file of integers using '''external merge sort'''. The input file is read into a ''single 32 byte buffer'' (8 ints) and the 8 ints are sorted and then written to a temp file on disk as strings. This continues until all the numbers from the input file are read and distributed into files of 8 integer strings. The last file may be less than 8 integers. All the temp files are opened as ifstreams in a dynamic pointer array. A minheap reads ints from each file stream in turn, and outputs its contents to the output file. ''(The heap generally only fills to 6 items before it writes its contents to the output file.)'' All sorted streams are merged in this way out to an external output file ''merged.txt''. <syntaxhighlight lang="cpp"> / ExternalSort.cpp / #include <iostream> #include <fstream> #include <queue> #include <string> #include <algorithm> #include <cstdio> / function signatures / int main(int argc, char argv[]); void write_vals(int* const, const size_t, const size_t); std::string mergeFiles(size_t); /* Comparison object compares first item of 2 std::pairs of ints true if first item is larger or the same. MinHeap sorts with this. It gets called a lot, so the simpler the better. STL api stipulates boolean predicate A "functor" object defines a function call operator () that returns a value. / struct Compare { // compare 2 pairs by first element bool operator() ( std::pair<int, int>& p1, std::pair<int, int>& p2 ) { return p1.first >= p2.first; // Ascending order } }; / aliases / using ipair = std::pair<int,int>; using pairvector = std::vector<ipair>; using MinHeap = std::priority_queue< ipair, pairvector, Compare >; / constants / const size_t memsize = 32; // 32 bytes const size_t chunksize = memsize / sizeof(int); // 8 int const std::string tmp_prefix{"tmp_out_"}; // tmpfile prefix const std::string tmp_suffix{".txt"}; // tmpfile suffix const std::string merged_file{"merged.txt"}; // output file / functions / // write int array to file void write_vals( int const values, const size_t size, const size_t chunk ) { // tmp_out_1.txt, tmp_out_2.txt ... std::string output_file = (tmp_prefix + std::to_string(chunk) + tmp_suffix); std::ofstream ofs(output_file.c_str()); //output file for (int i=0; i<size; i++) ofs << values[i] << '\t'; ofs << '\n'; ofs.close(); } /* merge all external sorted files into one output file (same size as original input file) / std::string mergeFiles(size_t chunks, const std::string& merge_file ) { std::ofstream ofs( merge_file.c_str() ); MinHeap minHeap; // array of ifstreams std::ifstream ifs_tempfiles = new std::ifstream[chunks]; for (size_t i = 1; i<=chunks; i++) { int topval = 0; // generate a unique name for temp file (temp_out_1.txt , temp_out_2.txt ..) std::string sorted_file = (tmp_prefix + std::to_string(i) + tmp_suffix); // open an input file stream object for each name ifs_tempfiles[i-1].open( sorted_file.c_str() ); // bind to tmp_out_{i}.txt // get val from temp file if (ifs_tempfiles[i-1].is_open()) { ifs_tempfiles[i-1] >> topval; // first value in the file (min) ipair top(topval, (i-1)); // 2nd value is tempfile number minHeap.push( top ); // minHeap autosorts } } while (minHeap.size() > 0) { int next_val = 0; ipair min_pair = minHeap.top(); // get min minHeap.pop(); ofs << min_pair.first << ' '; // write value to file std::flush(ofs); if ( ifs_tempfiles[min_pair.second] >> next_val) { ipair np( next_val, min_pair.second ); minHeap.push( np ); } } // close open files for (int i = 1; i <= chunks; i++) { ifs_tempfiles[i-1].close(); } ofs << '\n'; ofs.close(); delete[] ifs_tempfiles; // free memory return merged_file; // string } int main(int argc, char* argv[] ) { if (argc < 2) { std::cerr << "usage: ExternalSort <filename> \n"; return 1; } // open input file for reading std::ifstream ifs( argv[1] ); if ( ifs.fail() ) { std::cerr << "error opening " << argv[1] << "\n"; return 2; } // temp array for input (small) (32 bytes -- 8 ints) int* inputValues = new int[chunksize]; int chunk = 1; // counter (which chunk) int val = 0; // int for reading int count = 0; // count reads bool done = false; std::cout << "internal buffer is " << memsize << " bytes" << "\n"; // read chunksize values from input file while (ifs >> val) { done = false; inputValues[count] = val; count++; if (count == chunksize) { std::sort(inputValues, inputValues + count); write_vals(inputValues, count, chunk); // output vals to chunk ++; count = 0; done = true; } } // while if (! done) // one more file { std::sort(inputValues, inputValues + count); write_vals(inputValues, count, chunk); // output vals to } else { chunk --; // fix overshoot } ifs.close(); // done with original input file delete[] inputValues; // free dynamically allocated memory // perform external mergesort on sorted temp files, if any. if ( chunk == 0 ) std::cout << "no data found\n"; else std::cout << "Sorted output is in file: " << mergeFiles(chunk, merged_file ) << "\n"; return EXIT_SUCCESS; } /* compile: clang++ -std=c++11 -Wall -pedantic -o ExternalSort ExternalSort.cpp / / inputfile integers -- one per line for simplicity / </syntaxhighlight> {{out}} input: <pre> 2 65 76 88 55 22 35 11 76 99 7 111 4 55 34 63 22 13 45 9 0 112 123 345 456 8 654 678 999 888 10 3 555 534 236 143 860 648 1 627 711 223 332 5 443 445 </pre> tmp_out_1.txt <pre> 2 11 22 35 55 65 76 88 </pre> tmp_out_2.txt <pre> 4 7 34 55 63 76 99 111 </pre> tmp_out_3.txt <pre> 0 9 13 22 45 112 123 345 </pre> tmp_out_4.txt <pre> 3 8 10 456 654 678 888 999 </pre> tmp_out_5.txt <pre> 1 143 236 534 555 627 648 860 </pre> tmp_out_6.txt <pre> 5 223 332 443 445 711 </pre> merged.txt: <pre> 0 1 2 3 4 5 7 8 9 10 11 13 22 22 34 35 45 55 55 63 65 76 76 88 99 111 112 123 143 223 236 332 345 443 445 456 534 555 627 648 654 678 711 860 888 999 </pre> =={{header\|Go}}== This is a translation of the C++ code [https://www.geeksforgeeks.org/external-sorting/ here] which implements external sorting using a merge sort. In the interests of brevity, the extensive comments in the C++ version have been largely omitted. A small test file consisting of random integers has been generated and sorted to demonstrate that the approach works. <syntaxhighlight lang="go">package main import ( "fmt" "io" "log" "math" "math/rand" "os" "time" ) type MinHeapNode struct{ element, index int } type MinHeap struct{ nodes []MinHeapNode } func left(i int) int { return (2i + 1) } func right(i int) int { return (2i + 2) } func newMinHeap(nodes []MinHeapNode) MinHeap { mh := new(MinHeap) mh.nodes = nodes for i := (len(nodes) - 1) / 2; i >= 0; i-- { mh.minHeapify(i) } return mh } func (mh MinHeap) getMin() MinHeapNode { return mh.nodes[0] } func (mh MinHeap) replaceMin(x MinHeapNode) { mh.nodes[0] = x mh.minHeapify(0) } func (mh MinHeap) minHeapify(i int) { l, r := left(i), right(i) smallest := i heapSize := len(mh.nodes) if l < heapSize && mh.nodes[l].element < mh.nodes[i].element { smallest = l } if r < heapSize && mh.nodes[r].element < mh.nodes[smallest].element { smallest = r } if smallest != i { mh.nodes[i], mh.nodes[smallest] = mh.nodes[smallest], mh.nodes[i] mh.minHeapify(smallest) } } func merge(arr []int, l, m, r int) { n1, n2 := m-l+1, r-m tl := make([]int, n1) tr := make([]int, n2) copy(tl, arr[l:]) copy(tr, arr[m+1:]) i, j, k := 0, 0, l for i < n1 && j < n2 { if tl[i] <= tr[j] { arr[k] = tl[i] k++ i++ } else { arr[k] = tr[j] k++ j++ } } for i < n1 { arr[k] = tl[i] k++ i++ } for j < n2 { arr[k] = tr[j] k++ j++ } } func mergeSort(arr []int, l, r int) { if l < r { m := l + (r-l)/2 mergeSort(arr, l, m) mergeSort(arr, m+1, r) merge(arr, l, m, r) } } // Merge k sorted files: es0,es1 etc. func mergeFiles(outputFile string, n, k int) { in := make([]os.File, k) var err error for i := 0; i < k; i++ { fileName := fmt.Sprintf("es%d", i) in[i], err = os.Open(fileName) check(err) } out, err := os.Create(outputFile) check(err) nodes := make([]MinHeapNode, k) i := 0 for ; i < k; i++ { _, err = fmt.Fscanf(in[i], "%d", &nodes[i].element) if err == io.EOF { break } check(err) nodes[i].index = i } hp := newMinHeap(nodes[:i]) count := 0 for count != i { root := hp.getMin() fmt.Fprintf(out, "%d ", root.element) _, err = fmt.Fscanf(in[root.index], "%d", &root.element) if err == io.EOF { root.element = math.MaxInt32 count++ } else { check(err) } hp.replaceMin(root) } for j := 0; j < k; j++ { in[j].Close() } out.Close() } func check(err error) { if err != nil { log.Fatal(err) } } // Create initial runs, divide them evenly amongst the output files // and then merge-sort them. func createInitialRuns(inputFile string, runSize, numWays int) { in, err := os.Open(inputFile) out := make([]os.File, numWays) for i := 0; i < numWays; i++ { fileName := fmt.Sprintf("es%d", i) // es0, es1 etc. out[i], err = os.Create(fileName) check(err) } arr := make([]int, runSize) moreInput := true nextOutputFile := 0 var i int for moreInput { for i = 0; i < runSize; i++ { _, err := fmt.Fscanf(in, "%d", &arr[i]) if err == io.EOF { moreInput = false break } check(err) } mergeSort(arr, 0, i-1) for j := 0; j < i; j++ { fmt.Fprintf(out[nextOutputFile], "%d ", arr[j]) } nextOutputFile++ } for j := 0; j < numWays; j++ { out[j].Close() } in.Close() } func externalSort(inputFile, outputFile string, numWays, runSize int) { createInitialRuns(inputFile, runSize, numWays) mergeFiles(outputFile, runSize, numWays) } func main() { // Create a small test file of 40 random ints and split it into 4 files // of 10 integers each. numWays := 4 runSize := 10 inputFile := "input.txt" outputFile := "output.txt" in, err := os.Create(inputFile) check(err) rand.Seed(time.Now().UnixNano()) for i := 0; i < numWaysrunSize; i++ { fmt.Fprintf(in, "%d ", rand.Intn(math.MaxInt32)) } in.Close() externalSort(inputFile, outputFile, numWays, runSize) // remove temporary files for i := 0; i < numWays; i++ { fileName := fmt.Sprintf("es%d", i) err = os.Remove(fileName) check(err) } }</syntaxhighlight> {{out}} Contents of input.txt: <pre> 921996447 760852351 223421434 1245608832 745990119 1414811249 1947335121 762344474 588429291 993452626 2592794 491133923 1275871423 1152039534 649892156 278215570 595760601 1878223040 1267371451 2097209826 1409628494 1147072290 309824251 108477605 1705270413 1821354697 1703557665 473708588 110138202 1292465428 946557804 148800949 1471244316 1508853596 1306802817 1016358698 1661284048 527644251 546155704 337874167 </pre> Contents of output.txt: <pre> 2592794 108477605 110138202 148800949 223421434 278215570 309824251 337874167 473708588 491133923 527644251 546155704 588429291 595760601 649892156 745990119 760852351 762344474 921996447 946557804 993452626 1016358698 1147072290 1152039534 1245608832 1267371451 1275871423 1292465428 1306802817 1409628494 1414811249 1471244316 1508853596 1661284048 1703557665 1705270413 1821354697 1878223040 1947335121 2097209826 </pre> =={{header\|J}}== Untested on a memory mapped file. <syntaxhighlight lang="j"> NB. Apply an in-place sorting algorithm to a memory mapped file NB. in-place sort is translation of in-place python quicksort. require 'jmf' JCHAR map_jmf_ 'DATA'; 'file.huge' NB. The noun DATA now refers to the memory mapped file. NB. Use: quicksort DATA NB. use: quicksort DATA quicksort=: 3 :'qsinternal 0 , <:@:# ARRAY=: y' NB. ARRAY is global qsinternal=: 3 :0 'start stop'=. y if. 0 < stop - start do. 'left right pivot'=. start, stop, start{ARRAY NB. pivot, left, right = array[start], start, stop while. left <: right do. NB. while left <= right: while. pivot > left { ARRAY do. NB. while array[left] < pivot: left=. >: left end. while. pivot < right { ARRAY do. NB. while array[right] > pivot: right=. <: right NB. right -= 1 end. if. left <: right do. NB. if left <= right: NB. mapped files work by reference, assignment not required, but for testing. ARRAY=: (left, right) {`(\|.@:[)`]} ARRAY NB. array[left], array[right] = array[right], array[left] left=. >: left NB. left += 1 right=. <: right NB. right -= 1 end. end. qsinternal start , right NB. _quicksort(array, start, right) qsinternal left , stop NB. _quicksort(array, left, stop) end. i. 0 0 NB. verbs return the final noun ) </syntaxhighlight> Demonstration the sorting works: <pre> quicksort ?~10 ARRAY 0 1 2 3 4 5 6 7 8 9 </pre> =={{header\|Julia}}== <syntaxhighlight lang="julia">intfile = open("/tmp/mmap.bin", "r+") arr = Mmap.mmap(intfile, Vector{Int64}, (div(stat(intfile).size, 8))) # Int64 is 8 bytes sort!(arr) </syntaxhighlight> =={{header\|Nim}}== {{trans\|Phix}} <syntaxhighlight lang="nim">import algorithm, heapqueue, os, random, sequtils, strformat, strutils proc sortFiles(filenames: seq[string]) = for filename in filenames: var lines = filename.readFile() lines.stripLineEnd() # Ignore last line feed, if any. var sortedLines = sorted(lines.splitLines()) echo &"""{filename} => {sortedLines.join(", ")}""" filename.writeFile(sortedLines.join("\n")) proc mergeFiles(outfile: File; filenames: seq[string]) = var queue: HeapQueue[(string, File)] for filename in filenames: let f = open(fileName) queue.push (f.readLine(), f) while queue.len > 0: let (s, infile) = queue.pop() outfile.write s & '\n' if infile.endOfFile: infile.close() else: queue.push((infile.readLine(), infile)) when isMainModule: const WriteToFile = false # Compile time switch. randomize() let nf = rand(2..4) # Number of files. let lp = 3 # Lines per file. var filenames: seq[string] var lines = toSeq(1..nflp) lines.shuffle() for i in 1..nf: let filename = &"file{i}.txt" filenames.add filename let f = open(filename, fmWrite) for l in 1..lp: f.write &"Line {lines[^l]:2}\n" lines.setLen(lines.len - lp) f.close() echo &"sorting {nf lp} lines split over {nf} files" sortFiles(filenames) when WriteToFile: let f = open("results.txt", fmWrite) mergeFiles(f, filenames) f.close() else: mergeFiles(stdout, filenames) for filename in filenames: removeFile(filename)</syntaxhighlight> {{out}} <pre>sorting 12 lines split over 4 files file1.txt => Line 3, Line 4, Line 10 file2.txt => Line 1, Line 6, Line 12 file3.txt => Line 2, Line 5, Line 8 file4.txt => Line 7, Line 9, Line 11 Line 1 Line 2 Line 3 Line 4 Line 5 Line 6 Line 7 Line 8 Line 9 Line 10 Line 11 Line 12</pre> =={{header\|Perl}}== Simulate task by reading from 'DATA' handle and using tiny record limit. As written, works for any numeric input, but could define any kind of customized sorting. <syntaxhighlight lang="perl">use strict; use warnings; my $max = 4; # records per merge file my(@chunk,@tempf); sub mysort ($$) { return $_[0] <=> $_[1] } sub store { my($a) = @_; my $f = IO::File->new_tmpfile; # self-deleting after program exit print $f sort mysort @$a; seek $f, 0, 0 or warn "Oops: $!"; push(@tempf, { fh => $f, queued => scalar <$f> } ); } # read input and create sorted temporary files while (<DATA>) { push @chunk, $_; store(\@chunk), @chunk = () if @chunk == $max; } store(\@chunk) if @chunk; # merge everything while (1) { my($lowest) = (sort { mysort($a->{queued}, $b->{queued}); } grep(defined $_->{queued}, @tempf) )[0]; last unless $lowest->{queued}; print $lowest->{queued}; $lowest->{queued} = $lowest->{fh}->getline(); } __DATA__ 432 345 321 543 987 456 678 123 765 567 876 654 789 234</syntaxhighlight> {{out}} <pre>123 234 321 345 432 456 543 567 654 678 765 789 876 987</pre> =={{header\|Phix}}== Slight variation on [[Stream_Merge#Phix\|Stream_Merge]] <!--<syntaxhighlight lang="phix">(notonline)--> <span style="color: #008080;">without</span> <span style="color: #008080;">js</span> <span style="color: #000080;font-style:italic;">-- file i/o</span> <span style="color: #008080;">include</span> <span style="color: #000000;">builtins</span><span style="color: #0000FF;">/</span><span style="color: #000000;">pqueue</span><span style="color: #0000FF;">.</span><span style="color: #000000;">e</span> <span style="color: #008080;">include</span> <span style="color: #000000;">builtins</span><span style="color: #0000FF;">/</span><span style="color: #000000;">pfile</span><span style="color: #0000FF;">.</span><span style="color: #000000;">e</span> <span style="color: #000080;font-style:italic;">-- write_lines() - not [yet] documented</span> <span style="color: #008080;">procedure</span> <span style="color: #000000;">add</span><span style="color: #0000FF;">(</span><span style="color: #004080;">integer</span> <span style="color: #000000;">fn</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">pq</span><span style="color: #0000FF;">)</span> <span style="color: #004080;">object</span> <span style="color: #000000;">line</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">gets</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">if</span> <span style="color: #000000;">line</span><span style="color: #0000FF;">=-</span><span style="color: #000000;">1</span> <span style="color: #008080;">then</span> <span style="color: #7060A8;">close</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">else</span> <span style="color: #7060A8;">pq_add</span><span style="color: #0000FF;">({</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">,</span><span style="color: #000000;">line</span><span style="color: #0000FF;">},</span> <span style="color: #000000;">pq</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #008080;">end</span> <span style="color: #008080;">procedure</span> <span style="color: #008080;">procedure</span> <span style="color: #000000;">sort_files</span><span style="color: #0000FF;">(</span><span style="color: #004080;">sequence</span> <span style="color: #000000;">filenames</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">filenames</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span> <span style="color: #004080;">sequence</span> <span style="color: #000000;">lines</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">get_text</span><span style="color: #0000FF;">(</span><span style="color: #000000;">filenames</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">],</span><span style="color: #004600;">GT_LF_STRIPPED</span><span style="color: #0000FF;">),</span> <span style="color: #000000;">sorted</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">sort</span><span style="color: #0000FF;">(</span><span style="color: #000000;">lines</span><span style="color: #0000FF;">)</span> <span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"%s:%v => %v\n"</span><span style="color: #0000FF;">,{</span><span style="color: #000000;">filenames</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">],</span><span style="color: #000000;">lines</span><span style="color: #0000FF;">,</span><span style="color: #000000;">sorted</span><span style="color: #0000FF;">})</span> <span style="color: #008080;">if</span> <span style="color: #7060A8;">write_lines</span><span style="color: #0000FF;">(</span><span style="color: #000000;">filenames</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">],</span><span style="color: #000000;">sorted</span><span style="color: #0000FF;">)!=</span><span style="color: #000000;">1</span> <span style="color: #008080;">then</span> <span style="color: #0000FF;">?</span><span style="color: #000000;">9</span><span style="color: #0000FF;">/</span><span style="color: #000000;">0</span> <span style="color: #008080;">end</span> <span style="color: #008080;">if</span> <span style="color: #008080;">end</span> <span style="color: #008080;">for</span> <span style="color: #008080;">end</span> <span style="color: #008080;">procedure</span> <span style="color: #008080;">procedure</span> <span style="color: #000000;">merge_files</span><span style="color: #0000FF;">(</span><span style="color: #004080;">integer</span> <span style="color: #000000;">outfn</span><span style="color: #0000FF;">,</span> <span style="color: #004080;">sequence</span> <span style="color: #000000;">filenames</span><span style="color: #0000FF;">)</span> <span style="color: #004080;">integer</span> <span style="color: #000000;">pq</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">pq_new</span><span style="color: #0000FF;">()</span> <span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #7060A8;">length</span><span style="color: #0000FF;">(</span><span style="color: #000000;">filenames</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span> <span style="color: #000000;">add</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">open</span><span style="color: #0000FF;">(</span><span style="color: #000000;">filenames</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">],</span> <span style="color: #008000;">"r"</span><span style="color: #0000FF;">),</span><span style="color: #000000;">pq</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">end</span> <span style="color: #008080;">for</span> <span style="color: #008080;">while</span> <span style="color: #008080;">not</span> <span style="color: #7060A8;">pq_empty</span><span style="color: #0000FF;">(</span><span style="color: #000000;">pq</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">do</span> <span style="color: #0000FF;">{</span><span style="color: #004080;">integer</span> <span style="color: #000000;">fn</span><span style="color: #0000FF;">,</span> <span style="color: #004080;">string</span> <span style="color: #000000;">line</span><span style="color: #0000FF;">}</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">pq_pop</span><span style="color: #0000FF;">(</span><span style="color: #000000;">pq</span><span style="color: #0000FF;">)</span> <span style="color: #7060A8;">puts</span><span style="color: #0000FF;">(</span><span style="color: #000000;">outfn</span><span style="color: #0000FF;">,</span><span style="color: #000000;">line</span><span style="color: #0000FF;">)</span> <span style="color: #000000;">add</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">,</span> <span style="color: #000000;">pq</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">end</span> <span style="color: #008080;">while</span> <span style="color: #7060A8;">pq_destroy</span><span style="color: #0000FF;">(</span><span style="color: #000000;">pq</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">end</span> <span style="color: #008080;">procedure</span> <span style="color: #008080;">procedure</span> <span style="color: #000000;">test</span><span style="color: #0000FF;">()</span> <span style="color: #004080;">integer</span> <span style="color: #000000;">nf</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">rand</span><span style="color: #0000FF;">(</span><span style="color: #000000;">5</span><span style="color: #0000FF;">),</span> <span style="color: #000080;font-style:italic;">-- number of files</span> <span style="color: #000000;">lp</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">3</span> <span style="color: #000080;font-style:italic;">-- lines per file</span> <span style="color: #004080;">sequence</span> <span style="color: #000000;">filenames</span> <span style="color: #0000FF;">=</span> <span style="color: #0000FF;">{},</span> <span style="color: #000000;">lines</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">shuffle</span><span style="color: #0000FF;">(</span><span style="color: #7060A8;">tagset</span><span style="color: #0000FF;">(</span><span style="color: #000000;">nf</span><span style="color: #0000FF;"></span><span style="color: #000000;">lp</span><span style="color: #0000FF;">))</span> <span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #000000;">nf</span> <span style="color: #008080;">do</span> <span style="color: #004080;">string</span> <span style="color: #000000;">filename</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">sprintf</span><span style="color: #0000FF;">(</span><span style="color: #008000;">"file%d.txt"</span><span style="color: #0000FF;">,</span><span style="color: #000000;">i</span><span style="color: #0000FF;">)</span> <span style="color: #000000;">filenames</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">append</span><span style="color: #0000FF;">(</span><span style="color: #000000;">filenames</span><span style="color: #0000FF;">,</span><span style="color: #000000;">filename</span><span style="color: #0000FF;">)</span> <span style="color: #004080;">integer</span> <span style="color: #000000;">fn</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">open</span><span style="color: #0000FF;">(</span><span style="color: #000000;">filename</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"w"</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">for</span> <span style="color: #000000;">l</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #000000;">lp</span> <span style="color: #008080;">do</span> <span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"Line %02d\n"</span><span style="color: #0000FF;">,</span><span style="color: #000000;">lines</span><span style="color: #0000FF;">[</span><span style="color: #000000;">l</span><span style="color: #0000FF;">])</span> <span style="color: #008080;">end</span> <span style="color: #008080;">for</span> <span style="color: #000000;">lines</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">lines</span><span style="color: #0000FF;">[</span><span style="color: #000000;">lp</span><span style="color: #0000FF;">+</span><span style="color: #000000;">1</span><span style="color: #0000FF;">..$]</span> <span style="color: #7060A8;">close</span><span style="color: #0000FF;">(</span><span style="color: #000000;">fn</span><span style="color: #0000FF;">)</span> <span style="color: #008080;">end</span> <span style="color: #008080;">for</span> <span style="color: #7060A8;">printf</span><span style="color: #0000FF;">(</span><span style="color: #000000;">1</span><span style="color: #0000FF;">,</span><span style="color: #008000;">"sorting %d lines split over %d files\n"</span><span style="color: #0000FF;">,{</span><span style="color: #000000;">nf</span><span style="color: #0000FF;"></span><span style="color: #000000;">lp</span><span style="color: #0000FF;">,</span><span style="color: #000000;">nf</span><span style="color: #0000FF;">})</span> <span style="color: #000000;">sort_files</span><span style="color: #0000FF;">(</span><span style="color: #000000;">filenames</span><span style="color: #0000FF;">)</span> <span style="color: #004080;">integer</span> <span style="color: #000000;">outfn</span> <span style="color: #0000FF;">=</span> <span style="color: #000000;">1</span> <span style="color: #000080;font-style:italic;">-- or open("results.txt","w")</span> <span style="color: #000000;">merge_files</span><span style="color: #0000FF;">(</span><span style="color: #000000;">outfn</span><span style="color: #0000FF;">,</span><span style="color: #000000;">filenames</span><span style="color: #0000FF;">)</span> <span style="color: #000080;font-style:italic;">-- close(outfn)</span> <span style="color: #008080;">for</span> <span style="color: #000000;">i</span><span style="color: #0000FF;">=</span><span style="color: #000000;">1</span> <span style="color: #008080;">to</span> <span style="color: #000000;">nf</span> <span style="color: #008080;">do</span> <span style="color: #0000FF;">{}</span> <span style="color: #0000FF;">=</span> <span style="color: #7060A8;">delete_file</span><span style="color: #0000FF;">(</span><span style="color: #000000;">filenames</span><span style="color: #0000FF;">[</span><span style="color: #000000;">i</span><span style="color: #0000FF;">])</span> <span style="color: #008080;">end</span> <span style="color: #008080;">for</span> <span style="color: #008080;">end</span> <span style="color: #008080;">procedure</span> <span style="color: #000000;">test</span><span style="color: #0000FF;">()</span> <!--</syntaxhighlight>--> {{out}} <pre> sorting 9 lines split over 3 files file1.txt:{"Line 04","Line 01","Line 09"} => {"Line 01","Line 04","Line 09"} file2.txt:{"Line 06","Line 07","Line 02"} => {"Line 02","Line 06","Line 07"} file3.txt:{"Line 08","Line 03","Line 05"} => {"Line 03","Line 05","Line 08"} Line 01 Line 02 Line 03 Line 04 Line 05 Line 06 Line 07 Line 08 Line 09 </pre> =={{header\|Python}}== A technique demonstrated with a short string character data. <syntaxhighlight lang="python"> #! /usr/bin/python3 ''' $ # example session in bash $ python3 external_sort.py expect 123456789 memory size 1 passed memory size 2 passed memory size 3 passed memory size 4 passed memory size 5 passed memory size 6 passed memory size 7 passed memory size 8 passed memory size 9 passed memory size 10 passed memory size 11 passed ''' import io def sort_large_file(n: int, source: open, sink: open, file_opener = open)->None: ''' approach: break the source into files of size n sort each of these files merge these onto the sink ''' # store sorted chunks into files of size n mergers = [] while True: text = list(source.read(n)) if not len(text): break; text.sort() merge_me = file_opener() merge_me.write(''.join(text)) mergers.append(merge_me) merge_me.seek(0) # merge onto sink stack_tops = [f.read(1) for f in mergers] while stack_tops: c = min(stack_tops) sink.write(c) i = stack_tops.index(c) t = mergers[i].read(1) if t: stack_tops[i] = t else: del stack_tops[i] mergers[i].close() del mergers[i] # __del__ method of file_opener should delete the file def main(): ''' test case sort 6,7,8,9,2,5,3,4,1 with several memory sizes ''' # load test case into a file like object input_file_too_large_for_memory = io.StringIO('678925341') # generate the expected output t = list(input_file_too_large_for_memory.read()) t.sort() expect = ''.join(t) print('expect', expect) # attempt to sort with several memory sizes for memory_size in range(1,12): input_file_too_large_for_memory.seek(0) output_file_too_large_for_memory = io.StringIO() sort_large_file(memory_size, input_file_too_large_for_memory, output_file_too_large_for_memory, io.StringIO) output_file_too_large_for_memory.seek(0) assert(output_file_too_large_for_memory.read() == expect) print('memory size {} passed'.format(memory_size)) if __name__ == '__main__': example = main example() </syntaxhighlight> =={{header\|Raku}}== (formerly Perl 6) Borrowing from [http://rosettacode.org/wiki/Stream_Merge Stream_Merge] here. Temporary files are automatically deleted when program is done, so no explicit clean-up required. <syntaxhighlight lang="raku" line>use File::Temp; sub merge_streams ( @streams ) { my @s = @streams.map({ hash( STREAM => $_, HEAD => .get ) }).grep({ .<HEAD>.defined }); return gather while @s { my $h = @s.min: +.<HEAD>; take $h<HEAD>; $h<HEAD> := $h<STREAM>.get orelse @s .= grep( { $_ !=== $h } ); } } sub store (@values) { my ($filename,$filehandle) = tempfile(:prefix('external-sort.')); $filehandle.say: join "\n", @values.sort: +; $filename } # we're going to pretend that this is a long stream of input from stdin... my (@chunk,@files); for (<43 2 45 32 15 4 3 -9 45 66 0 42 78 123 -11 76 55 87 -2 64 92 34>) { @chunk.push: $_; @files.push: store(@chunk) and @chunk = () if @chunk.elems == 4; # limit of records per merge file } @files.push: store(@chunk) if @chunk; say join ' ', merge_streams @files».&open;</syntaxhighlight> {{out}} <pre>-11 -9 -2 0 2 3 4 15 32 34 42 43 45 45 55 64 66 76 78 87 92 123</pre> =={{header\|REXX}}== Programming note:   the method used to generate the input file is to create the file with   '''N'''   records, <br>breaking up the records into sort work files of no more than   '''10'''   records   (limit). <br>The sort work files are then sorted with an external sort, and then merged into one big file. This particular example uses the DOS   '''SORT'''   and   '''ERASE'''   commands. <syntaxhighlight lang="rexx">/REXX pgm reads a file, splits into smaller files, sorts 'em, combines into sorted file/ parse arg FID n lim seed . /obtain optional arguments from the CL/ if FID=='' \| FID=="," then FID= 'SORT_EXT.OUT' /name of the output (sorted) file. / if n=='' \| n=="," then n= 500 /number of records (rand #s) to gen. / if lim=='' \| lim=="," then lim= 10 /number of records per SORTWORK file. / if datatype(seed, 'W') then call random ,,seed /Numeric? Then use it as a rand seed./ sWork = 'SORTWORK.' /the filename of the SORTWORK files./ call gen n,lim /generate SORTWORK.nnn files. / call srt # /sort records in all SORTWORK files./ call mrg /merge records in the SORTWORK files./ exit 0 /stick a fork in it, we're all done. / /──────────────────────────────────────────────────────────────────────────────────────/ mrg: procedure expose FID sWork; parse arg # /#: the number of SORTWORK files. / @.= copies('ff'x, 1e5) /no value should be larger than this. / call lineout FID, , 1 /position the output file at rec # 1. / do j=1 until @.j==@.; call rdr j /read any number of SORTWORK files, / end /j/ /but initially just 1 record per file./ j= j - 1 /adj. J; read from a non─existent file/ do forever; y= @. /find the lowest value for N values./ z= 0 do k=1 for j /traipse through the stemmed @ array./ if @.k==@. then call rdr k /Not defined? Then read a file record/ if @.k<<y then do; y= @.k /Lowest so far? Then mark this as min./ z= k end end /k/ /* [↑] note use of << exact comparison/ if z==0 then leave /Any more records? No, close file. / call lineout FID, @.z /write the value to the output file. / call rdr z /re-populate a value from the # file. / end /forever/ /keep reading/merging until exhausted./ call lineout FID /close the output file (just in case)./ 'ERASE' sWORK"" /delete all the SORTWORK files. / return /──────────────────────────────────────────────────────────────────────────────────────/ gen: procedure expose #; parse arg m,siz; d= digits() /used for justify/ # = 0 /number of SORTWORK.nnn files so far/ do j=1 for m; #= 1 + j % siz /create workfile#/ call lineout 'SORTWORK.'#, right(random(, 1e5), d) /write rand #. / end /j/ do k=1 for #; call lineout 'SORTWORK.'# /close a workfile/ end /k/ return /──────────────────────────────────────────────────────────────────────────────────────/ rdr: parse arg a; @.a=@.; f= sWork \|\| a; if lines(f)\==0 then @.a= linein(f); return /──────────────────────────────────────────────────────────────────────────────────────/ srt: procedure expose sWork; parse arg # do j=1 for #; fn= sWORK \|\| j; 'SORT' fn "/O" fn; end /j/; return</syntaxhighlight><br><br> =={{header\|Wren}}== {{trans\|Go}} {{libheader\|Wren-dynamic}} {{libheader\|Wren-sort}} {{libheader\|Wren-str}} A bit simpler than the Go version as we use fixed length integers which (together with a following space) can be sorted as strings. <syntaxhighlight lang="wren">import "io" for File import "random" for Random import "./dynamic" for Struct import "./sort" for Sort import "./str" for Str var MinHeapNode = Struct.create("MinHeapNode", ["element", "index"]) class MinHeap { construct new(nodes) { _nodes = nodes var start = ((_nodes.count-1)/2).floor for (i in start..0) minHeapify(i) } left(i) { 2i + 1 } right(i) { 2i + 2 } nodes { _nodes } min { _nodes[0] } replaceMin(x) { _nodes[0] = x minHeapify(0) } minHeapify(i) { var l = left(i) var r = right(i) var smallest = i var heapSize = _nodes.count if (l < heapSize && Str.lt(_nodes[l].element, _nodes[i].element)) smallest = l if (r < heapSize && Str.lt(_nodes[r].element, _nodes[smallest].element)) smallest = r if (smallest != i) { _nodes.swap(i, smallest) minHeapify(smallest) } } } // Merge k sorted files: es0,es1 etc. var mergeFiles = Fn.new { \|outputFile, k, e\| var inp = List.filled(k, null) var offset = List.filled(k, 0) // current offset for each input file for (i in 0...k) { var fileName = "es%(i)" inp[i] = File.open(fileName) } var out = File.create(outputFile) var nodes = List.filled(k, null) for (i in 0...k) nodes[i] = MinHeapNode.new(0, 0) var i = 0 while (i < k) { var bytes = inp[i].readBytes(e) if (bytes.count < e) break // end of file reached nodes[i].element = bytes nodes[i].index = i offset[i] = offset[i] + e i = i + 1 } var hp = MinHeap.new(nodes[0...i]) var count = 0 while (count != i) { var root = hp.min out.writeBytes(root.element) var bytes = inp[root.index].readBytes(e, offset[root.index]) if (bytes.count < e) { // end of file reached root.element = "999999 " count = count + 1 } else { root.element = bytes offset[root.index] = offset[root.index] + e } hp.replaceMin(root) } for (j in 0...k) inp[j].close() out.close() } // Create initial runs, divide them evenly amongst the output files // and then merge-sort them. var createInitialRuns = Fn.new { \|inputFile, numWays, runSize, elementSize\| var inp = File.open(inputFile) var offset = 0 for (i in 0...numWays) { var fileName = "es%(i)" // es0, es1 etc. var bytes = inp.readBytes(runSize * elementSize, offset) offset = offset + runSize * elementSize var numbers = Str.chunks(bytes, elementSize) numbers = Sort.merge(numbers) File.create(fileName) { \|f\| f.writeBytes(numbers.join("")) } } inp.close() } var externalSort = Fn.new { \|inputFile, outputFile, numWays, runSize, elementSize\| createInitialRuns.call(inputFile, numWays, runSize, elementSize) mergeFiles.call(outputFile, numWays, elementSize) } // Create a small test file of 40 random 6 digit integers and split it into 4 files // of 10 such integers each. var numWays = 4 var runSize = 10 var elementSize = 7 // 6 digits + a following space var inputFile = "external_sort_input.txt" var outputFile = "external_sort_output.txt" var inp = File.create(inputFile) var rand = Random.new() var min = 100000 var max = 999999 for (i in 0...numWays*runSize) inp.writeBytes("%(rand.int(min, max).toString) ") inp.close() externalSort.call(inputFile, outputFile, numWays, runSize, elementSize) // remove temporary files for (i in 0...numWays) { var fileName = "es%(i)" File.delete(fileName) }</syntaxhighlight> {{out}} Sample run: <pre> Contents of external_sort_input.txt: 195387 279593 270645 457221 187563 459521 984067 443317 890630 986820 357072 302605 354825 295908 541221 273855 318978 913819 961359 776939 337617 640070 100140 266938 597987 305187 731698 449166 388165 121283 516001 256453 197931 660491 785453 544828 346520 532447 688793 194774 Contents of external_sort_output.txt: 100140 121283 187563 194774 195387 197931 256453 266938 270645 273855 279593 295908 302605 305187 318978 337617 346520 354825 357072 388165 443317 449166 457221 459521 516001 532447 541221 544828 597987 640070 660491 688793 731698 776939 785453 890630 913819 961359 984067 986820 </pre>