Last Updated: 9/25/2001
A_FileSort
Purpose
License
Description
Dependencies
Limitations
Performance Considerations
To Do
Usage
Class Methods
Instance Methods
Testing
A_Debug Usage
Purpose:
This class allows you to sort large amounts of data. You put records in and when you iterate over the rows they are sorted. The records are stored in an array until the buffer size is exceeded. If all data fits in the specified buffer size, then the Array is sorted by calling Array.sort and you can iterate over the rows. If the buffer fills up, data is written to disk and sorted in chunks. The temporary files used for sorting will have an extension of .srt.
Description:
A_Index,
A_BTree, and
A_View use this class to allow you to iterate over your data sorted in other ways. When data has to be written to disk, this uses a divide and conquer method for sorted. This works best if the first buffer full of data that is process is a good representation of the data that is being sorted. If the first buffer full of data is mostly sorted, I assume all of the data is mostly sorted. Here are the basic steps:
- Check to see if the data is currently sorted, special things are done (not described here) when the data is mostly sorted.
- Sort the data currently in memory
- Write a portion of the buffer to each temporary sort file and remember the highest value for each file
- As more records are written to the sorter, asearch is called to determine which file the new data should be written to. When finished, you should have 20 (assuming you have not changed the default file_count) that contain 1/20 of the data each. All of the data in the first file should be less than all of the data in the second file, etc.
- When you call each() to start iterating, the sorting process begins. The first file is rewound and read until all of its data is processed. If all of the data in file fits in the buffer, then it is sorted and rows are returned. If all the data does not fit in memory, then the whole process is repeated starting with the first bullet above. The same 20 temporary files are used but the new data is written at the end of the files perserving the first set of data and now preserving second set of data (that is 1/20th the size of the first set of data). This process repeats until an entire files worth of data fits in the buffer and the buffer is sorted and data is returned.
- When you have finished iterating over all the data in file 1, file 2 is processed in the same fashion. This repeats until all files are processed and you have iterated over the last sorted data.
This nice thing about this approach is that is can sort a very large amount of data using a fairly small amount of memory. The temporary files are not opened until they are needed. This means that this class performs as well as an Array if the data all fits in the buffer size. The other nice thing about this approach is that you start getting data back after sorting only 1/20 of the data. For interactive users, this means they see data much quicker.
Dependencies:
a_debug.rb - this provides debugging messages for all of my Ruby code. See
a_debug.html for more details.
a_redefinitions.rb - this class uses the Array.asearch method. The documentation for a_refinitions.rb for more details.
Limitations:
None
Performance Considerations:
If you have a lot of memory to spare, set the buffer size to a much larger value like 32M. Please keep in mind that a multi-user system supporting ArunaDB could have several of these and other large memory consuming classes being used at the same time. The amount of memory to use for sorting is dependent to how much memory you have and how many users are connected at the same time.
Double or triple the number of temporary files used for sorted. The default is 20, setting this 40 or 60 should help performance for large amounts of data.
The goal is to keep the number of times the data has to written to disk to minimum.
To Do:
I would like to rewrite this to use the
A_Buffer class instead of an Array.
I would also like to rewrite this in C for performance reasons.
Add support for multiple paths, in case one disk drive is not enough to sort your data.
Usage:
srt = A_FileSort.new()
srt.write(['hello world', 'Test this 1', 2])
srt.write(['hello world3', 'Test this 1', 2])
srt.write(['hello world', 'Test this 3', 2])
srt.each {|data| print "#{data.join(',')}\n"}
srt.close()
rev_sort = proc {|k1, k2| k2 <=> k1}
srt = A_FileSort.new(nil, &rev_sort)
0.upto(99){|i| srt.write(i)}
srt.each {|data| print #{data}\n"}
srt.close
Class Methods:
A_FileSort.version()
Returns the version of this class.
A_FileSort.buffer_size=(new_size)
Allows you set the memory size in bytes of the buffer. This value is used to determine if the buffer is full and should written to disk. This value applies to all sort files. The default value is 500K.
A_FileSort.buffer_size()
Returns the currently defined size of the buffer used for sort files.
A_FileSort.file_count=(new_size)
Sets the number of temporary files to open and use for sorting. This applies to all FileSort objects. The default value is 20 temporary files are used for sorting.
A_FileSort.file_count()
Returns the number of temporary files that is used for sorting.
A_FileSort.new(path=nil, &sort_block)
Creates a new object. Temporary files are not opened until they are needed.
- path - is the path where the temp files used for sorting is created. '/tmp' is the default.
- sort_block allows you to create a method to use for sorting. The sort_block method should take to parameters and return data1 <=> data2. You check the Ruby documentation for Array.sort for more details on how to set this up. I modeled this after Array.sort.
Instance Methods:
close()
Close this sort file and all opened temporary files.
each()
Iterate over all rows in the sort file. Yields(data). This starts the sorting process. The data is sorted the first time you call this method, so it take a minute or two before the data starts returning. Approximately every 1/20th of the data you could see more delays in the iterator as more data is being sorted.
write(data)
Write this data to the sort file.
- data - can be any Ruby object that works with Marshal.dump and Marshal.load.
Testing:
tst_a_filesort.rb - this script performs basic testing for the A_FileSort class. To run this tests type:
ruby -I.. tst_a_filesort.rb
A_Debug Usage:
7 - prints high level stuff like creating and opening an A_TempFile object
70 - prints information about rewinding and sorting data.
71 - prints information about reading and writing blocks of data to and from disk.