Random line from a file
From
Amcleod to
All on Tuesday, August 21, 2001 05:39:34
The question arises periodically -- how to select a line randomly from a text file? If the file were filled with fixed-length records you could compute the offset of a random line and FSEEK/FREAD the apropriate line. Alas, text files are annoyingly variable in line-length. Or you could read all the lines into an array (simulated on disk if needed) and select randomly from the array. Ugh! So here is my method-of-choice, which reads through the entire file from top to bottom, and randomly picks one line out of the file:
#----------------------------------------------------#
# randomly select one random-length line from a file #
#----------------------------------------------------#
!INCLUDE FILE_IO.INC
STR randomfile inbuffer winner
INT handle line_count rval cval
#--------------------#
# set up some values #
#--------------------#
SET randomfile "c:\\sbbs\\exec\\oneliner.txt"
SET line_count 0
SET winner "Default Value -- in case routine fails"
#---------------#
# open the file #
#---------------#
FOPEN handle O_RDONLY randomfile
IF_FALSE
PRINTF "Error opening %s\n" randomfile
RETURN
END_IF
#----------------------------------------#
# loop through entire file, line by line #
#----------------------------------------#
:Loop
# check for End-of-File
FEOF handle
IF_TRUE
# we're done!
GOTO Were_Done
END_IF
# read line -- watch for read failure
FREAD_LINE handle inbuffer
IF_FALSE
# we're done!
GOTO Were_Done
END_IF
ADD line_count 1
# compute random value and comparison value
RANDOM rval 1000000
SET cval 1000000
DIV cval line_count
# is this line a winner, replacing previous winner?
COMPARE rval cval
IF_LESS
# we have a winner
COPY winner inbuffer
END_IF
# check remaining lines
GOTO Loop
#------------------------------------------------------------------------#
# we should have a winner by this point, and can do with it what we want #
#------------------------------------------------------------------------#
:Were_Done
PRINT winner
It's quite clever how it works (no, I didn't think of it first!). It reads each line and determines whether that line would have been selected _assuming_ there are no more lines in the file. If it _does_ find another line, it determines whether it would have been chosen over what ever line had _already_ been chosen. Like this:
It reads the first line. Possibly the ONLY line. So this line must be selected 100% of the time. Hence Line #1 is selected randomly 1/1 of the time. Then it reads the second (and possibly last) line. If there are only two lines in the file, then there is a 1/2 chance that line #2 will be selected. Should a random comparison show that the 1/2 chance is met, line #2 replaces what was previoulsy selected (IOW line #1). Now it reads line #3. This line is 1/3 likely to be selected. If it is, it replaces the previously selected line (either #1 or #2). Line #4 replaces the previously selected line 1/4 of the time. Line #5 replaces the previous winner 1/5 of the time... and so on.
The only thing tricky is that since we don't have FP values, we have to generate integer random numbers and scale the count accordingly. (You can't have values of 1/4 = 0.25 to compare with randiom numbers between 0 and 1, so you have to use a scaling factor -- 1,000,000 in this case -- and compare 1000000/4 with a random number between 0 and 1000000.)
Caution -- this routine will select blank lines as well as any other, so watch out for blank lines PARTICULARLY at the end of the file. Or check for blank lines immediately after reading and "GOTO Loop" immediately, without adjusting the counter, if you find one. Skip Comments the same way.
You could also select "cookies" which consist of more than one line using a method similar to this. You'd have to separate groups of lines somehow (say a blank line), accumulate all non-blank lines as a possible winner, and when you discover a blank, THEN you incriment your counter, make your random check, and if successful replace the previous winning lines with the new set of lines you have accumulated.
I've run this algorythm in a loop 250,000 times with a file containing 25 different lines, and each line is selected 10,000 times +/- 1% which is good enough for government work.