Algorithms to Search Sorted Data
Do this on pencilcode.net
How would you find a name in a phone book?
The method is an algorithm if you can
describe it precisely enough for a computer or
a person to follow the same steps.
In this activity, we explore
the problem of finding a word
in an alphabetized
book. We explore two questions:
- How can various search methods be written as algorithms?
- Why might we choose one algorithm instead of another?
Introduction to Sorted Search
Run the following short program in Pencil Code.
It loads a script from a webpage, and then it
runs the "find" function from that script.
await loadscript 'http://csp.pencilcode.net/lib/find.coffee', defer()
do find
Each box has a word, and the words are
in alphabetical order.
Click on a box to get at a word.
Exercise: An Algorithm for Searching by Hand
- Design an algorithm to find the goal word manually:
- Click on boxes to get under them
until you find the goal.
- Click on the box "-1 if not here"
if you have found that the word is not here.
- Think about what you did, and describe your
algorithm in English in a way that would let
somebody else follow exactly the same procedure.
- Working with a partner or alone, analyze your
manual algorithm:
- Use your algorithm at least three times with
different data.
- Each time, count how many gets you need to
find the word (or to determine that it is not there).
- The number "n" is the number of boxes.
Make a table relating the number of gets you
needed to the number "n".
- Write a hypothesis relating n with
the number of gets used by your algorithm.
Programming a Random Search
Here is a simple algorithm somebody might use
if they did not know the order of the alphabet:
- Loop forever:
- Look on a random page.
- We have an answer if it matches.
Here is that algorithm, written in CoffeeScript.
await loadscript 'http://csp.pencilcode.net/lib/find.coffee', defer()
# Careful: put a space after the word "find" below.
find (goal, get, n) -> # define "goal" and "get" and "n".
loop # loop forever:
i = random n # pick i randomly from 0 to n-1.
w = get i # let w be the word on the ith page.
if w is goal # check if w matches the goal:
return i # if so, return the location.
In CoffeeScript, indenting is used to define blocks
of code under the control of a command, so it is important
to match the indenting when writing the program above.
In this program, the space after
find
is
also important. (Why? Because the input to
find
is an algorithm to use. It
must see your whole algorithm,
from the parentheses past the arrow to the end of
your indented code. If you forget the space,
CoffeeScript will think that the inputs to
find
are just the contents of the
parentheses.)
Exercise: Analyze Random Search
- Enter the program above, and get it to run.
Save it as "randomsearch".
- How many steps did it take to run?
- Does it always get the right answer?
- Explain why you would not want to use this
algorithm in real life.
- When searching
n
pages, what
is the largest number of gets you should ever need?
- Discuss the reliability of this algorithm.
- Graph the performance by changing
find
to findperf
, as in the code below.
- How often does the program fail to give an answer?
- Understand the bar graph.
How does the number of gets compare to
n
?
await loadscript 'http://csp.pencilcode.net/lib/find.coffee', defer()
findperf (goal, get, n) -> # run a performance test.
loop # everything below is unchanged.
i = random n
w = get i
if w is goal
return i
find
and findperf
are functions defined by the script at
http://csp.pencilcode.net/lib/find.coffee.
Although the script is long, it is ordinary
CoffeeScript: take a look if you are curious.
Running a Linear Search
The algorithm below is simple yet systematic.
Is it better than random search?
- Look at every page from the 0th to the last:
- If the word on the page matches, we're done.
- If not found after looking at the last page,
say "it's not here."
In CoffeeScript (go back to using find
):
await loadscript 'http://csp.pencilcode.net/lib/find.coffee', defer()
find (goal, get, n) -> # animate this algorithm:
for i in [0...n] # try each i from 0 to (n-1):
w = get i # let w be the word under ith button.
if w is goal # check if w matches the goal:
return i # if so, return the location.
return -1 # if we get past the loop, return -1.
The loop for i in [0...n]
repeats
the indented block once for each value of i
,
starting at 0, and up to, but not including
n
. In CoffeeScript, a numeric range using
three dots will include the first number but
omit the last number. Using two dots will include both
ends.
Exercise: Analyze Linear Search
- Enter the program above, and get it to run.
Save it as "linearsearch".
- How many steps did it take to run?
- Analyze the performance of the algorithm.
- When searching in
n
pages,
what is the number of gets that you would expect
(on average) to do with Linear Search?
- Discuss the reliability of this algorithm.
- Use
findperf
to measure the
performance of the algorithm.
- Do the tests differ from your prediction, and why?
Divided Search
When looking in a phone book, you might
divide your search in half by geting first
at the middle of the book. Then,
taking advantage of alphabetical order, you
could look through only half of the book.
- get at page #m, halfway through the book.
- If the goal is alphabetized before the middle, use the search range [0...m).
- Otherwise (when the goal is not before the middle)
use the search range [m...n).
- For each page in the search range:
- If the word on the page matches, we're done.
- If not found after looking at the last page
in the range, say "it's not here."
In the program below, the variable m
is
the half-way-page number, and the variables
a
and z
represent the
start and end of the search range. This algorithm
chooses a
and z
after comparing
the goal to the word at m
.
await loadscript 'http://csp.pencilcode.net/lib/find.coffee', defer()
find (goal, get, n) ->
m = floor(n/2) # m is the halfway location, rounded down.
if goal < get(m) # compare goal to the word on #m alphabetically:
a = 0 # if before, then search from 0
z = m # .... up to (not including) m.
else # otherwise:
a = m # if after, then search from m
z = n # ... up to (not including) n.
for i in [a...z] # do linear search on [a...z].
w = get i
if w is goal
return i
return -1
floor x |
is x rounded down. |
ceil x |
is x rounded up. ("ceiling") |
round x |
is x rounded to the nearest integer. |
floor
is a math function for rounding down.
floor(n/2)
is used to
pick a halfway page when there
might be an odd number of pages.
Exercise: Analyze Divided Search
- Enter the program above, and get it to run.
Save it as "dividedsearch".
- How many steps did it take to run?
- Analyze the performance of the algorithm.
- When searching in
n
pages,
what is the number of gets that you would expect
(on average) to do with Divided Search?
- Measure the performance using
findperf
.
- Explain any difference between the test results
and your prediction.
Binary Search
After dividing the problem in half, we could
divide it in half again; then again; and again.
Eventually we will be left with just one page to
look at. This is the idea behind
Binary Search:
- Start with a search range [0...n)
- Loop as long as the search range has more than one page:
- get at page #m, halfway through the range.
- If the goal is alphabetized before the middle, adjust
the top the search range down to m.
- Otherwise, adjust
the bottom the search range down to m.
- Look at the one page in the range.
Our word is either here, or it is not in the book.
Divide and Conquer Algorithms
Binary search is one of the most important
algorithms in computer science. It is the basic
example of a divide-and-conquer strategy.
In divide-and-conquer algorithms, some way
is found to divide the problem in to smaller
problems that are similar. Then each smaller
problem can be further divided until the
problem is easy to solve.
The reason divide-and-conquer algorithms
are so important is that they can be
blazingly fast. Even when searching a very
large book, Binary Search can get an answer
with just a few steps.
Coding Binary Search
In the code below, we again use the variables
a
and z
to represent
the current search range, and m
to represent the page halfway between a
and z
.
The difference between this
algorithm and Divided Search is that we continue
to adjust the search range repeatedly instead of
doing it just once.
await loadscript 'http://csp.pencilcode.net/lib/find.coffee', defer()
find (goal, get, n) ->
a = 0 # the start of the search range starts as 0.
z = n # the end of the search range starts as n.
while z - a > 1 # repeat while the distance from a to z is more than 1:
m = floor((a+z)/2) # m is the halfway location, rounded down.
if goal < get(m) # compare goal to the word on #m alphabetically:
z = m # if before, adjust the end of the range.
else #
a = m # otherwise, adjust the start.
# when the loop is over, "a" is the only page in the range.
if get(a) is goal # if the goal word is on this page:
return a # this page is our answer.
else # otherwise
return -1 # the word is not in the book.
The while
keyword in CoffeeScript
repeats a block of code as long as a condition is
true. Here, while z - a > 1
repeats the indented code as long as the gap
between z
and a
is
more than one page.
The first line of code that is not
indented under the while
is
reached when z - a
is 1 (or less).
In other words, the loop is finished when
there is only one page to look at.
Exercise: Analyze Binary Search
- Enter the program above, and get it to run.
Save it as "binarysearch".
- How many steps did it take to run?
- Try running with 1000 items by changing
the first line to
find 1000, (goal, get, n) ->
. How many steps does it take to run?
- Analyze the performance of the algorithm.
- When searching in
n
pages,
what is the number of gets that you would expect
(on average) to do with Binary Search?
- Measure the performance using
findperf
.
- Explain any difference between the test results
and your prediction.
- Can you do better? Is there an algorithm that
will use fewer gets than binary search?