Algorithms to Search Sorted Data

Do this on pencilcode.net

How would you find a name in a phone book? The method is an algorithm if you can describe it precisely enough for a computer or a person to follow the same steps. In this activity, we explore the problem of finding a word in an alphabetized book. We explore two questions:

How can various search methods be written as algorithms?
Why might we choose one algorithm instead of another?

Introduction to Sorted Search

Run the following short program in Pencil Code. It loads a script from a webpage, and then it runs the "find" function from that script.

await loadscript 'http://csp.pencilcode.net/lib/find.coffee', defer() 
do find

Each box has a word, and the words are in alphabetical order. Click on a box to get at a word.

Exercise: An Algorithm for Searching by Hand

Design an algorithm to find the goal word manually:
- Click on boxes to get under them until you find the goal.
- Click on the box "-1 if not here" if you have found that the word is not here.
- Think about what you did, and describe your algorithm in English in a way that would let somebody else follow exactly the same procedure.
Working with a partner or alone, analyze your manual algorithm:
- Use your algorithm at least three times with different data.
- Each time, count how many gets you need to find the word (or to determine that it is not there).
- The number "n" is the number of boxes. Make a table relating the number of gets you needed to the number "n".
Write a hypothesis relating n with the number of gets used by your algorithm.

Programming a Random Search

Here is a simple algorithm somebody might use if they did not know the order of the alphabet:

Loop forever:
1. Look on a random page.
2. We have an answer if it matches.

Here is that algorithm, written in CoffeeScript.

await loadscript 'http://csp.pencilcode.net/lib/find.coffee', defer()

# Careful: put a space after the word "find" below.
find (goal, get, n) ->     # define "goal" and "get" and "n".
  loop                     # loop forever:
    i = random n           # pick i randomly from 0 to n-1.
    w = get i              # let w be the word on the ith page.
    if w is goal           # check if w matches the goal:
      return i             # if so, return the location.

In CoffeeScript, indenting is used to define blocks of code under the control of a command, so it is important to match the indenting when writing the program above.

In this program, the space after find is also important. (Why? Because the input to find is an algorithm to use. It must see your whole algorithm, from the parentheses past the arrow to the end of your indented code. If you forget the space, CoffeeScript will think that the inputs to find are just the contents of the parentheses.)

Exercise: Analyze Random Search

Enter the program above, and get it to run. Save it as "randomsearch".
- How many steps did it take to run?
- Does it always get the right answer?
Explain why you would not want to use this algorithm in real life.
- When searching n pages, what is the largest number of gets you should ever need?
- Discuss the reliability of this algorithm.
Graph the performance by changing find to findperf, as in the code below.
- How often does the program fail to give an answer?
- Understand the bar graph. How does the number of gets compare to n?

await loadscript 'http://csp.pencilcode.net/lib/find.coffee', defer()

findperf (goal, get, n) ->   # run a performance test.
  loop                       # everything below is unchanged.
    i = random n
    w = get i
    if w is goal
      return i

find and findperf are functions defined by the script at http://csp.pencilcode.net/lib/find.coffee. Although the script is long, it is ordinary CoffeeScript: take a look if you are curious.

Running a Linear Search

The algorithm below is simple yet systematic. Is it better than random search?

Look at every page from the 0th to the last:
1. If the word on the page matches, we're done.
If not found after looking at the last page, say "it's not here."

In CoffeeScript (go back to using find):

await loadscript 'http://csp.pencilcode.net/lib/find.coffee', defer()

find (goal, get, n) ->     # animate this algorithm:
  for i in [0...n]         # try each i from 0 to (n-1):
    w = get i              # let w be the word under ith button.
    if w is goal           # check if w matches the goal:
      return i             # if so, return the location.
  return -1                # if we get past the loop, return -1.

The loop for i in [0...n] repeats the indented block once for each value of i, starting at 0, and up to, but not including n. In CoffeeScript, a numeric range using three dots will include the first number but omit the last number. Using two dots will include both ends.

Exercise: Analyze Linear Search

Enter the program above, and get it to run. Save it as "linearsearch".
- How many steps did it take to run?
Analyze the performance of the algorithm.
- When searching in n pages, what is the number of gets that you would expect (on average) to do with Linear Search?
- Discuss the reliability of this algorithm.
Use findperf to measure the performance of the algorithm.
- Do the tests differ from your prediction, and why?

Divided Search

When looking in a phone book, you might divide your search in half by geting first at the middle of the book. Then, taking advantage of alphabetical order, you could look through only half of the book.

get at page #m, halfway through the book.
If the goal is alphabetized before the middle, use the search range [0...m).
Otherwise (when the goal is not before the middle) use the search range [m...n).
For each page in the search range:
1. If the word on the page matches, we're done.
If not found after looking at the last page in the range, say "it's not here."

In the program below, the variable m is the half-way-page number, and the variables a and z represent the start and end of the search range. This algorithm chooses a and z after comparing the goal to the word at m.

await loadscript 'http://csp.pencilcode.net/lib/find.coffee', defer()

find (goal, get, n) ->
  m = floor(n/2)           # m is the halfway location, rounded down.
  if goal < get(m)         # compare goal to the word on #m alphabetically:
    a = 0                  # if before, then search from 0
    z = m                  # .... up to (not including) m.
  else                     # otherwise:
    a = m                  # if after, then search from m
    z = n                  # ... up to (not including) n.
  for i in [a...z]         # do linear search on [a...z].
    w = get i
    if w is goal
      return i
  return -1

floor x is x rounded down.

ceil x is x rounded up. ("ceiling")

round x is x rounded to the nearest integer.

floor is a math function for rounding down. floor(n/2) is used to pick a halfway page when there might be an odd number of pages.
Exercise: Analyze Divided Search

Enter the program above, and get it to run. Save it as "dividedsearch".

How many steps did it take to run?

Analyze the performance of the algorithm.

When searching in n pages, what is the number of gets that you would expect (on average) to do with Divided Search?

Measure the performance using findperf.

Explain any difference between the test results and your prediction.

Binary Search
After dividing the problem in half, we could divide it in half again; then again; and again. Eventually we will be left with just one page to look at. This is the idea behind Binary Search:

Start with a search range [0...n)
Loop as long as the search range has more than one page:

get at page #m, halfway through the range.
If the goal is alphabetized before the middle, adjust the top the search range down to m.
Otherwise, adjust the bottom the search range down to m.

Look at the one page in the range. Our word is either here, or it is not in the book.
Divide and Conquer Algorithms
Binary search is one of the most important algorithms in computer science. It is the basic example of a divide-and-conquer strategy.
In divide-and-conquer algorithms, some way is found to divide the problem in to smaller problems that are similar. Then each smaller problem can be further divided until the problem is easy to solve.
The reason divide-and-conquer algorithms are so important is that they can be blazingly fast. Even when searching a very large book, Binary Search can get an answer with just a few steps.
Coding Binary Search
In the code below, we again use the variables a and z to represent the current search range, and m to represent the page halfway between a and z.
The difference between this algorithm and Divided Search is that we continue to adjust the search range repeatedly instead of doing it just once.
await loadscript 'http://csp.pencilcode.net/lib/find.coffee', defer() find (goal, get, n) -> a = 0 # the start of the search range starts as 0. z = n # the end of the search range starts as n. while z - a > 1 # repeat while the distance from a to z is more than 1: m = floor((a+z)/2) # m is the halfway location, rounded down. if goal < get(m) # compare goal to the word on #m alphabetically: z = m # if before, adjust the end of the range. else # a = m # otherwise, adjust the start. # when the loop is over, "a" is the only page in the range. if get(a) is goal # if the goal word is on this page: return a # this page is our answer. else # otherwise return -1 # the word is not in the book.
The while keyword in CoffeeScript repeats a block of code as long as a condition is true. Here, while z - a > 1 repeats the indented code as long as the gap between z and a is more than one page.
The first line of code that is not indented under the while is reached when z - a is 1 (or less). In other words, the loop is finished when there is only one page to look at.
Exercise: Analyze Binary Search

Enter the program above, and get it to run. Save it as "binarysearch".

How many steps did it take to run?
Try running with 1000 items by changing the first line to find 1000, (goal, get, n) ->. How many steps does it take to run?

Analyze the performance of the algorithm.

When searching in n pages, what is the number of gets that you would expect (on average) to do with Binary Search?

Measure the performance using findperf.

Explain any difference between the test results and your prediction.

Can you do better? Is there an algorithm that will use fewer gets than binary search?