Contents

Searching a list

In this topic, we'll talk about searching a list of items—numbers, strings, objects, etc.—for a target. So far, we have only seen how to use arrays to store lists of items, so that's what we'll stick to here.

The running example we will use is to consider a list of numbers:

Array
5103208

The sorted version of the list will appear as:

Sorted array
3581020
(Back to top)

Linear search

The simplest way to search a list is to walk it from the first position to the last position. We will stop if we find the item we are looking for, or if every item in the list has been examined. We call this linear search and it is what we would call a brute force algorithm, because it doesn't do anything smart. Here's the full sequence of comparisons if we're searching the unsorted array in \ref for the number 20 (the element in orange is what is being examined during a step, and the grayed out elements are ones that have been examined):

Comparison 1: 5 == 20? No, keep going.
5103208
Comparison 2: 10 == 20? No, keep going.
5103208
Comparison 3: 3 == 20? No, keep going.
5103208
Comparison 4: 20 == 20? Yes, we can stop!
5103208

So that took four comparisons. What if we searched for the number 4? Given the unsorted list, we would have to search every single element, which would take five comparisons:

Comparison 1: 5 == 4? No, keep going.
5103208
...
Comparison 5: 8 == 4? No. It's not in the list.
5103208

So, the worst case complexity of linear search over a list with n items is O(n).

A slightly smarter version

One of the issues with linear search over an unsorted list is that we absolutely have to compare every single element when our target doesn't exist. For example, when we search for 4, we have to make five comparisons before we can say that 4 is not in the list.

What if we consider the sorted array in \ref? If we assume that the list is sorted, and we search for 4, do we have to compare against every single element? We don't, in fact. As soon as we see 5, we know that 4 cannot be in the list, because if it did, we would have seen it before 5:

Comparison 1: 3 == 4? No. 3 > 4? No, keep going.
3581020
Comparison 2: 5 == 4? No. 5 > 4? Yes. It's not in the list.
3581020

Only two comparison, which is much better than the five comparisons it took with the unsorted version. Now, even though this does allow us to short-cut things if the target is smaller than any of the things in the list, it does not help when the target is bigger. In those cases, we still need to make n comparisons. Therefore, since Big-O is worst case, linear search is still O(n), even over a sorted list.

(Back to top)

Binary search

Assuming that the given list is sorted, we can do much better than O(n). The binary search algorithm has complexity O(log2 n). It starts of by picking the mid point of the list—half way between the start and end. For the sorted array in \ref, that mid point is index 2: the minimum index is 0, the maximum is 4, so the mid point is floor((4-0)/2). If the item at the mid point is our target, we're done; if it is greater than our target, then we find the midpoint of the unexplored items to the left of our current mid point; if it is less than our target, we move to the right. If there are no more unexplored items in the direction we need to move, then we know the item is not in the list. Here's what it looks like when we search for the number 20:

Comparison 1: min=0, max=4, mid=2; 8 == 20? No. 8 > 20? No, look to the right.
3581020
Comparison 2: min=3, max=4, mid=3; 10 == 20? No. 10 > 20? No, look to the right.
3581020
Comparison 3: min=4, max=4, mid=4; 20 == 20? Yes, we found it!
3581020

To make it clearer how quickly we are decreasing our search space with each comparison, consider the following list of 100 items (don't worry about their values). Suppose it includes every even number from 2 to 200. I'm looking for 103, which is not in the list. Here are all the elements that are explored before we find out that 103 is not in the list:

□ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □ □

There are a total of seven comparisons: log2 100 ≈ 6.6. Round up and we get 7. Why is the complexity O(log2 n)?. Because we cut the search space in half (hence the base 2) at each comparison. Since we're operating on a list of length n, we want to know how many times we can cut a list of length n in half. Said differently, how many times can we divide n by 2? The log function provides us with that number.

(Back to top)

Other ways of searching

Here we've only talked about searching a simple list. There are other ways to search lists, but they generally require the data to be stored in a specialized data structure. For example, one structure we'll learn about this semester is a Binary Search Tree (BST). BSTs store sorted data and are structured in a way that closely resembles how we search a list using binary search.

Skip lists are a special structure that allow quick searching of data in a variety of formats, including data stored on disk. We won't talk about skip lists, but if you are curious, take a look at the Wikipedia page.

Searching over graphs is also an important problem. For example, if you want to know if two Facebook users are connected via mutual friends. Two popular algorithms we'll learn about are Depth First Search (DFS) and Breadth First Search (BFS).

A structure that has worst case performance on par with linear search, but amazing average performance—O(1)—is the hash table. We'll learn about these later in the semester. They are quite awesome and are one of the most commonly used data structures in modern data applications.

Related to hash tables are inverted indexes. These are used in web and document search. Inverted indexes store words as keys, each of which points to a list of web pages or documents that the word occurs in. This structure is used by Google, Bing, Yahoo!, and virtually every other web search service. We will not cover inverted indexes this semester, but if you are curious, see the Wikipedia page.

You've probably used services on your computer and phone that auto-suggest while you type. For instance, Google Search, Spotlight (on Macs), and smart phone keyboards will all provide text completion. This is usually done through a special structure called a trie (typically pronounced as "try", technically it should be pronounced as "tree" as it's short for retrieval). Tries are a special type of tree and are not terribly dissimilar to Binary Search Trees. We'll learn more about these later in the semester.

(Back to top)