A "Feild" Guide to Programming with C++Introductory programming notes by Henry Feild
Chapter 10: Searching
- Big-oh notation, e.g., O(n), is a way to describe the complexity of an algorithm
- linear search:
- works on either sorted or unsorted lists of data
- finds a target by looking at every item in a list in the worst case
- worst case complexity: O(n)
- binary search:
- only works on sorted lists of data
- finds a target by continually dividing the list in half, picking the half where the target will be if it exists, until the target is found or the list cannot be further halved
- worst case complexity: O(log2 n)
- we can search over lists of any kind—numbers, string, instances of your own struct or class
- There are lots of other searches out there, most involve more specialized data structures than simple lists
Up until now, we have focused almost exclusively on syntax—basic programming constructs and how to use them in C++. You've learned a lot over the past many chapters and you are now prepared to enter into the core of computer science: algorithm design. We want to take the programming tools we've learned and apply them to solving a higher order problem. In the case of this chapter, the focus will be on searching through a list of items to find a given target item. In the next chapter, we'll take a look at how to sort lists so they are in some specified order (e.g., alphabetical, ascending, descending, etc.).(Back to top)
Searching a list
In this chapter, we'll talk about searching a list of items—numbers, strings, objects, etc.—for a particular item, which we will refer to as the target.
As a running example, we will use a list of numbers:
In this list, the numbers are out of order. When in oder, the list appears as follows:
We will consider searching through both ordered and unordered lists in this chapter. Note that the algorithms discussed below will work on lists of any kind of data. We are only using numbers here as an example.(Back to top)
The simplest way to search a list is to iterate over it from the first position to the last position. We will stop if we find the item we are looking for, or if every item in the list has been examined. We call this linear search. In the world of algorithm design, any time you have an algorithm that solves a problem but does so in a dumb or expensive way, we refer to it as a brute force approach. In the case of linear search, we may have to look at every single item in the list and as such it is expensive in the number of items it has to compare to the target. So, linear search is considered to be a brute force approach to finding a target within a list.
Here's the full sequence of comparisons if we're searching the unsorted array in \ref for the number 20 (the element in orange is what is being examined during a step, and the grayed out elements are ones that have been examined):
So that took four comparisons. What if we searched for the number 4? Given the unsorted list, we would have to search every single element, which would take five comparisons:
So, in the worst case, we have to look at all the items in the list. If we have a list with n items, that means we have to compare the target to n values. Formally, we say the worst case complexity of linear search over a list with n items is O(n). We say this last bit as: big-oh n. Big-oh notation is a way for us to talk about the complexity of an algorithm relative to the size of the input.
A slightly smarter version
One of the issues with linear search over an unsorted list is that we absolutely have to compare every single element when our target doesn't exist in the list. For example, when we search for 4, we have to make five comparisons before we can say that 4 is not in the list.
What if we consider the sorted array in \ref? If we assume that the list is sorted, and we search for 4, do we have to compare against every single element? We don't, in fact. As soon as we see 5, we know that 4 cannot be in the list, because if it was, we would have seen it before we reached 5:
This took only two comparison, which is much better than the five comparisons it took with the unsorted version. Now, even though this does allow us to short-cut things if the target is smaller than any of the things in the list, it does not help when the target is bigger. In those cases, we still need to make n comparisons. Therefore, we say that in the worst case, linear search is still O(n), even over a sorted list (there are other analyses we could consider, such as best case and average case, however as computer scientists, we usually care most about worst case analyses).(Back to top)
Assuming that the given list is sorted, we can do much better than a worst case of O(n). The binary search algorithm has complexity O(log2 n). At a high level, it works by dividing the list in half, then half again, then half again, and so on, always picking the half that will contain the target if the target is in fact in the list, until it reaches the point where it's either found the target, or the list cannot be halved again.
The specifics of the algorithm are as follows. It starts of by picking the mid point of the list—half way between the start and end. For the sorted array in \ref, that mid point is index 2: the minimum index is 0, the maximum is 4, so the mid point is floor((4-0)/2). If the item at the mid point is our target, we're done; if it is greater than our target, then we find the midpoint of the unexplored items to the left of our current mid point; if it is less than our target, we move to the right. If there are no more unexplored items in the direction we need to move, then we know the item is not in the list. Here's what it looks like when we search for the number 20:
To make it clearer how quickly we are decreasing our search space with each comparison, consider the following list of 100 items (don't worry about their values). Suppose it includes every even number from 2 to 200. I'm looking for 103, which is not in the list. Here are all the elements that are explored before we find out that 103 is not in the list:
There are a total of seven comparisons: log2 100 ≈ 6.6. Round up and we get 7. Why is the complexity O(log2 n)?. Because we cut the search space in half (hence the base 2) at each comparison. Since we're operating on a list of length n, we want to know how many times we can cut a list of length n in half. Said differently, how many times can we divide n by 2 until we can't divide by 2 anymore? The log function provides us with that number.(Back to top)
Other ways of searching
Here we've only talked about searching a simple list. There are other ways to search lists, but they generally require the data to be stored in a specialized data structure. For example, one structure we won't learn about this semester is a Binary Search Tree (BST). BSTs store sorted data and are structured in a way that closely resembles how we search a list using binary search.
Skip lists are a special structure that allow quick searching of data in a variety of formats, including data stored on disk. We won't talk about skip lists, but if you are curious, take a look at the Wikipedia page.
Searching over graphs is also an important problem. For example, suppose you want to know if two Facebook users are connected via mutual friends. Two popular algorithms for this are Depth First Search (DFS) and Breadth First Search (BFS).
A structure that has worst case performance on par with linear search, but amazing average performance—O(1)—is the hash table. They are quite awesome and are one of the most commonly used data structures.
Related to hash tables are inverted indexes. These are used in web and document search. Inverted indexes store words as keys, each of which points to a list of web pages or documents that the word occurs in. This structure is used by Google, Bing, Yahoo!, and virtually every other web search service. We will not cover inverted indexes this semester, but if you are curious, see the Wikipedia page.
You've probably used services on your computer and phone that auto-suggest while you type. For instance, Google Search, Spotlight (on Macs), and smart phone keyboards will all provide text completion. This is usually done through a special structure called a trie (typically pronounced as "try", technically it should be pronounced as "tree" as it's short for retrieval). Tries are a special type of tree and are not terribly dissimilar to Binary Search Trees.(Back to top)