by Dan Duda
- Python 3.x
- PyCharm Community (optional)
- A text file of English words
- Basic Python knowledge
In the last tutorial we created a program that can find all the words that can be made from a string of letters. We also noticed that our program wasn’t very fast or efficient. Let’s fix that.
Before we try and optimize our code we should deternmine if that’s even needed. Our program did run very slow when using more than a few letters but to be sure we should add some timers to our code. We’ll also restructure our code into classes so that it’s easier to test.
Let’s add a new file to our project folder called “word_finder_simple.py”. We’ll pull our algorithm out of our main words.py and into this new file.
And we’ll clean up our main program a bit.
from word_finder_simple import WordFinderSimple
Running it should give the same results as before (just presented a little better). We’ve just refactored the code a bit to make it easier to test.
Enter a string: abc
Now let’s time our algorithm. We’ll make use of Python’s time module so import time and make the following changes to our words.py code.
We import the time module in line 1. We save the time at line 17 before we caall find_words and then get the time at line 19 after the search is run. On line 20 we display the time it took to search for words. We are timing only the time our word search algorithm takes. We do not include the time it takes to load the word list, etc. which happens on line 11.
Enter a string: book
Running with the letters “book” wasn’t too bad at 0.16 seconds but “agerts” took almost 7 seconds. This would not work in a game where we might need to be checking for words every few seconds.
While this isn’t very accurate testing at this point since we’re only testing a single run at a time, it gives us an idea of how performant our algorithm is. We could randomize the search letters and run it multiple times and get the average run time if we wanted but for now we can see we need to improve our algorithm.
If we look at our class, “WordFinderSimple”, notice the code at the beginning of the “search_words” function that tests whether a word is a valid word. Specifically look at the code “and word_part in self.words”.
if len(word_part) >= 3 and word_part not in found_words and word_part in self.words:
Although it’s intuitive to check if the word is valid by checking if it’s “in” the list of words, this is very inefficient. Lists in Python are not “ordered” by default. If we look at the word list we read in, it contains 370,099 words. Every time the code “in self.words” is run it has the scan the entire list looking for “word_part. If it doesn’t find it then it scanned over 370,099 words. Although this list of words is in memory it is still a slow process to scan it repeatedly.
Our word list is ordered because the text file we read in has them in alphabetical order. If they were not in order it would be easy enough to call “words.sort()” to order it just once after reading the list.
So for the time being we’ll keep our words list stored as a list in memory. Is there a more efficient way to search the list if it’s alphabetical? Yes. There’s a popular search algorithm in computer science called “binary sort”. Instead of scanning the entire list from beginning to end it will break the list in half each time. If we start in the middle of the list we can then check whether the word comes before or after that middle word and then on the next search we throw away the half where we know the word couldn’t be and only have to search the remaining half, and so on.
Let’s create a new python file in our folder and name it “word_finder_bsearch.py”. Copy the code from our “word_finder_simple” and make the following changes.
from bisect import bisect_left
The first thing that’s different is that we are importing a module called “bisect_left”. This is a Python module that will allow us to “split” our ordered list in order to execute a binary search. Our constructor method stays the same as before and just reads in the words list. We’ve added a new function called “binary_search”. This comes from the Python documentation at https://docs.python.org/3.6/library/bisect.html#searching-sorted-lists. You can read more about it there but basically it takes a ordered list as the first parameter and the item we’re searching for as the second parameter. If it doesn’t find the item it returns -1. We don’t need to worry about the 2 other parameters as they’re not important in our implementation.
The only other change then comes in the “search_words” function where we test whether the “word_part” is in the list of words. Instead of using the “in” keyword we call the “binary_search” function. Otherwise this function remains the same as in our “simple” version.
Let’s change the code in our main program a bit so that we can see the difference in performance of these two algorithms.
We’ll add an import for our new version and then instantiate it on line 13. We’ll also comment out the call to display the words found on line 29 so we can concentrate on the performance differences. The user is still asked to input a string of characters but now we will call both methods, simple first and then binary search, timing each one and displaying the results.
When I enter the string “agerts” I get the following results. Note, entering more than 6 characters may take too long for the “simple” algorithm to finish quickly. If this happens you can press CTRL-C to stop the program.
Enter a string: agerts
The binary search algorithm blows our simple algorithm out of the water by a factor of over 1000.
Although this is a huge improvement there’s a way we can make it even faster by changing how we store our words data in memory. We’ll look at that in the next tutorial.