Be sure to read this specification in its entirety before starting so you know what to do and how to do it!
Implement a program that spell-checks a file, a la the below, using a hash table.
$ gcc dictionary.c speller.c -o speller
$ ./speller texts/lalaland.txt
MISSPELLED WORDS
[...]
AHHHHHHHHHHHHHHHHHHHHHHHHHHHT
[...]
Shangri
[...]
fianc
[...]
Sebastian's
[...]
WORDS MISSPELLED:
WORDS IN DICTIONARY:
WORDS IN TEXT:
TIME IN load:
TIME IN check:
TIME IN size:
TIME IN unload:
TIME IN TOTAL:
Theoretically, on input of size n, an algorithm with a running time of n is “asymptotically equivalent,” in terms of O, to an algorithm with a running time of 2n. Indeed, when describing the running time of an algorithm, we typically focus on the dominant (i.e., most impactful) term (i.e., n in this case, since n could be much larger than 2). In the real world, though, the fact of the matter is that 2n feels twice as slow as n.
The challenge ahead of you is to implement the fastest spell checker you can! By “fastest,” though, we’re talking actual “wall-clock,” not asymptotic, time.
In speller.c
, we’ve put together a program that’s designed to spell-check a file after loading a dictionary of words from disk into memory. That dictionary, meanwhile, is implemented in a file called dictionary.c
. (It could just be implemented in speller.c
, but as programs get more complex, it’s often convenient to break them into multiple files.) The prototypes for the functions therein, meanwhile, are defined not in dictionary.c
itself but in dictionary.h
instead. That way, both speller.c
and dictionary.c
can #include
the file. Unfortunately, we didn’t quite get around to implementing the loading part. Or the checking part. Both (and a bit more) we leave to you! But first, a tour.
Open up dictionary.h
, and you’ll see some new syntax, including a few lines that mention DICTIONARY_H
. No need to worry about those, but, if curious, those lines just ensure that, even though dictionary.c
and speller.c
(which you’ll see in a moment) #include
this file.
Next notice how we #include
a file called stdbool.h
. That’s the file in which bool
itself is defined. You’ve not needed it before, since the CS50 Library used to #include
that for you.
Also notice our use of #define
, a “preprocessor directive” that defines a “constant” called LENGTH
that has a value of 45
. It’s a constant in the sense that you can’t (accidentally) change it in your own code.
Finally, notice the prototypes for five functions: check
, hash
, load
, size
, and unload
. Notice how three of those take a pointer as an argument, per the *
:
bool check(const char *word);
unsigned int hash(const char *word);
bool load(const char *dictionary);
Recall that char *
is what we used to call string
. So those three prototypes are essentially just:
bool check(const string word);
unsigned int hash(const string word);
bool load(const string dictionary);
And const
, meanwhile, just says that those strings, when passed in as arguments, must remain constant; you won’t be able to change them, accidentally or otherwise!
Now open up dictionary.c
. Notice how, atop the file, we’ve defined a struct
called node
that represents a node in a hash table
. And we’ve declared a global pointer array, table, which will (soon) represent the hash table you will use to keep track of words in the dictionary. The array contains N
node pointers, and we’ve set N
equal to 1
for now, meaning this hash table has just 1 bucket right now. You’ll likely want to increase the number of buckets, as by changing N
, to something larger!
Next, notice that we’ve implemented load
, hash
, check
, size
, and unload
, but only barely, just enough for the code to compile. Your job, ultimately, is to re-implement those functions as cleverly as possible so that this spell checker works as advertised. And fast!
Okay, next open up speller.c
and spend some time looking over the code and comments therein. You won’t need to change anything in this file, and you don’t need to understand its entirety, but do try to get a sense of its functionality nonetheless. Notice how, by way of a function called getrusage
, we’ll be “benchmarking” (i.e., timing the execution of) your implementations of check
, load
, size
, and unload
. Also notice how we go about passing check
, word by word, the contents of some file to be spell-checked. Ultimately, we report each misspelling in that file along with a bunch of statistics.
Notice, incidentally, that we have defined the usage of speller
to be
Usage: speller [dictionary] text
where dictionary
is assumed to be a file containing a list of lowercase words, one per line, and text
is a file to be spell-checked. As the brackets suggest, provision of dictionary
is optional; if this argument is omitted, speller
will use dictionaries/large
by default. In other words, running
$ ./speller text
will be equivalent to running
$ ./speller dictionaries/large text
where text
is the file you wish to spell-check. Suffice it to say, the former is easier to type! (Of course, speller
will not be able to load any dictionaries until you implement load
in dictionary.c
! Until then, you’ll see Could not load
.)
Within the default dictionary, mind you, are 143,091 words, all of which must be loaded into memory! In fact, take a peek at that file to get a sense of its structure and size. Notice that every word in that file appears in lowercase (even, for simplicity, proper nouns and acronyms). From top to bottom, the file is sorted lexicographically, with only one word per line (each of which ends with \n
). No word is longer than 45 characters, and no word appears more than once. During development, you may find it helpful to provide speller
with a dictionary
of your own that contains far fewer words, lest you struggle to debug an otherwise enormous structure in memory. In dictionaries/small
is one such dictionary. To use it, execute
$ ./speller dictionaries/small text
where text
is the file you wish to spell-check. Don’t move on until you’re sure you understand how speller
itself works!
Odds are, you didn’t spend enough time looking over speller.c
. Go back one square and walk yourself through it again!
So that you can test your implementation of speller
, we’ve also provided you with a whole bunch of texts, among them the script from La La Land, the text of the Affordable Care Act, three million bytes from Tolstoy, some excerpts from The Federalist Papers and Shakespeare, the entirety of the King James V Bible and the Koran, and more. So that you know what to expect, open and skim each of those files, all of which are in a directory called texts
within your pset5
directory.
Now, as you should know from having read over speller.c
carefully, the output of speller
, if executed with, say,
$ ./speller texts/lalaland.txt
Below’s some of the output you’ll see. For information’s sake, we’ve excerpted some examples of “misspellings.” And lest we spoil the fun, we’ve omitted our own statistics for now.
MISSPELLED WORDS
[...]
AHHHHHHHHHHHHHHHHHHHHHHHHHHHT
[...]
Shangri
[...]
fianc
[...]
Sebastian's
[...]
WORDS MISSPELLED:
WORDS IN DICTIONARY:
WORDS IN TEXT:
TIME IN load:
TIME IN check:
TIME IN size:
TIME IN unload:
TIME IN TOTAL:
TIME IN load
represents the number of seconds that speller
spends executing your implementation of load
. TIME IN check
represents the number of seconds that speller
spends, in total, executing your implementation of check
. TIME IN size
represents the number of seconds that speller
spends executing your implementation of size
. TIME IN unload
represents the number of seconds that speller
spends executing your implementation of unload
. TIME IN TOTAL
is the sum of those four measurements.
Note that these times may vary somewhat across executions of speller
, depending on what else WebIDE is doing, even if you don’t change your code.
Incidentally, to be clear, by “misspelled” we simply mean that some word is not in the dictionary
provided.
Alright, the challenge now before you is to implement, in order, load
, hash
, size
, check
, and unload
as efficiently as possible using a hash table in such a way that TIME IN load
, TIME IN check
, TIME IN size
, and TIME IN unload
are all minimized. To be sure, it’s not obvious what it even means to be minimized, inasmuch as these benchmarks will certainly vary as you feed speller
different values for dictionary
and for text
. But therein lies the challenge, if not the fun, of this problem. This problem is your chance to design. Although we invite you to minimize space, your ultimate enemy is time. But before you dive in, some specifications from us.
- You may not alter
speller.c
orMakefile
. - You may alter
dictionary.c
(and, in fact, must in order to complete the implementations ofload
,hash
,size
,check
, andunload
), but you may not alter the declarations (i.e., prototypes) ofload
,hash
,size
,check
, orunload
. You may, though, add new functions and (local or global) variables todictionary.c
. - You may change the value of
N
indictionary.c
, so that your hash table can have more buckets. - You may alter
dictionary.h
, but you may not alter the declarations ofload
,hash
,size
,check
, orunload
. - Your implementation of
check
must be case-insensitive. In other words, iffoo
is in dictionary, then check should return true given any capitalization thereof; none offoo
,foO
,fOo
,fOO
,fOO
,Foo
,FoO
,FOo
, andFOO
should be considered misspelled. - Capitalization aside, your implementation of
check
should only returntrue
for words actually indictionary
. Beware hard-coding common words (e.g.,the
), lest we pass your implementation adictionary
without those same words. Moreover, the only possessives allowed are those actually indictionary
. In other words, even iffoo
is indictionary
,check
should returnfalse
givenfoo's
iffoo's
is not also indictionary
. - You may assume that any
dictionary
passed to your program will be structured exactly like ours, alphabetically sorted from top to bottom with one word per line, each of which ends with\n
. You may also assume thatdictionary
will contain at least one word, that no word will be longer thanLENGTH
(a constant defined indictionary.h
) characters, that no word will appear more than once, that each word will contain only lowercase alphabetical characters and possibly apostrophes, and that no word will start with an apostrophe. - You may assume that
check
will only be passed words that contain (uppercase or lowercase) alphabetical characters and possibly apostrophes. - Your spell checker may only take
text
and, optionally,dictionary
as input. Although you might be inclined (particularly if among those more comfortable) to “pre-process” our default dictionary in order to derive an “ideal hash function” for it, you may not save the output of any such pre-processing to disk in order to load it back into memory on subsequent runs of your spell checker in order to gain an advantage. - Your spell checker must not leak any memory. Be sure to check for leaks with
valgrind
. - You may search for (good) hash functions online, so long as you cite the origin of any hash function you integrate into your own code.
- Execute
wget https://cdn.cs50.net/2019/fall/psets/5/speller/speller.zip
to download a (compressed) ZIP file with this problem’s distribution. - Execute
unzip speller.zip
to uncompress that file. - Execute
cd speller
to change into that directory. - Execute
ls
. You should see this problem’s distribution:
dictionaries/ dictionary.c dictionary.h keys/ Makefile speller.c texts/
- Implement
load
. - Implement
hash
. - Implement
size
. - Implement
check
. - Implement
unload
.
You should add you code in dictionary.c.
reference
'''
dictionary.c
'''
// Implements a dictionary's functionality
#include <ctype.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <strings.h>
#include "dictionary.h"
#define HASHTABLE_SIZE 10000
// Defines struct for a node
typedef struct node
{
char word[LENGTH + 1];
struct node *next;
}
node;
node *hashtable[HASHTABLE_SIZE];
// Hashes the word (hash function posted on reddit by delipity)
// The word you want to hash is contained within new node, arrow, word.
// Hashing that will give you the index. Then you insert word into linked list.
int hash_index(char *hash_this)
{
unsigned int hash = 0;
for (int i = 0, n = strlen(hash_this); i < n; i++)
{
hash = (hash << 2) ^ hash_this[i];
}
return hash % HASHTABLE_SIZE;
}
// Initializes counter for words in dictionary
int word_count = 0;
// Loads dictionary into memory, returning true if successful else false
bool load(const char *dictionary)
{
// Opens dictionary
FILE *file = fopen(dictionary, "r");
if (file == NULL)
{
return false;
}
// Scans dictionary word by word (populating hash table with nodes containing words found in dictionary)
char word[LENGTH + 1];
while (fscanf(file, "%s", word) != EOF)
{
// Mallocs a node for each new word (i.e., creates node pointers)
node *new_node = malloc(sizeof(node));
// Checks if malloc succeeded, returns false if not
if (new_node == NULL)
{
unload();
return false;
}
// Copies word into node if malloc succeeds
strcpy(new_node->word, word);
// Initializes & calculates index of word for insertion into hashtable
int h = hash_index(new_node->word);
// Initializes head to point to hashtable index/bucket
node *head = hashtable[h];
// Inserts new nodes at beginning of lists
if (head == NULL)
{
hashtable[h] = new_node;
word_count++;
}
else
{
new_node->next = hashtable[h];
hashtable[h] = new_node;
word_count++;
}
}
fclose(file);
return true;
}
// Returns true if word is in dictionary else false
bool check(const char *word)
{
// Creates copy of word on which hash function can be performed
int n = strlen(word);
char word_copy[LENGTH + 1];
for (int i = 0; i < n; i++)
{
word_copy[i] = tolower(word[i]);
}
// Adds null terminator to end string
word_copy[n] = '\0';
// Initializes index for hashed word
int h = hash_index(word_copy);
// Sets cursor to point to same address as hashtable index/bucket
node *cursor = hashtable[h];
// Sets cursor to point to same location as head
// If the word exists, you should be able to find in dictionary data structure.
// Check for word by asking, which bucket would word be in? hashtable[hash(word)]
// While cursor does not point to NULL, search dictionary for word.
while (cursor != NULL)
{
// If strcasecmp returns true, then word has been found
if (strcasecmp(cursor->word, word_copy) == 0)
{
return true;
}
// Else word has not yet been found, advance cursor
else
{
cursor = cursor->next;
}
}
// Cursor has reached end of list and word has not been found in dictionary so it must be misspelled
return false;
}
// Returns number of words in dictionary if loaded else 0 if not yet loaded
unsigned int size(void)
{
return word_count;
}
// Unloads dictionary from memory, returning true if successful else false
bool unload(void)
{
node *head = NULL;
node *cursor = head;
// freeing linked lists
while (cursor != NULL)
{
node *temp = cursor;
cursor = cursor->next;
free(temp);
}
return true;
}