So, what is Hash?
Hash consists of reorganize the information of given data structure so that the location of the data is correlated with the data’s key (such as the name, id or value). Therefore, it’ll result in creating a function that transforms the key into a location.
Continuing our example, our transformation could be as simple as converting ‘Frank Zappa’ to decimal. As in:
F=102 R=114 A=97 N=110 N=107 ==> Frank = 10211497110107
Z=122 A=97 P=112 P=112 A=97 ==> Zappa = 1229711211297
Frank + Zappa = 10211497110107 + 1229711211297 = 11441208321404
key = 11441208321404
With the key in hands we need then to transform it to a reasonable location. For example, we could use the key to determinate Frank’s ZIP Code:
Possible Zip Codes: 99999
key = 11441208321404
Frank's Zip Code: 11441208321404 % 99999 = 34631
In our example, Frank would be realocated to 34631, Arizona. So basically, knowing only Frank Zappa by his name we would reach his address doing only a few math. In reality, 34631 would be the position in a new data structure and we would have to rearrange all information (every citizen) to a new location.
You are probably wondering: what if the information’s key points to the same location as another. In fact, that can happen and it’s called collision. There are multiple ways of handling collision. In our example, it could be as simple as having more than one person in a given address. You would still have to knock on each door, but on fewer doors. The more you avoid collisions, the fewer doors you have to knock on, but even with collisions, hashes are still faster. Also notice the size of a hash is arbitrary, it can be larger or smaller than the total number of items.
This is a simple hash. The sole purpose of this program is learn and practice the basics of Hash Function and Hash Tables. We used C++ only as a learning languague, we did not aim to any particular implementation. I recommend checking the full code. It will basically consist in 3 parts, assign a key and a position to every element, inserting this element in a new array/list/vector and handling collisions when it happens.
We used a hash function to determine the position of a given object in the new array. This hash function is not optimal and it’s supposed to be only an example. This function will return the position for a given id:
int hashfunction(string id)
{
// create "unique" key and position for each element
int sum = 0;
int distribution[12] = {1, 10, 100, 1000, 10000, 10000, 10000, 10000, 1000, 100, 10, 1};
for (int i = 0; i < id.size(); i++)
sum += int(id[i])*distribution[i];
return sum%(SIZE); //where SIZE is the size of your hash
}
Then we inserted into a list, which we’ll see later why it was chosen as our storage. First we need to create it and then store each element in their position given by our Hash Function:
list<Element> *main_list;
main_list = new list<Element>[SIZE]; //create list of class Elements
void insertElement(Element &e)
{
int position = hashfunction(e.id); // calculate position
main_list[position].push_back(e); // store element in position
}
Before diving in how to handle collision, it is important to punctuate the difference between a vector and a list in C++, because it’ll help understand why we’ve chosen the latter.
A vector is a container which store elements in contiguous memory, that means vector[54] is right next to vector[53] in the memory. If you used a function like vector[53].push_back(value), it’ll fill the next position, vector[54].
A list however is non contiguous, which means list[54] does NOT comes next to list[53] in the memory. If you used a function like list[53].push_back(value) it’ll not fill list[54], but rather an empty location next do list[53] which carries the same name.
As we are using lists and list::push_back, collisions will be automaticly handled. Each position can hold multiple variables and are determined by their id. However, for the sake of completion it’s important to point out how to retrieve those informations stored in the same position in a list:
void search_id(string id)
{
/* Search for the element using Hash*/
int position = hashfunction(id);
for (Element e : main_list[position])
if (e.id == id)
{
cout << "Element info: " << e << endl;
break;
}
return;
}
Using an example database a simple search of three random elements was made using a standard array and a hash:
Hashes are substantially faster than normal search.
There are multiples ways and different mathematical expressions that can be used as a hash function, each of them serves a distinct purpouse. We only covered the simplest and minimal implementation.
Next we’ll check how to implement a hash to perform a hash join in SQL.