C - Making a Separate Chaining Hash Table - Issue

问题

I've spent some time doing this, taking effort to put understandable variables and stuff. Tried to make it look clean and tidied up. So that I can easily debug it. But I can't seem to find my issue... The terminal doesn't output anything. Please help me identify my mistake!

#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef struct list_node *node_ptr;

struct list_node
{
    node_ptr next;
    char *key;
    char *value;
    
};

typedef node_ptr LIST;
typedef node_ptr position;

struct hash_table
{
    LIST *list_ptr_arr;
    unsigned int table_size;
};

typedef struct hash_table *HASHTABLE;

unsigned long long int
hash(const char *key, unsigned int hash_size)
{

    unsigned long long int hash;

    for(int i = 0; key[i]; i++)
    {
        hash = (hash<<32)+key[i];
    }

    return (hash%hash_size);

}

unsigned int 
next_prime(int number)
{

    int j;

    for(int i = number; ; i++)
    {
        for(j = 2; j<i; j++)
        {
            if(i%j == 0){break;}
        }

        if(i==j){return j;}
    }
}

HASHTABLE
initialize(unsigned int table_size)
{
    HASHTABLE H;

    H = (HASHTABLE) malloc(sizeof(struct hash_table));
    if(H==NULL){printf("Out of Space!"); return 0;}

    H->table_size = next_prime(table_size);

    H->list_ptr_arr = (position*) malloc(sizeof(LIST)*table_size);
    if(H->list_ptr_arr==NULL){printf("Out of Space!"); return 0;}

    H->list_ptr_arr = (LIST*) malloc(sizeof(struct list_node)*table_size);

    for(unsigned int i = 0; i<table_size; i++)
    {
        if(H->list_ptr_arr[i]==NULL){printf("Out of Space!"); return 0;}

        H->list_ptr_arr[i]=NULL;
    }


    return H;
    
}



void
insert(const char *key, const char *value, HASHTABLE H)
{
    unsigned int slot = hash(key, H->table_size);
    node_ptr entry = H->list_ptr_arr[slot];

    node_ptr prev;

    while(entry!=NULL)
    {
        if(strcmp(entry->key, key)==0)
        {
            free(entry->value);
            entry->value = malloc(strlen(value)+1);
            strncpy(entry->value,value,strlen(value));
            return;
        }

        prev = entry;
        entry = prev->next;

    }

    entry = (position) malloc(sizeof(struct list_node));
    entry->value = malloc(strlen(value)+1);
    entry->key = malloc(strlen(key)+1);
    strncpy(entry->key,key,strlen(key));
    strncpy(entry->value,value,strlen(value));
    entry->next = NULL;
    prev->next = entry;

}

void
dump(HASHTABLE H)
{

    for(unsigned int i = 0; i<H->table_size; i++)
    {
        position entry = H->list_ptr_arr[i];

        if(H->list_ptr_arr[i]==NULL){continue;}

        printf("slot[%d]: ", i);

        for(;;)
        {
            printf("%s|%s -> ", entry->key, entry->value);

            if(entry->next == NULL)
            {
                printf("NULL");
                break;
            }

            entry = entry->next;
        }

        printf("\n");

    }

}


int main()
{
  
    HASHTABLE H = initialize(10);
    insert("name1", "David", H);
    insert("name2", "Lara", H);
    insert("name3", "Slavka", H);
    insert("name4", "Ivo", H);
    insert("name5", "Radka", H);
    insert("name6", "Kvetka", H);
    dump(H);
  
    return 0;   
    
}

Tried to modify it and change some things up a bit but nothing helped...

Thanks in advance guys !

回答1:

There are a few beauty issues and at least two errors that break the code. I won't go into minor things, it is mostly stylistic, but your initialize() and insert() functions don't work.

In initialize() you allocate memory for H->list_ptr_array twice. That leaks the memory from the first allocation for no good reason, but of course, that won't crash your code, just leak. In the second allocation, you allocate the wrong size, you use sizeof(struct list_node) * tale_size, but you want an array of pointers and not the structs (which, since the structs hold pointers, will be larger). That, again, only wastes memory and doesn't crash it. Still, you would be better off with the right memory, which you can get using

H->list_ptr_arr = malloc(table_size * sizeof *H->list_ptr_arr);

You don't need to cast the result of malloc(), it is a void * and you don't need to cast that to pointer types, but that is a stylistic issue. The important part of that line is that we can get the size of the underlying data from the variable we assign to, which will always guarantee that we get the right size, even if we change the type at some point. I also tend to use sizeof(type) from time to time, but sizeof *ptr is the better pattern, and it is worth getting used to.

Anyway, although you allocate the wrong amount of memory, you allocate enough, so your program doesn't crash because of it. But when you then run through the allocated bins in the table, you return with an error if they are NULL. They are not initialised at all, so if they are NULL (and they might be), then it is by pure luck. Or, if you consider it a sign of error, unfortune. But if you consider NULL a signal of allocation error here, why do you then initialise each bin to NULL right after you conclude that they aren't?

As it is, your initialisation will abort if you happen to get a NULL pointer in the array, and since you don't check for allocation errors in main() (which is fine for a test), that might be the reason your program is crashing. It is not the main issue, and it only happens if, by chance, you get a NULL in one of your bins, but it can happen. Don't do the check for NULL when you run through the bins. The bins are not initialised. Just set each to NULL.

It is in insert() the main problem lies. Your prev variable is not initialised before the while-loop, and if you do not enter the loop, it won't be after it either. Setting prev->next = entry when prev is uninitialised spells trouble, and is a likely candidate for a crashing error. Especially considering that the first time you insert something into a bin, entry will be NULL, so you trigger the error the very first time. What happens when you dereference an uninitialised pointer is undefined, but it rarely means something good. A crash is the best case scenario.

I understand the logic here. You want to move prev along the list so you can insert the new entry at the end, and you don't have a last element before you loop through the entries in the bin. But that doesn't mean you can't have an initialised pointer to where you want to insert a new entry. If you use a pointer to a pointer, you can start with the entry in the table's array. That is not a list_node, so a list_node * won't do for prev, but a list_node ** will work just fine. You can do something like this:

node_ptr new_entry(const char *key, const char *value)
{
  node_ptr entry = malloc(sizeof *entry);
  if (!entry) abort(); // Add error checking
  entry->value = malloc(strlen(value) + 1);
  entry->key = malloc(strlen(key) + 1);
  strncpy(entry->key, key, strlen(key));
  strncpy(entry->value, value, strlen(value));
  entry->next = NULL;
  return entry;
}

void
insert(const char *key, const char *value, HASHTABLE H)
{
    unsigned int slot = hash(key, H->table_size);
    node_ptr entry = H->list_ptr_arr[slot];

    // Make sure that we always have a prev, by pointing it
    // to the location where we want to insert a new entry,
    // which we want at the bin if nothing else
    node_ptr *loc = &H->list_ptr_arr[slot];

    while(entry != NULL)
    {
        if(strcmp(entry->key, key)==0)
        {
            free(entry->value);
            entry->value = malloc(strlen(value)+1);
            strncpy(entry->value,value,strlen(value));
            return;
        }

        // make loc the entry's next
        loc = &entry->next;
        // and move entry forward (we don't need prev->next now)
        entry = entry->next;
    }

    // now loc will hold the address we should put
    // the entry in
    *loc = new_entry(key, value);
}

Of course, since the lists in the bins aren't sorted or kept in any particular order (unless there are constraints you haven't mentioned), you don't need to append new entries. You can prepend them as well. Then you don't need to drag such a loc along for other linear search. You could do something like:

node_ptr find_in_bin(const char *key, node_ptr bin)
{
  for (node_ptr entry = bin; entry; entry = entry->next) {
    if(strcmp(entry->key, key)==0)
      return entry;
  }
  return 0;
}

void
insert(const char *key, const char *value, HASHTABLE H)
{
    unsigned int slot = hash(key, H->table_size);
    node_ptr *bin = &H->list_ptr_arr[slot];
    node_ptr entry = find_in_bin(key, *bin);
    if (entry) {
      free(entry->value);
      entry->value = malloc(strlen(value)+1);
      strncpy(entry->value,value,strlen(value));
    } else {
      *bin = new_entry(key, value, *bin);
    }
}

If you fix the initialization and insertion this way, I think the code should work. It does for the few tests I put it through, but I can have missed something.

Not an error as such, but something I will still quickly comment on. The next_prime() function looks like a slow version of Eratosthenes' sieve. That is fine, it computes a prime (unless I have missed something), but it is not something you need. If you google for it, you will find tables of the first K primes, for pretty large K. You can easily embed them in your code. That is, if you absolutely want your tables to have prime sizes. You don't need to, though. There is nothing wrong with having tables of other sizes.

There are some benefits to modulo primes for hashing, but the hash table doesn't have to have the size of the prime for this to work. If you have a large prime P, and a hash table of size M, you can do ((i % P) % M) and get the benefits of doing modulo P and the convenience of having table size M. When you resize tables and such, it is easier if M is a power of two, and then the last modulo operation can be a very fast bit-masking:

#define mask_k(n,k) (n & ((1 << k) - 1))

and then later...

   int index = mask_k(i % P, k); // where table size is 1 << k

The i % P might not be necessary either, it depends on how good your hash function is. If you have a hash function that gives you close to random numbers, then the bits in i are random, and then the k least-significant bits are as well, and % P does nothing to improve it. But if you want to do modulo a prime, you can do so for a large prime and mask it down to a smaller table size, so you don' have to use a table size that is a prime. And if you want to have a table size that is a prime anyway, use a table of primes. It is slow to have to compute new primes every time you resize the table.

来源：https://stackoverflow.com/questions/65031109/c-making-a-separate-chaining-hash-table-issue

标签

data-structures

hashtable