Home * Programming * Data * Hash Table * Shared Hash Table
Dining_philosophers.jpg

Shared Hash Table,
a Hash table or Transposition table which is accessed by various processes or threads simultaneously, running on multiple processors or processor cores. Shared hash tables are most often implemented as dynamically allocated memory treated as global array. Due to memory protection between processes, they require an Application programming interface provided by the operating system to allocate shared memory. Threads may share global memory from the process they are belonging to.
Dining philosophers problem [1]

Parallel Search

Almost all parallel search algorithms on SMP- or NUMA systems profit from probing hash entries written by other instances of the search, in its most simple form by instances of a sequential search algorithm which simultaneously search the same root position. The gains come from the effect of nondeterminism. Each processor will finish the various subtrees in varying amounts of time, and as the search continues, these effects grow making the search trees diverge. The speedup is then based on how many nodes the main processor is able to skip from transposition table entries. Unfortunately, this approach gives little speedup on a mere 2 processors, and scales quite badly after this. However, the NPS scaling is nearly perfect.

Concurrent Access

Due to its size, i.e. 16 or more bytes, writing and reading hash entries are none atomic and require multiple write- and read-cycles. It may and will happen that concurrent writes and reads at the same table address and almost same time results in corrupt data retrieved that causes significant problems to the search. Interrupts may occur between accesses, and there are further nondeterministic issues involved [2] which may cause one thread to read two or more atomic data items, which were written by different threads, searching different positions with the same hash-index due to type-2 errors.

Locks

One common solution to avoid such errors is synchronization using atomic locks, and to implement a critical section or mutual exclusion.

Cilkchess

As an example, Cilkchess used Cilk's support for atomicity [3]. It uses one lock per hash entry:
typedef struct
{
   Cilk_lockvar lock;
   U64 key;
   U64 data;
} ttentry;
 
ttentry hashtable[TABLESIZE];
 
void ttinit ( ) {
   for (int i = 0; i < TABLESIZE; ++i)
      Cilk_lock_init( hashtable[i].lock);
}
 
void update_entry ( ttentry *e, U64 key, U64 data ) {
   Cilk_lock (e->lock); /* begin critical section */
   e->key = key;
   e->data = data;
   ...
   Cilk_unlock (e->lock); /* end critical section */
}

Granularity

An important property of a lock is its granularity, which is a measure of the amount of data the lock is protecting. In general, choosing a coarse granularity (a small number of locks, each protecting a large segment of data) results in less lock overhead when a single process is accessing the protected data, but worse performance when multiple processes are running concurrently. This is because of increased lock contention. The more coarse the lock, the higher the likelihood that the lock will stop an unrelated process from proceeding, i.e. in the extreme case, one lock for the whole table. Conversely, using a fine granularity (a larger number of locks, each protecting a fairly small amount of data), like in the Cilkchess sample above, increases the overhead of the locks themselves but reduces lock contention. For a huge transposition table with millions of fairly small entries locks incur a significant performance penalty on many architectures.

Lock-less

Xor

Robert Hyatt and Tim Mann proposed a lock-less transposition table implementation [4] for 128 bit entries with two atomic quad words, one qword for storing the key or signature, the 64-bit Zobrist- or BCH-key of the position, and one qword for the other information stored, move, score, draft and that like (data). Rather than to store two disjoint items, the key is stored xored with data, while data is stored additionally as usual. According to Robert Hyatt, the original idea came from Harry Nelson somewhere in 1990-1992 [5].

index = key % TABLESIZE;
hashtable[index].key  = key ^ data;
hashtable[index].data = data;

Since the retrieving position requires the same key for a probing hit, the stored key xored by the retrieved key must match the stored data.

index = key % TABLESIZE;
if (( hashtable[index].key ^ hashtable[index].data) == key )
{
   /* entry matches key */
}

If key and data were written simultaneously by different search instances with different keys, the error will usually yield in a mismatch of the comparison, except the rare but inherent Key collisions or type-1 errors [6]. As pointed out by Harm Geert Muller [7], the XOR technique might be applied for any size.

Checksum

For a lock-less shared Hash table with (much) larger entry sizes such as the Pawn Hash Table, one may store an additional checksum of the data, to likely detect errors after retrieving, and to safe the consistence of an entry.

SSE2

x86 and x86-64 SSE2 128-bit read/write instructions might in practice atomic, but they are not guaranteed even if properly aligned [8] [9]. If the processor implements a 16-byte store instruction internally as 2 8-byte stores in the store pipeline, it's perfectly possible for another processor to "steal" the cache line in between the two stores [10].

Allocation

Multiple threads inside one process can share its global variables or heap. Processes require special API calls to create shared memory and to pass a handle to other processes around for interprocess communication. POSIX provides a standardized API for using shared memory [11] . Linux kernel builds since 2.6 offer /dev/shm as shared memory in the form of a RAM disk.

See also


Publications

[12]

1980 ...

1990 ...

2000 ...

2010 ...


Forum Posts

1999

2000 ...

2005 ...

2010 ...


External Links

Shared Memory


Cache


Concurrency and Synchronization


Distributed memory


Misc


References

  1. ^ Dining philosophers problem from Wikipedia
  2. ^ Memory disambiguation from Wikipedia
  3. ^ Don Dailey, Charles E. Leiserson (2001). Using Cilk to Write Multiprocessor Chess Programs. Advances in Computer Games 9, pdf, 5 Other parallel programming issues, Cilk support for atomicity, pp. 17-18
  4. ^ Robert Hyatt, Tim Mann (2002). A lock-less transposition table implementation for parallel search chess engines. ICGA Journal, Vol. 25, No. 1
  5. ^ Re: Lockless hash: Thank you Bob Hyatt! by Robert Hyatt, CCC, August 26, 2013
  6. ^ Robert Hyatt, Anthony Cozzie (2005). The Effect of Hash Signature Collisions in a Chess Program. ICGA Journal, Vol. 28, No. 3
  7. ^ Re: lockless hashing by H.G.Muller, CCC, February 07, 2011
  8. ^ Re: Effectively atomic read of 16 bytes on x86_64 without cmpxchg16b? by Anthony Williams, groups.google.lock-free, February 08, 2012
  9. ^ Re: Speaking of the hash table by Ronald de Man, CCC, December 10, 2012
  10. ^ SSE instructions: single memory access by Atom, Stack overflow, October 04, 2011
  11. ^ shm_open, The Single UNIX Specification version 2, Copyright © 1997 The Open Group
  12. ^ Maged Michael - Selected Publications
  13. ^ The Aurora Distributed Shared Data System
  14. ^ Transposition-driven scheduling - Wikipedia
  15. ^ Transposition driven scheduling by Daniel Shawul, CCC, April 04, 2013
  16. ^ Information on the C++11 Memory Model by Scott Meyers, April 24, 2012
  17. ^ John Romein, Henri Bal, Jonathan Schaeffer, Aske Plaat (2002). A Performance Analysis of Transposition-Table-Driven Scheduling in Distributed Search. IEEE Transactions on Parallel and Distributed Systems, Vol. 13, No. 5, pp. 447–459. pdf

What links here?


Up one Level