#include <waiting_threads.h>
#include <m_string.h>

Include dependency graph for waiting_threads.c:

Classes
struct	st_wt_resource
struct	deadlock_arg

Macros
#define	incr(VAR, LOCK) do { (VAR)++; } while(0)

Functions
void	wt_init ()
void	wt_end ()
void	wt_thd_lazy_init (WT_THD thd, const ulong ds, const ulong ts, const ulong dl, const ulong *tl)
void	wt_thd_destroy (WT_THD *thd)
my_bool	wt_resource_id_memcmp (const void a, const void b)
int	wt_thd_will_wait_for (WT_THD thd, WT_THD blocker, const WT_RESOURCE_ID *resid)
int	wt_thd_cond_timedwait (WT_THD thd, mysql_mutex_t mutex)
void	wt_thd_release (WT_THD thd, const WT_RESOURCE_ID resid)

Variables
ulonglong	wt_wait_table [WT_WAIT_STATS]
uint32	wt_wait_stats [WT_WAIT_STATS+1]
uint32	wt_cycle_stats [2][WT_CYCLE_STATS+1]
uint32	wt_success_stats

Detailed Description

"waiting threads" subsystem - a unified interface for threads to wait on each other, with built-in deadlock detection.

Main concepts ^^^^^^^^^^^^^ a thread - is represented by a WT_THD structure. One physical thread can have only one WT_THD descriptor at any given moment.

a resource - a thread does not wait for other threads directly, instead it waits for a "resource", which is "owned" by other threads. It waits, exactly, for all "owners" to "release" a resource. It does not have to correspond to a physical resource. For example, it may be convenient in certain cases to force resource == thread. A resource is represented by a WT_RESOURCE structure.

a resource identifier - a pair of {resource type, value}. A value is an ulonglong number. Represented by a WT_RESOURCE_ID structure.

a resource type - a pointer to a statically defined instance of WT_RESOURCE_TYPE structure. This structure contains a pointer to a function that knows how to compare values of this resource type. In the simple case it could be wt_resource_id_memcmp().

a wait-for graph - a graph, that represenst "wait-for" relationships. It has two types of nodes - threads and resources. There are directed edges from a thread to a resource it is waiting for (WT_THD::waiting_for), from a thread to resources that it "owns" (WT_THD::my_resources), and from a resource to threads that "own" it (WT_RESOURCE::owners)

Graph completeness ^^^^^^^^^^^^^^^^^^

For flawless deadlock detection wait-for graph must be complete. It means that when a thread starts waiting it needs to know all its blockers, and call wt_thd_will_wait_for() for every one of them. Otherwise two phenomena should be expected:

Fuzzy timeouts:

thread A needs to get a lock, and is blocked by a thread B. it waits. Just before the timeout thread B releases the lock. thread A is ready to grab the lock but discovers that it is also blocked by a thread C. It waits and times out.

As a result thread A has waited two timeout intervals, instead of one.

Unreliable cycle detection:

Thread A waits for threads B and C Thread C waits for D Thread D wants to start waiting for A

one can see immediately that thread D creates a cycle, and thus a deadlock is detected.

But if thread A would only wait for B, and start waiting for C when B would unlock, thread D would be allowed to wait, a deadlock would be only detected when B unlocks or somebody times out.

These two phenomena don't affect a correctness, and strictly speaking, the caller is not required to call wt_thd_will_wait_for() for all blockers - it may optimize wt_thd_will_wait_for() calls. But they may be perceived as bugs by users, it must be understood that such an optimization comes with its price.

Usage ^^^^^

First, the wt* subsystem must be initialized by calling wt_init(). In the server you don't need to do it, it's done in mysqld.cc.

Similarly, wt_end() frees wt* structures, should be called at the end, but in the server mysqld.cc takes care of that.

Every WT_THD should be initialized with wt_thd_lazy_init(). After that they can be used in other wt_thd_* calls. Before discarding, WT_THD should be free'd with wt_thd_destroy(). In the server both are handled in sql_class.cc, it's an error to try to do it manually.

To use the deadlock detection one needs to use this thread's WT_THD, call wt_thd_will_wait_for() for every thread it needs to wait on, then call wt_thd_cond_timedwait(). When thread releases a resource it should call wt_thd_release() (or wt_thd_release_all()) - it will notify (send a signal) threads waiting in wt_thd_cond_timedwait(), if appropriate.

Just like with pthread's cond_wait, there could be spurious wake-ups from wt_thd_cond_timedwait(). A caller is expected to handle that (that is, to re-check the blocking criteria).

wt_thd_will_wait_for() and wt_thd_cond_timedwait() return either WT_OK or WT_DEADLOCK. Additionally wt_thd_cond_timedwait() can return WT_TIMEOUT. Out of memory and other fatal errors are reported as WT_DEADLOCK - and a transaction must be aborted just the same.

Configuration ^^^^^^^^^^^^^ There are four config variables. Two deadlock search depths - short and long - and two timeouts. Deadlock search is performed with the short depth on every wt_thd_will_wait_for() call. wt_thd_cond_timedwait() waits with a short timeout, performs a deadlock search with the long depth, and waits with a long timeout. As most deadlock cycles are supposed to be short, most deadlocks will be detected at once, and waits will rarely be necessary.

These config variables are thread-local. Different threads may have different search depth and timeout values.

Also, deadlock detector supports different killing strategies, the victim in a deadlock cycle is selected based on the "weight". See "weight" description in waiting_threads.h for details. It's up to the caller to set weights accordingly.

Status ^^^^^^ We calculate the number of successfull waits (WT_OK returned from wt_thd_cond_timedwait()), a number of timeouts, a deadlock cycle length distribution - number of deadlocks with every length from 1 to WT_CYCLE_STATS, and a wait time distribution - number of waits with a time from 1 us to 1 min in WT_WAIT_STATS intervals on a log e scale.

Sample usage as was done in the Maria engine ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

in class THD, THD::transaction had a WT_THD object; there were session variables to set the short/long depth/timeout.
when mysqld started, wt_init() was called; when it ended, wt_end() was called.
in THD's constructor, wt_thd_lazy_init(&transaction.wt, &variables.wt_deadlock_search_depth_short, &variables.wt_timeout_short, &variables.wt_deadlock_search_depth_long, &variables.wt_timeout_long);
in THD::cleanup(): wt_thd_destroy(&transaction.wt);
this was sufficient to make the deadlock-detector available to the Maria engine (which can grab THD); the engine used it this way:

when it wrote a row, and hit a duplicate key, it would find who wrote this key, the "blocker" transaction. If "blocker" had committed, duplicate key error would be sent. Otherwise, we would wait for it, in the following code snippet (originally from storage/maria/ma_write.c). After the blocker is gone, we would retry the write:

  while (keyinfo->ck_insert(info,
           (*keyinfo->make_key)(info, &int_key, i, buff, record,
                                filepos, info->trn->trid)))
  {
    / * we got a write error * /
    if error is not "duplicate key" then return error;
    info->dup_key_trid has the culprit:
    if the culprit is ourselves then return error;
    otherwise:
    blocker= trnman_trid_to_trn(info->trn, info->dup_key_trid);
    / *
      if blocker TRN was not found, it means that the conflicting
      transaction was committed long time ago. It could not be
      aborted, as it would have to wait on the key tree lock
      to remove the conflicting key it has inserted.
      /
    if (!blocker || blocker->commit_trid != ~(TrID)0)
    { / * committed * /
      if (blocker)
        pthread_mutex_unlock(& blocker->state_lock);
      rw_unlock(&keyinfo->root_lock);
      goto err;
    }
    / * release root_lock to let blocker finish its work * /
    rw_unlock(&keyinfo->root_lock);
    {
      / * running. now we wait * /
      WT_RESOURCE_ID rc;
      int res;
      const char *old_proc_info;

      rc.type= &ma_rc_dup_unique;
      rc.value= (intptr)blocker;
      res= wt_thd_will_wait_for(info->trn->wt, blocker->wt, & rc);
      if (res != WT_OK)
      {
        pthread_mutex_unlock(& blocker->state_lock);
        my_errno= HA_ERR_LOCK_DEADLOCK;
        goto err;
      }
      old_proc_info= proc_info_hook(0,
                                    "waiting for a resource",
                                    __func__, __FILE__, __LINE__);
      res= wt_thd_cond_timedwait(info->trn->wt, & blocker->state_lock);
      proc_info_hook(0, old_proc_info, __func__, __FILE__, __LINE__);

      pthread_mutex_unlock(& blocker->state_lock);
      if (res != WT_OK)
      {
        my_errno= res == WT_TIMEOUT ? HA_ERR_LOCK_WAIT_TIMEOUT
                                    : HA_ERR_LOCK_DEADLOCK;
        goto err;
      }
      / * if we come here, blocker has rolled back or committed,
      so is gone, we can retry the ck_insert() * /
    }
    rw_wrlock(&keyinfo->root_lock);

#ifndef MARIA_CANNOT_ROLLBACK keyinfo->version++; #endif }

ma_rc_dup_unique was: / * a WT_RESOURCE_TYPE for transactions waiting on a unique key conflict * / WT_RESOURCE_TYPE ma_rc_dup_unique={ wt_resource_id_memcmp, 0};
When a Maria transaction would commit or rollback it would call: / * Wake up threads waiting for this transaction * / static void wt_thd_release_self(TRN *trn) { if (trn->wt) { WT_RESOURCE_ID rc; rc.type= rc.value= (intptr)trn; wt_thd_release(trn->wt, & rc); trn->wt= 0; } }

Tests ^^^^^ unittest/mysys/waiting_threads-t.c, currently disabled.

Definition in file waiting_threads.c.

Function Documentation

my_bool wt_resource_id_memcmp	(	const void *	a,
		const void *	b
	)

Trivial resource id comparison function - bytewise memcmp.

It can be used in WT_RESOURCE_TYPE structures where bytewise comparison of values is sufficient.

Definition at line 651 of file waiting_threads.c.

int wt_thd_cond_timedwait	(	WT_THD *	thd,
		mysql_mutex_t *	mutex
	)

called by a waiter (thd) to start waiting

It's supposed to be a drop-in replacement for pthread_cond_timedwait(), and it takes mutex as an argument.

Returns: one of WT_TIMEOUT, WT_DEADLOCK, WT_OK

Definition at line 1139 of file waiting_threads.c.

void wt_thd_lazy_init	(	WT_THD *	thd,
		const ulong *	ds,
		const ulong *	ts,
		const ulong *	dl,
		const ulong *	tl
	)

Lazy WT_THD initialization

Cheap initialization of WT_THD. Only initialize fields that don't require memory allocations - basically, it only does assignments. The rest of the WT_THD structure will be initialized on demand, on the first use. This allows one to initialize lazily all WT_THD structures, even if some (or even most) of them will never be used for deadlock detection.

Parameters

ds	a pointer to deadlock search depth short value
ts	a pointer to deadlock timeout short value
dl	a pointer to deadlock search depth long value
tl	a pointer to deadlock timeout long value

Note: these are pointers to values, and WT_THD stores them as pointers. It allows one later to change search depths and timeouts for existing threads. It also means that the pointers must stay valid for the lifetime of WT_THD.

Definition at line 594 of file waiting_threads.c.

void wt_thd_release	(	WT_THD *	thd,
		const WT_RESOURCE_ID *	resid
	)

called by a blocker when it releases a resource

it's conceptually similar to pthread_cond_broadcast, and must be done under the same mutex as wt_thd_cond_timedwait().

Parameters

resid a resource to release. 0 to release all resources

Definition at line 1213 of file waiting_threads.c.

int wt_thd_will_wait_for	(	WT_THD *	thd,
		WT_THD *	blocker,
		const WT_RESOURCE_ID *	resid
	)

notify the system that a thread needs to wait for another thread

called by a waiter to declare that it (thd) will wait for another thread (blocker) on a specific resource (resid). can be called many times, if many blockers own a blocking resource. but must always be called with the same resource id - a thread cannot wait for more than one resource at a time.

Returns: WT_OK or WT_DEADLOCK

As a new edge is added to the wait-for graph, a deadlock detection is performed for this new edge.

Definition at line 1011 of file waiting_threads.c.

Variable Documentation

uint32 wt_cycle_stats[2][WT_CYCLE_STATS+1]

distribution of cycle lengths first column tells whether this was during short or long detection

Definition at line 294 of file waiting_threads.c.

uint32 wt_wait_stats[WT_WAIT_STATS+1]

wait time distribution (log e scale)

Definition at line 289 of file waiting_threads.c.

ulonglong wt_wait_table[WT_WAIT_STATS]

preset table of wait intervals

Definition at line 285 of file waiting_threads.c.

Classes

Macros

Functions

Variables

Detailed Description

Function Documentation

Variable Documentation