A Practical Multi-Word Compare and Swap Operation

Table of Contents

A Practical Multi-Word Compare and Swap Operation

This paper introduces the RDCSS (Restricted double compare and single swap) concurrent instruction using CAS (compare and swap).

This RDCSS instruction is then used to write MWCAS (Multi-word compare and swap).

RDCSS

RDCSS checks to make sure that two locations are set to the correct value provided (i.e. no other thread has mutated them) and then updates the second memory location to a value provided.

We could try something like this, but this isn’t thread-safe (between the first line of the function and the second, something could’ve updated *a2, so this function is not linearizable.

uint64_t RDCSS(uint64_t *a1, uint64_t *a2, uint64_t o1, uint64_t o2, uint64_t n2) {
  uint64_t a2_local = *a2;
  if(*a1 == o1 && a2_local == o2) *a2 = n2;
  return a2_local;
}

We can reduce this algorithm to 2PL (2 Phase Locking) like so:

uint64_t RDCSS(uint64_t *a1, uint64_t *a2, uint64_t o1, uint64_t o2, uint64_t n2) {
  lock(a2);
  uint64_t a2_local = *a2;  // Access protected by lock on a2
  lock(a1);
  uint64_t a1_local = *a1;  // Access protected by lock on a1
  unlock(a1);
  if (a1_local == o1 && a2_local == o2) *a2 = n2; // Protected by lock on a2
  unlock(a2);
  return a2_local;
}

Unfortunately, this requires locking a2 and a1, so it’s not correct either.

The paper sidesteps this in two ways: having a cooperative approach, and using descriptors.

If other threads notice that an address has a descriptor on it, and the thread did not place that descriptor in that memory address, they wait until the other thread is done with its work before continuing on.

This looks like this:

struct RDCSSDescriptor_t {
    word_t* a1;        // control address
    const word_t  o1;  // expected control value
    word_t* a2;        // data address
    const word_t  o2;  // expected data value
    const word_t  n2;  // new data value
};

word_t RDCSS(RDCSSDescriptor_t* d) {
    do {
        r = CAS1(d->a2, d->o2, d);
        if (IsDescriptor(r)) Complete(r);   // H1
    } while (IsDescriptor(r));
    if (r == d->o2) Complete(d);
    return r;
}

word_t RDCSSRead(addr_t *addr) {
    do {
        r = __atomic_load(addr);
        if (IsDescriptor(r)) Complete(r);
    } while (IsDescriptor(r));
    return r;
}

void Complete(RDCSSDescriptor_t* d) {
    v = __atomic_load(d->a1);
    if (v == d->o1) CAS1(d->a2, d, d->n2);
    else CAS1(d->a2, d, d->o2);
}

Implementing Multi-word Compare and Swap

Now, we can implement multi-word compare and swap, which looks like 2PL (try to install descriptors into both memory addresses using RDCSS. If both operations succeed, we write the values to both, since we know this is race-free (since other threads will know to wait, since our descriptors are still there).

If either or both operations fail, the operation is marked as failed, and retried.

struct CASNDescriptor_t {
    const int n;        // Number of words we are updating
    Status status;      // Status of overall operation
    Entry entry[];      // Array of entries with `addr`, `old_val`, `new_val`
}

bool CASN(CASNDescriptor_t* cd) {
    if (__atomic_load(&(cd->status)) == UNDECIDED) {
        // phase1: Install the descriptors
        status = SUCCEEDED;
        for (i = 0; (i < cd->n) && (status == SUCCEEDED); i++) {
retry_entry:
            entry = cd->entry[i];
            val = RDCSS(new RDCSSDescriptor_t(  // X1
                                &(cd->status), UNDECIDED, entry.addr,
                                entry.old_val, cd));
            if (IsCASNDescriptor(val)) {
                if (val != cd) {
                    CASN(val);
                    goto retry_entry;
                }  // else we installed descriptor successfully.
            } else if (val != entry.old_val) {
                status = FAILED;
            }
        }
        CAS1(&(cd->status), UNDECIDED, status);  // C4: Update status
    }
    // phase2: Roll forward/back the descriptors to values.
    succeeded = (__atomic_load(&(cd->status)) == SUCCEEDED);
    for (i = 0; i < cd->n; i++)
        CAS1(cd->entry[i].addr, cd,
             succeeded ? (cd->entry[i].new_val) : (cd->entry[i].old_val));
    return succeeded;
}

word_t CASNRead(addr_t *addr) {
    do {
        r = RDCSSRead(addr);
        if (IsCASNDescriptor(r)) CASN(r);
    } while (IsCASNDescriptor(r));
    return r;
}

Performance

Shockingly, according to the paper, the algorithm performs roughly as well as the lock version, even though this algorithm is lock-free.