# plus/minus epsilon

### Cryptographic Data Structures

8 Apr 2015We know that a *data structure* is a particular way of organizing data so that
we can efficiently solve a problem with it. Perhaps a *cryptographic data
structure* would just be the same thing, but with a notion of *security*
determined by an adversary's ability to win a certain game.

In a sense, they're minimal cryptosystems that can be glued together into more complex cryptosystems similar to the way that fundamental data structures can be glued together to give solutions to increasingly complicated problems.

We study data structures and algorithms because:

- Computing systems are concerned with the storage and retrieval of information.
- For systems to be economical, data must be organized (into data structures) to support efficient manipulation (by algorithms).
- Choosing the wrong algorithms and data structures makes a program slow at best and unmaintainable and insecure at worst.

In cryptography, there are lots of analogs of traditional data structures, but altered to be useful in cryptographic work and they're incredibly interesting to me because of their simplicity and power. This post is just intended to be a rundown of the more interesting ones I've found.

#### 1. Linked Lists

Linked lists aren't that special in a cryptographic setting. They take one of two forms:

- Nodes consist of
`(value, ptr)`

pairs, where`ptr`

is the location of the next item in the list (or hash, for absolute location). - Nodes consist of
`(label, value, ptr)`

triples, where`ptr`

is the`label`

of the next item in the list, if an implicit/hash labeling system is undesirable.

Mildly more interesting, given a tweakable cipher and a key `K`

, we can
encrypt the `value`

and/or `ptr`

fields of each node, using the `label`

or
location as the tweak. The linked list is useless in this form unless the
holder knows the key `K`

(in which case, he can read the entire list trivially),
so naturally, `K`

is called the *trapdoor* for this list. With an
*interval-constrained* tweakable cipher, we can explicitly restrict where the
list can be traversed. [1, 2, 4]

**Security:** If we use an *authenticated* tweakable cipher, we can
ensure that our list hasn't been tampered with. In order to talk about the
confidentiality of our linked list, we need to introduce the idea of a *leakage
function* `L`

which replaces as much of a protocol/data structure as possible
with random data without being detected. The intuition here is similar to that
of the CPA game: a PRF is secure if an adversary who obtains an arbitrary
number of (input, output) pairs still can't distinguish the function from random
anywhere else in the input-space. Once we have a specific construction of a
linked list, we can define an `L`

and say that our construction is
*L-semantically-secure*. [1, 2]

In the naive construction (one list, shuffled, no padding), we immediately leak the length of our list. If we ever add/remove from it, we leak the type and size of the operation.

Instead, on initialization, we can allocate `n`

blocks and split them into two
lists: one with application data and one with "available" blocks. If we wish
to delete a list, we join it to the list of "available" blocks. If we wish to
add/modify a list, we take so many blocks off the front of the "available"
blocks list and tie them them into our application data. Done carefully, this
approach allows us to hide the total size of our data and the type of updates we
issue to it--it affords a "stronger" leakage function. Note that the number of
blocks we need to rewrite is a linear function of the size of the change, not
the size of the already-written data, so our scheme is *ideally incremental*. [12]

#### 2. Skip Lists

There is a trivial modification of #1 that converts linked lists into skip
lists, but it's not particularly interesting and that's not what this section is
about. Instead, I'm going to talk about skip lists on *hash chains*.

Hash chains are incredibly useful in cryptography, and one major application is
*proving progression through time*. In a significantly more efficient version
of OCSP, a
certificate authority might generate a random seed `s`

(called an *anchor*) and
sign `H`

<sup>`365`

</sup>`(s)`

(called the *head*) into Alice's new certificate.
If Alice's certificate is un-revoked on day `i`

, then the certificate authority
shares `H`

<sup>`365-i`

</sup>`(s)`

with Alice so she can prove to Bob that her
certificate is fresh and valid. (Bob verifies
`H`

<sup>`i`

</sup>`(H`

<sup>`365-i`

</sup>`(s)) = H`

<sup>`365`

</sup>`(s)`

). [13]

Or, in secure routing protocols, there's a value called a `metric`

which denotes
the number of hops from the source node to the current receiver. It's very
important that we have an efficient way to authenticate the metric so when we
search for shortest paths from A to B, an adversary can't claim to possess a
shorter path than they really do (in order to waste network resources or monitor
traffic). So, one solution is to sign the head of a new random hash chain into
each of our messages and move an element of the hash chain (starting with the
anchor) up one step every time the message makes a new hop. If we get a message
with hash chain head `H`

<sup>`max`

</sup>`(s)`

claiming to have metric `j`

,
authenticated by `H`

<sup>`j`

</sup>`(s)`

, we can verify that claim by checking
`H`

<sup>`max-j`

</sup>`(H`

<sup>`j`

</sup>`(s)) = H`

<sup>`max`

</sup>`(s)`

.

Moving one step up a hash chain every hop lets us verify a message originates
from a node no less than `k`

hops away. One problem in this scheme is that if a
node receives a message with metric `k`

, they can 'forget' to increase the
metric to `k+1`

, making themselves seem better connected than they really are!
This is called *same-distance fraud* and to prevent it, we could instead embed
an authentication scheme (or one-time signature scheme) between each step in our
hash chain where the private key is `h`

<sub>`i-1`

</sub> and the public key is
`h`

<sub>`i`

</sub> (numbering in order of generation, not usage). If we get a
message from a neighbor named `Alice`

, and she's trying to pass off a message
with metric `j`

authenticated by `Bob`

, then we know to ignore her claim. [5]

In designing a complex system, though, as we embed more and more into our hash chains and make them longer and longer to accommodate finer resolutions or longer spans of time, we can easily end up putting users in a situation where they may be asked to do tens of millions of non-parallelizable hashes! This is where skip lists (or skipchains) come in.

A skip list itself is just a hash chain with keypairs for a hash-based one-time
signature scheme (like a MW-Chain) embedded in it. For each private key, we
branch off and generate a sub-chain of length `m`

and publish the signature of
the head under the corresponding keypair.

If our skip list has length `n`

and each sub-chain has length `m`

, then our
total *virtual* hash chain has length `nm`

, which can be extended further by
letting the sub-chains also be skip lists, so we get cubic/quartic/quintic...
growth in the total length. Even better, we don't need to define a priori the
entire chain--just the top-level chain and the first sub-chain--so the cost for
bare-bones initialization is `n + m`

. (The chain is called "virtual" because
it's not continuous; in a standard skip list making longer jumps is optional,
but this isn't the case here.) [5]

#### 3. Binary Search Trees

Creating a tamper-evident binary search tree is easy enough: build the normal
BST and then hash it one level at a time until you get to the top and have one
hash that authenticates the entire tree. If we don't want to do the work or
hold all of the data ourselves, we can give this tree to a third party and let
them do it. Even better--when we give them searches to run, they can
efficiently convince us that they've performed the search correctly by giving us
the *complement cover* of the nodes it found (the smallest set of nodes that
we can hash together with the search output to get the root). [8]

That works for tree-structured data, but what about flat data like logs? How do we verifiably search those?

The *history trees* of [9] give us exactly what we want! Their general idea is
this: as we build the hash tree over the flat data (each block / log entry is
a leaf node), we aggregate attributes up the tree. To formalize it, we talk
about a triple ( τ, ⊕, Γ ) where τ
represents the type of the attribute, the function
⊕: τ×τ→τ specifies how to combine the attributes of
the two children into the attribute of the parent and the function
Γ: `I`

→τ extracts the attribute from a leaf node.

For example, if a bank is storing transaction history within an untrusted data
center, they might chose the triple
( τ: `int`

, ⊕: `max`

, Γ : `trans.value`

)
because this would allow them to search for transactions over a certain value
in a given time-slot. The data center can prove to them it doesn't need
to return data in some section of the history by demonstrating that the nodes
over it all have attribute < `x`

.

Another really interesting triple is ( τ: `bool`

, ⊕: ∨, Γ : `x.isFlagged`

)
because even without the crypto, it's a surprisingly intuitive and efficient way
to search flagged data (important/unimportant events, busy/idle servers,
red/black nodes).

This scheme is tamper-evident like a few of the above because an adversary would have to find a collision in our hash function to convince us we stored something we didn't. We can also talk about the L-semantic-security of the scheme if structure-preserving encryption is applied. (In the bank example, order-preserving symmetric encryption lets us keep the history tree structure, but hides the actual transaction amounts. Our leakage function would have to capture the notion that the data center learns a total ordering of transactions, but not their real currency values.)

#### 4. 2-3 Search Trees and B-Trees

*Unlike* binary search trees, B-trees almost exclusively keep application data
in all their nodes, rather than just on the leaves, and that's for two reasons:
self-balancing and incrementality.

For *authenticated searches* and *membership proofs* (proving `x`

∈ `S`

),
the fact that B-trees are self-balancing is generally desirable and it also
prevents an adversary with control over the order in which entries are added to
the tree from making certain nodes too difficult to lookup. If we're using a
2-3 tree to verifiably store revoked certificates, it's best that an adversary
doesn't have the ability to trick the directory into timing out every time a
user asks about the validity of a target's certificate.

In section #3, the process of doing an authenticated search or membership proof
was described, but another nice operation to have is *non-membership proofs*
(proving `x`

∉ `S`

). The tree's holder can prove that a value `x`

is
*not* in the tree by following its branches like normal and demonstrating the
existence of a nil link where `x`

*should* be.

All of the above is well and good, but it's not a huge step beyond history trees. History trees are clearly only optimized for prepends or appends--like a ledger--but modifying the middle of the tree would require downloading and re-calculating almost the entire thing. No good! A critical observation in the implementation of a B-tree is that when it's modified, only nodes on the path from the affected area to the root are updated--the exact same nodes sent to us in membership and non-membership proofs.

Just by obtaining a membership or non-membership proof for an element `x`

, we
have all the information we need to perform insert/update/delete operations at
that location, calculate and publish the tree's new head, and update any
third-parties that are holding the tree for us. Because these proofs are
sufficient and only logarithmic in the total number of nodes, our scheme
qualifies as *incremental*, meaning we've created a *dynamic hash tree* (even
though that sounds like an oxymoron).

We can apply similar privacy-preserving techniques to B-trees as we did to
linked lists. Among other things, they give us *very* efficient searchable
symmetric encryption constructions, capitalizing on the ability to parallelize
enumeration through a B-tree.

[10]

(*Randomized Hash Treaps* also allow all of the operations discussed above, but
are *history independent*--an adversary querying the data structure can't learn
anything new about the order in which the elements were added--at the cost of
only being probabilistically balanced. See [14].)

#### 5. Inverted Indexes

An inverted index is a map from a keyword to a set of documents that contain that keyword and it can be constructed as a bundle of enumerable data structures (like linked lists or B-trees)

To securely index a set of encrypted documents, we build the normal inverted
index `D`

for the unencrypted documents and for each keyword `w`

in the index
we represent the associated set of documents as a data structure with the
trapdoor `f(K, w)`

--where `f`

is a PRF and `K`

is a master key--and encode it
into a pre-allocated storage space (the kind discussed in section #1).

To allow an untrusted party to search a corpus of documents for a keyword `w`

,
we send them the location of the first node and `f(K, w)`

. Knowing these two
things, the untrusted party can decrypt and enumerate the data structure to
learn `D[w]`

without learning anything else about the corpus.

The above is similar to how searchable symmetric encryption is constructed in [1], and it's a simplified version of the T-Set sub-construction of [2] and [3]. With constrained tweakable ciphers [4], we can think about hierarchical keyword searches, if that even makes sense.

**Security:** It's important that the pre-allocated storage space is a fixed
size to hide important statistical information about the encrypted documents and
the types of updates we perform, but sometimes extra leakage is tolerated for
the sake of dynamic allocation. There are two main techniques for dynamically
allocating secure memory: *joining* and *splitting*.

In the joining technique, whenever we exhaust our current amount of space, we download the entire index and re-build a new index which is twice as big. If we delete a lot of information and detect that we're using less than a quarter of our allocated space, we download the entire index and re-build a new index which is half the size (halving instead of quartering prevents thrashing).

In the splitting technique, whenever we exhaust our current amount of space, we simply create an entirely new storage space with an entirely new inverted index. In the future, to search for a keyword, the user has to issue individual queries for each storage space and the server has to consolidate all of the individual search results into one response.

If we allocate space `N`

blocks at a time, the joining technique only leaks the
index's size exponentially relative to `N`

(the `x`

variable in the expression
`2`

<sup>`x-1`

</sup>`N`

< `size`

≤ `2`

<sup>`x`

</sup>`N`

) but is incredibly
expensive, both in terms of computation and bandwidth. The splitting technique
leaks the index's size linearly relative to `N`

(the `x`

variable in the
expression `(x-1)N`

< `size`

≤ `xN`

) and some partial historical
information, but is fairly cheap on bandwidth when new allocation is required.

In the long run, though, the user will always end up handling about the same amount of data regardless of allocation strategy. It makes sense then to opt for a hybrid strategy to balance leakage and cost: joining when bandwidth and time are plentiful and splitting when they're not.

#### 6. Prefix Trees, Tries, and Patricia Trees

Discussing inverted indexes brings up another related and really cool pair of data structures: prefix trees/tries and Patricia trees.

Per usual, building them is easy enough: Take a normal prefix/Patricia tree and add hashes. They support proofs of membership and non-membership of the exact same form as B-trees and they're incremental in the same way, but they're more complex to implement, they're not necessarily self-balancing, and they'll probably have worse performance than a well-tuned B-tree.

All I'm seeing is negatives, so why are we bothering with these? Explaining that takes a bit of background.

Pseudorandom functions (PRFs) are fairly well-known, but their close relatives,
verifiable pseudorandom functions (VRFs), are not. Similar to PRFs, on
initialization we choose a random function from a family of VRFs. Unlike PRFs,
we should also be able to produce a commitment (analogous to a public key) to
our choice that commits us to using this function exclusively. Our chosen VRF
takes one element of the input space and has two outputs: one is a
pseudorandomly chosen element of the output space and the other is a proof that
the first output is the correct choice under the function. Someone with the
function's commitment, the input, and both outputs, should be able to convince
themselves that we really have the function's output on the given input
(creating a temporary breach of pseudorandomness). However, the function should
suffer *absolutely no* loss of pseudorandomness at any input point we haven't
revealed a proof for. [16]

Going back to trees, how about instead of the silly, intuitive construction
mentioned above, we refer to elements by their output under a VRF? Assuming
that you already magically have the function's commitment (most likely through
the same magic that told you the tree's head), when you ask a server to search
for element `x`

, the server could instead come back with "Okay, I'll search for
*{VRF's output on X}* because *{VRF proof}*" and then send over an authenticated
search for that. You know that it's the correct value to search for because of
the proof, and you know the search is correct because of the mechanisms
discussed above.

Using a VRF not only helps balance the tree, it also prevents sub-exponential search for elements that are lexicographically close to another (for an adversary with oracle access to the tree). Particularly for a bitwise trie, we can cut off the search after a fixed, reasonable length and add a small number of dummy nodes to make it near impossible to determine set cardinality while still having proofs that are about the same size we would get from B-trees.

Needless to say, prefix trees afford *incredibly* strong leakage functions--way
stronger than in the footnote about history-independence earlier.

[15]

#### 7. Dictionaries and Sets

Basically every data structure gives rise to a set implementation and every set gives rise to a dictionary. This is already astonishingly general and coupling it with the controlling power of cryptography produces an amazing variety of constructions and applications.

There're two primary operations:

- Proofs of Membership
- Proofs of Non-Membership

There're two primary variables for parameterizing performance: Proof of membership and proof of non-membership size relative to the size of the set.

And there're four primary variables for parameterizing functionality/security:

**Updates**∈ {Static, Partially Dynamic, Fully Dynamic} -- Does the set have to be fully defined before we can do any work, or can it change over time? Do changes require incremental updates of public information (partial) or are they transparent (fully)?**Verification**∈ {Public, Private} -- Can anybody verify proofs of membership to our set or are we the only ones?**Duplicates**∈ {Yes, No} -- Can we add duplicate elements/keys to the set?**Leakage**∈ {Yes, No} -- Do proofs of membership for any given set of elements leak anything about any other elements of the set?

To build some intuition, lets start with the simplest examples: Signatures and MACs allow proofs of membership but not proofs of non-membership. They are both fully dynamic, allow duplicates, have no leakage, and keep proof of membership size constant. Signatures are publicly verifiable and MACs are privately verifiable. The obvious application is proving membership to the set of "Things Person X Has Sent".

Accumulators or zero-knowledge sets (two names for the same thing) have been
referred to as the *decentralized version of signatures*, and that's a very
natural connection to make: you publish a "public key" (accumulated value) and
for every message you want to authenticate you have to provide a "signature"
(proof or short witness)--the 'decentralized' bit refers to the fact that they
are *static*, so they're useful for a lot of the same things as hash-based
cryptosystems and in some ways complete a time-space tradeoff. Accumulators are
usually slower but their proofs are constant-size, while hash trees are fast but
have large proofs which are logarithmic in the size of the set. File-sharing
services use either hash trees or accumulators to allow receiving peers to
authenticate the blocks they're getting one at a time, rather than all at once
at the end of the download (like they would have to do if the file was trivially
signed). [6, 7]

Now, what about duplicate keys? None of the schemes we've built so far are adequate for something like a name to public key dictionary because they can't enforce any uniqueness constraint on the key and therefore can't give an authoritative binding of one piece of identifying information to another. But you may recall, we've already solved the problem of proving a search was done correctly! Sets and dictionaries based on B-trees and prefix trees are really nice because they allow proofs of membership and non-membership of only logarithmic size and their structure inherently prevents duplicates. They can be built with hashes for public verifiability or MACs for private verifiability, and we can steal the VRF-augmented prefix tree discussed above to prevent leakage. They are at most partially dynamic, though. [12, 15]

#### 8. Condition Variables

Condition variables are data structures that allow a set of processes to
synchronize their behavior. Non-interactive verifiable secret sharing and/or
distributed commitments, on the other hand, are cryptographic primitives that
*force* a set of processes to synchronize their behavior.

In normal secret sharing schemes, a dealer distributes shares of some secret to a set of participants such that the original secret can only be reconstructed from the shares by the cooperation of a quorum of participants. In a non-interactive verifiable secret sharing scheme (distributed commitment scheme), the dealer publishes encrypted shares so that any participant can decrypt its share and non-interactively (without contacting anyone else) convince itself of the validity of the rest of the shares (proving that the dealer isn't corrupted), so that when there's a quorum of participants, they will be able to successfully derive the correct secret.

This gives a trivial way to simulate simultaneous broadcasts: On round `n`

of a
protocol, each participant publishes shares of their output for that round.
Once all participants have done so, they can begin reconstructing everybody's
message for round `n`

by publishing their decrypted shares and move on to round
`n + 1`

. As long as there's no quorum of malicious participants, no participant
can change their output in round `n`

depending on what other participants are
publishing at the same time and no participant can retract or refuse to reveal
their output for round `n`

once the reveal stage has begun.

[11]

### References

- Searchable Symmetric Encryption: Improved Definitions and Efficient Constructions
- Highly-Scalable Searchable Symmetric Encryption with Support for Boolean Queries
- Dynamic Searchable Encryption in Very-Large Databases: Data Structures and Implementation
- Constrained Pseudorandom Functions and Their Applications
- Efficient Security Mechanisms for Routing Protocols
- One-Way Accumulators: A Decentralized Alternative to Digital Signatures
- Zero-Knowledge Sets
- Fast Digital Identity Revocation
- Efficient Data Structures for Tamper-Evident Logging
- Certificate Revocation and Certificate Update
- A Practical Scheme for Non-Interactive Verifiable Secret Sharing
- Incremental Cryptography and Application to Virus Protection
- Efficient Certificate Revocation
- Balloon: A Forward-Secure Append-Only Persistent Authenticated Data Structure
- Bringing Deployable Key Transparency to End Users
- Verifiable Random Functions