plus/minus epsilon

Cryptographic Data Structures

8 Apr 2015

We know that a data structure is a particular way of organizing data so that we can efficiently solve a problem with it. Perhaps a cryptographic data structure would just be the same thing, but with a notion of security determined by an adversary's ability to win a certain game.

In a sense, they're minimal cryptosystems that can be glued together into more complex cryptosystems similar to the way that fundamental data structures can be glued together to give solutions to increasingly complicated problems.

We study data structures and algorithms because:

Computing systems are concerned with the storage and retrieval of information.

For systems to be economical, data must be organized (into data structures) to support efficient manipulation (by algorithms).

Choosing the wrong algorithms and data structures makes a program slow at best and unmaintainable and insecure at worst.

In cryptography, there are lots of analogs of traditional data structures, but altered to be useful in cryptographic work and they're incredibly interesting to me because of their simplicity and power. This post is just intended to be a rundown of the more interesting ones I've found.

1. Linked Lists

Linked lists aren't that special in a cryptographic setting. They take one of two forms:

Nodes consist of (value, ptr) pairs, where ptr is the location of the next item in the list (or hash, for absolute location).
Nodes consist of (label, value, ptr) triples, where ptr is the label of the next item in the list, if an implicit/hash labeling system is undesirable.

Mildly more interesting, given a tweakable cipher and a key K, we can encrypt the value and/or ptr fields of each node, using the label or location as the tweak. The linked list is useless in this form unless the holder knows the key K (in which case, he can read the entire list trivially), so naturally, K is called the trapdoor for this list. With an interval-constrained tweakable cipher, we can explicitly restrict where the list can be traversed. [1, 2, 4]

Security: If we use an authenticated tweakable cipher, we can ensure that our list hasn't been tampered with. In order to talk about the confidentiality of our linked list, we need to introduce the idea of a leakage function L which replaces as much of a protocol/data structure as possible with random data without being detected. The intuition here is similar to that of the CPA game: a PRF is secure if an adversary who obtains an arbitrary number of (input, output) pairs still can't distinguish the function from random anywhere else in the input-space. Once we have a specific construction of a linked list, we can define an L and say that our construction is L-semantically-secure. [1, 2]

In the naive construction (one list, shuffled, no padding), we immediately leak the length of our list. If we ever add/remove from it, we leak the type and size of the operation.

Instead, on initialization, we can allocate n blocks and split them into two lists: one with application data and one with "available" blocks. If we wish to delete a list, we join it to the list of "available" blocks. If we wish to add/modify a list, we take so many blocks off the front of the "available" blocks list and tie them them into our application data. Done carefully, this approach allows us to hide the total size of our data and the type of updates we issue to it--it affords a "stronger" leakage function. Note that the number of blocks we need to rewrite is a linear function of the size of the change, not the size of the already-written data, so our scheme is ideally incremental. [12]

2. Skip Lists

There is a trivial modification of #1 that converts linked lists into skip lists, but it's not particularly interesting and that's not what this section is about. Instead, I'm going to talk about skip lists on hash chains.

Hash chains are incredibly useful in cryptography, and one major application is proving progression through time. In a significantly more efficient version of OCSP, a certificate authority might generate a random seed s (called an anchor) and sign H365(s) (called the head) into Alice's new certificate. If Alice's certificate is un-revoked on day i, then the certificate authority shares H365-i(s) with Alice so she can prove to Bob that her certificate is fresh and valid. (Bob verifies Hi(H365-i(s)) = H365(s)). [13]

Or, in secure routing protocols, there's a value called a metric which denotes the number of hops from the source node to the current receiver. It's very important that we have an efficient way to authenticate the metric so when we search for shortest paths from A to B, an adversary can't claim to possess a shorter path than they really do (in order to waste network resources or monitor traffic). So, one solution is to sign the head of a new random hash chain into each of our messages and move an element of the hash chain (starting with the anchor) up one step every time the message makes a new hop. If we get a message with hash chain head Hmax(s) claiming to have metric j, authenticated by Hj(s), we can verify that claim by checking Hmax-j(Hj(s)) = Hmax(s).

Moving one step up a hash chain every hop lets us verify a message originates from a node no less than k hops away. One problem in this scheme is that if a node receives a message with metric k, they can 'forget' to increase the metric to k+1, making themselves seem better connected than they really are! This is called same-distance fraud and to prevent it, we could instead embed an authentication scheme (or one-time signature scheme) between each step in our hash chain where the private key is hi-1 and the public key is hi (numbering in order of generation, not usage). If we get a message from a neighbor named Alice, and she's trying to pass off a message with metric j authenticated by Bob, then we know to ignore her claim. [5]

In designing a complex system, though, as we embed more and more into our hash chains and make them longer and longer to accommodate finer resolutions or longer spans of time, we can easily end up putting users in a situation where they may be asked to do tens of millions of non-parallelizable hashes! This is where skip lists (or skipchains) come in.

A skip list itself is just a hash chain with keypairs for a hash-based one-time signature scheme (like a MW-Chain) embedded in it. For each private key, we branch off and generate a sub-chain of length m and publish the signature of the head under the corresponding keypair.

If our skip list has length n and each sub-chain has length m, then our total virtual hash chain has length nm, which can be extended further by letting the sub-chains also be skip lists, so we get cubic/quartic/quintic... growth in the total length. Even better, we don't need to define a priori the entire chain--just the top-level chain and the first sub-chain--so the cost for bare-bones initialization is n + m. (The chain is called "virtual" because it's not continuous; in a standard skip list making longer jumps is optional, but this isn't the case here.) [5]

3. Binary Search Trees

Creating a tamper-evident binary search tree is easy enough: build the normal BST and then hash it one level at a time until you get to the top and have one hash that authenticates the entire tree. If we don't want to do the work or hold all of the data ourselves, we can give this tree to a third party and let them do it. Even better--when we give them searches to run, they can efficiently convince us that they've performed the search correctly by giving us the complement cover of the nodes it found (the smallest set of nodes that we can hash together with the search output to get the root). [8]

That works for tree-structured data, but what about flat data like logs? How do we verifiably search those?

The history trees of [9] give us exactly what we want! Their general idea is this: as we build the hash tree over the flat data (each block / log entry is a leaf node), we aggregate attributes up the tree. To formalize it, we talk about a triple ( τ, ⊕, Γ ) where τ represents the type of the attribute, the function ⊕: τ×τ→τ specifies how to combine the attributes of the two children into the attribute of the parent and the function Γ: I→τ extracts the attribute from a leaf node.

For example, if a bank is storing transaction history within an untrusted data center, they might chose the triple ( τ: int, ⊕: max, Γ : trans.value) because this would allow them to search for transactions over a certain value in a given time-slot. The data center can prove to them it doesn't need to return data in some section of the history by demonstrating that the nodes over it all have attribute < x.

Another really interesting triple is ( τ: bool, ⊕: ∨, Γ : x.isFlagged) because even without the crypto, it's a surprisingly intuitive and efficient way to search flagged data (important/unimportant events, busy/idle servers, red/black nodes).

This scheme is tamper-evident like a few of the above because an adversary would have to find a collision in our hash function to convince us we stored something we didn't. We can also talk about the L-semantic-security of the scheme if structure-preserving encryption is applied. (In the bank example, order-preserving symmetric encryption lets us keep the history tree structure, but hides the actual transaction amounts. Our leakage function would have to capture the notion that the data center learns a total ordering of transactions, but not their real currency values.)

4. 2-3 Search Trees and B-Trees

Unlike binary search trees, B-trees almost exclusively keep application data in all their nodes, rather than just on the leaves, and that's for two reasons: self-balancing and incrementality.

For authenticated searches and membership proofs (proving x ∈ S), the fact that B-trees are self-balancing is generally desirable and it also prevents an adversary with control over the order in which entries are added to the tree from making certain nodes too difficult to lookup. If we're using a 2-3 tree to verifiably store revoked certificates, it's best that an adversary doesn't have the ability to trick the directory into timing out every time a user asks about the validity of a target's certificate.

In section #3, the process of doing an authenticated search or membership proof was described, but another nice operation to have is non-membership proofs (proving x ∉ S). The tree's holder can prove that a value x is not in the tree by following its branches like normal and demonstrating the existence of a nil link where x should be.

All of the above is well and good, but it's not a huge step beyond history trees. History trees are clearly only optimized for prepends or appends--like a ledger--but modifying the middle of the tree would require downloading and re-calculating almost the entire thing. No good! A critical observation in the implementation of a B-tree is that when it's modified, only nodes on the path from the affected area to the root are updated--the exact same nodes sent to us in membership and non-membership proofs.

Just by obtaining a membership or non-membership proof for an element x, we have all the information we need to perform insert/update/delete operations at that location, calculate and publish the tree's new head, and update any third-parties that are holding the tree for us. Because these proofs are sufficient and only logarithmic in the total number of nodes, our scheme qualifies as incremental, meaning we've created a dynamic hash tree (even though that sounds like an oxymoron).

We can apply similar privacy-preserving techniques to B-trees as we did to linked lists. Among other things, they give us very efficient searchable symmetric encryption constructions, capitalizing on the ability to parallelize enumeration through a B-tree.

[10]

(Randomized Hash Treaps also allow all of the operations discussed above, but are history independent--an adversary querying the data structure can't learn anything new about the order in which the elements were added--at the cost of only being probabilistically balanced. See [14].)

5. Inverted Indexes

An inverted index is a map from a keyword to a set of documents that contain that keyword and it can be constructed as a bundle of enumerable data structures (like linked lists or B-trees)

To securely index a set of encrypted documents, we build the normal inverted index D for the unencrypted documents and for each keyword w in the index we represent the associated set of documents as a data structure with the trapdoor f(K, w)--where f is a PRF and K is a master key--and encode it into a pre-allocated storage space (the kind discussed in section #1).

To allow an untrusted party to search a corpus of documents for a keyword w, we send them the location of the first node and f(K, w). Knowing these two things, the untrusted party can decrypt and enumerate the data structure to learn D[w] without learning anything else about the corpus.

The above is similar to how searchable symmetric encryption is constructed in [1], and it's a simplified version of the T-Set sub-construction of [2] and [3]. With constrained tweakable ciphers [4], we can think about hierarchical keyword searches, if that even makes sense.

Security: It's important that the pre-allocated storage space is a fixed size to hide important statistical information about the encrypted documents and the types of updates we perform, but sometimes extra leakage is tolerated for the sake of dynamic allocation. There are two main techniques for dynamically allocating secure memory: joining and splitting.

In the joining technique, whenever we exhaust our current amount of space, we download the entire index and re-build a new index which is twice as big. If we delete a lot of information and detect that we're using less than a quarter of our allocated space, we download the entire index and re-build a new index which is half the size (halving instead of quartering prevents thrashing).

In the splitting technique, whenever we exhaust our current amount of space, we simply create an entirely new storage space with an entirely new inverted index. In the future, to search for a keyword, the user has to issue individual queries for each storage space and the server has to consolidate all of the individual search results into one response.

If we allocate space N blocks at a time, the joining technique only leaks the index's size exponentially relative to N (the x variable in the expression 2x-1N < size ≤ 2xN) but is incredibly expensive, both in terms of computation and bandwidth. The splitting technique leaks the index's size linearly relative to N (the x variable in the expression (x-1)N < size ≤ xN) and some partial historical information, but is fairly cheap on bandwidth when new allocation is required.

In the long run, though, the user will always end up handling about the same amount of data regardless of allocation strategy. It makes sense then to opt for a hybrid strategy to balance leakage and cost: joining when bandwidth and time are plentiful and splitting when they're not.

6. Prefix Trees, Tries, and Patricia Trees

Discussing inverted indexes brings up another related and really cool pair of data structures: prefix trees/tries and Patricia trees.

Per usual, building them is easy enough: Take a normal prefix/Patricia tree and add hashes. They support proofs of membership and non-membership of the exact same form as B-trees and they're incremental in the same way, but they're more complex to implement, they're not necessarily self-balancing, and they'll probably have worse performance than a well-tuned B-tree.

All I'm seeing is negatives, so why are we bothering with these? Explaining that takes a bit of background.

Pseudorandom functions (PRFs) are fairly well-known, but their close relatives, verifiable pseudorandom functions (VRFs), are not. Similar to PRFs, on initialization we choose a random function from a family of VRFs. Unlike PRFs, we should also be able to produce a commitment (analogous to a public key) to our choice that commits us to using this function exclusively. Our chosen VRF takes one element of the input space and has two outputs: one is a pseudorandomly chosen element of the output space and the other is a proof that the first output is the correct choice under the function. Someone with the function's commitment, the input, and both outputs, should be able to convince themselves that we really have the function's output on the given input (creating a temporary breach of pseudorandomness). However, the function should suffer absolutely no loss of pseudorandomness at any input point we haven't revealed a proof for. [16]

Going back to trees, how about instead of the silly, intuitive construction mentioned above, we refer to elements by their output under a VRF? Assuming that you already magically have the function's commitment (most likely through the same magic that told you the tree's head), when you ask a server to search for element x, the server could instead come back with "Okay, I'll search for {VRF's output on X} because {VRF proof}" and then send over an authenticated search for that. You know that it's the correct value to search for because of the proof, and you know the search is correct because of the mechanisms discussed above.

Using a VRF not only helps balance the tree, it also prevents sub-exponential search for elements that are lexicographically close to another (for an adversary with oracle access to the tree). Particularly for a bitwise trie, we can cut off the search after a fixed, reasonable length and add a small number of dummy nodes to make it near impossible to determine set cardinality while still having proofs that are about the same size we would get from B-trees.

Needless to say, prefix trees afford incredibly strong leakage functions--way stronger than in the footnote about history-independence earlier.

[15]

7. Dictionaries and Sets

Basically every data structure gives rise to a set implementation and every set gives rise to a dictionary. This is already astonishingly general and coupling it with the controlling power of cryptography produces an amazing variety of constructions and applications.

There're two primary operations:

Proofs of Membership
Proofs of Non-Membership

There're two primary variables for parameterizing performance: Proof of membership and proof of non-membership size relative to the size of the set.

And there're four primary variables for parameterizing functionality/security:

Updates ∈ {Static, Partially Dynamic, Fully Dynamic} -- Does the set have to be fully defined before we can do any work, or can it change over time? Do changes require incremental updates of public information (partial) or are they transparent (fully)?
Verification ∈ {Public, Private} -- Can anybody verify proofs of membership to our set or are we the only ones?
Duplicates ∈ {Yes, No} -- Can we add duplicate elements/keys to the set?
Leakage ∈ {Yes, No} -- Do proofs of membership for any given set of elements leak anything about any other elements of the set?

To build some intuition, lets start with the simplest examples: Signatures and MACs allow proofs of membership but not proofs of non-membership. They are both fully dynamic, allow duplicates, have no leakage, and keep proof of membership size constant. Signatures are publicly verifiable and MACs are privately verifiable. The obvious application is proving membership to the set of "Things Person X Has Sent".

Accumulators or zero-knowledge sets (two names for the same thing) have been referred to as the decentralized version of signatures, and that's a very natural connection to make: you publish a "public key" (accumulated value) and for every message you want to authenticate you have to provide a "signature" (proof or short witness)--the 'decentralized' bit refers to the fact that they are static, so they're useful for a lot of the same things as hash-based cryptosystems and in some ways complete a time-space tradeoff. Accumulators are usually slower but their proofs are constant-size, while hash trees are fast but have large proofs which are logarithmic in the size of the set. File-sharing services use either hash trees or accumulators to allow receiving peers to authenticate the blocks they're getting one at a time, rather than all at once at the end of the download (like they would have to do if the file was trivially signed). [6, 7]

Now, what about duplicate keys? None of the schemes we've built so far are adequate for something like a name to public key dictionary because they can't enforce any uniqueness constraint on the key and therefore can't give an authoritative binding of one piece of identifying information to another. But you may recall, we've already solved the problem of proving a search was done correctly! Sets and dictionaries based on B-trees and prefix trees are really nice because they allow proofs of membership and non-membership of only logarithmic size and their structure inherently prevents duplicates. They can be built with hashes for public verifiability or MACs for private verifiability, and we can steal the VRF-augmented prefix tree discussed above to prevent leakage. They are at most partially dynamic, though. [12, 15]

8. Condition Variables

Condition variables are data structures that allow a set of processes to synchronize their behavior. Non-interactive verifiable secret sharing and/or distributed commitments, on the other hand, are cryptographic primitives that force a set of processes to synchronize their behavior.

In normal secret sharing schemes, a dealer distributes shares of some secret to a set of participants such that the original secret can only be reconstructed from the shares by the cooperation of a quorum of participants. In a non-interactive verifiable secret sharing scheme (distributed commitment scheme), the dealer publishes encrypted shares so that any participant can decrypt its share and non-interactively (without contacting anyone else) convince itself of the validity of the rest of the shares (proving that the dealer isn't corrupted), so that when there's a quorum of participants, they will be able to successfully derive the correct secret.

This gives a trivial way to simulate simultaneous broadcasts: On round n of a protocol, each participant publishes shares of their output for that round. Once all participants have done so, they can begin reconstructing everybody's message for round n by publishing their decrypted shares and move on to round n + 1. As long as there's no quorum of malicious participants, no participant can change their output in round n depending on what other participants are publishing at the same time and no participant can retract or refuse to reveal their output for round n once the reveal stage has begun.

[11]

References

Searchable Symmetric Encryption: Improved Definitions and Efficient Constructions
Highly-Scalable Searchable Symmetric Encryption with Support for Boolean Queries
Dynamic Searchable Encryption in Very-Large Databases: Data Structures and Implementation
Constrained Pseudorandom Functions and Their Applications
Efficient Security Mechanisms for Routing Protocols
One-Way Accumulators: A Decentralized Alternative to Digital Signatures
Zero-Knowledge Sets
Fast Digital Identity Revocation
Efficient Data Structures for Tamper-Evident Logging
Certificate Revocation and Certificate Update
A Practical Scheme for Non-Interactive Verifiable Secret Sharing
Incremental Cryptography and Application to Virus Protection
Efficient Certificate Revocation
Balloon: A Forward-Secure Append-Only Persistent Authenticated Data Structure
Bringing Deployable Key Transparency to End Users
Verifiable Random Functions

Cryptography Data Structures