plus/minus epsilon

A Generalization of Secure Distributed Hash Tables

27 Mar 2014

Peer-to-peer distributed hash tables have been the subject of a large amount of research because they solve a significant problem in computer science: the secure and efficient distribution of large amounts of data amongst a loose collection of disparate nodes, lacking any centralized authorities or hierarchies.

Systems that accomplish this open up a rich and almost limitless list of possible peer-to-peer applications, such as distributed IPv6 networks, distributed DNS, server-less messaging, trackerless torrents, distributed file systems, and so on.

Definitions & Basics

Typically, DHTs only have one exposed algorithm:

Lookup(key): Given a key, it maps the key onto a node. [2]

The user can then send RPC requests to the node, requesting the value of a key, or requesting that some value be stored under that key.

While the lack of any correlation between the key and the associated data can be useful, it is usually desirable for a system to have self-verifying path names, meaning the key for a value is derived by applying a hash function to the value. This allows the system to detect and ignore invalid data. In the case of a hash function, the bonus that this prevents adversaries from overwriting valid data is a given.

DHTs also have a few basic components:

A key identifier space
A node identifier space
Rules for associating keys to particular nodes
Per-node routing tables that refer to other nodes
Rules for updating routing tables as nodes join and fail

Underlying all of these components is usually some invariant that is required in order for the lookup protocol to function correctly. "For example, the Chord system arranges nodes in a one-dimensional (but circular) identifier space; the required invariant is that every node knows the node that immediately follows it in the identifier space. If an attacker could break this invariant, Chord would not be able to look up keys correctly."

Attacks on DHTs can be generalized by the following:

Incorrect Forwarding: Malicious nodes can forward lookups to an incorrect or non-existent node. Breaks correctness of lookups.
Incorrect Placement: Malicious nodes can simply declare (incorrectly) that a random node is the node responsible for a key. Unbalances the load and breaks correctness of lookups.
Incorrect Routing Updates: Malicious nodes can corrupt routing tables of other nodes by sending them incorrect routing updates. Breaks correctness of lookups.
Incorrect Choice of Multiple Options: Malicious nodes can abuse flexibility in the protocol to target certain nodes or force the network to choose undesirable nodes.
Partitioning into Sybil Networks: While a node is bootstrapping into a network, a malicious node can trick it into partitioning into a network controlled by the attacker. Allows the attacker to deny service or learn about the behavior of a node that would normally be hidden.
Denial of Responsible Data: Malicious nodes can join and participate in the lookup protocol correctly, but deny the existence of data it was responsible for.
Doesn't Maintain Replication Invariant: A single malicious node can prevent all replication of keys it's responsible for from happening.
Inconsistent Malicious Behavior: Malicious nodes may be able to maximize their impact by only behaving correctly for certain peers.
Overload of Targeted Nodes: Malicious nodes can target certain nodes and overload them with garbage packets.
Rapid Cycling: Malicious nodes can force the system to rebalance unnecessarily, clogging the network with excessive data transfers.
Unsolicited Messages: Malicious nodes can forge unsolicited responses to queries.

[1]

The design of components 4 and 5 draws heavily on the study of flat ad-hoc networks. There has been extensive study on hierarchical and geographic ad-hoc networks, but they often have objectives contrary to those of most DHTs, so only flat routing approaches--in which each node plays an equal role--are given attention. They tend to be further classified into either proactive (periodic) protocols, or reactive (on-demand) protocols.

Proactive protocols exchange routing information in the background regardless of load on the network, so every node in the network always knows a path to every other node. This makes them desirable for applications that require real-time communication or need to uphold QoS guarantees.

Reactive protocols exchange routing information only if needed, in order to send an actively pending communication. This type of protocol has recently become popular because of the significantly reduced overhead, allowing larger scales than proactive protocols.

Lastly, ad-hoc networks are also classified by whether they implement distance-vector routing (DV) or link state routing (LS).

DV protocols maintain a table of all destinations, containing the neighbor with the shortest advertised route to each peer. (The magnitude of the advertised route is also stored alongside the direction of the shortest route--hence the name 'distance-vector.') "DV protocols are generally known to suffer from slow route convergence and tendency of creating loops."

LS protocols require each node to have a map of the entire network topology, by flooding the network with link related information. While this approach is more robust, "in large networks, the transmission of routing information will ultimately consume most of the bandwidth and consequently block applications, rendering it unfeasible for bandwidth limited wireless ad hoc networks." [3]

Contrary to these definitions, most DHTs take advantage of their increased freedom (by being connected to the internet, rather than only nodes in a certain broadcast range), by requiring that the querying node make all connections to distant nodes itself and allowing the nodes in each hop to only output the required routing information, rather than pass on data on behalf of another. Some even go so far as to implement zero-hop protocols, where every node always knows a publicly-accessible address of every other node. While zero-hop protocols reduce latency, they're always more rigid because they lose the ability to route through some types of network partitions.

Overlay Network

A DHT overlay network is a system designed to organize a large number of cooperating nodes and improve the efficiency of their communication. This is often the hardest part to design and the location of most of a construction's security and efficiency flaws.

Therefore, it's advisable to choose a system which has already been designed for the level of security and type of efficiency desired. SEAD (proactive) [4] and SEAR (reactive) [5] are some of the most notable examples of secure ad-hoc networks, and their efficiency tradeoffs are discussed above. As well, they are resistant to attacks 3, 4, 8, and 11. (Also attack 9, if they are kept intact, rather than turned into gossip protocols for a zero-hop design.)

It should be noted, however, that both SEAD and SEAR require a form of authentication between peers. [4, 5] The easiest method to do this is to let nodes generate a random asymmetric keypair (ElGamal, for example) at startup, and use their public key as their node id in the network. Therefore, all pairs of nodes naturally have a shared secret key, and every node has a way to sign its own messages and verify signatures on the messages of others.

Trackers vs Bootstrap Nodes

Contrary to the natural design of SEAD and SEAR, where nodes are distributed in a physical space bound by the physical abilities of their equipment to transmit and receive messages, internet applications using these constructions have to find some way to insert new nodes into a network. Often either a tracker is run to allow nodes to join, or the addresses of bootstrap nodes are advertised.

A tracker is a small, efficient, server that tracks the nodes in the network, and when a new node wishes to join, it outputs the public addresses of a small subset of the nodes in the network for the new node to try and connect to. Once connected, the node can begin to learn about the network topology and participate in the protocol. If a node doesn't wish to accept the new connection (perhaps because it is already burdened), it simply rejects it and forgets about the peer.

However, a bootstrap node is a typical node in the network, and when a new node wishes to join, it sends a request to connect to that node in particular. If the node wishes to accept the connection, it does so then--if it doesn't, it passes a notice about the new node into the network and the first node that can accept a new connection contacts the new peer with a follow-up request. The most important difference to note from trackers is that new nodes are connected to by other nodes, rather than connect to other nodes.

Trackers tend to be simpler approaches and are easier to scale, while bootstrap nodes tend to be more flexible and less centralized. Trackers are usually the preferred approach though, because they require that the centralized and trusted party wishes to partition the new node into a sybil network, while almost any node can do so with bootstrap nodes--either by seeing the request first and rushing to occupy all of the node's available bandwidth, or altering the contact information in the notice to its own and becoming a man-in-the-middle.

Placement & Replication

The last and most important part of a DHT is the placement and replication of key-value pairs. With an overlay network like SEAD, where every node knows of every other node, consistent hashing is the natural choice to determine where keys belong, because it defends against attack 10 as efficiently as possible. However, SEAR doesn't maintain an index of all active peers on its own, so either a different mechanism would need to be found or SEAR could borrow SEAD's concept of sequence numbers and routing tables, but omit the periodic updates and distance vectors. [5]

Also, replication needs to be implemented in a way that distributes keys among a randomly chosen subset of nodes in the network. As mentioned before about self-verifying path names, the key for a value is derived by computing H(v), where H is some cryptographic hash function and v is the value. From that, it's often extended to H<sup>n</sup>(v), where the output is the nth replica of v. Upon the insertion of a key, a fixed number of replicas would also be inserted into the DHT, to reduce the impacts of attacks 1 and 6. Attack 7 can be mitigated by pushing the responsibility for maintaining a key and it's replicas to either the nodes responsible for them, or the nodes which have an active interest in keeping a certain key in the table. [1]

Lastly, to defend against attack 2, it's important that a node agree that a key belongs to it when given a set request. With consistent hashing, this is as trivial as requiring that nodes always check that a given key hashes to their own id. [1]

Conclusion

Distributed hash tables that are as-secure-as-possible with efficiencies catered to particular use cases can be trivially designed and implemented by only making a few (if any) decisions about the goal of the end product.

References

"Security Considerations for Peer-to-Peer Distributed Hash Tables"
"Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications"
"Scalable Routing Protocols for Mobile Ad Hoc Networks"
"SEAD: secure efficient distance vector routing for mobile wireless ad hoc networks"
"SEAR: A Secure Efficient Ad Hoc On Demand Routing Protocol for Wireless Networks"

Cryptography Distributed Hash Tables