Lecture 18: Anonymous Communication and Onion Routing, Tor

Lecture 16: Anonymous Communication and Onion Routing, Tor

Anonymous Routing

Let me ask you this, if your communication is encrypted, is your privacy protected? It isn't really protected if someone watching the bytes fly by knows to whom you are speaking and when, is it? Remember that routing involves unencrypted headers that contain source and destination information.
Today, we are going to explore anonymous routing, techniques for routing that protect the identity of the sender and the receiver.

What Do We Need?

We a way of sending a message from one machine, to another machine, without anyone in the middle knowing from where it came or to whom it is going. This is almost possible. The first node is going to know where it came from. And, the last node is going to know where it went. But, the nodes in the middle, can be like the random long path from point-A to point-B, as much to generate confusion as to get from one place to another.
If I were taking you for a drive and wanted to keep the location secret from you (think CIA safe house), other than blind-folding you, I'd pick the route in advance. It wouldn't be the shortest or simplest path. It would just be confusing. And, to make sure no one accidentally told you anything, I might you a bunch of different taxis, each one just knowing to get you from Point-A to Point-B, not where you were eventually going, nto where you originally came from, and not necessarily even which was the last stop.
And, every time I took someone to the safe house, I'd use a different route and different drivers, so no one figured anything out, e.g. there's usually anther driver waiting over there, or "last stop".
And, of course, if you happened to be carrying any secrets, I'd want them encrypted and inconspiciously hand-cuffeed to your wrist.

How Do We Acheive This In Software?

First, we need to establish the network of taxi drivers. We need a set of nodes that will do this type of hop-to-hop routing for us, accepting messages and forwarding them as needed.
Then we need a way of expressing our route to those among this network of nodes that we choose to use -- without letting them all see it. To do this, we first pick our nodes, and then negotiate with each of them a key to use to encrpyt the messages for their hop. What they get when they decrypt each is another message. The payload is encrypted, but the header is not and gives them the next hop. the last hop hands off the message, either encrypted such that the destination can decrypt it, or not at all.
So, first we pick some arbitrary routing. Then, we negotiate this set of keys with each hop. Finally, we encrpyt the message so each hop can decrpyt it in turn. Now, we send our messge to the first router, and it flies from there.
What we have created here is essentially a virtual circuit. We've set up a path from Point-A to Point-B. But, since we had to negotiate all of these keys, we want to use it for at least a while. So, we send a bunch of messages through before creating an new one. This amortizes the communication cost over time.
But, of course, not too much time. We want to switch routings again fairly soon, so that traffic analysis doesn't show our pattern of traffic through this set of routers.

Tor

The system as described above is essentially tor. In addition Tor offers anonymous roundezvous services. In other words, it allows clients and servers to connect anonymously to each other using the anonymous Tor network to generate a virtual meeting place that is not at either system.
Tor hides the actual routes. And, it uses virtual circuits for only about 10 minutes, to balance the amortization of the cost with the desire to keep things changing.

Not as Strong As It Seems

With a relatively small number of compromised nodes, it is possible to trace Tor traffic. Traffic analysis might show that the source, the destination, and certain links are "hot" at certain times. Or, certain timing artifacts might be noticed at one node and observed at another.
One can try to obfuscate these problems by increasing the size of the system to reduce the impact of imposters; generating a bunch of bogus traffic to dilute the real traffic; bundling messages together somewhat arbitrarily at the various nodes, only unbundling them later to mix the traffic streams; introducing random timing artifacts, such as jitter, delay, or reordering, etc.
Breaking Tor's anonymity and figuring out how to make it stronger are still active research areas. Right now, I think those breaking it are winning.

Peer-To-Peer Systems

We began class with a discussion about peer-to-peer models vs client-server models. The upshot of the discussion is that the client-server model generates a service hot-spot at the server and increasingly hot network utilization the closer one gets to the server. Traffic that is diffuse toward the client is concentrated near the serve.
Distributing the service helps. But, as long as clients out number servers, there is a natural imbalance. The solution to this imbalance is a peer-to-peer architecture where services are provided (hopefully, but not likely) in proportion to their use.

Peer-to-Peer Challenges

Peer-to-peer systems present several challenges:

Describing and searching for content
Naming
Finding objects and directory services
Stability of the peers
Trust of the peers

It becomes difficult to search for object in peer-to-peer systems, because high-level searches don't localize well. For this type of thing, we really do want a distributed map-reduce or toher parallel search or large, in memory monolithic database, or both. We don't want to have to ask a large number of distant peers and then need to coordinate the results.
Naming needs to be universal. How do we know that there arne't multiple "Greg Kesdens" in the world, or "Super Distributed Systems Textbooks"? There are a lot of ways we could address this. And, we discussed some in class, most of which could identify buckets of results. But, for the rest of the discussion, we're going to assume everything is named by a hash of the contents -- this way, the name (virtually) guarantees that it identifies what we expect.
Once can identify each object we want, we need to find it. One option is a fully distributed directory service (ouch!), another is a directory service distributed among select peers (super peers?), and a third is a distributed hash -- we're going to focus on that approach for the rest of class.
Stability of peers is obviously an issue. Some peers will be well resourced and stable, others will be thin and brittle. The stability of the system depends upon haivng enough stable resources to mitgate the impact of a smaller quantity of brittle resources. One common solution here is to appoint willing, richer, more stable, longer-serving peers as "super peers" and giving them more responsibility, perhaps incentivising them by proving more or better service (or just by the good feeling from being a good citizen).
Trust is the nearly impossible part. Insert the whole discussion about public key infrastructure here. It is really challenging to trust the identity of hosts. With luck, we can get a hash of what we want from enough sources trusted enough to trust the answer -- and then we can check that what we get matches the hash. Thus, the must brittle part of this is really trusting the search results and/or human-name to hash-name mapping.

Distributed Hashing: Consistent Hashing and Chords

Another idea for a peer-to-peer system is to implement a huge distributed hash table. The problem with traditional hash tables, though, is that they don't handle growth well. The hash function produces a large number, which is then taken modulus the table size. As a result, the table size can't change without needing to rehash the entire table -- otherwise the keys can't be found.
A consistent hashing scheme is one that makes the hash value independent of the table size. The result is that keys can be found, even if the table size changes. In the contest of a distributed hash table, this means that keys can be found, even if nodes enter (and possibly leave) the system.
One technique for doing this is the Chord protocol. This protocol views the world as a logical ring. Given an m bit key, it has logical positions 0 ... 2^m-1. Think of them as hours on a clock. Some of these positions have actual nodes assigned to them, others do not. Like token ring, each node "need" only know its successor, but actually knows the topology of the entire ring in order to handle failures.

Since there are fewer nodes than actual addresses (hours on the clock), each node can be responsible for more than one key. Keys are mapped to actual nodes by assigning them to the "closest" node with an equal or greater number.
In order to find a key, we could do a brute fource search of the circle, but instead each node keeps a "finger" pointing to the next node, two nodes away, 4 nodes away, 8 nodes away, etc. In other words, each node keeps pointer to nodes exponentially farther and farther away.

These pointers are stored in a table such that the ith entry of the table contains a pointer to a node that is 2ⁱ away from it, e.g. at position node_number + 2ⁱ. As with keys, if a node is not present at the exact location -- the next greater node is used. This arrangement makes it possible to search for a bucket in O(log n) time, because, with each step, we either find the right node, or cut the search space in half.
In order for a node to join, it simply is added to an unrepresented position (hour on the clock) within the hash table. It gets its portion of the keys from its successor, and then goes live. Similarly, disappearing from the hash simply involves spilling ones keys to one's successor.

Credit: Thanks to Dave Anderson at CMU for the pictures above!

Distributed Hashing and Fault Tolerance

Fault tolerance is likely managed (a) at the node level, and/or (b) at the system level by replication. To solve this problem, you can more-or-less apply what we've already learned about checkpointing, logging, replication, etc.