Tag: distributed

Distributed Computing so far

I gave a half-hour presentation on my research at the admitted students weekend earlier.  I think I was able to convey some sense of the opportunities that Mudd offers for research, and presented a reasonable face of what it could look like.

The talk was an explanation of the concepts behind distributed hash tables, and an introduction to the work I’ve been doing with bitTorrent.

These are the slides I used: Admitted Students presentation


on p2p in the browser

The final project I want to accomplish this semester in my distributed computing independent study, is to come up with a way to access p2p capabilities from within the browser. I was originally hoping to do this with JavaScript API that relied behind the scenes on the new flash p2p protocol, rtmfp. The protocol has been around since last spring, but development for it is not exactly optimal. You need to be using flash to make the movie, and you need a proprietary and expensive piece of server software to coordinate data transfers. The protocol itself is still proprietary, meaning that Adobe won’t actually tell you how to make a server that can work with it, or how data is sent across the wire.

They finally have started advertising it at least, and so it will gain some prominence, and eventually the details will come to light, but for now it is unrealistic to attempt to use flash for a general server agnostic service. (http://www.ietf.org/proceedings/10mar/slides/tsvarea-1.pdf)

Instead, my plan is to modify the open source privoxy software and add p2p capabilities at that layer. My eventual goal is to make a communal grease-monkey system, where pages can be modified by the user through pieces of JavaScript, and those scripts can then be shared to friends. With the right level of abstraction, I think that this can produce a very powerful system. Starting next week, I’ll begin reading and hacking privoxy to find out how to integrate new code into that project.


Torrent Auditor

I have now migrated the python torrent client that I’ve been working on to a google code project.

It lives at torrentauditor and now has basic support for actually downloading torrent files.

I researched the bittorrent extension protocols this week, but was somewhat frustrated by what I found. Most of the interesting ones are implemented on a per-client basis, and aren’t well documented outside of that client. The Vuze client it turns out switches to an entirely different application specific protocol when it meets another client of the same time. The libtorrent based clients do much the same thing, although they send their additional messages over the existing connection.

However, the good news is that the basic protocol is friendly enough that it can be implemented without major trouble. I chose to focus on in-order reading for now simply for simplicity sake, although it is highly inefficient.

One goal that I’m going to try to focus on a bit in the next weeks as I have time, is to be able to extract frames of videos from downloaded data. For my digital animation class I would like to make an automated program that stitches together frames / short clips of videos entirely automatically – a visual representation of the swarm.


Downloading Torrents

I extended the work from last week in order to actually get data from a swarm. The main change is that new sockets are now allocated for each connection, state is remembered, and the client can actually get so far as to download data from the other peers.

What needs to happen still is that 1. I need a better way of stopping, right now it stops upon a lull in network activity, but that doesn’t really mean anything. 2. I need to actually save downloaded data somewhere, or alternatively keep track of the rate that it’s coming in (or the time delay between packets.) 3. I need to factor out the different message types into separate modules to keep the code readable.

I’m going to set up a code page for this project soon, so that I don’t have to post stuff just in this blog. I should get to that in the coming week, so that there’s a nicer way of interacting with this code base. This model isn’t super memory efficient, but it is a very simple system to work with, and I think it’s become a good jumping off point for trying to interact with swarms. In particular it’s really easy to add in hooks to get methods to run when messages are received, and to store arbitrary state about connections.

[python]
# Standalone Torrent Auditor
#
import socket
import time
import sys
import getopt
import random
import benc
import binascii
import select
import hashlib
import urllib
import server

#a bit of global state about who we are
client = "AZ"+chr(0×05)+"31"
myID = "".join(chr(random.randrange(0, 256)) for i in xrange(20))
peerCache=[]

#load in a .torrent file
def readFile(filePath):
f = open(filePath, ‘r’)
data = ”.join(f.readlines())
f.close()
return benc.bdecode(data)

#register with the tracker to generate potential peers
def register(torrent,myPort):
url = torrent['announce'];
ihash = hashlib.sha1(benc.bencode(torrent['info'])).digest();
query = urllib.urlencode({‘info_hash’:ihash,
‘peer_id’:myID,
‘port’:myPort,
‘uploaded’:0,
‘downloaded’:0,
‘left’:0,
‘compact’:1,
‘event’:'started’});
url += "?"+query;
trackerhandle = urllib.urlopen(url);
trackerdata = ”.join(trackerhandle.readlines());
trackerhandle.close();
parseddata = benc.bdecode(trackerdata);
initialnodes = parseddata['peers'];
peers = [];
while len(initialnodes) > 5:
ip = initialnodes[0:4];
port = initialnodes[4:6];
initialnodes = initialnodes[6:];
peers.append({‘state’:0,
‘ip’:socket.inet_ntoa(ip),
‘ihash’:ihash,
‘port’:ord(port[0])*256+ord(port[1]),
‘buffer’:”,
‘callback’:handleData});
return peers;

#register a new incoming connection
def onNew(socket,(ip,port)):
obj = {‘socket’:socket,
‘ip’:ip,
‘port’:port,
‘state’:1,
‘buffer’:”,
‘callback’:handleData};
handleData(obj)
server.track_socket(obj);

#reply to a socket with a torrent message
def reply(socket,mtype,payload):
#first flatten the payload
s = ”;
while len(payload):
p = payload.pop();
if isinstance(p,int):
s += struct.pack(‘!i’,p)
else:
s += p
s = chr(mtype) + s;
l = len(s);
pl = struct.pack(‘!i’,l) + s;
socket.sendall(ppl);
return pl;

#parse torrent msg
def handleMsg(obj,msg):
if msg[0] == 0: #choke
obj['state'] &= ~4
elif msg[0] == 1: #unchoke
obj['state'] |= 4
#and we’d like to request a piece from them.
reply(obj['socket'],6,[0,0,2<<15])
elif msg[0] == 2: #interested
obj['state'] |= 8
elif msg[0] == 3: #uninterested
obj['state'] &= ~8
elif msg[0] == 4: #have
idx = struct.unpack(‘!i’,msg[1:5])
obj['have'][idx/8] |= (1 << (8 – (idx%8)))
elif msg[0] == 5: #bitfield
obj['have'] = msg[1:]
elif msg[0] == 6: #request
#ignored
return;
elif msg[0] == 7: #piece
#got our data
print "succeeded in downloading!"
elif msg[0] == 8: #cancel
#we aren’t in that business
return;
#parse incoming data
def handleData(obj):
try: nbuf = obj['socket'].recv(4096)
except socket.error, err:
print "disconnected %s" %obj['ip']
server.closed_socket(obj['socket'])
return;
if not nbuf:
print "disconnected %s" % obj['ip']
server.closed_socket(obj['socket'])
return;
data = obj['buffer'] + nbuf;
#Handshake
if obj['state'] &2 == 0:
if data[0] == chr(19) and len(data) >= 68:
obj['ihash'] = data[28:48]
obj['peerid'] = data[48:68]
obj['buffer'] = data[68:]
obj['state'] += 2
if obj['state'] & 1== 0:
#we need to respond to the handshake
obj['socket'].sendall(handshake(obj['ihash']))
obj['state'] += 1
print "shook hands with %s" % obj['ip'];

#all other messages are prefixed by their length
else:
mlen = struct.unpack(‘!i’,data[0:4])[0]
while len(data) > mlen:
msg = data[1:mlen+1]
data = data[mlen+1:]
if len(msg):
#Actual message received
handleMsg(obj,msg);
if len(data):
mlen = ord(data[0]);
else:
break;
#save the rest of the data
obj['buffer'] = data
print "unknown message %s"%data

#define the handhsake
def handshake(ihash):
announce = chr(19) + ‘BitTorrent protocol’
announce += chr(0×0)*8
announce += ihash
announce += myID
return announce

#talk with cached peers
def onTimeout():
global peerCache;
inited = 0;
if len(peerCache) == 0:
return True;
for i in range(5):
if len(peerCache) == 0:
break;
obj = peerCache.pop()
obj = server.track_socket(obj)
if not obj:
continue;
obj['socket'].sendall(handshake(obj['ihash']))
obj['state'] &= 1
return False;

def usage():
global peerCache;
print "Usage:";
print "client –file=loc.torrent";
print "Will report on statistics for the desired torrent";

def main():
filePath = "default.torrent";
try:
opts, args = getopt.getopt(sys.argv[1:], "hf:", ["help", "file="])
except getopt.GetoptError, err:
# print help information and exit:
print str(err) # will print something like "option -a not recognized";
usage();
sys.exit(2);
for o, a in opts:
if o in ("-h", "–help"):
usage();
sys.exit();
elif o in ("-f", "–file"):
filePath = a;
else:
assert False, "unhandled option";
print "Loading Info… ",
info = readFile(filePath);
print "okay";
port = 6886;
print "Detecting Swarm… ",
seeds = register(info,port);
print len(seeds), " peers returned";
peerCache.extend(seeds);
print "Entering Main Loop";
onTimeout();
server.main_loop(onTimeout,onNew,port);
print "Finished Snapshot";

if __name__ == "__main__":
main()
[/python]


Auditing Bit torrent

One of the strengths of bit torrent is that the primary data transfer protocol is entirely separate from the advertisement protocol. This also has created a strain both in discovering other users who have data, and keeping accurate reports of data that was transfered.

The first issue is one that has been developed for extensively, culminating in many extensions to the protocol which purport to make it easier to find other users. These include distributed trackers, PEX, DHT, among many others.

The second issues has been covered less throughly, since it is a problem that can not fundamentally be solved due to the distributed nature of the system. There is no real way to verify the legitimacy of statistics a client reports, since neither it nor any of the peers it has interacted with can be trusted.

One attempt to get a better sense of what is really going on is to create a client that actually interacts with with the data transfer protocol, to verify that reported statistics are not entirely inaccurate. This client does not interact in the traditional way, but will infrequently connect to peers and ask them to send it data – which it can then use to estimate the bandwidth of that client. This knowledge combined with knowledge of which clients have what portions of the data will allow the client to estimate the interactions that are taking place within the swarm.

These estimates can then be checked against reported statistics to discover when a client is misreporting its statistics.

The code below is not finished. It completes the initial functions of peer discovery and connection, but is not able to successfully download or monitor peers. The primary focus of work will be to implement the encryption protocol which is now standard for torrent traffic, so that the client is able to interact successfully with most users.

[python]
# Standalone Torrent Auditor
#
import socket
import time
import sys
import getopt
import random
import benc
import binascii
import select
import hashlib
import urllib

#Initialize a UDP Socket,
#and the other global info about who this client is
client = "AZ"+str(0×05)+"31";
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM);
s.connect(("msn.com",80));
myIP = s.getsockname()[0];
s.close();
myPort = 6886;
UDPSocket = socket.socket(socket.AF_INET,socket.SOCK_DGRAM);
UDPSocket.bind((myIP,myPort));
myID = "".join(chr(random.randrange(0, 256)) for i in xrange(20));
knownPeers=[];

#handle sending a raw UDP datagram
def sendData(data,host,port):
global UDPSocket;
#print ‘messaged %s:%d’%(host,port);
UDPSocket.sendto(data,0,(host,port));

#load in a .torrent file
def readFile(filePath):
f = open(filePath, ‘r’);
data = ”.join(f.readlines());
structure = benc.bdecode(data);
return structure;

#register with the tracker to get peers
def register(torrent):
url = torrent['announce'];
ihash = hashlib.sha1(benc.bencode(torrent['info'])).digest();
query = urllib.urlencode({‘info_hash’:ihash,
‘peer_id’:myID,
‘port’:myPort,
‘uploaded’:0,
‘downloaded’:0,
‘left’:0,
‘compact’:1,
‘event’:'started’});
url += "?"+query;
trackerhandle = urllib.urlopen(url);
trackerdata = ”.join(trackerhandle.readlines());
trackerhandle.close();
parseddata = benc.bdecode(trackerdata);
initialnodes = parseddata['peers'];
peers = [];
while len(initialnodes) > 5:
ip = initialnodes[0:4];
port = initialnodes[4:6];
initialnodes = initialnodes[6:];
peers.append({‘state’:0,’ip’:socket.inet_ntoa(ip),’ihash’:ihash
,’port’:ord(port[0])*256+ord(port[1])});
return peers;

def AnnouncePeer(myID,key,token,lp,host,port):
data = {‘q’:'announce_peer’,'a’:{‘id’:myID,’info_hash’:key,
‘token’:token,’port’:lp},’v':client,’y':’q',’t':str(0×05)+str(0×05)};
sendData(benc.bencode(data),host,port);

def parseQuery():
global UDPSocket,knownPeers;
(msg,(hn,hp)) = UDPSocket.recvfrom(4096); #should be more than enough
found = 0;
for p in knownPeers:
if p['ip'] == hn and p['port'] == hp:
found = 1;
p['state'] &= 2;
print msg;
if not found:
print msg;
knownPeers.append({‘state’:2,’ip’:hn,’port’:hp,’ihash’:0});
#data = benc.bdecode(msg);

#check the type of message here, maybe
#hisid = data['r']['id'];
#nodes = data['r']['nodes'];
#l = len(nodes)/26;
#for i in range(0,l):
# nid = nodes[(26*i):(26*i+20)];
# nhost = nodes[(26*i+20):(26*i+24)];
# nport = nodes[(26*i+24):(26*i+26)];
# knownHosts[nid]=socket.inet_ntoa(nhost);
# knownPorts[nid]=ord(nport[0])*256+ord(nport[1]);
# if bitdif(nid,targetID) < bitdif(hisid,targetID):
# FindNodeReq(myID,targetID,knownHosts[nid],knownPorts[nid]);
#knownHosts[hisid] = hn;
#knownPorts[hisid] = int(hp);
#return hisid;

def initiateConns():
global knownPeers;
inited = 0;
for p in knownPeers:
if(p['state'] == 0 and inited < 5): #uncontacted
announce = str(0×19) + ‘BitTorrent protocol’;
announce += str(0×0)*8;
announce += p['ihash'];
announce += myID;
p['state'] = 1; #contacted
inited += 1;
sendData(announce,p['ip'],p['port']);
return inited == 0;

def MainLoop():
global UDPSocket;
print "Communicating",
rate = 0;
while 1:
(x,y,z) = select.select([UDPSocket],[],[],1) #wait to receive something
if len(x):
parseQuery();
else:
if initiateConns():
return;
continue; #we don’t care much about errors, since it’s all datagrams

def usage():
print "Usage:";
print "client –file=loc.torrent";
print "Will report on statistics for the desired torrent";

def main():
global myID,knownPeers;
filePath = "default.torrent";
try:
opts, args = getopt.getopt(sys.argv[1:], "hf:", ["help", "file="])
except getopt.GetoptError, err:
# print help information and exit:
print str(err) # will print something like "option -a not recognized";
usage();
sys.exit(2);
for o, a in opts:
if o in ("-h", "–help"):
usage();
sys.exit();
elif o in ("-f", "–file"):
filePath = a;
else:
assert False, "unhandled option";
print "Loading Info… ",
info = readFile(filePath);
print "okay";
print "Detecting Swarm… ",
seeds = register(info);
print len(seeds), " peers returned";
knownPeers.extend(seeds);
print "Entering Main Loop";
MainLoop();
print "Finished Snapshot";
print "Discovered Swarm State:";
for p in knownPeers:
print p['ip'],": ",
if(‘has’ in p):
print p['has'],
if(‘speed’ in p):
print p['speed'];
else:
print "unconnectable";

if __name__ == "__main__":
main()

UDPSocket.close()
[/python]


Talking to Kad

[python]
# Standalone Mainline Kad Client
#
import socket
import time
import sys
import getopt
import random
import benc
import binascii
import select

client = "AZ"+str(0×05)+"31";
UDPSocket = socket.socket(socket.AF_INET,socket.SOCK_DGRAM);
targetID = "".join(chr(random.randrange(0, 256)) for i in xrange(20));
myID = "".join(chr(random.randrange(0, 256)) for i in xrange(20));
reqs = 0;
knownHosts={};
knownPorts={};

def sendData(data,host,port):
global UDPSocket,reqs;
reqs += 1;
#print ‘messaged %s:%d’%(host,port);
UDPSocket.sendto(data,0,(host,port));

def sendPing(myID,host, port):
data = {‘q’:'ping’,'a’:{‘id’:myID},’v':client,’y':’q',’t':0×05+0×05};
sendData(benc.bencode(data),host,port);

def GetPeersReq(myID,ih,host,port):
data = {‘q’:'get_peers’,'a’:{‘id’:myID,’info_hash’:ih},’v':client,’y':’q',’t':str(0×05)+str(0×05)};
sendData(benc.bencode(data),host,port);

def FindNodeReq(myID,target,host,port):
data = {‘q’:'find_node’,'a’:{‘id’:myID,’target’:target,’want’:['n4']},’v':client,’y':’q',’t':str(0×05)+str(0×05)};
sendData(benc.bencode(data),host,port);

def GetPeersReq(myID,ih,host,port):
data = {‘q’:'get_peers’,'a’:{‘id’:myID,’info_hash’:ih},’v':client,’y':’q',’t':str(0×05)+str(0×05)};
sendData(benc.bencode(data),host,port);

def AnnouncePeer(myID,key,token,lp,host,port):
data = {‘q’:'announce_peer’,'a’:{‘id’:myID,’info_hash’:key,’token’:token,’port’:lp},’v':client,’y':’q',’t':str(0×05)+str(0×05)};
sendData(benc.bencode(data),host,port);

def parseFindNodeResponse():
global UDPSocket,myID,targetID,knownHosts,knownPorts;
(msg,(hn,hp)) = UDPSocket.recvfrom(4096); #should be more than enough
data = benc.bdecode(msg);
#check the type of message here, maybe
hisid = data['r']['id'];
nodes = data['r']['nodes'];
l = len(nodes)/26;
for i in range(0,l):
nid = nodes[(26*i):(26*i+20)];
nhost = nodes[(26*i+20):(26*i+24)];
nport = nodes[(26*i+24):(26*i+26)];
knownHosts[nid]=socket.inet_ntoa(nhost);
knownPorts[nid]=ord(nport[0])*256+ord(nport[1]);
if bitdif(nid,targetID) < bitdif(hisid,targetID):
FindNodeReq(myID,targetID,knownHosts[nid],knownPorts[nid]);
knownHosts[hisid] = hn;
knownPorts[hisid] = int(hp);
return hisid;

def parseGetDataResponse():
global UDPSocket,myID,targetID;
(msg,host) = UDPSocket.recvfrom(4096); #should be more than enough
data = benc.bdecode(msg);
token = data['r']['token'] or print(data);
nodes = data['r']['nodes'];
print nodes;
return token;

def readGetDataResponse():
global UDPSocket,myID,targetID;
(msg,host) = UDPSocket.recvfrom(4096); #should be more than enough
data = benc.bdecode(msg);
token = data['r']['token'];
nodes = data['r']['nodes'];
print nodes;
return nodes;

def bitdif(ia,ib):
totalDifferences = 0;
for i in range(0,len(ia)):
for j in range(0,8):
if ord(ib[i]) & (0×01 << j) != ord(ia[i]) & (0×01 <<j):
totalDifferences+=1;
return totalDifferences;

def findClosestPeerMainLoop():
global UDPSocket,knownHosts,knownPorts,reqs;
print "Searching",
foundID = 0;
while 1:
(x,y,z) = select.select([UDPSocket],[],[],1) #wait to receive something
if len(x):
print ".",
foundID = parseFindNodeResponse();
else:
print "!",
reqs=0;
return (knownHosts[foundID],knownPorts[foundID])
sys.stdout.flush()
reqs -= 1;
if reqs == 0:
return (knownHosts[foundID],knownPorts[foundID]);

def getDataMainLoop():
global UDPSocket,knownHosts,knownPorts,reqs;
print "Waiting",
data = "";
while 1:
(x,y,z) = select.select([UDPSocket],[],[],1) #wait to receive something
if len(x):
print ".",
data = parseGetDataResponse();
else:
print "!",
sys.stdout.flush()
reqs -= 1;
if reqs == 0:
return data;

def getDataReadLoop():
global UDPSocket,knownHosts,knownPorts,reqs;
print "Waiting",
data = "";
while 1:
(x,y,z) = select.select([UDPSocket],[],[],1) #wait to receive something
if len(x):
print ".",
data = readGetDataResponse();
else:
print "!",
sys.stdout.flush()
reqs -= 1;
if reqs == 0:
return data;

def announcePeerMainLoop():
global UDPSocket,knownHosts,knownPorts,reqs;
data = False;
while 1:
(x,y,z) = select.select([UDPSocket],[],[],1) #wait to receive something
if len(x):
data = True;
else:
print "!",
sys.stdout.flush()
reqs -= 1;
if reqs == 0:
return data;

def usage():
print "Usage:";
print "client <-i|-o> [--key=key]";
print "where -i will store data on stdin, and -o will retrieve data to stdout";

def main():
global targetID, myID;
try:
opts, args = getopt.getopt(sys.argv[1:], "hk:s:p:io", ["help", "key=","server=","port="])
except getopt.GetoptError, err:
# print help information and exit:
print str(err) # will print something like "option -a not recognized";
usage();
sys.exit(2);
rootHost = "router.bittorrent.com";
rootPort = 6881;
client = ‘Az’+str(0×05)+’31′;
save = True;
for o, a in opts:
if o in ("-h", "–help"):
usage();
sys.exit();
elif o == "-o":
save = False;
elif o == "-i":
save = True;
elif o == "–key":
targetID = a;
if len(targetID)!=20:
targetID = targetID[0:20]+" "*(20-len(targetID));
elif o == "–server":
rootHost = a;
elif o == "–port":
rootPort = int(a);
else:
assert False, "unhandled option";
# inititation
if save:
#to store data, we’re going to
# 0. read the data in
# 1. find the node closest to the key
# 2. put in a bunch of phoney add_peers to save data there
print "Enter Message:";
data = ”;
for line in sys.stdin:
data += line;
print "Finding Host in charge of key %s"%binascii.b2a_base64(targetID);
FindNodeReq(myID,targetID,rootHost,rootPort);
(targetHost,targetPort) = findClosestPeerMainLoop();
print;
packets = 1+len(data)/20;
print "Adding Data (%d packets)"%packets,
for i in range(0,packets):
buf = data[(i*20):((i+1)*20)];
buf += " "*(20-len(buf));
GetPeersReq(buf,targetID,targetHost,targetPort);
token = getDataMainLoop();
AnnouncePeer(buf,targetID,token,10000+i,targetHost,targetPort);
confirm = announcePeerMainLoop();
print ".",
print "Done";
# check to see if it held
GetPeersReq(myID,targetID,targetHost,targetPort);
token = getDataReadLoop();
else:
#to retrieve data, we’re going to
# 1. find the node closest to the key
# 2. run the get_node, and parse the returned datals
FindNodeReq(myID,targetID,rootHost,rootPort);
(targetHost,targetPort) = findClosestPeerMainLoop();

if __name__ == "__main__":
main()

UDPSocket.close()
[/python]


The “Mainline Kademlia” protocol

Kademlia is a description of the network interactions and rpc calls that can form a distributed hash table. Today, these kad based DHTs are one of the most common forms of distributed storage. However, since kademlia does not specify the application level protocol to make calls, it is instead implemented on top of existing application protocols, making many of the rings incompatible with each other.

I was interested specifically in the Kad ring that is used by the Mainline Bitorrent client. This storage is used in practice as a distributed source of information about torrents, so that new peers can join a swarm even when there is no tracker present. The ring is used by utorrent, the official bittorrent.com client, and the Vuze client with an optional plugin. (Vuze has a native DHT which is incompatible with other systems.)

Unsurprisingly, the protocol for this DHT is well advertised. The reason is two-fold: first, the amount of space is limited, and it is hard for clients to tell between valid and invalid data, so you don’t want to allow arbitrary data into the DHT. Secondly, the questionable legality of many torrents means that it is detrimental for a third party to have a means to easily monitor the swarm.
- more -


Distributed Computing and Me

The moniker distributed computing has become a broad catchall term that has been applied to everything from parallel computation to the auto configuration of ad-hoc networks. In order explore the technical problems in this field it is necessary to clearly define what exactly the field is, and to this end I want to elaborate on what I believe is and is not distributed computing.
The first use of distributed computing was to describe the problem faced by projects like SETI at home and the genome project. These organizations needed to process huge quantities of data and did not have the budget to do so internally, and instead turned to the greater population. They did not diverge from client-server architecture, but allowed any computer running their client software to request and process chunks of data. The problem that was created was not one of communication, but of trust, since none of the clients could be expected to be either reliable or trustworthy. This problem is typically solved by checking for suspicious responses, and by sending each chunk of data to multiple clients to ensure they agree on an answer.
Five years ago the term grid computing was a new big thing, though it has since been outmoded by the term cloud computing. Both of these models are related to distributed computing, in that they focus on the problem of scalability and reliability of a large number of machines. Both terms imply that the machines are entirely under ones control, with grid computing referring specifically to the use of such clusters internal to an organization and cloud computing referring to the outsourcing of that resource. The problem faced is not as different from the previous as we might initially think. Given the large number of machines and components, failure will occur regularly, and it is important to make sure data and computation are redundant and can hide these individual problems. The problem is lessened only in that nodes are typically not purposefully malicious in this scenario. Cloud computing and grid computing before it both are also systems which have been built and have essentially overcome the issues faced. Many large tech companies run enormous data centers, and consumer services such as Google’s AppEngine and Amazon’s EC2 allow consumers to outsource their computation to the cloud.
Mesh networks are another seemingly disjoint field that has traditionally been linked to distributed computing. Here, a number of computers need to work together to access a common resource, like the Internet, but don’t necessarily trust each other. The typical visualization is that only one computer will have an Internet connection, and that resource needs to be shared to computers that are not directly connected to it. The problem here is now primarily one of trust, since not only is there a likelihood of reliability issues, but each node benefits from ignoring requests from others since that leaves more bandwidth for it. There have been several implementations of mesh computing, although none that can boast tremendous success. The most known is certainly the OLPC, which attempted to use mesh computing as a solution to the intermittent access to resources found in developing countries. Beyond these bespoke solutions, there has been a 802.11s standard for mesh computing developed by the 802.11 working group, which defines a global standard and frequency for wireless communication.
To me distributed computing is both all and none of these problems. Instead I see the fundamental problem to be answered as one of cooperation. Users should be viewed as selfish, in that they want to get the most from a service while giving as little as possible. For proof of this, look at the number of successful free web services versus those that cost money. Free and ad supported continues to rule as a business model because most users are unwilling to pay more than they need to for a service. The challenge then is to get a set of strangers to cooperate so that they all get something more than they started with.
In addition to the previously mentioned fields, this problem is faced by file sharing networks. Here as much as anywhere we see the need for cooperation among selfish individuals. The goal for each user is to get data from others as quickly as possible, while sending as little as possible – both for legal and purely selfish reasons. One attempt to solve this problem have been to develop communities surrounding the technology so that past performance is reflected and good behavior encouraged, but even this is not foolproof.
For me, the goal of distributed computing is to provide a structure where users can gain the ability to burst beyond resources they control, and not worry about peers who are malicious or self-serving. I plan to delve into this problem by looking and experimenting with existing systems, looking specifically at the issue of fault tolerance – which I see as fundamental to any solution, and then building a structure of my own.