"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable."
- Leslie Lamport, May 1987
Networking researchers don't hesitate in using tools such as traceroute to diagnose network problems, and use the results of these tools to decide what can be done to mitigate the problem (e.g. if the server hosting a web page is down, then check if Google is caching the web page). But this process is tedious, and would benefit from automation, and most Internet users lack the technical skills to use and understand such tools, even if basic tools are readily available on most platforms (e.g. Microsoft Windows).
The goal of this project is to examine methods to diagnose network faults and, wherever possible, to recover from them in a "user-friendly" manner, and by doing so, make the Internet a more dependable medium for its users.
There is a long history of research into increasing network dependability, starting with Baran's initial proposal (1964 http://www.rand.org/publications/RM/RM3420/) of packet switching as a technology for survivable military communication despite enemy attacks directed against nodes, and more recently with web service providers aiming to increase the availability of their servers through replication and load balancing. What distinguishes this project from past work is that it approaches dependability from the end-user/client's perspective, rather than from the network or server perspective. While the end-user may ultimately be unable to rectify a service that is unavailable, they are certainly interested in being able to anticipate service outages, diagnose where faults occur so that they can report the fault and learn when it may be rectified, and in determining alternate means of communicating with their target (e.g. accessing older cached information, a mirror, or the target at a later time).
To limit the scope of this project, it will initially assume that the end-user's local system has basic network connectivity, and that the faults being diagnosed are limited to the connectivity to specific targets. It will also concentrate on web browsing as the application to demonstrate fault diagnosis and recovery, but since many of the faults being diagnosed and recovered occur below the application layer, many of the principles will readily extend to other applications. Furthermore, communication networks are inherently complicated systems, with extensive use of modularity to handle this complexity, but in so doing often obscuring the cause of problems. It is not always possible for an expert to diagnose all network faults, and we do not expect the automated systems developed as part of this project to be able to diagnose all network faults. However, we do expect to be able to automate the diagnosis of common faults, and so alleviate some of the problems experienced by end-users, and reduce the cost of supporting end-users through help desks and the like.
The projects are listed in rough chronological order, i.e. in the order in which they might be used by someone experiencing a network problem:
The goal of this project is to develop a form (possibly in XML) for describing network outages, and software to allow operators to distribute such forms, and for end-users to filter out irrelevant forms and even maintain a database of anticipated outages, which can be referred to when trying to explain observed outages.
To promote end-user adoption of the software, it could also be used to advertise other events, such as road traffic conditions (e.g. see this link) of interest to the end-user.
This could be extended to form a peer-to-peer system in which end-users share their knowledge of service (un)availability, and of which abnormal events have been reported or are being rectified.
To determine when problems exist, we intend to develop network performance monitoring software. This will take three forms: First, end-user software (e.g. web browser plug-in) that measures the performance of communication with systems that the end-user is currently interacting with (e.g. throughput and response times to web and DNS servers), and so provides timely detection of faults in communicating with systems of highest immediate interest to the end-user. Second, software on the end-user's platform will monitor general network performance, as seen from the local system, e.g. by processing data from TCP and link-layer MIBs about error rates etc, and by monitoring ECN fields of received IP packets. Third, software on the end-user's platform will monitor performance of frequently used resources (e.g. links) within the network using tools, which either already exist (e.g. see http://www.caida.org/tools/) or will be developed as part of this project, and services from network service providers and other "Internet weather" monitoring services (e.g. http://www.internettrafficreport.com/).
See the Raceroute web page describing an enhanced traceroute.
The goal of this project is to create software (likely in Java to enable use on varied web browsers) that will allow the user to enter a URL and will test what connectivity is available to that URL, and create a report which can be sent to report inability to access a certain URL.
Furthermore, networks are complicated systems, and are built in a modular fashion, e.g. exploiting the benefits of information hiding. Information hiding is desirable for the common case of normal operation (e.g. it facilitates component change, and reduces the amount of information that users need to process), but can be detrimental when problem solving, when the user wants an explanation for why they cannot obtain the desired service. This balance between transparency and opacity significantly affects the feasibility of different approaches to diagnosing network faults.
e.g. UNSW subscribes to the IEEE Xplore database of papers, but the subscription has a limited number of "seats" (concurrent users). If someone can't download a paper immediately because all seats are currently in use, then can a tool repeat the request until the download succeeds?
People assume that when they send an electronic message (email, instant messaging, SMS etc) then it will be received by the intended recipient. But how often does this occur? (Particularly today, with the increasing use of email filters, intended to protect users from spam.) Furthermore, why do messages get lost? (Sometimes it is because of servers crashing, and the protocols that they use, e.g. SMTP, being deliberately designed to avoid end-to-end acknowledgements so that end-systems needn't be available simultaneously.) Finally, what can be done to improve email dependability? e.g. can a sending email client process error messages (e.g. Hotmail "exceeded disk quota" messages) and automatically try alternate email addresses (e.g. Unimail) to reach the destination? Can a sender receive an acknowledgement indicating that the message has reached the destination computer/user, without violating the privacy of the destination user to read messages without telling others? This thesis will assess current messaging systems, and develop software to prototype more dependable messaging systems.