INDE: Improving Network Dependability for End-users

Send any questions about this work to Tim Moors by email to t.moors AT unsw.edu.au

"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable."
- Leslie Lamport, May 1987

Background

Two signs of the increasing maturity of the Internet are that users have become increasingly dependent on it as a communication medium, and at the same time its constituency has expanded beyond the domain of technical experts to encompass lay men and women. This constituency is faced with a dilemma: They have an intense demand for resources offered on the Internet, but are often frustrated by their inability to access these resources. There can be countless explanations for their obstructed access, e.g. a web page may have moved or a link that is essential to creating a path between them and a server may be unavailable, but these users lack the technical knowledge and expertise to diagnose these problems, and to determine what (if anything) can be done to mitigate them.

Networking researchers don't hesitate in using tools such as traceroute to diagnose network problems, and use the results of these tools to decide what can be done to mitigate the problem (e.g. if the server hosting a web page is down, then check if Google is caching the web page). But this process is tedious, and would benefit from automation, and most Internet users lack the technical skills to use and understand such tools, even if basic tools are readily available on most platforms (e.g. Microsoft Windows).

The goal of this project is to examine methods to diagnose network faults and, wherever possible, to recover from them in a "user-friendly" manner, and by doing so, make the Internet a more dependable medium for its users.

There is a long history of research into increasing network dependability, starting with Baran's initial proposal (1964 http://www.rand.org/publications/RM/RM3420/) of packet switching as a technology for survivable military communication despite enemy attacks directed against nodes, and more recently with web service providers aiming to increase the availability of their servers through replication and load balancing. What distinguishes this project from past work is that it approaches dependability from the end-user/client's perspective, rather than from the network or server perspective. While the end-user may ultimately be unable to rectify a service that is unavailable, they are certainly interested in being able to anticipate service outages, diagnose where faults occur so that they can report the fault and learn when it may be rectified, and in determining alternate means of communicating with their target (e.g. accessing older cached information, a mirror, or the target at a later time).

To limit the scope of this project, it will initially assume that the end-user's local system has basic network connectivity, and that the faults being diagnosed are limited to the connectivity to specific targets. It will also concentrate on web browsing as the application to demonstrate fault diagnosis and recovery, but since many of the faults being diagnosed and recovered occur below the application layer, many of the principles will readily extend to other applications. Furthermore, communication networks are inherently complicated systems, with extensive use of modularity to handle this complexity, but in so doing often obscuring the cause of problems. It is not always possible for an expert to diagnose all network faults, and we do not expect the automated systems developed as part of this project to be able to diagnose all network faults. However, we do expect to be able to automate the diagnosis of common faults, and so alleviate some of the problems experienced by end-users, and reduce the cost of supporting end-users through help desks and the like.

Sub projects

Each of the following projects is suitable for students doing their 4th year BE thesis. The prerequisite for these projects is a strong understanding of networking and programming, as evidenced by distinction levels in TELE3018/COMP3331 (or similar networking subjects), COMP1021/1721 (or similar programming subjects), and a reasonably high overall Weighted Average Mark (WAM). If you are interested, then please email t.moors AT unsw.edu.au a copy of your academic record and resume.

The projects are listed in rough chronological order, i.e. in the order in which they might be used by someone experiencing a network problem:

Notifying users of outages and remedial actions

Some outages are planned as part of scheduled maintenance. Network operators often advertise these outages by broadcast email sent to all of their end-users. For examples, see this file. Such emails are often of limited value to end-users because they may describe outages that don't affect an individual end-user (e.g. a network link that they don't use, or the outage may be at a time when they don't need the network) and because they must be manually processed to determine which are relevant to a specific end-user, and which are not. It would also be useful for network operators to be able to advertise to their end-users known current outages, and what remedial action is being taken, so as to reduce the number of queries from end-users.

The goal of this project is to develop a form (possibly in XML) for describing network outages, and software to allow operators to distribute such forms, and for end-users to filter out irrelevant forms and even maintain a database of anticipated outages, which can be referred to when trying to explain observed outages.

To promote end-user adoption of the software, it could also be used to advertise other events, such as road traffic conditions (e.g. see this link) of interest to the end-user.

This could be extended to form a peer-to-peer system in which end-users share their knowledge of service (un)availability, and of which abnormal events have been reported or are being rectified.

Determining that an access problem exists

This project could, conceivably, be split into three smaller projects.

To determine when problems exist, we intend to develop network performance monitoring software. This will take three forms: First, end-user software (e.g. web browser plug-in) that measures the performance of communication with systems that the end-user is currently interacting with (e.g. throughput and response times to web and DNS servers), and so provides timely detection of faults in communicating with systems of highest immediate interest to the end-user. Second, software on the end-user's platform will monitor general network performance, as seen from the local system, e.g. by processing data from TCP and link-layer MIBs about error rates etc, and by monitoring ECN fields of received IP packets. Third, software on the end-user's platform will monitor performance of frequently used resources (e.g. links) within the network using tools, which either already exist (e.g. see http://www.caida.org/tools/) or will be developed as part of this project, and services from network service providers and other "Internet weather" monitoring services (e.g. http://www.internettrafficreport.com/).

Determining the cause of a problem

The software that determines when problems exist may be able to localise the problem precisely enough to identify the cause of the problem (e.g. a link failure). However, there will also be times when additional diagnosis is needed to determine the cause of the problem. This project will examine mechanisms to achieve such diagnosis. We will start with the traditional ICMP mechanisms (such as time-exceeded messages and extensions by the IETF's ICMP Traceback working group and its successors), and extend the venerable traceroute by adding new heuristics to estimate the path length, deal with asymmetric paths, and to accelerate route tracing.

See the Raceroute web page describing an enhanced traceroute.

Network measurement

This project will survey existing tools for measuring network topologies and path characteristics (e.g. traceroute, pathchar and probing using packet pairs) and will propose new techniques (e.g. mapping topologies of bridged networks by filling bridge tables by changing the MAC address with which a station transmits). Such network measurements are important for learning about network topologies, which help explain how link outages affect reachability. Measurements of path and link characteristics (e.g. bandwidth, delay, loss rate, etc) are useful for establishing baseline performance, and in identifying abnormal performance (e.g. detecting failures due to inadequate performance, and possibly suggesting imminent total failure).

Informative fault report wizard

When organising the 11th IEEE International Conference On Networking (ICON), we provided a web site (www.icon2003.com) with information about the conference. This site was accessed thousands of times, but some end-users experienced problems accessing (parts of) the site. Even though most of these end-users were, by occupation, network researchers, the reports of problems in accessing the web site sent by email generally failed to convey the technical details about the fault that were needed for rectifying the situation. For example, a typical fault report said "I am having problems accessing www.icon2003.com" without stating which URL they were trying to access, when they tried to access, whether they could access other web servers or the ICON server in ways other than HTTP (e.g. ping or traceroute) etc. For some examples of information that should be included in a fault report, see this web page.

The goal of this project is to create software (likely in Java to enable use on varied web browsers) that will allow the user to enter a URL and will test what connectivity is available to that URL, and create a report which can be sent to report inability to access a certain URL.

Balancing network transparency and opacity/obscurity

There is a delicate balance between network transparency, which aids diagnosis, and obscurity, which service providers often desire for security and competitive advantage. Many firewalls and routers are configured to filter out traditional ICMP diagnosis messages because of these concerns. This project involves collaborating with other researchers to develop insight into this balance between transparency and obscurity, and then developing tools and methodologies that accommodate these differing requirements. These may use techniques such as ICMP tunnels and probes to widely-permitted ports (e.g. 80) with decrementing TTLs.

Furthermore, networks are complicated systems, and are built in a modular fashion, e.g. exploiting the benefits of information hiding. Information hiding is desirable for the common case of normal operation (e.g. it facilitates component change, and reduces the amount of information that users need to process), but can be detrimental when problem solving, when the user wants an explanation for why they cannot obtain the desired service. This balance between transparency and opacity significantly affects the feasibility of different approaches to diagnosing network faults.

Recovery from faults

This project will provide tools that enable end-users to respond to network faults in ways that allow them, whenever possible, to realise their goals, rather than being stupefied by arcane error messages. These tools will include facilities to allow end-users to indicate their determination to access a web page (and so control the persistence of protocols used in attempts to access the web page), to access the web page through a proxy, or to access an older version of the page from a cache.

e.g. UNSW subscribes to the IEEE Xplore database of papers, but the subscription has a limited number of "seats" (concurrent users). If someone can't download a paper immediately because all seats are currently in use, then can a tool repeat the request until the download succeeds?

Email dependability

See the separate web page for more details.

People assume that when they send an electronic message (email, instant messaging, SMS etc) then it will be received by the intended recipient. But how often does this occur? (Particularly today, with the increasing use of email filters, intended to protect users from spam.) Furthermore, why do messages get lost? (Sometimes it is because of servers crashing, and the protocols that they use, e.g. SMTP, being deliberately designed to avoid end-to-end acknowledgements so that end-systems needn't be available simultaneously.) Finally, what can be done to improve email dependability? e.g. can a sending email client process error messages (e.g. Hotmail "exceeded disk quota" messages) and automatically try alternate email addresses (e.g. Unimail) to reach the destination? Can a sender receive an acknowledgement indicating that the message has reached the destination computer/user, without violating the privacy of the destination user to read messages without telling others? This thesis will assess current messaging systems, and develop software to prototype more dependable messaging systems.

Related work

The US's Computer Science and Telecommunications Board has a project on Building Certifiably Dependable Systems