Aaron Cordova's Blog

Online Data Availability Challenges

When deciding to share data with others, publish content, or create a new website or web application one must consider the problem of where the files and code must live. As the Internet Archive project illustrates, content on the web is unsettlingly ephemeral.

For those with an interest in creating some lasting content or presence on the web, and for those worried about data loss and high data availability, the following factors conspire to kill your data. I'll address each and hopefully a picture of a solution will emerge:

1. The majority of internet users are connected to the net via obfuscated means

Wouldn't it be nice if it were possible to make some content or data on your hard drive available to others without having to put them somewhere else? Alas, the nature of most internet users' connections means there's no good way for others to consistently find your machine from wherever they may be. First, the assignment of IP addresses to user machines is dynamic, you get a different address every time you reconnect to the web, which might not be changed that often, but can't be relied on not to change. This means that there's no good way for other people to consistently find your machine and the data you may wish to make available.

There are dynamic DNS updating services to help with this, but then you run into the next hurdle: many users are sharing a single IP address via a technology called NAT (network address translation) that only allows traffic coming back from an original user request to find it's way to a user's machine, rather than unsolicited web requests from anywhere.

These are probably good for security, but completely destroy the option of making data hosted on your own machine highly available to others. Furthermore, upload speeds tend to be much lower than download, so only a few users could reach your data simultaneously.

This means you've got to find some service out there on the net that is setup for making data available consistently and to a large number of users.

2. Hard drives and other storage media fail fairly reliably

Conventional hard drives (not Solid State Disks) have moving parts and are often the first component of any computer to fail. It's just a matter of time. More expensive disks last longer but no disks last forever. This necessitates backing data up somewhere. But even then the backups are prone to fail. Writable CDs only last a few years, and other hard drives will also fail or may silently corrupt the data.

The safest option is to host several copies of data on several disks that are monitored for bit rot (not total drive failure, but just a few bits of data changing or going away) and failure, upon which new copies are created from the remaining good copies. This can be provided via RAID, or newer distributed filesystems such as HDFS, which is the only filesystem I know of that checks for bit rot.
[Update: as @Zooko points out, ZFS and Tahoe-LAFS also provide this feature]

This means we have to find a service that offers to make multiple copies of your data on multiple spinning disks, monitor the copies, and automatically re-replicate as necessary. I mentioned a few technologies, and these are either very available, such as RAID, or are becoming more so, as with hosted distributed filesystems.

3. Web hosting companies can come and go

This doesn't necessarily threaten data loss, as most companies will probably offer to let you get your data somehow should they fail, but this process is not usually very well known if it exists at all. Even the big web companies have reluctantly responded to the need to allow users to extract all their data.

It would be best for data availability to have your data hosted by a strong company, with advanced data centers, and enough scale to provide your data with several spinning disks.

4. Web technologies, such as servers and browsers, change

This one may be the hardest challenge to overcome, as there doesn't seem to be much demand for solving it. The best one can do is try to stick to widespread standards, formats and technologies to extend the usefulness of their data over time. I believe we need more work on creating lasting data and content delivery systems that have fewer dependencies that the systems we have today.

5. The networked nature of the web gives rise to spiky page view behavior

Even if your data is protected against loss, others must be able to get to it. Because of the interlinked nature of the web, content tends to be accessed in spiky bursts as word spreads about the content through user sharing and re-posting. If your data or content isn't in a highly scalable application, some users will be denied access to it.

Modern content hosting services provide a technology known as Content Delivery Networks, which are essentially geographically distributed servers that hold static content and direct requests to the nearest physical copy. For dynamic data, scale can be addressed by applications that can be distributed across a cluster of machines that can be grown and shrunk dynamically with no downtime, and requests load balanced across them.

This means we need to dynamically replicate content and applications across many machines to respond to user demand. Fortunately web technology has been trending towards this with the advent of dynamically scalable platforms.

At this point, a few services more or less fit this bill, namely some of the large 'cloud computing' offerings. Each offer some form of scalable, replicated content delivery. However, we have one more factor to address.

6. Controversial content may have a hard time finding a home and staying available

Content that threatens powerful interests, or that may be considered controversial, or that could be caught up in intellectual property disputes, is subject to being denied hosting if the affected interests can pressure content hosting companies. The unfortunate side effect of all the other factors, which point us towards large, thriving, and technologically advanced hosting companies, is that it makes it easier to have content made unavailable via legal or commercial pressure on one or two entities. 

Solutions to factors #6 and #3 seem to be at odds with each other.

So how does one achieve data replication, distribution, and scalability, but at the same time avoid centralized control? I believe we need more effort in this area, perhaps going back to challenge #1 of securely and consistently making content on user machines available. A few decentralized services exist to help avoid making content 'killable' through censorship efforts, and these may even get close to addressing the other requirements, but are not well adopted or integrated into the web. 

The perfect solution seems to be a decentralized, geographically distributed, widely addressable, dynamically scalable data replication and delivery service that is an essential component of several independent commercial endeavors to ensure it's longer term support and viability. I don't think we have many candidates, but I'd love to be proven wrong.