Aaron Cordova's Blog

Big Data Reading List

Since there's so much going on in the Big Data space these days, getting up to speed quickly is important for lots of technical decision makers.

This is a list of books and articles that might be helpful for learning about the major concepts and innovations in the Big Data space. It is by no means an attempt to be comprehensive or even an unbiased representation, just useful. I've organized the list according to what I feel are a few fundamental approaches to tackling the Big Data challenge, namely:

New Distributed Architectures

Distributed Architectures address the most basic problems related to Big Data - i.e. what does one do when the data no longer 'fits' on a single machine. Ostensibly one must store or stream the data and process it somehow.

Machine Learning

Machine learning, modeling, data mining, etc address the problem of understanding the data. Even if I can store and process the data, ultimately I need to gain some level of human understanding of the information contained therein. 

Machine learning can help solve this problem via modeling - either in a way such that the model is transparent and a human can understand the fundamental processes that generated the data, or a model that can be used in place of human understanding to help make decision. It can also reduce dimensionality and reveal structure.

Visualization

Visualization is a different approach to helping understand the data that leverages the considerable power of the human visual cortex to help find patterns and structure in the data. 


Sometimes these approaches combine. I think that perhaps all of the above approaches are coalescing into a new field that could be termed 'Data Science'.


New Distributed Architecture Concepts


Machine Learning and New Architectures

Machine Learning

Visualization

A Few Blogs / Sites




Accumulo on EC2

I've posted a guide to running Accumulo on Amazon's EC2. Accumulo has been deployed on hundreds of machines on EC2 and it works pretty well.

Accumulo is an implementation of Google's BigTable with addition features such as cell-level security labels and programmable server side aggregation.
Scaling the size of the cluster we saw an 85% increase in the aggregate write rate each time we doubled the number of machines, reaching 1 million inserts per second at the 100 machine mark.

Netflix has shown similar results running Cassandra on EC2 on 100 machines in their benchmark.

Hadoop Genealogy

Came across this diagram of the evolution of Hadoop on the Apache Blog. Interesting that so much competition is going on down in the dirty details of features, like append. It's as if the commercial contention has visibly manifested itself in the code.


click to enlarge

Online Data Availability Challenges

When deciding to share data with others, publish content, or create a new website or web application one must consider the problem of where the files and code must live. As the Internet Archive project illustrates, content on the web is unsettlingly ephemeral.

For those with an interest in creating some lasting content or presence on the web, and for those worried about data loss and high data availability, the following factors conspire to kill your data. I'll address each and hopefully a picture of a solution will emerge:

1. The majority of internet users are connected to the net via obfuscated means

Wouldn't it be nice if it were possible to make some content or data on your hard drive available to others without having to put them somewhere else? Alas, the nature of most internet users' connections means there's no good way for others to consistently find your machine from wherever they may be. First, the assignment of IP addresses to user machines is dynamic, you get a different address every time you reconnect to the web, which might not be changed that often, but can't be relied on not to change. This means that there's no good way for other people to consistently find your machine and the data you may wish to make available.

There are dynamic DNS updating services to help with this, but then you run into the next hurdle: many users are sharing a single IP address via a technology called NAT (network address translation) that only allows traffic coming back from an original user request to find it's way to a user's machine, rather than unsolicited web requests from anywhere.

These are probably good for security, but completely destroy the option of making data hosted on your own machine highly available to others. Furthermore, upload speeds tend to be much lower than download, so only a few users could reach your data simultaneously.

This means you've got to find some service out there on the net that is setup for making data available consistently and to a large number of users.

2. Hard drives and other storage media fail fairly reliably

Conventional hard drives (not Solid State Disks) have moving parts and are often the first component of any computer to fail. It's just a matter of time. More expensive disks last longer but no disks last forever. This necessitates backing data up somewhere. But even then the backups are prone to fail. Writable CDs only last a few years, and other hard drives will also fail or may silently corrupt the data.

The safest option is to host several copies of data on several disks that are monitored for bit rot (not total drive failure, but just a few bits of data changing or going away) and failure, upon which new copies are created from the remaining good copies. This can be provided via RAID, or newer distributed filesystems such as HDFS, which is the only filesystem I know of that checks for bit rot.
[Update: as @Zooko points out, ZFS and Tahoe-LAFS also provide this feature]

This means we have to find a service that offers to make multiple copies of your data on multiple spinning disks, monitor the copies, and automatically re-replicate as necessary. I mentioned a few technologies, and these are either very available, such as RAID, or are becoming more so, as with hosted distributed filesystems.

3. Web hosting companies can come and go

This doesn't necessarily threaten data loss, as most companies will probably offer to let you get your data somehow should they fail, but this process is not usually very well known if it exists at all. Even the big web companies have reluctantly responded to the need to allow users to extract all their data.

It would be best for data availability to have your data hosted by a strong company, with advanced data centers, and enough scale to provide your data with several spinning disks.

4. Web technologies, such as servers and browsers, change

This one may be the hardest challenge to overcome, as there doesn't seem to be much demand for solving it. The best one can do is try to stick to widespread standards, formats and technologies to extend the usefulness of their data over time. I believe we need more work on creating lasting data and content delivery systems that have fewer dependencies that the systems we have today.

5. The networked nature of the web gives rise to spiky page view behavior

Even if your data is protected against loss, others must be able to get to it. Because of the interlinked nature of the web, content tends to be accessed in spiky bursts as word spreads about the content through user sharing and re-posting. If your data or content isn't in a highly scalable application, some users will be denied access to it.

Modern content hosting services provide a technology known as Content Delivery Networks, which are essentially geographically distributed servers that hold static content and direct requests to the nearest physical copy. For dynamic data, scale can be addressed by applications that can be distributed across a cluster of machines that can be grown and shrunk dynamically with no downtime, and requests load balanced across them.

This means we need to dynamically replicate content and applications across many machines to respond to user demand. Fortunately web technology has been trending towards this with the advent of dynamically scalable platforms.

At this point, a few services more or less fit this bill, namely some of the large 'cloud computing' offerings. Each offer some form of scalable, replicated content delivery. However, we have one more factor to address.

6. Controversial content may have a hard time finding a home and staying available

Content that threatens powerful interests, or that may be considered controversial, or that could be caught up in intellectual property disputes, is subject to being denied hosting if the affected interests can pressure content hosting companies. The unfortunate side effect of all the other factors, which point us towards large, thriving, and technologically advanced hosting companies, is that it makes it easier to have content made unavailable via legal or commercial pressure on one or two entities. 

Solutions to factors #6 and #3 seem to be at odds with each other.

So how does one achieve data replication, distribution, and scalability, but at the same time avoid centralized control? I believe we need more effort in this area, perhaps going back to challenge #1 of securely and consistently making content on user machines available. A few decentralized services exist to help avoid making content 'killable' through censorship efforts, and these may even get close to addressing the other requirements, but are not well adopted or integrated into the web. 

The perfect solution seems to be a decentralized, geographically distributed, widely addressable, dynamically scalable data replication and delivery service that is an essential component of several independent commercial endeavors to ensure it's longer term support and viability. I don't think we have many candidates, but I'd love to be proven wrong.