Aaron Cordova's Blog

Dremel vs. Tenzing vs. Sawzall

Recent buzz surrounding Google's Dremel and the potential for an open source implementation caused me to wonder about a similar paper Google published about a system called Tenzing, about which there seems to be less buzz.

Turns out both technologies are in use in Google's YouTube Data Warehouse. The slides from XLDB that describe the system highlight the following tradeoffs, which may be specific to Google's implementation, but may reveal more fundamental tradeoffs between latency and query power. The also include Sawzall, a language for implementing MapReduce jobs.

The slides contain the following table. Note that 'high' = good for rows except latency:

SawzallTenzingDremel
Latencyhighmedlow
Scalabilityhighhighmed
SQLnonehighmed
Powerhighmedlow


Looking at this chart, it appears that there is a bit of a continuum. Dremel provides the lowest (best) latency, it appears at the cost of query power (no joins?).  

If more query power is required moving to Tenzing appears to handle 'medium complexity analysis' with strong SQL support (i.e. more of the SQL spec is implemented and it's likely more compatible with SQL based systems). Tenzing sacrifices a bit of latency but scalability is actually considered better than Dremel.

Finally, switching from declarative SQL-like queries to the procedural language Sawzall provides more query power and control at the cost of yet more latency. 

Open Source Options


Currently, Sawzall has been open sourced and can be found here. There is a proposal to create a Dremel implementation as an Apache Incubator project called Drill from the guys at MapR and some other companies. There's also a project called OpenDremel.

These projects are interesting since achieving higher scalability was considered to come at the cost of the interactivity and flexibility provided by SQL. Dremel demonstrates that low-latency, interactive SQL-like queries are possible at 'medium' scales.

I'd love to hear why the YTDW guys say Dremel doesn't scale as well as MapReduce as I didn't get that sense from the research paper. They quote 'trillions of rows in seconds', it runs on 'thousands of CPUs and petabytes of data', and processes 'quadrillions of records per month'. That's 'medium' scalability for Google. It's likely the case that Google's version of MapReduce scales to astronomical numbers and that Dremel can handle the biggest datasets that all but a few are likely to throw at it.