In this interview, we chat with Damien Katz, founder of Apache CouchDB.
CouchDB is a schema-less distributed document-oriented database that can be
queried over HTTP using a RESTful JSON API.
- Meeting the challenges associated with building a database and storage engine
- Contrasting document-oriented databases with relational ones
- Trade-offs relative to working with relational databases
- Smoothing the paradigm shift to schema-less and document-oriented databases
- The role of CouchApp and the Couchio company in the world of CouchDB
Scott Swigart: To start, please introduce yourself and talk a little bit about the CouchDB project and the Couchio company.
Damien Katz: I am the founder of the CouchDB project, and I’ve been in the industry now for 15 years. I started in the software industry on LotusNotes, the back-end database model of which is largely the inspiration for CouchDB, although Lotus Notes was a much larger project than CouchDB is.
About five years ago, I was done with a startup and decided I didn’t want to take a normal job. Relational databases were everywhere at that point, and MySQL was getting insanely popular. I had a lot of fondness for the LotusNotes platform, although it was very much the ugly duckling.
I saw all of its warts, but I also saw that it had some beauty, and I wanted to take that beautiful part and bring that into the modern open source web. I quit my paid software career and moved my family from Massachusetts to North Carolina and lived off of savings and occasional contract work while I started the CouchDB project.
We just founded Couchio, which is the company wrapped around CouchDB, last December. I moved my family out to California in January, and we’re doing the whole startup thing in Silicon Valley. We’re focusing on services and support and hosting right now.
Scott: Taking on developing a database and a storage engine is no small task. In fact, it’s one of the hardest things to do in computer science. In addition to considering the functionality, there are also performance and scalability considerations that become really important. Talk a little bit about taking on that challenge.
Damien: You’re right that there are a lot of issues to worry about with a database server, such as performance, concurrency, and reliability. When somebody writes something, does it actually get written? And when somebody read something, was it supposed to be read? Was it actually committed by someone else?
At the root of these concerns are the ACID properties: atomic, consistent, isolated, and durable. It’s very difficult to accomplish that while also achieving the performance and reliability characteristics you want. It takes a lot of careful planning.
Doing it improperly can easily produce results that seem to perform well until you hit some sort of threshold where you run out of resources on the computer and everything sort of falls off. Or the computer doesn’t shut down properly and you might corrupt all your data.
There are a whole lot of challenges there that you have to really think about, and there are a surprisingly large number of ways that you can address those issues. CouchDB takes a fairly novel approach compared to most database engines.
Actually, not all database engines do actually address those issues. Some just decide to kind of punt by saying that you need to run on a very reliable server or on multiple machines for failover.
There are all sorts of different ways to directly solve these issues, as long as you really think about it carefully early on. Early in the CouchDB development cycle, it seems like I spent weeks or months just thinking about the design from a very high-level point of view.
I spent huge amounts of time planning how the software would interact with the clients and the disk, how multiple clients would interact with all that, and coming up with a simple design that would be reliable and provide good performance.
Scott: Another point of interest is that CouchDB is a document-oriented database. It’s far more common to be familiar with relational databases, so talk a little bit about this kind of database and the tradeoffs between it and your traditional relational database.
Damien: The traditional relational database, based on tables and rigid schemas, is an old design that’s very well studied and optimized. When you can break your design out into schemas and carefully size each column and table to match your data, you can make amazingly fast solutions.
But we often don’t have that sort of insight into our data model ahead of time when we’re developing an application. And frequently our data model just won’t fit into that kind of schema. The document model is an alternative, where the data is self contained, and documents fit into that.
To understand the model, it helps to think of business cards. Most business cards contain the same information: a person’s name, phone number, email address, possibly their actual physical mailing address, and other bits of information such as a professional title or the company name.
These cards are all documents, and they have close similarities as well as differences. This is an example of something that doesn’t necessarily fit very well into a relational database. Of course, business cards are relatively simple, and you might be able to figure out all the possible iterations and put them into a relational database.
But there are other types of documents where you don’t have that luxury ahead of time, and it can be especially difficult with something like content. That sort of data is ideal for a document database, where you’re going to have fields and attributes that vary broadly from document to document.
It doesn’t mean the documents themselves are completely different from one another, and usually they actually have a great deal in common. Usually, all the documents have a common set of attributes, but they also have unique attributes that vary document to document. Those are the types of things that relational databases have a difficult time with, but that document databases are ideal for.
Scott: From a programming standpoint, how do developers deal with that? To consider your business card example, it’s obviously simple to code against the stuff you’re expecting, but like you said, the whole point of a document-centered database is that it can deal with the more freeform stuff as well.
How do you see application developers dealing with unknown, free-form sets of fields?
Damien: Frequently, in a document oriented app, you just want to capture the information at first, and you don’t necessarily even know how you want to report on it. The simplest approach is just to add the fields that you want to capture into the documents, and then later on figure out what to do with it.
For example, you might have captured a large set of foreign tax IDs, and then six months later, you decide how to create a report on all those content records. Therefore, it’s important to make it easy to get the data into the database, so you don’t have to adjust the schema, you don’t have to do an alter table or anything. You just stick the data in there and it’s captured.
Later on, you can create a new view without having to alter any of the data in the database, without having to alter any of the existing logic to create something that will show you all the documents that have these foreign tax IDs.
That’s an example of something that can be kind of difficult to do in a relational database. With a document database, the application doesn’t actually have to modify a schema somewhere in order to do this; it just basically sticks the attributes right in the document.
Scott: Obviously, relational databases can do joins, and there are other things that they’re good at, but there are trade-offs. You have to consider what your data looks like to decide which model fits best with your application.
It also seems that document-oriented databases have strengths when it comes to distributed processing in the cloud. Parallelism used to mean multiple threads inside one computer, but more and more, it refers to nodes.
Tens, hundreds, or thousands of nodes each crunch on a piece of data, and then the results are put back together on the back end. Talk about a little bit about how you see development changing as it moves to the cloud and where CouchDB fits in.
Damien: With relational databases, one of the difficulties is that, with a fixed schema, it’s difficult to have multiple instances of the same database in different places, particularly with a master/master type of setup.
A master/slave type of setup is the standard for relational databases across multiple machines. All the changes go into a single master, and you have multiple slaves to alleviate the read load. That works fine with relational databases, and you can also do that in CouchDB and other document databases.
On the other hand, in document databases, you can also have a master/master model, with geographically distributed databases that can be independently updated and can replicate their updates with each other.
With relational databases, you can do master/master setups with very careful planning of your data model ahead of time. You have to do a subset of the relation model to make sure that you don’t get into an invariant condition, and then you can do master/master.
Using relational databases, though, you can’t change the schema on one side in a master/master setup without a tremendous amount of planning and some downtime on each side. When you change the schema on one node, you create a conflict of schemas and data. You have to figure out how to translate the differences.
With a document-oriented database, each document contains its own schema, so those conflicts don’t arise. Instead, if two people edit the same document, that conflict is detected automatically by the system and can be resolved by the user on an automated basis.
Relational technologies are designed to have a single master at any given time that is the authoritative master state of the data. Without a whole lot of planning, you have to build something on top of the relational model in order to get around that.
In the document model, it’s simple to open it up to multiple machines and move the data around. The data is independent rather than interrelated.
Document databases allow data to move between machines, within a single cluster or across clusters much more easily. Of course, that makes cloud computing much easier than with relational databases. It’s easier to partition new machines, bring new machines online, or move them across data centers.
Scott: Schema-less databases and document-oriented databases are a paradigm shift, and it provides bigger programmer challenges than something like picking up a new language. I would liken it to the transition from procedural to object oriented programming.
When people try to pick up a functional programming language for the first time, they have to grasp an entirely new way of approaching the problem. I would imagine that there is a tendency by some people to try to force it to work like a relational database.
What are some of the key concepts that people need to wrap their heads around in order to use CouchDB the way it’s meant to be used, rather than shoehorning it into working the old way?
Damien: The first thing to consider is the tendency to normalize your data excessively. While we aren’t a relational database, it is possible instead of putting a customer’s name into an order record, to use a customer ID, and that might actually be a better way in CouchDB.
Still, you could go too far and make everything relational. For example, when a customer buys something from a store, the receipt doesn’t have the store’s tax ID number on it, right? They give you the actual name of the store and the address and the phone number.
Those things could change later on. The store could change names, it could change ownership, it could move, or it could change phone numbers, and any piece of information on that receipt could potentially become out of date.
That receipt still has value to us, as a record of a transaction at a point in time. The relational paradigm says don’t do that, we need to normalize everything. The document paradigm says a receipt is perfectly fine, and we can put the name and phone number right on the receipt.
CouchDB encourages you to think more in terms of the receipt, but it can also allow you to be more relational. You don’t have to be entirely non-relational and de-normalized. The challenge for developers to work their head around is where to find that sweet spot, so they can decide optimally where to normalize and denormalize in CouchDB.
That’s actually the challenge in a relational database, too. It is very easy to over-normalize a relational database. As a result, it can become difficult to work with, all the queries can become unwieldy because you’re doing way too many joins, and everything can become more obfuscated than it should be.
There is a similar issue with CouchDB, in the sense that you can go over-normalized or under-normalized, although in CouchDB, you generally want to lean more toward a de-normalized form.
Scott: I think that data warehousing foreshadowed this transition to some extent, in the sense that, in certain scenarios you have to embrace denormalization. On one hand, you have to be concerned with breaking things down granularly and having tables and sub-tables, and joining all kinds of stuff together.
But to actually get performance, the data warehouse has to have a pre-denormalized view, to let you do that sort of stuff.
Document centric database in some ways maps to object oriented bit better, right? A document is sort of a hierarchical object; if you think of an HTML page, it’s got a title, it’s got a body, and it’s got a bunch of elements in it that are defined on one hand but on the other hand they can kind of be in any order and so forth.
Talk a little bit how a document-centric database is a more natural fit for an object-oriented programmer than this sort of data-mapping layer where we’ve always had to take our objects, shatter them into tables, push them out into a relational database, and then kind of reform the hierarchy as we pull stuff out and get it back into objects.
Damien: I’ve always felt that, at its heart, CouchDB is really an object database. The reason I don’t call it an object database is that it’s a very overloaded term in the industry, and people have a set of expectations that go along with object databases.
Nowadays, there really aren’t that many popular object databases, but back in the day, Gemstone had some popularity. Maybe it’s still popular these days, but I haven’t heard much about it. This was the idea of a transparent object-persistence layer for a programming language. Gemstone was for Smalltalk, there was Zope for Python, and I know there were others for other languages.
These are essentially databases to store programming objects natively right into a database without any sort of real translation layer. CouchDB, on the other hand, is not a seamless, persistent layer where you get your objects and then you have different methods and everything also gets stored in there.
That’s the biggest difference, but there are still objects. They just don’t have the methods, and they’re not in an object-oriented programming database.
I think one of the big reasons object databases didn’t catch on is that they’re married too closely to the languages they are designed to work with. That’s convenient when you’re working with the language, but the data frequently outlives the application or the project, and you also need to share the data with other applications or other projects.
With CouchDB and JSON, you can use it from any language, but if you’re using something like Gemstone from Smalltalk to instantiate one of those objects, you’re going to have to call it from Smalltalk. If you’re going to access it from something else, you are going to have to write an access wrapper.
That becomes a big obstacle to re-using that data from another application. I think that’s one of the big reasons that object databases never really caught on and never had a chance against SQL which is its own simple language and so it didn’t have that problem.
It also plays much more easily with a variety of programming languages than any single-object database does. CouchDB’s popularity owes in part to the fact that it can play very easily with a wide range of programming languages. And it opens up your database to a lot more applications and being able to share the data, which is very important.
Scott: As cloud development matures and evolves, what do you see as the challenges, opportunities and future direction for CouchDB?
Damien: The large data stuff hasn’t really been our primary focus at CouchDB. That’s not to say that we haven’t focused there at all, and we have always designed the CouchDB data model so it can scale up.
But our focus has been primarily on making sure that we can build applications very easily, very rapidly, so they can replicate. Building really large scale applications has always been a secondary goal for CouchDB.
We make it very easy to launch web apps, scale them up or scale them down, and not worry too much about their reliability. You can just bang them out, get them working, and deploy them. That’s primarily our interest.
Scott: Talk a little bit about Couchio and what the company does.
Damien: Right now we’re providing services and support. Our first customer that we signed up for support is Canonical, and we’re helping them with their Ubuntu One service. We are just about to launch our hosting service. We are in private beta, and we hope to initiate public beta soon.
Our focus with hosting is going to be on low-friction hosting, to get people started with as little work as possible. We aren’t focused on large-scale hosting, for people with huge data sets and things like that.
We’re building these CouchApps for things like mail, timelines, scheduling, and document management. We want to provide hosting and support for these CouchApps, so users don’t necessarily even know that these CouchApps are built on top of CouchDB, they just know that their applications are being made to run their business.
And the benefit of these CouchApps is that they can run them in the cloud, they can run them in the browser, and they can run them on a small server in their office. The whole time, we have them backed up and they’re available in the cloud via replication.
Their data’s always with them, it’s installed locally on their laptops or on their phones, and it’s always available using CouchDB for replication. So if their network connection goes down, or it’s really slow, or if they’re traveling on an airplane, they always have access to their data.
Moreover, they’re always accessing it through the same browser interface, and regardless of the network conditions, it’s always a fast, low-wait connection, because it’s local.
Scott: This notion of a CouchApp, where the app is stored in the database, is fascinating, and it makes sense because the database can store anything. And it’s automatically replicated, so if you needed to stand up more nodes or whatever behind your load balancer or those kinds of things, it’s a piece of cake.
All of those can be stored in a single design document, and these replicate around with the data. You can also use apps to put them up on the server cloud, so you point your browser at it and it’s just like a regular web browser app. We have several examples of these.
Mozilla has written one called Raindrop that’s basically a mail-style app. We’ve written a chat app and a calendar app.
You run these in your browser, and they’re just like regular web apps. The really cool thing is that once you install CouchDB locally, you can replicate these down and still run them even if you’re disconnected from the network. You have the exact same experience.
We want to get the browser plugins, so you can integrate them straight in the browser. In the CouchApps, there would be the option to use it offline, which would cause it to replicate locally.
The browser would still deliver the same user experience, like a local proxy. They would still see in the URL that they’re hitting the remote, but it would intercept all those requests and hit the local replica.
Any changes they make to the local replica immediately get replicated up to the cloud, and any changes somebody makes to the cloud version are replicated down to the local one, so you always have the most up to date version locally.
It could also be an appliance that’s rented to somebody. There’s this thing called the Shiva plug, which is a really small Linux plug that’s about the size of an Airport Express. It runs Linux, and it has a hard drive and a little web server. We’re going to put CouchDB on those types of devices.
Then people can rent these plugs that have the server and their application on there. Their cloud provider will mail it to them, they plug it into the wall, and it automatically connects to the cloud CouchDB instance.
They would get a really high bandwidth, low latency experience for these apps, because they would be served straight from the plug, which is also constantly replicating with the cloud. This solves problems for small businesses that are running off of a cable modem, and their telco has oversold.
A lot of people run off of DSL or cable modem, and frequently throughout the day, it gets slow or stops working altogether. The customers don’t have access to the network for whatever reason, so they’re completely in a holding pattern until the network comes back up. The plug makes it so the data is always right there locally.
Scott: That’s a great explanation, because it makes it clear to people how this technology can actually help them in their business.
Damien: The idea behind that little device is that now you have a little tiny server that’s always tethered to the cloud. You can serve applications to everybody in the office, and nobody has to install anything.
You’ve essentially replaced the whole Microsoft Office package. Whereas previously, you’d have to buy Microsoft Office and install Office Server and Exchange and all sorts of different things in order to deal with the fact that you can’t have everything in the cloud because your network connection is just too unreliable.
This bridges that world, so now you’ve got this low cost tiny plug and everything’s higher speed because it’s local, and then you’ve also got everything in the cloud, so that everything’s backed up, professional administered, and up to date. And if the plug dies, everybody in the office just accesses the cloud version while you wait for Fed Ex to deliver you a brand new plug.
Scott: I want to be sensitive to the time. Are there any closing thoughts from your side?
Damien: We’ve just gone to version 0.11, which is our release candidate for 1.0. We’ve already been very stable as a database engine, and right now what we’re actually trying to stabilize on is our program interface, and we’re about ready to go.
Scott: Thanks for taking the time to talk today.
Damien: Thank you.