In this interview, we talk with Eliot Horowitz, founder of 10gen, which is the
company that created MongoDB. What's MongoDB? Well, it's an open source,
document oriented, schema free database designed for massive scale and
performance. If that got your attention, then I invite you to read on…
- The evolution of various data models in databases
- Advantages of object-oriented databases in terms of horizontal scaling
- Handling issues when columns are added or deleted between application versions
- Suitability of MongoDB for the cloud
- OS support and usage with MongoDB
- The company (10gen) wrapped around MongoDB
- Community involvement in the development of MongoDB
Scott Swigart: To start, can you introduce yourself, 10gen, and the role of open source behind it?
Eliot Horowitz:Sure. I’m the CTO and co-founder of 10gen, which was really created to build MongoDB, an open-source, document-based database. We started 10gen and started working on MongoDB mostly out of our own frustrations after building scalable web infrastructure for a dozen years.
It really just came to a point where we were constantly fighting the same battles and building one-off solutions to handle functional gaps in relational databases. We started building MongoDB about two years ago, and we launched it earlier this year. People are getting very excited about it.
It’s very much targeted toward the standard web developer’s needs, anywhere from a simple one-guy website, all the way up to a Facebook-level, highly scalable, highly distributed system.
Scott: Building a storage engine is no trivial task, and beyond that, for a long time the bread and butter way of doing it was with relational databases. There was Oracle on one end of the spectrum and MySQL on the other, with a lot of other stuff in between, but people were used to throwing select statements at a database, joining tables together.
Then companies like Google came along, and over time, they’ve started to stored data differently. It’s sharded; it’s a schema-less store. You see things like SimpleDB up on Amazon, and I think Microsoft’s doing something similar and offering a data store. Talk a little bit about where MongoDB fits in the spectrum of databases.
Eliot: I think the first question you have to ask about any database these days is, “What’s the data model?” First, there’s the relational data model, which includes Oracle, MySQL, and all those. Then you have key-value, which has been around for awhile and includes things like MemcacheD and Dynamo from Amazon, where it’s really just “get input” type operations.
Obviously, those have been used for a long time, and they’re great for many cases. When you need simple gets and posts, they’re highly scalable, highly distributable, and very fast. They also have a very simple data model, although on the farthest possible extreme from SQL.
And then you’ve got a bunch of other data models in the middle: graph databases, tabular databases like BigTable, and document databases like Mongo.
There are a few different ways that you can think about document databases. One of the nice things about document databases is that they’re closely mapped to how most developers are writing code, whereas SQL databases were designed for accounting and banking 30 or 40 years ago, prior to the advent of web applications and the rise of object-oriented programming.
Now, when people are writing in PHP, Ruby, or Python, and they’re writing with data objects instead of rich classes (like user profiles). Trying to map that to relational data models is complex and slow. You can see the evidence of that in the number of different object-relational mapping libraries (ORMs) out there.
There are probably 1,000 different ORMs in the world, so clearly, it’s an unsolved problem, and every week you hear about a new one coming out. That whole class of problems exists because there’s a very clunky mapping from objects to relational databases. With document databases, that mapping becomes much simpler.
In many cases, it enables a direct mapping. You can take your object and direct it toward the database. And when you do need to do a mapping, the whole class of problems is massively simplified.
Another aspect of the same issue is that, when you loosen some of the versional constraints, you don’t have joins, because the data model is more suited to rich objects. For example, if you want to store tags for a blog post, you can store them as an embedded array rather than having a separate table that you have to join against. Once you get rid of joins, scaling horizontally becomes much simpler, which is a great benefit.
A bunch of the new databases coming out right now are trying to completely change the database paradigms, in terms of issues like how replication works or how you query the database. One of the things we’re trying to do with MongoDB is to keep it close to what people are used to. So, when moving from a relational database to MongoDB, there aren’t too many surprises.
You still have collections that are basically equivalent to tables, you still have different databases, and you still create indexes. You can still do an ‘explain’ to see what kind of query plan you’re going to get, and you can do ad hoc queries.
That means that if you’re coming from MySQL, it’s a pretty easy transition to MongoDB. The data model changes, but a lot of the things you’re used to are still going to be the same, mostly because we don’t think there’s any need to change them.
Scott: There was a guy a number of years ago who said that object-relational mapping is the Vietnam of our industry. It’s the thing that people just keep trying over and over to do, and it always ends up being excessively complicated. Back in my coding days, I spent as much time writing the mapping as I would have spent just writing the code to hydrate and dehydrate the objects.
Still, there have been object-oriented databases in the past. Talk to me about some of the things that make MongoDB unique or interesting compared to preceding object-oriented databases.
Eliot: The first object databases came out in the early ’90s, and there were a bunch of problems. For one thing, a lot of code was being written in C++ then, where they sped up the types, and the mapping layer really didn’t work all that well.
You could take a C++ object and save it, but not all the people were doing that. And then you had other languages where you didn’t really have objects the same way you do now. Remember that it was still very early in the trajectory of object-oriented programming at that point, so in some ways it was ahead of the curve.
The applications people were writing then were often more suited to relational databases, I think. Now, web scalability is a lot more important, because things can get big pretty quickly. I think people want much richer information from their databases, and they’re trying to use them in very different ways than they used to.
If you’re building a banking system or an e-commerce system, relational databases work pretty well. It’s really when you get into Web 2.0 sites that relational databases really start having troubles. If you look at Twitter, Facebook, or those kinds of websites, the needs have changed.
Really, though, I think the biggest thing is that the programming languages have changed a lot. No one’s writing web applications in C. Most web applications are now being written in PHP, Ruby Python, and some Java, all of which map very nicely to a document database. You can interact with any one of the languages and it all works pretty well.
I also think, going back to what we mentioned before, that the object databases before were actually more closely related to current graph databases than to document databases. The document database is really just taking MySQL, and instead of having a row, you have a document. So I think it’s a much simpler transition and it’s actually much closer to MySQL than a lot of people might think.
Scott: I’m calling it object-oriented, and you’re calling it document-oriented. Can you unpack that for me a little bit?
Eliot: In our case, a document is a BSON document, which you can think of as similar to a JSON document. You can have numbers, strings, dates, embedded arrays, and embedded objects.
If you have a user profile, instead of having first name, last name, then street, city, zip, you can have an address sub-object; you can have an address, and that can have fields. Then you can have multiple addresses, so instead of having address1 and address2 fields, you can have an array of addresses, and it’s already much simpler.
In your code, you can map that very cleanly. You have an address object that maps to this object, and you’ve got an array of them. You could have none, or you can have as many as you want. The database doesn’t really have a clunky interface, and you’re not joining, so there’s no huge performance hit. It’s all stored in the same place on disk.
Scott: Is a document basically a language-agnostic object, for lack of a better term? In other words, it’s not tied to PHP or Ruby or any particular programming language, but it basically has object concepts?
Eliot: Right; it’s basically like JSON, which is used pretty heavily in transporting data across the web now. We use BSON, which is basically like JSON except that it’s binary and has more types. JSON doesn’t have a native date type, and there are more types you need for web infrastructure.
Rather than hacking JSON, we just created a binary version that’s both easier for computers to read and extensible. It has more types, so you can have as many types as you need for real web applications.
Scott: Talk a little bit about scaling. You mentioned that because you don’t have joins and those sorts of things, and the data is all stored in one place, it scales horizontally very well. What does that actually look like, and how does it put the data back together, find which nodes different pieces of data are stored on, and resolve those kind of issues?
Eliot: The simple case is user profiles. Let’s say you have a billion user profiles and you needed 10 machines to store them all. Basically what you would do in this case is to say, “OK, I want to start my users collection,” and you could shard it on anything you want. You could shard it on email address or first name, or you could sort it geographically. You could shard it by country and then by state, depending on what kinds of queries you do.
Let’s say you sharded by using email addresses. Looking for a particular user is very fast. Now let’s say I want to find all people who live in Connecticut. It basically does a merge sort across all of the different shards.
It will go to each one and say, “OK, give me your users from Connecticut and get all of them,” and then it will aggregate them for you on the fly. From a driver/client perspective, it’s just the same if you’re sharded or if you’re not sharded. This is what people have been doing manually for years.
I’ve built many systems in the past where I needed 11 different databases to store something, and I did the sharding myself, which is a lot of manual work. What we’re doing is basically just generalizing what developers have been doing themselves in one-off solutions for years. It’s much simpler, because you don’t have to do joins.
A distributed join is very, very hard, but when your user profile is one object, it makes the problem much simpler.
Scott: It’s always interesting for me to ask where the edges are. What are some scenarios where a relational database may still be a better choice? Are there other tables where, maybe, a big table store is a better choice?
Eliot: I think big table solves the same general problems, although there we have some advantages over them. I think the document model is a little bit better than the tabular model, and there are a lot of subtleties there.
The big reason you use a relational database is if you need multi-object interactions. We don’t support multi-object transactions because of the performance overhead and complexity of doing distributed joins.
If you’re writing a banking system and you need to debit my account, credit your account, and do it completely atomically, we don’t support that. However, you can act atomically on any single object.
For example, you could say, “Increment this field on this object by five.” You can have it so lots of threads can do it, lots of different machines can do it, and it’s all atomic and safe.
Scott: Another issue that’s interesting is dealing with versioning. For example, in the course of versioning an application, you may change your objects, add properties to them, or various other things. With a relational database, it’s a relatively painful process up front, where when you change the schema, you’re changing the schema for the database.
That can mean that you have to populate a whole bunch of new columns with default values or do other things to migrate the database to the new version. With Google App Engine, it doesn’t work that way. You’re never really rolling the whole database forward to a new version, but you have to do more defensive programming
For example, let’s say you pull a user out of the database and that user hasn’t been touched for a version or two of your app. You’re pulling out the data the way it was stored the last time that user was saved, and so you may have to do some stuff in your code to map that older user data to your newer object. How do you guys handle that?
Eliot: You could migrate the entire database, but you don’t have to, which is the big thing. The simplest case is where you want to add a field. Here, the nice thing is that you can just start using the field. If an old object doesn’t have it, it will just return null, and it’s pretty easy and clean.
If you want to really change the format, you could actually have a migration script that migrates all your data, but that would require you to bring your site offline for an hour or so.
That’s another big change over the last ten years: people are no longer are willing to have maintenance windows. I remember working on systems where it was totally acceptable to have a two-hour maintenance window on a Saturday morning.
If Twitter went down for a two hour maintenance window once a month, I don’t think people would be too happy, and I think databases have to handle that. In any case, databases typically require you to do a little bit more work on the client side to make sure that you know what you’re getting out and you know what you’re putting back in.
Scott: On another topic, the cloud is a very important subject these days, and it seems like MongoDB is very well suited for something like Amazon EC2 and infrastructure on demand, where you’re able to install your own software and those kinds of things.
Obviously, the things that are more “platform as a service,” whether it’s ads or Google App Engine, have their own store, and to some degree you get what you get, which is what they provide. Talk a little bit about where you’re seeing usage of MongoDB in the data center versus in the cloud and some of the interesting uses you’re seeing for it.
Eliot: It does work well in the cloud, because it’s horizontally scalable, although it’s certainly not dependent on that usage. We see a lot of people running it in their own data centers, on their own hardware.
If you look at the startups that are using Mongo, a huge percentage of them are running on EC2, and I think that has less to do with MongoDB than with the fact that EC2 is a popular place to host sites.
We’re perfectly happy to run anywhere, but if you look at where people are building websites right now, it’s very heavily EC2 and other cloud-based solutions. You do not see very many startups buying and racking hardware to go on the Internet, which I did six years ago.
Horizontal scalability is a great thing for the cloud, because if you do grow quickly, you can add capacity very quickly. If you’re not horizontally scalable, the cloud doesn’t really help you that much; it’s actually worse, because you can’t buy or rent a $2 million Sun box on EC2.
Scott: If you start out with a small instance and you want to upgrade, and it’s your own hardware and your own data center, you can put more memory and CPU in it. With a cloud provider, you tend to get the type of instance you said you wanted, unless you want to shut it down and migrate the whole thing over, which causes that downtime.
Is this basically a Linux technology, or do you have significant Windows users?
Eliot: We have more Windows users than I would ever have guessed. If you asked me nine months ago if we were going to have anyone running in production with Windows, I would have said, “Maybe one or two.”
It’s a low percentage, but there are definitely many people running on Windows in production. We fully support Windows, OS X, Linux, and Solaris.
Scott: That’s interesting for the Windows developers. Is it sort of like PHP on Windows, or are people actually using it as a back end for .NET apps?
Eliot: PHP and .NET; not so much anything else. There’s a C# driver, and someone is working on a pure .NET driver. My knowledge of that right now is a little weak, but there are a bunch of different .NET things going on. I think there is a link adaptor, also.
Scott: We’ve talked a lot about the database and the project, so can you talk a little bit about 10gen and the company wrapped around it? What do you do around MongoDB, for instance?
Eliot: All the core MongoDB developers and the main developers of all the official drivers are 10gen employees. We basically sponsor the projects, and we are doing most of the development work. We also offer commercial support and training.
There are other interesting revenue opportunities for us, but those are the main things right now.
Scott: Talk a little bit about where it’s going. What are some of the top feature requests, or what do you see as some of the opportunities and interesting directions?
Eliot: If you look at our road map for this year, there’s no one big feature. I think the only big thing we’re doing right now is getting the auto-sharding to be fully production bulletproof. Traditional master-slaves and all that kind of stuff are pretty much bulletproof at this point.
The rest of it is just a lot of features. We are sort of similar to MySQL or PostgreSQL in terms of how you could use us, and people want all the features that they’re used to in MySQL and PostgreSQL. These include things like full-text search, SNMP, and all the assorted add-ons providing special indexing.
Scott: That must be an interesting thing to balance, because for example, MySQL’s original claim to fame was that it was simple, it was fast, and it wasn’t trying to be your enterprise relational database.
Then it just sort of kept growing and growing, because people kept asking for more things, until people looked at it and said, “What happened to my simple, lightweight, fast database?” And then you saw people going off and starting to work on Drizzle and those sorts of things.
How do you balance that request for one more thing versus wanting not to lose sight of what you set out to build and the scenarios that you really wanted to nail?
Eliot: I think it comes down to a few key things. One is making sure that we don’t regress on performance, which is one thing we’re very careful about. We constantly monitor our performance and make sure we’re not doing things that hurt it.
If there was a feature that would hurt our performance, we would think long and hard about implementing it, and we are definitely more interested in making the basics work than we are in adding more features.
We’re also careful about which features we let in, emphasizing common web infrastructure needs. It’s not useful for us to be simple and fast if you also need another database because you can’t do x, y, and z.
It’s very much the 80/20 rule. We want to cover the most general, wide-reaching things that people need, but it really comes down to good software engineering and good product management to make sure that we don’t screw up the basics. You don’t break the simple stuff by adding extra features. Another important aspect is good documentation.
Scott: I don’t want to throw too many software companies under the bus, but there is often a willingness to regress on performance, because they figure that hardware has gotten faster, so the user won’t be affected, even if, in terms of CPU cycles, it’s taking longer to do stuff.
How hard and fast is your policy of comparing the old version and the new one on the same box to make sure it didn’t get any slower?
Eliot: The next version coming out in a couple of months will be significantly faster than the previous version on the same hardware. We’re constantly working on that, so I’d like to think that that’s a pretty hard rule.
Scott: As time goes on, ultimately you want people to be using it, and you want it to meet people’s needs. Can you talk a bit about cases where, even if a change might hurt performance but really helps you nail a usage scenario, implementing it might be the right choice?
Eliot: I think a lot of that really comes back down to software engineering. For example, you can have such a feature but make sure that people can turn it off if they’re not using it, and that it won’t hurt their performance if they do turn it off.
We’re really trying to be very good on that level.
Scott: You mentioned that the people who are writing MongoDB mostly work for 10gen. Talk a little bit about the community story that lives around that.
Eliot: We’re seeing very few contributions to the core server code, mostly because it’s big, it’s flawed, and it’s complicated. There is a huge community, and it is unbelievably helpful in terms of the drivers. We’re getting lots and lots of great patches and contributors to the official drivers, so if you look at the PHP, Perl, Ruby, and File/Print drivers, there are tons of contributions.
The community has also been building a lot of great tools that sit on top of the drivers. One of the things we’re trying to do is keep the drivers pretty low level, very simple, and as fast as possible. If you want to hook into a login service, or if you want to hook up Apache to MongoDB for logging, we’ve seen a lot of great projects like that.
Scott: That’s interesting, in the sense that it’s a relatively common thread across a lot of successful open source projects that the number of people who contribute to the core might be really small, but the number of people who write add-ons, for example, might be very large.
It sounds like the ancillary components, for lack of a better term, related to MongoDB largely consist of those drivers, as well as administration and reporting tools that layer on top and add a lot of value.
Eliot: Right, and for example, there’s also a pretty standard adapter for Apache to log to MySQL. People are going to naturally want that to go to MongoDB, and there are going to be a lot of other different things that people will want.
That’s where the community has been very helpful, because we can’t be experts on a thousand different technologies, so it really makes sense for the experts in those fields to build them. It really works better for everyone.
Scott: Well, we’ve gotten to the end of our time, and I think we’ve covered everything I wanted to. Thanks for taking the time to talk to me today.
Eliot: Thank you. I’ve enjoyed it.