Cloudera has built a business around the open-source Hadoop project, and in this
interview, we chat with Cloudera CTO and cofounder Amr Awadallah about all
- Background on Hadoop and its origins
- Implementing support for MapReduce to handle distributed data
- Suitability of Hadoop for cloud computing
- The relationship between Cloudera’s Hadoop version and the Apache version
- Cloudera’s business and monetization model around Hadoop
- Licensing models as they apply to Cloudera and Hadoop
Scott Swigart: Could you introduce yourself and Cloudera?
Amr Awadallah: I’m CTO, VP of engineering, and cofounder of Cloudera. I was previously VP of engineering for data systems at Yahoo!. That’s where I first got exposure to Hadoop and recognized the need to form a company to commercialize it. The company is just 13 months old now.
Scott: Can you fill in some of the blanks for our readers who may not be entirely familiar with Hadoop?
Amr: Sure. With traditional data systems, you have a bunch of filers where you store your data. A compute grid is connected to these filers, which essentially is a bunch of computers that fetch over the data, do computations on it, then store the data back.
That model worked fine, but it started to break when data sizes got too big and moving the data between the storage network and the compute resources became a big stress on the network infrastructure.
There was also a big stress on the filer heads. A filer consists of a bunch of disks, and all these disks are attached to a head, which is really what does all the serving. The filer head itself starts to become a bottleneck when you’re running a job that accesses a large proportion of your data. If you’re just searching a file or two within your data, it’s still OK.
Those factors are what drove the need for Hadoop. We had massive amounts of data to process, and existing systems were not capable of scaling sufficiently. Hadoop itself came out of the open source community, though Yahoo! played a very big role, as well.
Scott: I’m always interested in the actual mechanics of how something as complex as Hadoop comes to be. Can you speak to that specifically?
Amr: Two developers named Doug Cutting and Mike Cafarella were trying to build a web-scale, distributed search engine. They built a system called Nutch about seven years ago.
Obviously, the system needed to be massively scalable to index the whole web, so they were spending a lot of energy in that area. Around that time, Google published two key papers, on Google MapReduce and the Google File System, GFS. It’s the marriage of those two technologies that enables the large scale data processing that we can do today.
GFS allows linking a lot of commodity hardware together to present one large, unified file system. MapReduce has two parts: the programming model and the job execution framework.
The MapReduce job execution framework lets you spread your job over multiple nodes, so that the first level of mapping, where you’re reading your data in and doing the first level of aggregation, is all local. All the communication is purely local between the computation resources and local disks. Therefore, you don’t take a performance hit or stress the network from passing data over it.
In the second phase, which is called the reduce phase, some inter-node communication takes place, which does necessarily stress the network infrastructure to some degree. Usually by that stage, though, the data has already been reduced in size to the subsets that you care about, so the stress on the network infrastructure is dramatically reduced.
When Doug and Mike saw these papers, they had a “Eureka” moment and started to integrate these technologies into their web-scale search technology with Nutch.
Scott: How did the work at Yahoo! coincide with what Doug and Mike were doing with Nutch? I assume that Yahoo! must have adopted an open source approach in place of some of their proprietary algorithms. Is that true, and what went on behind the scenes with those decisions?
Amr: At the time, Yahoo! was also doing a lot of work internally on their infrastructure for serving search. As the web grew exponentially, the cost of serving a search was getting prohibitively high, and Yahoo! needed a more scalable system to cost-effectively build the web index.
They saw the MapReduce and GFS papers from Google as well, and in light of the work being done by then on the open source Hadoop project, they recognized that they had a choice. They could either build proprietary MapReduce and GFS technology from scratch, or they could adopt Hadoop internally and support the open source project.
A very big debate went on inside Yahoo! about the right direction to go. The proponents of the open source approach made the very good argument that supporting Hadoop would take away one of the key advantages that Google had. Namely, only Google had the systems to really take advantage of MapReduce and GFS, which locked in a tremendous advantage for them.
They argued successfully that it was in Yahoo!’s best interest to unlock that technology, so other startups could start adopting it and building ideas around it. The idea was that this approach would shift the power from Google.
In response to those arguments and others like them, Yahoo! hired Doug Cutting on staff and invested very heavily on building a large Hadoop team. In fact, the vast majority of the code base in Hadoop today came from Yahoo!, so they certainly deserve a lot of credit for that.
Scott: How did Hadoop move beyond its foundations in search?
Amr: After Yahoo! started to integrate the technology within its web infrastructure initially, Hadoop started penetrating a lot of other areas of Yahoo! as well. Now it’s being used in Yahoo! mail for spam filtering, as well as on the Yahoo! front page to model which stories to display to which users based on their interests. It’s being used for ad modeling and ad targeting, as well. Facebook was another early adopter and supporter of Hadoop.
Around that time, which was maybe two years ago, it was starting to be clear that Hadoop had implications for more than just the big web properties. The technology is useful to many other domains as well. In fact, this notion of moving computation to run as close as possible to the data is something that everybody should be doing, given how big data is growing these days.
This trend is being reinforced by the fact that commodity servers are becoming very powerful in terms of the number of cores and the number of discs you can have in them. Now you can have eight cores very economically, and within the next year you’re going to have 32 or even 64. You can have anywhere between four terabytes to even 48 terabytes of disk space on one of these standard server nodes.
It has become very clear that the trend is toward large hardware topologies, where systems will need to be very resilient to failure. They will need to know how to shift workloads and data around when things fail, so the system just heals itself.
Scott: One of the things that this model depends on is moving away from a traditional relational database that doesn’t really fragment the data across lots of nodes. Instead, those traditional systems are really based around the idea of a single, monolithic body of data.
That seems to be one of the big paradigm shifts for developers. When they start using something like BigTable where the data is distributed across a lot of nodes, they have to think a little differently about the way they build applications.
Working our way from the bottom up, talk a little bit about thinking differently in how you store your data so that you can use a MapReduce process on top of that.
Amr: First, it’s worth making the important clarifying point that Hadoop is not a database. Hadoop is a data processing system, and in fact, I would even go as far as saying Hadoop is an operating system. The core of an operating system boils down to a file system, the storage of files, and a process scheduling system that runs applications on top of these files.
There are many other components that help with devices, credentials and user access, and so on, but that is the core. Hadoop is exactly the same thing. The core of Hadoop is the Hadoop Distributed File System, which is a file system that’s runs across many nodes. It links together the file systems on many local nodes to make them into one big file system. Hadoop MapReduce is really the job scheduling system that takes care of scheduling jobs on top of all those nodes.
That is the key distinction between Hadoop’s approach and that of database systems. Hadoop, at its heart, does not require any structure to your data. You can just upload files directly from anywhere, like a web server, RFID device, or cell phone mobile device, directly into Hadoop.
They could be images, videos, or just a bunch of bits. They don’t have to have a schema with column types and so on, which gives you tremendous agility and flexibility.
Hadoop has a very nice model that I sometimes refer to as schema on read. Whereas defining your schema as you’re writing the data in limits what you can put in by requiring it to be conformant to the schema that you created, Hadoop allows you to define the schema as you’re reading stuff out.
That gives you a lot of flexibility and agility, since you can add files that have dynamic parts like JSON or new standards coming up like Avro, which is a very good project coming out of the Hadoop project that’s similar to protocol buffers from Google and Thrift from Facebook. Avro makes files have a schema around them as well, but these schemas are semi-structured, rather than conforming to a strict relational model.
That said, it’s also important to point out that structured stuff is a subset of unstructured stuff. The fact that Hadoop at its heart is a file system doesn’t mean that it can’t do database relational stuff. It does actually, in the same way that Windows at its heart is a file system, but you can run SQL Server on top of it to get the relational services, schemas, column types, and so on.
One of the key projects on top of Hadoop is Hive, which actually came out of Facebook. Hive essentially provides a relational database on top of Hadoop that utilizes the underlying file system but has a metastore that keeps the schema of the files.
It knows that a given file is tab delimited or whatever, it knows the column type for these files, and Hive allows you to write SQL against these files. It will look up the schema and then it will write for you the MapReduce jobs so that you don’t have to go and learn MapReduce from scratch.
Now you have the flexibility of going either way. One approach is to get at the core of the MapReduce framework using Java MapReduce, which we sometimes refer to as being like assembly language for Hadoop. It gives you the most flexibility and performance, but it is fairly complex and difficult to learn.
Alternately, you can go in with a high level language like Hive. In this case, you can just use SQL, if that’s what you’re used to, to write your job. Hive itself has lots of optimizations. It understands the underlying MapReduce framework, so it can properly map your problem on top of your data.
Scott: Obviously, search is a logical use of MapReduce, but I’m sure there are many others. For example, in the oil and gas industry, a single oil derrick might produce terabytes of data a day. In other industries, there might be huge amounts of financial data to plow through.
It seems that every industry is faced with figuring out ways to deal with these large bodies of data. Given that, what are some of the more interesting uses or problems you’ve seen thrown at Hadoop?
Amr: The group at Yahoo! that I came from was using Hadoop for data analytics and data warehousing. We had something like 100,000 web servers across the world, and once we collected data from across all these servers, we dumped it into Hadoop, which became the place where we stored all of the data, instead of traditional network storage.
Our reasoning for doing that was a matter of economics, given the quantity of hardware. Hadoop lets us scalably process that data, clean it up, and normalize it so we could pass it along to the systems that need it.
Hadoop is getting very wide adoption in the data warehousing and business intelligence domains. One of the biggest uses within Yahoo! right now is dealing with all of the log information from servers. Analyzing that information allows for better spam filtering, ad targeting, content targeting, A/B testing for new features, et cetera.
It’s not web-specific. For example, everybody does data warehousing, and we see very strong adoption there.
Separate from that, your example of oil companies is a very good one, as is the financial sector. Right now, we do have a couple of very large financial institutions working with us on these exact problems, taking huge amounts of data from domains like credit card processing and building predictive models for fraud that enable better decisions, for example, about whether to block or allow a given transaction.
In the stock market, Hadoop is being used to do simulations that help predict option pricing and related problems. That’s another very healthy market that we’ve seen growth in.
Another interesting market is genomics, specifically for doing gene sequence mapping and protein sequence alignment. In many ways, that is a fuzzy string matching problem, where you have a string of DNA that you need to match to a template string that has certain diseases identified in it.
Any individual task is not that big, but there are so many of them that you need to split the work over many nodes and do the matching in parallel. Government agencies have a number of implementations that involve massive bodies of data as well.
Scott: Let me turn our line of conversation in a slightly different direction. Cloudera obviously has the word “cloud” in it, and implementations like Amazon Web Services have Hadoop implementations.
Hadoop seems to intersect with the cloud fairly well. When I think of the cloud, I think about having an API to start up and shut down nodes, utility computing, and being able to provision and shut down relatively large numbers of nodes remotely, programmatically.
What makes the cloud and Hadoop a good fit for each other?
Amr: I completely agree with you that the cloud is essentially a set of resources that you can scale up or down with a nice API, based on dynamic demand. There are many services that fit that description, of course, and I should add that a cloud is not always external to your organization.
Even though Cloudera has the word “cloud” prominently in our name, we are not building a cloud or a cloud infrastructure like Amazon AWS or Microsoft Azure. Rather, what we’re building is software that enables cloud computing. Hadoop is at the heart of that, although we’re also building other components around Hadoop that strengthen that kind of operating proposition.
Hadoop fits very well with the cloud model, because in many ways, Hadoop is like a virtual machine, albeit one that actually takes the inverse approach to what one generally associates with virtual machines.
That is, a virtual machine hypervisor allows you to partition a single server to look like many small servers. On the other hand, Hadoop is about linking together a bunch of physical servers to look like one big server, which starts looking like a cloud service.
You can have many users using this resource, and the resource can be scaled up or down, depending on the demand it’s being hit with. That’s exactly the sweet spot for Hadoop.
We’re working very closely with Amazon, and their Amazon Elastic MapReduce service gives you the option to use the Cloudera specific distribution of Hadoop, although Amazon has the generic Apache distribution of Hadoop as well. If you want to get a maintenance support contract, you obviously need to use our distribution.
Amazon’s business is about providing the cloud service itself, and not about building Hadoop. Our business is about building Hadoop itself, so that’s the synergy there.
Scott: As you mentioned, there is an Apache version of Hadoop and then there’s the Cloudera version. As different companies wrap themselves around different open source projects, they’re structured in different ways. Talk a little bit about Cloudera and what you add to the public open source version of Hadoop, in terms of additional software, support, or services.
Amr: I should start by saying that Cloudera is an enterprise software company. Open source is an enabler for us, and it’s part of what we do, but our mission is about building enterprise software for large-scale data processing in internal or external clouds.
In terms of how we go about working with the community, Hadoop is an Apache licensed product, and the Apache license is one of the most consumer-friendly licenses out there.
The Apache license allows anybody to take the code and do whatever they want to do with it, whether that means using it internally, redistributing it, even packaging and selling it. That flexibility makes it a very nice license, but of course, it means that you have to add value in order to make money from the code.
The way we make money is a bit similar to Red Hat, although Red Hat does charge for the distribution, whereas we don’t. You can get our distribution directly from our site without having to pay us anything.
There are many differences between our distribution and the Apache distribution. Primarily, our distribution is well-packaged as RPMs, or RPM packages for Red Hat, or Deb packages for Ubuntu or Debian. That makes it much easier to install using standard tools, and in fact, we are talking right now with Microsoft about how we can package it properly to run on top of the Microsoft Windows platform.
The other distinction is that the Apache organization can be a bit slow in terms of launching new Hadoop releases. We release new distributions almost every two weeks, with the latest and greatest features.
Of course, we release them as test distributions, and we tell people to be careful how they use them. As with other fast-moving distributions, the bleeding-edge stuff can have problems as well. Yahoo! also has a distribution, which is essentially the Apache Hadoop code that has been tested on the Yahoo! internal infrastructure. Everything that Yahoo! has in their Hadoop distribution makes it to the Apache Hadoop release, and similarly for any contributions that Cloudera makes to our distribution. We aren’t forking off the code base, it’s just a matter of the Apache Hadoop official release lagging behind a bit.
Scott: What is the ongoing relationship between Yahoo! and Cloudera?
Amr: As I said before, Hadoop owes a lot to Yahoo!, in terms of how they made it into what it is today, and they continue to contribute a lot. Yahoo! has a very big Hadoop team–I’d say almost 100 people right now. They have lots of open positions as well, for new people to join that team.
A lot of the Hadoop patches that we pick up were developed and tested by Yahoo!, and that’s what we include in our distribution. We do our own development and testing as well, in addition to what Yahoo! does. We listen to our customers to decide on which features/improvements to start working on next. Yahoo’s development roadmap is obviously a function of their internal needs.
Scott: How does Cloudera monetize their work with Hadoop?
Amr: Right now, the only reason you would need to pay us is if you need our maintenance service. Think about software where there’s no upfront licensing fee but there is a maintenance fee. The maintenance fee buys you access to our team, and you can open tickets and ask questions.
Companies using Hadoop in a production environment need a partner they can fall back on when they have problems, and we are that partner. We can also provide advice about what network hardware to use, how to properly balance the number of hard disks and cores on your machines, and how to configure Hadoop itself.
Hadoop has something like 200 different system parameters that you need to properly configure depending on the infrastructure you have and the type of job that you’re running. Writing code using Java MapReduce is a bit of an art right now, and we can help with that to make sure your code is properly written with respect to the MapReduce model.
These are the kinds of services that we offer. We also offer quick patches. If you encounter a bug specific to your problem, and you want to make sure you get the patch for it right away so you can continue operating your cluster, we’ll fix the patch for you very quickly. We have a number of Apache Hadoop committers on staff.
I forgot to mention that Doug Cutting, the inventor of Hadoop, left Yahoo! in September and joined our team. Tom White is another key committer we have at Cloudera. He is also the author of the O’Reilly “Hadoop: The Definitive Guide” book, which is the “bible” for Hadoop right now.
The other thing we do is training, in terms of learning this new technology, how to work with it, and how to develop problems within this new scalable way of thinking. We provide training from the very basic, fundamental kind of stuff all the way to writing algorithms specific to the problem that you are trying to solve.
That’s how a lot of open source companies start, where their main revenue stream is maintenance fees; this allows them to make healthy revenues in their early years.
Most open source companies evolve to an open core strategy, where the core platform is free, but functionality around the core is not necessarily free. That is the direction that we will be moving in.
We’ll keep the Hadoop core free, because that’s Apache, and we’ll continue contributing to that. Indeed, more than half our engineering team is focused on making the Hadoop platform better, but at the same time, we’re now investing in building infrastructure around Hadoop to strengthen the platform.
Scott: It’s interesting to consider that dynamic among companies that are wrapped around open source projects like Joomla, WordPress, or MySQL. Every one is different, even with their similarities.
It’s interesting to me to understand how different companies are structured, because it seems like early on, there were lots and lots of different models people were trying around open source. It seems like the market seems to be settling out a little bit.
One common approach is a sort of aquarium model, like Google or MySQL, where it’s being written by people who all work for a company, and they put the source code out there for all to see. There may be a patch from outside the company here or there, but most of the development is being done by people who all work for the same company.
On the other hand, there’s the example of the Red Hat model, where there’s a pretty vibrant community and the product draws code from a lot of different places. They do a lot of value-added stuff around it, such as providing an enterprise-grade version, fast patching, training, and consulting services.
These two seem to be the predominant models that at the moment. Do you see it in a similar way, or do you see other ways companies are structuring around open source?
Amr: I completely agree. The Apache Software Foundation follows the latter model. In fact, they only allow projects to be “official projects” within Apache if the committers in the project are diversified over a number of companies.
If just one company controls the full code base, Apache will not let it become a core project, which I think is a very healthy approach. It does lead to certain logistics and coordination problems, but it allows the software to evolve much more quickly.
It also has sort of a Darwinian advantage, in the sense that when a number of companies are all working on a problem and there is more than one solution, the community can just take the best one. That said, you do lose control a bit, which might slow things down, even though it helps foster innovation, so it is certainly a balancing act.
Scott: Maybe a year ago, we interviewed Justin Erenkrantz of the Apache Foundation, and that was one of the things we discussed–that requirement for a project to have maintainers from at least three different companies to make it out of the incubator and become an official Apache project.
In another area, the Apache license is very different than the GPL license, for example. How does that fit into the strategy of Yahoo! or Cloudera? Do you think the Apache license versus the GPL license affects the evolution of Hadoop?
Amr: It does play a very big role. As I said earlier, the GPL license has a viral aspect in terms of distribution. Whenever you redistribute the code, you are giving up some rights back to the owners of the code. On the other hand, the Apache license doesn’t have that viral nature to it, which makes it a much friendlier license, in terms of adoption.
We have to work harder to make our distribution, training, and maintenance services really stand out, and to provide additional value on top of the vanilla Apache Hadoop codebase. After all, customers can just go and get the code directly from Apache, without involving us at all.
Scott: There are obviously differing viewpoints on licensing, with some people who really like the GPL for what it does, and some people who really like the Apache license for what it doesn’t do, or what it doesn’t make you do. People argue about which is really the most free.
Amr: That’s a good point, and of course, I have my own opinions about it. Our CEO, Mike Olson, led the open source revolution from the beginning. Mike was the CEO for Sleepycat, which commercialized BerkeleyDB, one of the first open source companies.
Scott: We are running out of time, so I don’t want to miss the chance to ask you whether there’s anything you’d like to bring up that we haven’t discussed yet.
Amr: Hadoop has been around for a few years, but it’s still very young, relatively speaking. There is still a lot of maturing for it to do, and that’s what we are spending most of our energy on right now. We, with the Apache Hadoop community, are doing some great stuff in areas like performance, usability, availability, security, auditing, interoperability, and so on.
We are also building new data system products that augment the Hadoop platform. For example, a big problem that many companies have today is how to reliably collect data from across hundreds or thousands of servers, and bring it into a central data repository. That’s a problem that we are very excited about.
Scott: Cool. We’ll keep paying attention to what comes next. Thanks for taking the time to talk today.
Amr: Thank you.