It’s not exaggeration to say that Ian Clarke takes on some of the hardest problems in computing. Ian invented Freenet, and open-source network that remained uncracked, and is today contemplating Swarm, which has ambitions goals about moving computing closer to distributed data. But Ian isn’t just a geek’s geek. In this interview, he throws in some thoughts about the business side of software development too.
- The origins of Freenet and anonymous peer-to-peer architecture
- On the mathematics behind “Six Degrees of Kevin Bacon”
- Observations about optimal strategies for managing an open source community
- The Swarm project’s approach to distributed data
- Implementing fork/join and related concepts in distributed systems
- Relative beneits of open source and proprietary approaches to development
- Incorporating open source into an overall business strategy
Scott Swigart: To start off our conversation, could you please take a moment to introduce yourself and talk a bit about the different efforts you divide your time between?
Ian Clarke: Sure. I have a degree in computer science and artificial intelligence from the University of Edinburgh in Scotland. I started my career as a subcontractor to the European Space Agency.
Prior to graduation, I did some work on emergent systems and anonymity on the Internet. That grew into an open source project for which I act as founder and coordinator called Freenet, which started back in 1999 and has been ongoing ever since.
When I made Freenet publicly available after I graduated in 1999, it ended up getting quite a bit of publicity, and we released our first version in March 2000. In fact, a paper we released in 2000 describing Freenet ended up being the most cited computer science paper of that year.
Freenet was kind of novel, in that it was the first defense that allowed anonymous peer-to-peer architecture. Freenet also was precursor to the concept of the distributed hash table, which is a technique to efficiently store and retrieve data from distributed computers.
I moved to California and started a company, raising $4 million in venture capital, primarily from Intel. We developed a peer-to-peer content distribution technology, which we sold in 2003. All this time, Freenet was my hobby going on in the background.
I then did some consulting, and as part of that, I designed a peer-to-peer video distribution system for Janus Friis and Niklas Zennstrom, who I’m sure you know are the creators of the software you’re currently using.
I handed that off to their team in 2006, and it became a project known as Joost, which you can look at as the next big thing after Skype. I then co-founded a company called Revver, which was one of the early entrants into the short-form video-content sharing space. Revver is unique in that we were advertising against video back into 2005 and sharing the revenue with content creators.
Scott: How did that set the stage for the projects you’re working on now?
Ian: One of the problems I was working on as Chief Scientist at Revver was how to target individual videos with specific ads to maximize revenue. I became very interested in the whole field of collaborative filtering, predictive analytics, and recommender systems.
My current day job, which is a project called SenseArray, grew out of that. One area I learned a lot about with Revver concerns the challenges of scaling up a website, scaling up databases, and worrying about what data goes where and what data is cashware.
I believe that a lot of the architectural problems inherent in building a website and having a bunch of people actually start using it should not be the concern of the software engineers. Those issues should be handled by the framework in a mostly transparent way.
Those thought processes gave rise to a project called Swarm, which I spoke publicly about only a couple of months ago for the first time. Right now, it’s at a very early conceptual prototype stage, because I’m trying to get people interested in helping me work on it. Swarm is extremely ambitious and encompasses some very difficult problems.
Scott: Freenet sounds like the kind of thing that people might have been more interested in 2002 or later, in the sense that it predates the Patriot Act and widespread eavesdropping over global communications. Freenet provides an anonymized way to get out content and communicate with other people.
What do you think made Freenet interesting at the time you launched it, and how did it become more interesting because of changes culturally and geopolitically between 2000 and today?
Ian: When we first launched Freenet, it really generated two kinds of interest which were almost orthogonal to each other. One could have easily occurred without the other, but they both happened at the same time. The first kind was academic interest in what Freenet was doing and how it was doing it.
Freenet did a couple of things that had never really been done before. One of the core problems that Freenet tried to solve was to find information quickly from data that is distributed across a large network of computers. Many of those computers are unreliable, and some could be operated by your enemies.
You cannot rely on any form of centralization in order to do that. You probably remember Napster, which was a peer-to-peer network that maintained a central directory recording where everything was. Of course, that central directory can be shut down, and that really was not an option for Freenet.
At the time, Freenet was designed to be used in countries like China, Saudi Arabia, and Iran, where governments are not constrained by constitutions and laws in the same way that they are in this country.
Freenet was intended to operate in a very hostile environment, which creates some difficult problems. How do you find a piece of information, without consulting some centralized entity that keeps track of where it is? I developed a technique for doing this that relates somewhat to small world theory.
Scott: That’s the same theoretical basis as the Six Degrees of Kevin Bacon game, right? [laughs]
Ian: Exactly. The whole notion that you can connect between any two people through about six other people actually has a lot of mathematics behind it.
It turns out that if you just randomly create a network of things, you cannot so easily get from any one point to any other without omissions. For that to work, it must be a small world network, and it just so happens that human relationships do form a small world network, whereas that’s not true of any arbitrary network of things.
One of the novel things in Freenet was exploiting small world theory to solve the problem of searching and retrieving data in a decentralized, efficient, scalable way. I think that’s part of what generated the academic interest, and some of the stuff we were doing with cryptography was also quite new.
We certainly did not invent any new cryptographic algorithms, but we were using existing algorithms in some novel ways. That generated interest in the academic communities, computer security communities, cryptography communities, and what later became the peer-to-peer communities, although Freenet slightly predates the term “peer-to-peer.”
The other type of interest we were getting was mainstream interest from media like the New York Times and 60 Minutes, which related primarily to the copyright implications of something like Freenet.
Around about that time, of course, Napster was getting a huge amount of publicity, but a lot of people realized that Napster would probably get shut down, and here was Freenet, which could not be shut down. That generated a lot of mainstream media interest that really didn’t have much to do with the academic interest.
In some ways, they conflicted with each other, especially since some academics dismissed Freenet because it was getting so much mainstream publicity, which is kind of a weird effect.
Scott: What did you learn from Freenet in terms of managing an open source community? A lot of people have talked about why communities succeed or fail, but what are some of your lessons learned from that experience?
Ian: I have made a lot of mistakes, and I think I will continue to, so I certainly do not really see myself as being an authority. I think what has worked quite well for us has been being very open in terms of allowing people to join the development effort. I know that in some open source projects, it is incredibly difficult to gain access to source controls, and it is a big deal when you do.
We take the opposite tack. If somebody emails us and says they have an idea about how to do something, we give them access. If they do something wrong, we can revert it, which is the beauty of source control. We do everything we can to lower the barriers to entry for people to contribute to the project.
Another lesson has been that it’s very important to explain to people that they shouldn’t wait for someone else to tell them what to do. I find that a lot of people send messages to our mailing list, saying that they want to be part of the project and asking for something to do. Invariably, nobody will respond to them.
People need to know that nobody is going to laugh at them or shout at them or get mad at them if they just try to improve something. They can just find something that needs doing and then go do it.
Initially, one of our most prolific contributors over the years was Oskar Sandberg, who actually ended up doing a Ph.D. on stuff related to Freenet. He initially joined the project by writing a very simple Perl utility. To be honest with you, I can’t even remember exactly what it did, but it was something extremely simple, and from there he got more and more involved.
It’s really about lowering the barrier to entry, but also making it clear to people that they have to take stuff. They can’t wait for it to be given to them.
Scott: In other words, it’s better if someone comes to the project with their own itch that they want to scratch rather than “Hey, I just want to be part of this put me to work.” We hear from a lot of people that the best developers are often the people who are really using it all the time and so they know, “Hey, this is cool but it would be even cooler if it could do X.”
To return to Swarm for a moment, one of the things you mentioned that got my attention is that this is an incredibly ambitious undertaking. What is it that makes Swarm so ambitious, and how is that likely to impact the management of the community around it?
Ian: To put that more in context, I should first talk a little bit about what Swarm is. The basic premise is to build an environment where computer programs can be written so that they can either run on one computer just like a normal program or be distributed across many computers in a way that is mostly transparent to the programmer. The fundamental concept behind Swarm is to move the computation, not the data.
All computer programs are essentially a collection of code that operates on data, and the data can potentially be distributed across multiple computers. The concept with Swarm is that, if code that is executing on one computer needs to do something to data that is on a remote computer, you should move the computation to where the data is.
The conventional approach would be to move the data to the computer where the code is executing, and the Swarm approach is actually a little bit like the MapReduce concept that you may be familiar with.
Scott: In fact, I was just thinking that I should ask you how this approach is similar or different from Hadoop’s approach.
Ian: They both share the maxim “Move the computation, not the data,” but Swarm is really a lot more general. In order for a problem to be solvable through MapReduce, it needs to be possible to break it down into a map operation and a reduce operation.
Some problems are reducible in that way, especially stuff like indexing web pages, but many problems are not. In particular, anything that requires the equivalent of a database Join between two tables simply can’t be broken down into a MapReduce in any kind of efficient way.
In short, then, MapReduce is the same general concept but only applicable to a specific class of problems, whereas Swarm is a lot more general. Really, you can write any computer program, and with Swarm in particular you can write your code in Scala
I’m not sure how familiar you are with Scala, but I consider it to be the successor to Java. It has a lot of benefits of Java but without a lot of its annoyances and with some neat features that Java may never get, like closures, type inference, and that sort of thing.
You write a computer program in Scala, and it can really do anything. There is no requirement that you can break it down into MapReduce operations. It can just be an arbitrary computer program.
Scott: Can you share a few particulars about the Swarm approach to handling distributed data?
Ian: Let’s say an object on the computer that is executing happens to have a pointer to an object on a remote computer. At that point, Swarm essentially freezes your computer program, serializes it through a mechanism called continuations, and migrates the computation to the remote computer, where it starts up again.
From the perspective of the programmer, you don’t even need to know that this is happening or where your code is executing. Swarm can actually move the execution state of your code around between multiple computers in a way that the programmer doesn’t really have to think about it.
Now obviously, it’s not inherently very efficient to serialize a computer program’s state and migrate it across the network, so the second component of Swarm figures out how to arrange the data to minimize the number of times that the program state has to migrate from one computer to another.
Those are essentially the two components of Swarm. Firstly, allow a computer program’s state to jump around to follow data, so that you can just distribute data across multiple computers and then the computer program just follows it where it needs to go. The second component helps us be smart about where we put the data to maximize efficiency.
In terms of what makes this hard, the continuation mechanism I mentioned is something called portable continuations, which are supported by very few programming languages. Even Haskell cannot support portable continuations, and you know, if you can think of an obscure capability for a programming language, chances are Haskell can do it.
Ian: It just so happens that in Scala 2.8, which is the upcoming version, somebody implemented portable continuations. In many ways, this is the kind of thing that makes Swarm even possible.
In other areas of difficulty, there are all kinds of concurrency issues when code is acting on data that can be distributed across multiple computers. You have to deal with persistence and consistency, and all of that.
Problems of consistency and concurrent access to data are part of a very active research area at the moment, and there are a lot of hard problems there. How do you know when it’s safe to delete data so you can efficiently implement garbage collection across many computers? How do you implement stuff like atomic transactions when your data is distributed across multiple computers?
All the time, you’ve got a separate overseer process that’s watching how your computer program is jumping between computers and trying to move data in order to optimize that. I’ve got a lot on my plate, and I cannot follow all of these problems myself. At the moment, I’m trying to attract people with expertise in areas like handing concurrency and software transactional memory to the Swarm project.
Scott: I am thinking about the approach you describe and comparing it to a more traditional environment, where 100 web servers are talking to a handful of middle-tier servers, which are talking to one huge honking database. You scale that thing up as large as you possibly can, and if you can’t scale it up any larger, you fail.
I’m also thinking about the big table approach, where data is shared across lots of nodes. With something like Google App Engine, you get scalability, but you lose some relational database operations like Joins. You may be limited in terms of not being able to do the equivalent of a select that returns a million rows. You’re limited to a thousand results or something like that.
You talked about Joins and a lot of these concepts, and it sounds like what you are doing with Swarm is envisioning a novel data store. The data is distributed across a lot of nodes, and there’s an optimizer that’s always positioning the data so it will be the most efficient for the code to come to it.
At the same time, though, you’re still supporting what people tend to think of as relational operations, like Joins and those types of things. Am I envisioning it correctly?
Ian: I think you’re absolutely right that you could do things like Join using Swarm, although that would really be at a higher level of instruction. So, for example, somebody could implement a relational database on top of Swarm, but Swarm is not itself a relational database. Really, Swarm’s representation of data is as an object graph, much as data is represented in memory.
Scott: It seems like the concept of moving the execution to the data becomes more and more important as everything now just generates terabytes of data, from super colliders to oil fields.
It’s no longer practical to move the data to the computation, and you talked about your code running in one place for a little while and then having its state packaged up and running somewhere else.
Still, a lot of times what you want is for your code to sort of fan out. How does Swarm deal with the notion of making the logic fan out and go parallel when it can across nodes? And does it take that responsibility off of the developer, or is that something they would have to be conscious of?
Ian: This is something I’m thinking about at the moment. For example, one thing you could do is implement something like a fork/join approach to parallelization. I know there is a fork/join framework that will probably be part of the next major version of Java. You could absolutely implement something like that on top of Swarm, where you explicitly say to it, you can parallelize this if you want.
I wouldn’t say the parallelization is completely transparent to the programmer. It’s not like the system is examining the code and being very clever about figuring out what can we do in parallel and what we need to do serially.
What you can do is implement something like fork/join so the programmer can explicitly give the system permission to parallelize something. Then Swarm will decide where it makes sense to run on one thread or whether to fork it and send different parts of the execution to different other machines and then recombine it all at the end.
In short, the answer to your question is that you can do parallel processing with Swarm, but it probably won’t be completely transparent to the programmer.
Scott: Still, it sounds like it would be implemented from the programming standpoint in more of a declarative way, rather than the with imperative fork/join syntax that you would have in other languages.
Ian: I think that is almost certainly true, although I should say that a lot of this really is very much still on the table. You could implement fork/join as a library on top of Swarm. Swarm gives you primitives such as fork a process or communicate between processes, and you have a lot of flexibility about what you build on top of that.
Scott: Taking the conversation in a slightly different direction, with Freenet, you manage the community, and with Swarm, you’re trying to attract people to tackle some nearly intractable problems. Then you also do proprietary software development.
Talk a little bit about the difference in the way you see things working in the proprietary world versus the open source world and where you see advantages or disadvantages to either approach for certain kinds of problems.
Ian: There is the obvious benefit in the proprietary world that when you employ people, they more or less they have to do what you say, which is certainly not the case in the open source world, where you have to persuade people.
Open source can make it a lot more difficult to put processes in place, because programmers by nature often resent processes unless they strongly believe in them, although even in the commercial world, if you impose something and nobody believes it’s any good, then it’s probably not going work very well either. You can implement structure more easily in the proprietary world though, I think.
The proprietary world also tends to present fairly clear goals. If you’re going in the right direction, people will buy it and keep paying for it, and if you’re going in the wrong direction, they won’t buy it, and you don’t have a business.
With something like Freenet, it is hard to know what exactly you’re trying to be. For example, are we trying to service people living in countries like the U.S., people living in China, or both? The goal of proprietary software is to make money, and hopefully you can translate that back to how you need to prioritize different feature-functionality today.
Scott: I’ve never heard anyone call that out as an advantage, although it makes perfect sense. To put it another way, in proprietary software, there’s very low latency between the moves you make as a software company and the signals the marketplace sends back to you.
On the other hand, a lot of open source projects don’t even know how much of it is out there. It’s very nebulous how much “share” it has, so there’s a little more latency between the time a project starts to head in the wrong direction and the point where it’s really receiving clear information, that it is negatively affecting the use of that project.
Flip it around and talk about some of the advantages of open source that aren’t really present in proprietary software. For example, if I am a Fortune 500 company, I’m not in the business of building a web server, but I need one. It’s a lot more beneficial for me to devote a handful of programmers to Apache so we get the things we want, rather than building one from scratch.
Ian: That is certainly very true. From the perspective of the industry in general, open source permits people to create and collaborate on aspects of infrastructure that don’t need competitive advantages, but just need to work.
Because of that factor, you can have Microsoft and other companies, whether they like it or not, wind up cooperating with each other to the benefit of all. If you look at it from a game theory point of view, proprietary software can be more of a zero-sum game that tends to discourage collaboration, whereas open source tends to encourage or even enforce collaboration if it’s a viable open source license.
I certainly believe that the software industry is much better off with open source software. But on the other side, in many situations in my career, I’ve been building proprietary software, and people have said, “Oh, you should open source it. Wouldn’t that be great?”
It would be great for people in general, but it would remove the incentive to create the software in the first place. Because generally speaking, open source business models tend to be services-based business models, which do not scale as well as product-based business models.
To return to your original question, the advantages of open source versus proprietary are different, depending on whose perspective you take. You can build a business creating open source software, but I think it’s more difficult.
Scott: We talked to a VC, and his comment was “Sometimes we give open source credit for being more different than it is.” There are a lot of companies that go with kind of a “freemium” model, where they have a free version of the product, and then they try to up-sell you to the enterprise version. It’s the equivalent of a loss-leader, which isn’t new in business, or the modern equivalent of the 30-day evaluation.
Do you think it’s true that we sometimes highlight more differences between open source development and proprietary development than there really are?
Ian: It all depends on your strategy. I think there are many different reasons for open source. For example IDEA, the Java IDE, recently released a community version. I certainly hear a lot about companies that build open source community versions and then try to up-sell people to a commercial version.
I have not seen a lot of solid data demonstrating the success or otherwise of that approach. I think there are some situations where a company has a wider strategic interest to create a platform; like for example the Google Android platform, or Google Wave.
But, in those cases, a much longer-term strategic interest informs those high-risk projects. I think a lot of Google’s open source projects do not make money, and will not make money, even in the longer term. But luckily enough, they’ve got a cash cow in the form of advertising that they can use to subsidize all of these things.
Scott: You see a lot of experimenting, but it seems that notable successes are not all that prevalent. It’s also hard to determine whether success is because they went with an open source/enterprise business model, or whether something else contributed to their success.
Just to be sensitive to the time, can you add any closing thoughts?
Ian: I’m a huge user of open source, and I have several hats on my head. I’ve got a software engineering hat, where I just want there to be cool stuff, and anything that helps there be cool stuff is all right by me. I think almost all open source falls into that category. I think open source’s existence benefits the software ecosystem and innovation in general.
Then I’ve got my entrepreneur’s hat on, where I’m thinking in terms of whether being a producer of open source can make me money. The software I write is predictive analytic software, and I typically sell it to CTO-level people.
It’s quite a technical sale, so one of the challenges is to find customers and get them to pay attention for long enough that they actually understand what the software does and whether it can help them.
I have a blog where I will release small, relatively self-contained snippets of code that do stuff that relate, in some way or another, to what my product does. I know that the type of people who are going to find these snippets of code may well be the type of people who work for a company that could use the software I create.
It’s not so much a loss-leader as it is a publicity tool, and that is definitely one way in which producing open source is in the business interests of even a small software company.
Scott: Thanks for taking the time to chat.
Ian: Thank you.