A blog forum to provide deep dive analysis and community conversations about software development models. For more details click here.
Tags:

Interviewers: Scott Swigart and Sean Campbell

Interviewee: Daniel Jacobson

In this interview we talk with Daniel of NPR. In specific, we talk about:

Sean Campbell: Can you tell us a little bit about your background and your organization?

Daniel Jacobson: I started out as a developer, initially in Microsoft technology, and from there I migrated into technologies like Java, PHP, Oracle, and MySQL. I was a developer for quite a few years, but I eventually transitioned into more of a managerial and leadership role at NPR.

That transition started with the advent of our content management system, which I led the building of and deployed back in 2002. It is still the centerpiece of everything we do at NPR, in digital media. Now I’m leading all of the application development efforts in digital media at NPR.

NPR obviously still is a traditional media organization. But digital media is now a major part of who we are, making us a multi-platform media organization. The digital media arm, of which I am a part, is basically responsible for distribution for all platforms other than radio. The CMS manages all of these distribution channels for digital content, including NPR.org, the API, podcast, RSS, and mobile platforms.

Sean: Unpack for us a little bit what you offer to the public and organizations through the API. I’ve experimented with it a bit, and it’s amazing how fast you can carve something together and throw it in something like Google Reader to get a fairly fast updating stream.

Give us a 30,000 foot view of what’s there and what people can do with it.

Daniel: Almost anything that you can get on NPR.org, you can get through the API, to the extent that we have the rights to redistribute it. In terms of slicing it to get interesting feeds of content, we have a whole slew of categorization schemes and keywords that we apply to each story.

That lets you go into the API and make a feed, identifying content by categories like topics, the program it came from, or the personality who presented it, like Nina Totenberg or Noah Adams.

You can also build mashups like an “All Things Considered” feed of content where Noah Adams talked about the Appalachian Trail. You can then parse that custom feed in one of our many output formats.

Currently we offer eight output formats. NPRML is essentially an XML format that matches as closely as possible the data in the database. It’s the richest format that we have. We also have RSS, media RSS, Atom, and podcast outputs. We actually built a layer on top of the API called “Mix Your Own Podcast” so you can just throw terms in there and create a custom podcast on the fly.

We also have HTML and JavaScript widgets, and we also output through JSON. In addition to that, you can paginate through the results, so if you get 20 results in the first page, you can page through the next 20. You can also do date range searches.

Our goal is that any way you would want to search through the API, we provide a mechanism to do it.

Sean: Can give us an example of a particularly interesting way or successful way this functionality has been used?

Daniel: That’s a timely question, because I’m preparing exactly that information for OSCON right now. [laughs]

Our target when we built the API included four distinct audiences. One of them was obviously the public, so we could open up the world and see the great things they would build. We also had our member stations in mind, as well as business partners and ourselves.

Later this month, we will be doing a series of major launches, and people will be able to see how we’re implementing the API on our live site. Describing the ways NPR is taking advantage of it may be good context for how other people can feel comfortable in building things and knowing that whatever they build is going to persist. This is the case because NPR’s products will be based on the same API.

The API is going to be the centerpiece of our site; it is the technical infrastructure, with every single page is derived from the API itself.

We have also created another layer on top of it that enables more front line people, including editors and designers, to add custom feeds to anything on the site. We have built tools in our CMS that have enabled them to construct API feeds on the fly, drop them onto the page, and apply a style sheet to present the content in the way that they want. It will also let them configure how many stories, what elements to show,– basically enabling the editors to have control of the related content for the story that they are working on.

Sean: How are the ways those other audiences use the API different from the ways you use it internally?

Daniel: We really see the API as a way to get content to member stations so that they can reach their audience better. Quite a few stations out there use it in novel ways, including Minnesota Public Radio. They just did a new site re-launch and they are making extensive use of the API. Some other station re-design sites are coming up, and they have told us they are making extensive use of the API as well. North Country Public Radio is another example. They have told us that somewhere around 50% of their pages have API-derived content in them.

Partnerships are another target audience. The API gives us a very easy, fast mechanism to get content up on their platforms. Instead of working with a partner to try to get our two systems to talk to each other, we can simply point them to the API, and they can pull the content out and build it into their application.

As NPR explores other platforms, like mobile, iPhone sites and apps, Android, and others, it’s very easy for us to get our content up there. Beyond that, obviously the public is a big target audience for us.

One of our favorite apps is called “NPR Addict.” It’s on the iPhone, created by a public API user named Brad Flubacher. His app takes our feeds and streams the audio on the iPhone. Back in December, we launched a station finder API, so he is taking advantage of that as well to find member stations. There is actually another station finder app on the iPhone Store as well.

Sean: Other than the iPhone, what other platforms do you see a lot of interest around for NPR content-related apps?

Daniel: We have seen a range of code wrappers being built, including some Perl and Ruby and other implementations. One of the more interesting implementations was an audio player someone built for the KDE environment. This is a very interesting area to me, because we’re probably not going to be building apps for KDE, but somebody went out there and filled that void and reached out to an audience that otherwise wouldn’t get this kind of customization.

Some other mashups done by public users include Twitter mashups (like NPRBackstory and All Tweets Considered), Reverbiage, which displays NPR stories on a map based on their locations, a Simile Timeline mashup and some widgets focused on NPR Music. We have also built a Yahoo! widget and a Google Gadget, both of which place NPR content on those portal sites.

Scott Swigart: Tell us a little about the intersection between what you guys are doing and open source in general. I’d be interested in knowing about the open source you are using for the technology you are building, as well as hearing about the open source development audiences you are hoping to make life easy for as they connect to your API and consume content.

Daniel: We in digital media have a philosophy that to the extent possible, we want to use open source technology. We can’t always do that for a variety of reasons, but that’s our first instinct.

We’re an Apache shop, and we use Debian operating systems for all our web servers. We’re running PHP for outward facing websites, and our CMS, which we built in-house, runs in Tomcat and JSP.

The one major area where we are not using open technologies is our current database, which is Oracle. That said, we have actually built the API essentially as a data repository layer that sits on top of Oracle. It’s really just an XML layer that stores all the content.

We did this for a variety of reasons, starting with performance, and it also gives us a lot of modularity to transform content pretty quickly on the fly and get it into any of those formats that I mentioned earlier.

But in addition to that, we wanted to create an abstraction layer so that we are not dependent on any given database technology. That has real benefit to us in case we have to pull the database out of the system for whatever reason, and we actually saw an opportunity here to migrate off of Oracle. Sometime within the next year or so, we will probably migrate from Oracle to MySQL.

That would be the last major step of moving entirely to open source technology. We do use some other smaller proprietary tools, but at the core, we are using open source.

Scott: What about contributing back to the open source community?

Daniel: The API was really our first venture into that, aside from RSS, which I don’t really consider to be an open source move.

The API is really the core thing that we can contribute to this world as part of NPR’s public service mission, in the sense that it allows us to give our content to the users and let them use it in meaningful ways that adhere to our terms of use.

Those terms of use are very important, because obviously, we don’t want people making improper use of the data, and if they are going to use it commercially they are going to have to talk to us and set up agreements.

But assuming they are using it within the terms of use, we really want to make the content available out there, so that people can do all those interesting things that I mentioned and hopefully more.

I think that in the next several months, you will start seeing more aggressive open source moves from NPR. We are going to start trying to contribute code and other tools to the community so people can do interesting things with them.

We’re still researching exactly how to interact most effectively with the community, opening our technology up to them so they can do whatever they want with it, while we understand our role in that (if there is one).

At OSCON, we are conducting a media mashup with the New York Times. We seeded it with several mashups that we have created and shared the code on github. We are trying to facilitate the conversation about code and apps and working with other media organizations.

Scott: We interviewed Mark Fronds at the New York Times, and he seemed to be fairly pragmatic about a certain ideological intersection between journalism and open source that led him to feel like open source is a good foundation to build on top of.

Of course, unlike NPR, the New York Times is a for-profit entity. He liked the idea of building proprietary intellectual property on top of open source, using things like Apache licensing that are a little bit freer in terms of being able to build IP that you don’t necessarily have to contribute back.

NPR is a little bit different ideologically, in terms of making the content freely available. You’re probably not as much out to build a business and a revenue model. Ideologically, how do think that plays into how open source fits into NPR’s mission?

Daniel: The fact that we’re a non-profit organization is the key point. Our public service mission is to inform and educate, and we continually seek out those ways that will give us the best opportunity to fulfill that mission.

Of course, we still have to generate revenue, but we have to factor in more than just the bottom line, and part of that is how well we are serving our stations and reaching the public. Opening up our content through an API gives a lot of latitude to serve our audiences, without restrictions that commercial media outlets like the New York Times, Reuters, or the BBC have to contend with.

For example, I think ours is the only major media site that does not have any rate limiting on our API. We opened up all of our content, and you can download all of our MP3s through Mix Your Own Podcast, to the extent that we have the rights to do all this, which is most of the content.

We basically just jumped head-first into this pool, because we saw that this was the best opportunity to get people interested, engaged, and writing very interesting applications or widgets based on our content.

Ultimately, that’s going to reach a greater audience for us, because if people are, for example, writing this KDE app on a Linux platform, we’re not going to be there on our own. So now this tool allows us to reach a new audience, and it potentially exposes new people to NPR.

NPR’s position is also unique among media outlets in the sense that we’re not truly giving up the content in the way that “New York Times” would if they offered full text. Their prime asset is their text, whereas one of our prime assets is audio. Of course, we’re generating more and more text, and images, and things like that, but the New York Times is a text organization, so if they’re feeding out their full text content, it’s gone.

It’s still important for us to maintain the brand and make sure that the attribution follows it. If people are really seeking out the audio, though, which is a large portion of the audience, that audio is still being distributed from our servers, and it’s always a revenue opportunity.

For example, for MP3 downloads, or for access to the M3U or WMA links, or any of that kind of streaming material, we have the opportunity to do a pre-roll on top of any of that. We’re not doing so in most cases, but we have this kind of revenue advantage, I think, because our asset is more bound to our servers, and it’s harder for the public to disseminate.

Scott: You’re obviously building technology that will stay inside of NPR. Could you envision releasing the internal code in a way that would let other media organizations replicate what you have done to expose their content?

Daniel: The big hurdle is the time, money, and resources to abstract the code so that it is portable and can be used in any kind of platform, or at least some subset of platforms.

That’s a challenge, and it’s especially a challenge because we are a non-profit organization and have fewer resources compared to other media organizations like the New York Times, USA Today, or the Washington Post.

We have five permanent in-house developers, and we sometimes bring in contractors. That’s a pretty small team to be running an operation that yields so much, so the first hurdle’s going to be resources.

The philosophical hurdle doesn’t really exist. We would love to open source as much as we can, and even if it’s not truly open source, we would like to do things like enable our member stations, who actually have a very similar plight to us, to create and distribute audio-rich content.

We would love to create tools that would enable them to do more, and to make it portable so that they can just pick it up and plug it in. We talk about that pretty routinely, and ideologically, we are all for it. Resource-wise, though, it’s a challenge.

Open sourcing the CMS would be a huge endeavor. I can imagine an app that sits on top of the XML layer that is more generic so that it can read in any XML and then convert it and yield output formats, whether it’s Atom, or RSS, whatever else. I’d love to do that.

We also publicly document everything that we can. For example, I post pretty regularly to our Inside NPR.org blog with details of our technical implementations, strategic decisions behind what we did with the API, and the challenges we’ve hit with it.

We’re pretty open with our knowledge, and we’re trying to get better about being open with tools. I think that the big picture goal (which I think is a real long way out) if possible, is open sourcing our systems.

Sean: Regarding your plan to migrate from Oracle to MySQL, what is your perspective on the significance of MySQL being acquired by a commercial company?

Daniel: That acquisition did give us some pause, but there are several reasons why we are still considering MySQL. Open source is certainly attractive, and as I mentioned, part of our core philosophy is to use open source platforms where possible. The fact that it is open source is not the primary reason, but it’s a nice reason.

For example, if we determined that Oracle made the most sense, because it’s better suited to the enterprise, because we need a support contract, or whatever, we would adjust our approach accordingly.

For us, however, it boils down to cost and scalability. If we want to scale our Oracle system and create redundancy, and allow us to swap in and out machines for maintenance or set up a RAC system, that’s very expensive. For a nonprofit organization with limited resources, that’s a challenge, and every time we add new CPUs, our price goes up.

With MySQL, the cost of adding new CPUs is limited to the cost of the CPUs and resources we need to implement them. The clustering environment is also a lot easier to implement, whereas RAC takes very specialized skills, so there’s a consulting cost on top of it.

MySQL is a much easier clustering environment, and it gives us that scalability for the future. We can have that redundancy, and we can always throw more hardware at a problem without worrying about those financial issues. The added benefit of it being open source, with a community that will continue to build on it, is icing on the cake.

Our view is that it is very unlikely that MySQL will be shut down as an open source project. If it is, the major alternative would be PostgreSQL, which we actually have already talked about, but I think Oracle would be foolish to shut down this open source community. That would create a major backlash that would probably also lead to forking of the code anyway.

Sean: There are light leaks around the community, and whatever Oracle would decide to do, someone would try to continue it in some form or another.

On a completely different topic, though, what are you guys running in terms of the operating system? And what are the factors that made you look more at MySQL than PostgreSQL?

Daniel: We use Debian as an OS. In terms of leaning more toward MySQL than PostgreSQL, there are two factors. The first is human resources, in the sense that our people know MySQL much more so than PostgreSQL.

There’s a little bit of a learning curve and expense that goes with PostgreSQL implementation that MySQL wouldn’t. That includes setting up the server, integration of the clustering, and building database schemas that are optimized for MySQL versus PostgreSQL.

The second factor is just that anecdotally, from some really simple tests that we’ve executed and some research that we’ve seen, MySQL tends to be faster. I don’t know what the current studies show on the latest versions of each, but speed was one of the factors in making this decision.

Scott: There seems to be another movement out there, the NoSQL movement towards Big Payable and stuff that really distributes and sort of clouds out I guess. Is that something that you guys have any interest in?

Daniel: I’ve been reading a little bit about that lately, and I think that the XML Repository is some form of an implementation of that. So I think philosophically, I agree: thick relational systems that have all kinds of joins carry a cost, and it’s usually in terms of performance.

I like the philosophy of the NoSQL approach, and I think that creating this really flat XML layer is a step in that direction. In some ways, there is still a benefit, however, to a relational database model, and I think it depends on what the system is.

Basically our approach is that we keep content management relational and highly normalized, which is very effective for managing data for a small set of users.

We try to abstract that out, flatten it, and get it outside of the database in this XML layer, for performance and for all the reasons that the NoSQL movement is even getting a voice. I think there is a purpose for both and that’s of our approach to our architecture.

Scott: Maybe it’s the nature of developers seeing things in very black and white terms, but there always seems to be a methodology du jour. You see the same kind of thing with Ruby on Rails as it comes along and PHP.

There tends to be a trend toward making something really simple, and then complexity gets added over time, until something really simple comes along to replace it. Everyone says, “Hurray—we’ve done away with all that complexity,” and then a week later, people start to complain about its limitations, so it starts to become more complex in its own right.

Daniel: I don’t think a smart technologist is someone who knows all the little nuances of a given technology or even a series of technologies. A smart technologist is someone who knows when to use which technology for which situation.

Sometimes it’s the perfect scenario to use a relational database model and really chunk out your data in normalized ways. Sometimes it makes sense to make it flat. People who understand architecture are the people that can actually make those right decisions.

Regarding simplicity versus complexity, RSS is a good example, and I think it has a place in the marketplace and in the history of the evolution of the Internet. It’s just not enough anymore in this distributed world that we live in, although it’s a great tool for certain things.

Now you need things like our API. You need more complex distribution channels, because that’s the world we live in. Maybe it will get simpler somewhere down the road, but at the moment, that’s the evolution I see.

Sean: That ties into the topic of control of information. Pandora got a good licensing deal on streaming music and then turned right around and said, “Hey, radio stations, why do you pay less for your content stream?”

We are also starting to see people sue organizations like Facebook, insisting that they can’t lock up all that profile information and not give the user the ability to have their profile federated and available to them at any time.

I’m curious about your views in this area, although I understand that you’re not speaking on behalf of NPR or anyone else.

There will come a time, I like to joke with Scott, when the Supreme Court will say, “I don’t care what agreement you signed. In order to allow the free flow of commerce, these profiles and this information has to be distributed.”

I find it hard to believe that the Facebooks of the world will be allowed to grow into behemoths that own all that data without being forced to federate it with small organizations. At the same time, I see you giving it all away so people can use it if they want.

Is this is a conversation you have had over coffee at some point? Do you see this really as a small, anomalous moment in time where companies can have people sign an agreement that says, “All your family photos are ours for eternity,” and that eventually that just really won’t be able to stick?

Daniel: I’ll give you my personal take on it, although as you said, this does not represent NPR. I think that right now, we’re in a very ambiguous time in the marketplace.

Previously, there was all this lockdown, and there were subscription models that took content away, and you had to subscribe to it. No APIs were distributing this stuff out in a meaningful way, besides RSS, which was really a ploy to get users back onto people’s sites.

Then this whole API movement and widget distribution happened. I think Facebook is a contributor to this, in having the ability to create tools that enable content to be put into that framework.

All of that distribution has created a lot of ambiguity, in terms of who owns what content, whether it should be available, whether it can be locked down effectively, etc. I think that there is a real shift toward opening up all this content in meaningful ways. A lot of organizations are embracing it, although many are still locking down certain materials.

And of course, everyone still has a terms of use agreement attached to it, and your question really speaks in large part to when those terms of use will go away.

If you’re putting content on the web, it’s already out of your hands, and you’ve ceded control. If I’ve put a status update on Facebook, I have to assume that it will be widely distributed, even if it is not going to be. The fact that I’m publishing it in any form to a public space, whether it’s Facebook, NPR, my personal blog, or whatever, makes it open.

There might be some continued lock-down in terms of whether people can legally make money off of something without my permission or whether they have to give me an attribution. The murkiness lies in those details, but I have to assume that if I just put it out on the web, I’ve lost control of it.

Sean: I certainly concur with that. Do you have a read on what NPR’s position on that issue is?

Daniel: I’m not sure that NPR has quite so extreme a view on this, but I will say that we had many conversations as we talked about the API concerning whether we should release specific types of content. We talked a lot about what types of content we should hold back, in order not to miss certain business opportunities and things like that.

The prevailing argument was that, if material is already on our website, it’s a lot easier for the average person to copy and paste off of our web page than it is to parse an XML feed. So, if we’re concerned about people stealing our content in ways that we don’t want, what we really need is a really nice terms of use that enables people but also sets some boundaries.

Sean: Well, that’s a good place to wrap up, and we are at the end of our time. Thanks for talking with us today.

Daniel: Thank you.

Comments (0) Posted by campsean on Monday, July 27th, 2009


You can follow any responses to this entry through the magic of "RSS 2.0" and leave a trackback from your own site.

Post A Comment