Interviewers:
Scott Swigart and Sean Campbell
Interviewee: Josh Berkus
In this interview with Josh, PostgreSQL Core Team Lead at Sun Microsystems Inc., we asked him about:
- How PostgreSQL fits into the landscape of open source databases
- New uses of PostgreSQL
- How new code and features are selected for PostgreSQL
- Quality control and maintenance of code
- Core maintainers and code contributors
- How contributors participate in PostgreSQL development
- Differences between open and closed source database
- Variations between PostgreSQL products
- PostgreSQL Windows users and Microsoft support
Sean Campell: Josh, tell us a bit about your role with PostgreSQL.
Josh Berkus: OK. Well, what I’m best known for is that I’m a member of the seven member core steering committee for PostgreSQL Open Source Database Project. I also work at Sun on PostgreSQL and I’m the PostgreSQL lead, which is sort of a strategic and evangelism position at Sun Microsystems in our database technology group.
I’ve been a database applications developer since about 1994. I started out actually with desktop databases and moved to Microsoft SQL Server, and from there to PostgreSQL in 1998. For the last couple of years, I worked at Greenplum, a data warehousing company. I did a lot of consulting on database performance. Now I seem to spend most of my time going to conferences.
Sean: Talk a little bit, if you would, about PostgreSQL and where it kind of fits in the ecosystem with things like MySQL, open source on one side and the proprietary Microsoft SQL Server and Oracle Database.
Josh: Well, the way I like to describe PostgreSQL is we’re sort of the high end of open source databases: very SMP scalable; capable of running large, complex queries involving multiple sub selects; and, all kinds of SQL tricks. A lot of functionality is built into the database, including triggers and views and schema, and the ability to use 11 or 12 procedural languages to write procedures in.
Mostly, people who look at PostgreSQL are also considering databases like Oracle or DB2 or SQL Server 2005 for their needs - large database installations involving possible terabytes of data and usually a dedicated database administrator.
That’s probably the majority of our usage. We also get a fairly substantial amount of sort of embedded usage. Not because PostgreSQL is really an embedded database, but because of our extremely liberal BSD open source licensing terms.
Something I would like us to be better known for is the extensibility of the database model. The origin of PostgreSQL goes back to 1986, which was the POSTGRES Project at the University of California - Berkeley. It was the second UCB database project. The reason it was called Postgres was it actually stands for post Ingres, Ingres having been the first UCB database project.
One of the principles that PostgreSQL was founded on was the idea of an object relational database. That is, the database administrator or designer should be able to modify the behavior of database objects with code, or add their own new types of database objects. On a useful basis, that’s given us a whole bunch of exotic data types that are extremely useful for keeping certain specific and unusual types of data genetic sequences, geographic data, or cryptographic data that might be hard to store in the standard SQL data types, let alone to index or do anything useful with.
Sean: Well, tell me a little bit about out that, just to dive bomb for a second. Coming from a database background, there’s been a lot of discussion about this on the SQL Server side, progressively integrating additional data types, working in an XML data type, and so on.
Josh: Right.
Sean: But obviously, one of the challenges with something like this is making sure that it’s valid and useful for the community at large. I think the deeper you get into a particular data type, you really have to have it vetted by the community, to make sure it’s actually useful, and people don’t just go back and throw it into a more generic field type.
So, tell me a little bit about the process of getting something like that included in PostgresSQ, and maybe take the most interesting example you can think of. From soup to nuts, how did it get originally proposed, and then how did it get eventually integrated into the product?
Josh:So an example of one, actually, that’s going to be fully integrated into the core code is the UUID data type. It’s coming out in version 8.3, which is the next release version, and will come out somewhere around October.
Now, I know that SQL Server has had UUIDs for a while. We’ve been actually putting off incorporating UUIDs and GUIDs, simply because there’s five or six competing ways to make up such a data type.
So, when UUIDs and GUIDs were proposed for inclusion in Postgres, one of the goals with that was to actually create a data type that would uphold the various different ways of forming a UUID, and yet have a compact representation as well as its own index types and operators.
That is, for example, equality for a UUID is the same sort of concept as it would be for an integer. Differential operators, like greater than, less than, and that sort of thing, actually don’t function the same.
Sean: Sure.
Josh: Because, among other things, UUIDs will contain information on which machine created that particular ID.
Sean: Right.
Josh:
These are all projects of genomics scientists.
One of the things that they needed to be able to do is they needed to be able to store sequences of base pairs in the database. Now, that’s what we call a multi value data type, as in it has meaning as a whole, in sequential order, but also, each component has meaning in small pairs. Initially they tried to store this, actually, sort of vertically - you have each pair as a row. But when you’re talking about millions of protein sequences, each of which can have a couple thousand base pairs, that’s really not a realistic method of storage.
So instead, they actually modified our already existing array data type which is quite fully featured, in terms of having its own special comparison operators and special ways of indexing it and modified it to hold base pairs.So, that allowed them to do meaningful comparisons of things like equality, and particularly to compare individual base pairs. That is, if the base pair in this third place equals both sequences, what percent of the base pairs are equal to those sequences?
Because in protein analysis, you’re not expecting an exact match; you’re expecting a match, basically, according to percentages. You want to be able to look up and say, “OK, I want to find all the genetic sequences that have this particular sequence of six base pairs anywhere in the proteinî and that requires a special index type which is based on our generalized index search tree. The tree allows you to create your own index types to support some of these exotic data types.
Sean:
Josh:
A second question is how good the code quality is. That’s a big deal, because we’re an open source project that’s been open source for 11 years and part of a project that’s 21 years old. Maintainability in our code base is actually, possibly, our number one priority because every one of our major contributors now was not here at the beginning of the project.
Every single one of them inherited the code from someone else, and they know how important it is to be able to pass it on. So that’s actually been a big holdup in incorporating our full text search type into the core, because it was written by some Russian developers and they had a lot of issues with doing internal documentation of the code in English.
So, in that case, for other data types and the like, there’s going to be the same sort of standards in terms of internal documentation of the code; good quality public user documentation; the code being formatted correctly and easy to read; and, having consistent sub functions and references that are easy to follow and match our other coding standards that are part of our documentation.
Another part of it is going to be, again, for the maintenance issue, whether or not we feel that the contributors are going to be with the project for the medium to long term, because if it’s something that somebody just dumps on us however good it is right now and then leaves, then what we’ve done is we’ve added to the burden of the core maintainers to keep up that extra 10,000 or 15,000 lines of code.
For an individual data type, that’s not very much. If you look at it like we add 10 new data types, then that requires us to have two more code maintainers, just to maintain that extra code base. So, if somebody is not going to be making a long term commitment to maintain the code, then the bar to add it to the core code becomes much higher.
Then, a final consideration would be external dependencies. That is, one of the things that we do to make PostgreSQL easy to install is that the dependencies to install the very core code of PostgreSQL with no additional options are extremely light. You basically need a handful of GNU utilities, certain C code building utilities, and that’s it. That’s made it very easy to install PostgreSQL on 30 different platforms; so has building it from source, so there’s a variety of means. For the people out on Windows, you can build it with either MinGW or with Visual Studio. That wouldn’t be possible if we had a whole slew of external dependencies.
So, thereís add ons that require heavy external dependencies. For example, there’s a procedural plug-in to allow you to use PHP inside the database. The reason it hasn’t been included in the PostgreSQL core code, even though it’s fairly feature complete and is reasonably popular, is specifically because in order to build it you have to have PHP and Apache installed and even configured in certain ways. So that makes it a real dependency issue in order to build that component.
That’s very tricky, and we don’t necessarily want things that are that hard to build in the core code. We’ll put them in add ons, where people realize that they have to take extra steps.
Sean:
Because different projects handle that promotion process a little bit differently.
Josh: Yeah. We don’t have a formal process. We’re actually at the sort of extreme end of open source projects in that all of our policies are negotiated and unwritten, pretty much. It’s because of the age of the project and it’s because we’ve always had a consensus process for making decisions, which has yet to break down. So there hasn’t been a need for some of the more elaborate, formal structures you would find in, for example, Apache.
Sean: Right, so you guys haven’t had to have like a voting process, per se, on things. It’s kind of been a communal decision process.
Josh: Yeah. There are some things that are conventions. You’re not going to be considered, for example, for getting commit ability on the CVS tree if you haven’t been around the project for a couple of years, contributing. Again, it’s a 10 year old project. We feel that we can wait two or three years for somebody…
Sean: [laughs] Right, right.
Josh: If they go away in that amount of time, then we didn’t want them as a committer anyway.
Sean: Right, right. Somebody that flashes onto the list and is like, “I want to be helpful! I want to be helpful!” then six months later, you’ve never heard from them. You have kind of a base vetting process just from that alone.
Josh: Yeah, yeah. So we’ve got people who’ve been around and contributing code for a couple of years. Volume is also a consideration, because somebody who’s only contributing one or two small patches per version, then there’s no particular need for them to have any greater level of access.
The big issue is having lots of free time or having time paid by your employer to work on these things, because the main thing that we need from major contributors now is actually time to review other people’s code.
Sean:
Is there a scenario where someone is funded predominantly by their employer to write code for PostgreSQL?
Josh: Yes.
Sean: And what percentage of the base of people writing the mainline code probably falls into that type of category?
Josh: We haven’t tried to do a count for about three years, but I would estimate, now, 80% to 90% of the code changes that go into any given version are written by people who were either paid directly to work on PostgreSQL development, or for whom working on PostgreSQL development is an approved use of their work time.
The second class would be large PostgreSQL users, like the staff of Afilias, for example. They’re not required by Afilias to contribute to Postgres, but if they want to spend Tuesday afternoon working on a Postgres patch, it’s completely acceptable to Afilias if they do so.
There’s a number of those. So, yeah. That’s actually one of the big myths of open source is people imagine a bunch of hobbyists. And I’ll say, in the early days of Postgres, we were hobbyists, because you couldn’t use it for much. I was actually earning my living as a SQL Server performance consultant, and working on Postgres for sort of my own stuff. But once a project gets big and commercially adopted, you’re going to find that at least three quarters of the code contribution comes from people who are paid to work on the project.
Sean: I want to give Scott a chance to chime in here, too. But one of the things that came out of one of the conversations we had I think it was with Michael Tiemann - was the concept of: alright, so you’ve got a set of developers, they work for a large company, and that company would like to get something into the product. So, they go off and squirrel away and work on some feature.
Letís say it’s the genetic discussion we were having earlier, right? That’s probably not the best example, but let’s imagine that was the case. How do you deal with the challenge of someone going and building ìxî, and they feel they’ve invested real company time and stake and equity in it, but yet, maybe it doesn’t make it in because the community writ large just doesn’t feel it’s maintainable or it doesn’t meet the standards you guys are looking for?
Josh: Yeah. Well, that’s something that needs to be handled with a fair amount of diplomacy. And there have certainly been failures on that in the past, but I think itís because all of our interactions with our contributors tend to be highly personal. That is, if something gets rejected, then it’s going to be after Bruce or one of the other reviewers had 16 or 17 different email interactions with a contributor. But no, that’s not until after we’ve given the contributor multiple chances, made it clear to them what needed to be changed, and given them multiple chances to modify their stuff, and hopefully made them understand why it was being postponed.
We’re actually struggling with that right now, because we’re having a bit of a fire hose problem with version 8.3; as in, when we hit feature freeze at the beginning of April, we had something over 100 different patches pending, some of which involved up to 50 60,000 lines of code. So, the result is we’ve actually set a very high bar for things making it into 8.3. Patches that we might have accepted in an earlier version and spent more time getting up to the acceptable standards of code are instead being held back for 8.4.
Sean:
So, from a database side in the open source world, what do you think the open source development methodology brings to a database product that either gives it more credibility, a better feature set, etc. when compared to a closed source model for developing a database product?
Because we’ve asked everybody this, in some form or other, and the responses have been really interesting. I don’t mean that in a pejorative way, I mean they’ve been very interesting to listen to.
But, we haven’t talked to anybody from the database side.
Josh: Well, actually, one of the biggest benefits is for security and reliability. We actually had Coverity run a code check on PostgreSQL a couple of years ago, something that they’re apparently going to do again for us, and one of the first things that I noticed is that the PostgreSQL core actually has possibly the lowest code count, in terms of lines of code, for any major SQL database.
What that’s indicative of is that we’ve spent a lot of time; that every time we release a new version, there is significant refactoring involved in it, and a real effort to keep the code clean and eliminate anything that’s Byzantine or hard to read.
The payoff for that is that it makes it very easy to keep the code reliable and secure. That is, if somebody reports a security issue, we can generally come up with a fix in 24 to 48 hours, because it’s very easy to zero in on exactly where the problem is happening.
It also prevents such issues from occurring in the first place, because there aren’t mystery functions that nobody understands and can’t touch. Having worked on some proprietary software, I know how that kind of stuff creeps into your proprietary software, because you’re more concerned with meeting the ship date, and the idea is that you’ll clean it up in the first update version. Only after you meet the ship date, cleaning it up becomes a low priority.
So for us, because all of the code is out there and that it’s all visible, maintainability is a primary goal. There is no postponing cleaning it up. The cleaning it up has to happen before we release. The result has been very highly secure and very highly reliable code.
Scott: So, you mention that there’s a lot of patches queued up, and a lot of them aren’t necessarily going to make it into this upcoming version. How is that a decision that’s made? Open source projects kind of have a different hierarchy and a different culture, so I’m trying to understand with PostgreSQL, is it kind of a representative democracy where the steering committee sort of votes on those or…?
Josh: Think of it as almost a pure democracy. Yeah, because most of those decisions are made by rough consensus on what we call the “hackers mailing list” which has something on the order of 7,000 subscribers. Although probably only 75 -100 of those people are really active.
Scott: Sure.
Josh: The rest of them are just monitoring what goes on. Basically what happens is, if there’s a patch, there’s a couple of other lists attached to that - the actual patches mailing list or the actual committer’s mailing list. But most of the discussion happens on “hackers.”
If somebody submits a patch, or preferably a specification before they submit the patch, then there’s going to be lots of discussion and we’ll form a rough consensus on whether it’s a good idea or not, whether it belongs in the core code, and whether it belongs in an add in, and other issues like that. Then, when the patch actually gets submitted, it becomes up to the code reviewer who will generally be one of our handful of committers, people who actually have direct access to the CVS tree who decide whether the code is of sufficient quality to make it in or whether it needs work.
If that’s an extended process, they will generally take that back onto the hackers mailing list and say, “This is a really cool feature, but the code is a mess and it needs X, Y, and Z. If the original contributor didn’t clean it up, is there someone else who cares enough about it to clean it up, or are we going to hold it back?” That will get sort of worked out there. And it’s sort of peer democracy. It’s not so much pure democracy as actually what I call “volunteerocracy.”
[laughter]
Josh: It’s that somebody can force a decision by saying, “This feature is really, really important to me and I’m going to do whatever it takes to clean it up so it can go in.”
Scott: Right.
Josh: When somebody doesn’t step forward and do that, often stuff gets held back.
Scott: OK. So there isn’t like a formalized vote, but it is pretty obvious kind of what the consensus is.
Josh: Yeah. I mean, the core team is a steering committee, but our goal is to actually do as little as possible. The main thing that we do is we set the date for feature freeze, beta and release. We handle security issues, because those need to be dealt with in a confidential forum, and that’s it.
There’s been months where we’ve gotten maybe a dozen messages total on the closed core list. The vast majority of any decision making, any reviewing, any discussion, happens on the public mailing list, particularly hackers, but also to a lesser degree, patches and committers.
There’s a few segments of PostgreSQL, things like the JDBC driver, which have their own development mailing list, so they make their own decisions within their development mailing list to submit their decisions to hackers. We take their word for it because the rest of us don’t know anything about Java.
Scott:
And the different distros are all basically kind of a fork of that kernel.
The kernel that ships in Ubuntu isn’t exactly the same as the one that’ll ship in RedHat or something like that. And part of the reason for that is there may be certain features which are very important to RedHat customers, but it’s not something that they’re able to get necessarily into the core kernel, right? I’m curious if the same kind of thing happens with PostgreSQL; if Sun, for example, had customers who really needed certain features but the consensus was that those features shouldn’t end up in the core products - at least at this time - do you end up with companies kind of making their own weak forks for their specific customer or does that not really happen on this project?
Josh: Yeah, I’ll speak for PostgreSQL in general and then I’ll tell you specifically what we’re doing at Sun. In the case of PostgreSQL in general, it wasn’t something that used to happen, even though we supported the idea. We always have had sort of a kernel model, like Linux.
If you have the PostgreSQL core code, that 13 megabytes of code, and then you have probably 100 megabytes of add ins on places like (TP)Foundry and SourceForge and our contrib modules and a whole bunch of other places. In the past, it’s sort of been up to the user or the developer to add these things together themselves. What’s been happening more recently is that the packager for the Linux distribution or the BSD distribution has made certain decisions about what packages they want to include.
But those decisions have been fairly lightweight. People have not been taking the strategy of putting together an actual distribution until recently. And what’s changed recently is that we’ve gotten a lot of startups like Greenplum and EnterpriseDB that have sort of their own special version of Postgres that are deliberately maintained forms. In the case of Greenplum, it’s a data warehouse enhanced form of Postgres, and in the case of EnterpriseDB, it’s an Oracle compatible version of Postgres. There’s been a couple of others that are older, like the old Windows version, and the multi threaded Windows version called PowerGres in Japan, which is actually about four years old. Fujitsu actually has their own version, which is called Fujitsu Supported Postgres.
So, what’s been happening with this is that a lot of those companies do develop their own features for Postgres, which they submit, but don’t necessarily get accepted immediately, and possibly don’t get accepted in their original form.
That’s going to cause problems for those companies down the road, because… Well, I’ll give you an example. Greenplum actually developed bitmap on disk indexes. We currently have bitmap in memory indexes released with Postgres. But Greenplum developed bitmap on disk indexes a couple of years ago and put them into the Bizgres open source project to be submitted into the PostgreSQL core code.
The thing is that there have been some issues with index maintenance and with code style, and particularly with conflicts with other patches that we’ve incorporated to improve the performances of indexes, and that bitmap index patch.
Because those have not been resolved, the bitmap index patch is still not in the core code of Postgres. The problem that that is going to lead to is that when that patch does make it in in 8.4 or whatever it’s quite possibly going to have a slightly different API from the version that Greenplum has been distributing with their own proprietary product.
That will put Greenplum in the position where they actually need to have support for both versions: their original version and the version that made it into the core code. What that results in for these companies that are actually making core changes and distributing them in advance of getting at least vocal approval from the community, is that they develop an increasing maintenance burden. Now, companies like Greenplum and EnterpriseDB and Fujitsu in general recognize this and try to avoid that situation. They try to wait until their patch is queued and accepted before they start distributing it. But, like with the bitmap indexes, it doesn’t always work. Since the process of actually distributing modified versions of Postgres and marketing them heavily is relatively new, except for PowerGres, then it’s a little hard to see what could happen.
I mean, what could happen is what happened with PowerGres. XRA in Japan developed a multi threaded, high performance version of PostgreSQL 7.3 for Windows. But they modified PostgreSQL heavily to make it work in that context, to the point where it no longer worked on Linux. When the PostgreSQL project decided that we were going to adopt Windows as a platform, which we finally released in version 8.0, one of the decisions was that we wanted to have the same core code with no substantial differences regardless of operating system. That is, we weren’t going to have a separate code tree for Windows because that was going to be impossible to maintain.
So as a result, when we released the official Windows version, it was substantially different from PowerGres. So now a lot of the users in Japan are in the sort of weird position where they’re stuck with PowerGres, which is no longer advancing; it’s stuck at the version 7.3 feature set.
Or, they adopt the new official version, which is already five versions and four years later, and have a different performance profile and a lot of changes that will be requiring them to change their applications. So, that’s the sort of thing you want to avoid. That’s why at Sun, with our distribution of PostgreSQL, the PostgreSQL for Solaris, one of our policies is that we actually don’t distribute anything until it is accepted into the patch queue with a very strong assurance of acceptance versus revision by the PostgreSQL community.
If it is completely separable as an add in, if it’s something that can be added at build time and no later and doesn’t modify other APIs, then we might accept something. The particular example I’m thinking there is probes, which is that we can add in additional probes non invasively. It doesn’t matter if we add in a few extra probes before those probes appear in a community version, because they don’t affect other functionalities.
Scott: If you had to kind of ballpark it, how much of the effort do you feel like goes into adding new features to the product versus how much effort it is to support such a variety of platforms?
There’s different Linux distros, obviously, and supporting Windows on the same core code base obviously presents challenges also. How much time goes into bugs and testing and things like that just to ensure really good compatibility on all these different platforms versus building out new functionality?
Josh: Well, see there’s another way… You asked earlier about what the benefits were of the open source development process, and that’s another area where having really clean code is the big benefit, in that it’s allowed us to actually minimize the amount of platform specific engineering that we do.
We’ve also made some sacrifices in order to minimize that maintenance version, that maintenance issue. For example, on Linux and Unix platforms, we only use POSIX standard interfaces. This means that we’re not making use of some other operating system interfaces specific to particular operating systems that might give us additional operating system features and performance. For example, one of the big discussions I have here at Sun is that Sun’s new file system is ZFS and has some of its how APIs that are non POSIX. The ZFS people keep telling me it’s going to give us some tremendous benefit using databases on ZFS.
But we really don’t want to do that as an open source community, because the moment that you do that, you’re dedicating some amount of hours of somebody’s time just to maintain that interface for the code. So, we’ve completely avoided doing that, and that’s allowed us to minimize the maintenance version. That sort of platform specific bug and compatibility issue then becomes less than 20% of our overall development effort.
Now, Windows in particular, because there are more maintenance issues associated with Windows is very different from the POSIX platforms. We do have a couple of people for that. For example, one of our major contributors is Magnus… I’m going to mispronounce his last name, so I won’t say it. Magnus H. from Sweden. He spends most of his contribution time to Postgres, and he probably spends somewhere around half of his work time contributing to Postgres. He spends the majority of that, actually, maintaining Windows compatibility issues. So think about that as sort of one quarter of a developer for a year. Plus, a bunch of our other contributors and maintainers, like Bruce and Tom and Dave Page particularly, spend a minority of their time dealing with Windows build specific issues.
Again though, we try to stick to standard interfaces and to minimizing any particular code paths for Windows. Now, unfortunately that does limit our level of performance on Windows and our ability to integrate with some of the Windows utilities. But in terms of preventing us from having to have a completely separate version of Windows, it works.
Scott: If you talk to somebody who’s shipping a compiler, they would expect that… Well, Intel would sort of put people on the project to make compiler optimizations for Intel architectures, because Intel would really know how to do that.
AMD would put people on the project who make the compiler optimization for AMD architecture, because they would know how to do that. So you end up with kind of this collaborative effort where companies are kind of coming together and they’re putting their expertise in on the stuff they’ve developed. Microsoft isn’t staffing anybody on making sure that Postgres runs as well as it can on Microsoft’s operating system, is that correct? I mean, it sounds to me like you’re saying that work is being done by other people who…
Josh: Yeah. The Microsoft folks have been friendly to us, particularly Microsoft Labs have been consistently friendly to us, but Microsoft doesn’t contribute any efforts or help with the project. And I’ll actually say, except for Sun, who is directly involved.
For some of the other things, for example, Intel did actually have some ICC optimization efforts, but that was through EnterpriseDB, and I don’t think it would have happened without EnterpriseDB’s involvement. So, a lot of this has been by proxy, which would be the case with Microsoft as well, if there was any huge interest in it. Microsoft has its own database product though, one that they’re pretty dedicated to promoting. Well, a lot of individual Microsoft engineers have been extremely friendly to us, but nobody actually contributing to the Postgres project works at Microsoft that I know of.
That work is done entirely by community people. And even those that are primarily Windows maintainers, like Magnus, spend as much or more of their time using, say, Linux than they do Windows. So, there hasn’t been a big push to Windows specific optimization through them.
Sean: This is good. I guess the same thing we’ve offered to everyone, Josh, are there any closing comments you would like to add to the discussion as it’s extrapolated so far? Is there something you feel we haven’t’ touched that you’d like to get on the record, I guess?
Josh
Sean: Well, both. We’re getting heavy traffic from both sides. But still, there’s definitely some Windows folks involved, so go ahead.
Josh: Well, actually one of the other things that made it possible for us to do Microsoft for it has been that in general not for our product specifically but Microsoft has actually made tools to build and run programs made for Linux and Unix a lot easier on the Windows platform than it used to be.
One thing in particular is that the PostgreSQL project was actually the first user of the open source WiX installer, something we’re actually extremely happy with. It has allowed us to make PostgreSQL vastly more accessible to users on Windows because it provides them with a really nice installer. So nice, in fact, that after we released the first Windows version, a bunch of the Linux folks began saying, “Hey, why can’t we have an installer like that in Linux?”
[laughter]
Scott: Well, that’s one of the other things too that you run into, is that in the Linux world, and even in the Unix world, people are more comfortable kind of running, making and building it for their particular configuration.
But in the Windows world, it’s really sort of mandatory to ship a binary and an installer, not so much the source. But that’s interesting that you found the WiX project to be really useful for your needs.
Josh: Yeah. It showed up at exactly the right time, because it got released like, I don’t know, three or four months before our targeted first Windows release. We were able to go out of the gate with an installer, as I recall. I wasn’t actually involved in the Windows build at the time, so I’m not sure that’s 100% accurate. But certainly very close after having the Windows release, we had a nice graphical installer that was not only a nice graphical installer, but it also has like little check boxes for all the most popular add ins.
Scott: Oh, nice.
Josh: And again, if you’re a Linux user or whatever, somebody says, “Oh, that’s in the contrib module, you just need to type in these three commands and it will be installed.” That’s not really available to Windows users. They have to have a whole tool chain that doesn’t ship with Windows. So, having that nice binary installer has made it vastly more accessible for Windows users.
Scott: Do you have any sense for how much Postgres runs on Unix versus Linux versus Windows?
Josh: It’s a little hard to tell, because where we really get a full sense is the people that are active on the mailing lists. But we’re keenly aware that that doesn’t actually represent what’s actually out in the field.
Scott: Sure.
Josh: Because we only have, like, 35,000 - 40,000 people that are active on the various mailing lists and forums. Whereas even just judging by a certain manufacturer’s distributions, who are bundling PostgreSQL in their products, there’s several tens of millions of copies out there in the field.
Just given that Windows users tend to be more used to web forums and IM than they are to mailing lists and IRC, we’re guessing that we probably have a lower level of participation from Windows users. So, on the one hand, we have this vast majority of downloads from the Windows users, but on the other hand in terms of people who actually come back to the project and ask questions, the people that we know are on Windows is a minority.
But that doesn’t tell us who’s using it. You follow me?
Scott: Yeah, sure.
Josh: Because it may be that a lot of people are using it on Windows and they’re just not joining the mailing lists. They’re getting their help in other ways.
The other thing is that having a unified code base where we have the same version for Windows and Linux, etc., means that actually for a lot of the newbies asking questions about how to do this and how to do that, until we have some interaction with them, we don’t actually know what platform they’re on.
Scott: Right.
Josh: I’ve had a number of chats on IRC where it wasn’t until I got to like the third or fourth exchange of questions that I realized that somebody was on Windows. Which is a terrific thing in terms of being able to support our user base, because that was one of the biggest things I was worried about when we released the Windows version, that all of the sudden we would have 50,000 new Windows users hitting the mailing list, and those of us who are not Windows users would not be able to help them.
Scott: Right.
Josh: And it hasn’t turned out that way. So, I would guess that probably the majority of installations are in Windows, just based on the download numbers. But how much those people are using their installations, I couldn’t tell you.
Scott: OK. Great. Well, thank you very, very much for taking the time to chat with us. This has been great. You were really good to talk to because you were able to really dig into a lot of how the process works. If you look at a lot of our interviews, process tends to be some of the stuff that we’re the most interested in. And I learned a lot, just from this conversation.
Josh: Well, I’m happy to have a chance to explain this and some stuff about how the project works. We don’t have a lot of written policy, so it can be very opaque to somebody who’s just joining. I actually recently did a presentation on developing PostgreSQL at OSCON because we realized a few people had problems with how very indecipherable this was to how we were supposed to do things…
Scott: Right.
Josh: I’m really glad to have a chance to explain that a little.


