Interviewee: Matt Gibbs
In this interview with Matt, we asked him about:
- Approaching open and closed source SDL
- Development challenges for UI Framework and Services team
- Handling security within feature development
- Incorporating SDL into ASP.NET
- Launching multiple distributions with open source
- Testing against multiple operating systems
- Testing cycle for features
- Managing bug fixes and feature stability
- Following a development cycle
- Microsoft philosophy on product development
Scott Swigart: Hi, Matt. Before we start, go ahead and introduce yourself.
Matt:I’m Matt Gibbs, and I’m the Development Manager in the UI Framework and Services at Microsoft. My team is responsible for ASP.NET and AJAX as well as parts of Silverlight.
Scott Swigart: Thanks! We’ve been commissioned to look into the premise that open source and closed source SDLs [Software Development Lifecycles] are fundamentally different, and that the differences manifest themselves in the final products in tangible ways.
That’s not to say that one is better than the other – certainly both models have their strengths and weaknesses – but it’s really designed to get information out there for two audiences: one is software vendors who are looking at building products. One of their first choices is whether they are going to go with an open source or a closed source model, and we want to look at some of the ramifications of that choice.
The second audience is people bringing products into their organizations, and we want to uncover some of the basic expectations that they could have around open and closed source, or some of the characteristics of each model that might impact their choices.
Richard Bowler: You’ve been the Development Manager for ASP.NET for about a year, right?
Matt:Yes, about a year.
Richard: What would you say are the major development challenges with that product?
Matt: I think for us, the biggest challenge is that we have a large customer base, so as we innovate, we have to be very sensitive to backwards compatibility. We have a fairly complex platform, so innovation is challenging when you’re trying to make sure that existing applications people have built with it continue working.
Richard: It’s a pretty complex project, and it’s built on another complex project, which is the CLR; to co-develop with the CLR team, do you just take what they give you and work from there, or how does that work?
Matt: We work really closely with them. In fact, we meet regularly to go through code changes we’re making with members of the CLR team, so that we’re all in sync.
Richard: It sounds like you’ve been able to direct feedback to them, saying, “We really need this kind of capability, let’s fold it into the next release,” or something like that.
Matt: Yes. And they’re pretty sensitive to how their changes impact us. They run compatibility tests and stress tests against our scenarios, as well.
Richard: It seems like a huge challenge, when you consider the number of software products that Microsoft has on the market that interoperate. It seems like a huge challenge to keep them all working together.
Matt: For the framework we’ve actually held a pretty strict line. I am part of a committee where we review things that were classified as “potentially breaking a developer’s code,” and the default was that those weren’t allowed – that was kind of the highest priority. Things like “security fixes” would be allowed. Then there always needs to be a way to get back to an earlier version’s behavior, through a configuration parameter, for example.
Richard: Speaking about security, we talked with Michael Howard, and also with James Whittaker about the SDL. It looks like it’s really paying off for you guys. Were you involved in the development internally when the SDL was being instituted at Microsoft?
Matt: Yes, actually I worked with Michael Howard on IIS where we ran into a few security issues that helped highlight the need for a more structured approach.
Richard: I know what the benefits are, because we discussed it with Michael and James. But what were the main difficulties in shoehorning those new processes over the top of existing SDL processes?
Matt: I don’t think it really came down to process difficulties, as much as it came down to just helping developers recognize that the mindset they had been using needed to be front-loaded with security all along.
In our first round, everybody recognized that we needed to pay even more attention to security, and make it a higher priority. But then the realization was that it can’t be one phase in the development process. It really needs to be something that we pay attention to all along.
Richard: To put it another way, then, you had to get a wrench on the culture. You had to make security important in everyone’s mind.
Matt: Right, and it really had to be the case that everybody came to the same conclusion. You couldn’t just tell people it was important – they had to see what kind of impact it had to customers, and recognize it themselves.
Richard: Did you get much internal resistance from individual developers, or was there pretty much across-the-board buy-in?
Matt: It was pretty wholesale – everyone recognized it. I think particularly since I’ve been involved in web platforms, and we saw a couple of places where we’ve made mistakes, everybody recognized, “OK, we have to do something different, and be more practiced in this,” because otherwise we’re not going to get the right outcome.
Richard: Where was ASP.NET in the process of instituting the SDL? Was Version 1 out before the SDL really came to the fore?
Matt: No, when the process was launched we were already in development. It was during that cycle where we decided to take a break, and we went back and said, “We need to invest time in this process.” Everybody did some extra security training, and then we essentially stopped development, and did a pass for just security. We looked at every feature, all the code, looking for specific issues, just to make sure we had done all the right things.
Scott: One of the things that I found looking at the open source side of things, is just the sheer number of distributions that they have to ship for a product. MySQL for example, I think has 16 distributions, because it has to run on all of these different flavors of Linux, none of which the people working on MySQL really have a lot of control over. Those versions of Linux are all rev’ing independently of each other, so the underlying operating system that it’s dependent on is changing all the time. You’ve got a whole ton of applications that are built on top of MySQL, and are therefore dependent on it.
I can’t even really imagine exactly how to test that, and I know that in the open source world there are issues about, “This version of MySQL isn’t compatible with this version of WordPress, which isn’t compatible with this version of Linux.” So, they are always fighting these compatibility issues as things just rev independently of each other all the time.
Microsoft has some rather similar challenges, in the sense that IIS revs with the server. ASP.NET revs on its own timeline, which isn’t tied to the base operating system, and Atlas is also on its own timeline to some degree.
Talk a little bit about the kind of test matrices that Microsoft builds, and the amount of resources in terms of hardware and testers that come to bear when you guys are getting ready to release something, or while you’re developing something to ensure compatibility with all the versions of all the things you ship.
Matt: It’s probably more expensive than you might initially think, because in the same way that there are a bunch of different Linux distributions, we also still test against a lot of different supported versions of the operating system.
We’ll test against XP and Windows 2000 and Win2k3, with different levels of the service packs and with components installed; with Visual Studio installed, without Visual Studio installed, and then different flavors of IIS. We’re already testing with IIS 7 on Vista where we’ve got a lot of integration work going on. When we go to do compatibility checking and functional testing, we have a fairly large lab. And the run time, unfortunately, is on the order of weeks to really ensure compatibility.
Scott: What I’m envisioning is that somebody pushes a big button, and a hundred machines get configured with tests. It all comes up and runs, and results get cranked out — test, pass, or fail. Is it something kind of like that?
Matt: Yes, and they categorize tests into different levels of priorities so they can say, “We’re going to do a quick sanity check on something,” and that can take a couple of days. Or, they can say, “We want to run the gauntlet, and run everything we’ve got,” like before a service pack or something, and that will take much longer.
It covers everything from different language packs, to different service packs installed. One time-consuming thing is that errors that are false positives come up, and they need to be investigated. These might be just test issues, and that takes time as well, because you don’t want to ship with something that might be a problem, so everything has to be investigated.
Scott: I know that Microsoft’s got certain terminology around this. I’ve heard of something called the BVT, which I think stands for Build Verification Test. There are things that have to happen before something gets released as a CTP, Community Tech Preview. Can you talk a little bit about that, in terms of what the different levels of testing are, and what kind of things have to get verified on those different levels?
Matt: There’s a scaled approach. On a developer’s machine, we’ll have what we typically call Developer Tests.
The developer cranks out a lot of small standalone tests against the feature he’s building, against his APIs. He’ll run those every time he goes to check in code. We’ll aggregate those for our dev team, so that you’re running a fairly lengthy set of them. But, they still run within just a few minutes, so that you can continue making fast development progress. Then we’ll run a superset of those before we check in.
Scott: So, that’s to make sure that one developer’s change didn’t break some other developer’s code.
Matt: Yeah, that the basic functionality is intact. Then, the test team runs what they call a set of nightlies. They’ll run those on every daily build that comes out, they’ll kick off an automated set of nigthlies. They’re small enough in scope that they can run overnight, and any failures can be investigated fairly quickly. They are what they call “Pri0” test cases; and that would be mainstream use-cases.
Then, the superset of that is something that’s maybe ten times the number of test cases, and that’ll run for days to verify the full functionality. That’s literally thousands upon thousands of test cases.
Richard: So, your nightlies maintain a level of sanity about the ongoing nature of the code base – yesterday it was still good, today it’s still good, tomorrow somebody’s checking in something bad. Let’s back up and fix this so you never get too far out of whack.
Matt: Then, you may find something when you go through a full test pass. When the tester originally made the test plan, they may have said it’s a fringe scenario, so it was classified as a priority two test case. You might only run it before a release, and then something breaks.
Scott: Microsoft has a lot of people who blog, so there are obviously a lot of people on the product teams who are fairly transparent about where they are and what they’re doing. I think Scott Guthrie put up a big blog post where he said, “You know, there’s a point in time where we don’t necessarily fix every bug because we have to weigh the risk of impacting the stability of the product.”
Every time we touch a line of code, there’s a risk that we’re going to hurt the stability of the product, so there’s a point where we switch from telling a certain team what we’re fixing to actually having to get permission to fix certain things because of where we are at in the stabilization process. Am I characterizing that right?
Matt: Yeah, we’ve spent a lot of time mining our bug tracking database, and you look at data points like, “For X number of bugs, how many fixes actually cause some other problem?” The closer you get to shipping, the more likely you are to look at something and say, “This is really not a very important bug, and people are just not really going to see this. It’s good that we did the due diligence and found it, but this late in the game, you run the risk of breaking something more important.” You might just leave a bug in there and fix it when you have more time to shake out the problems.
Richard: I guess if you didn’t do that, you’d never ship.
Matt: Exactly. With new engineers, you’ll regularly see them hit this kind of frustrating point when they have a bug, they want to fix it, but it’s time to ship and they get told no. Their point of view is that it’s a problem that should be fixed, and they have to be told that they can fix it, but not until the next version
Scott: I have to admit that it’s odd to think that leaving bugs alone raises quality.
Richard: In my own process over time it seems like when we get down to the mode when we’re primarily testing an implemented product and engineers are working on bug fixes and not new features and behaviors. It seems like as you get closer and closer to when you need to ship it gets harder and harder to jump through the hoop of where a bug ought to be fixed. At some point it’s got to be an absolute showstopper to stop the shipment; it’s got to be a crash or something like that, otherwise you’re just not going to fix it until the next release.
Matt: Yeah, we get to the very end where we call it recall mode, where the question is whether you would actually take the product off the shelf in order to fix this issue. At this final point in the process, unless we’d recall the product, we don’t fix it.
Scott: Another thing that’s pretty interesting to me is the whole way that Microsoft does milestones. Again, my impression of it from the outside is that somebody will say, “OK, these are the features that we’re targeting for milestone one, for M1,” and so developers will go off and they’ll be working on those features. There will be a graph of the bug count that’s just climbing, climbing, climbing; the number of test cases written are just climbing, climbing, climbing.
At some point, it seems to switch over to where new features aren’t being written, it’s focusing on stabilization for the M1 milestone. So, there’s a trend down in the bug count. It seems like there are a lot of different variables that are being looked at in those stages; what’s happening to the bug count, what’s happening to the perf numbers…
Can you talk about how all that stuff comes together? Am I characterizing these milestones correctly? Is that sort of how it works?
Matt: In the past, that is how I would have characterized it – a sort of waterfall approach where we crank out a bunch of features and accumulate bugs, and the test team cranks out test cases, and then we play catch-up and try to satisfy all those test cases and bring the bug count down isn’t the most efficient. But we’ve actually shifted in our group to a little different approach that we think is working better.
Instead, we’ve gone to what we’re calling a feature crew model where the dev, test, and PM work together from the start on defining the design, the set of test cases, and the unit tests, and they drive for more quality up front before the feature even gets checked into the product. So, they’ll work in an isolated source code branch getting a feature to a higher level of quality and then get it into the product.
Richard: It sounds like you’re moving, at least to some extent, toward a more Agile model.
Matt: Yeah, it’s ended up being a hybrid of Agile development with features of SCRUM, but you couldn’t really call it one or the other. It still falls into a sort of waterfall effect where we have a milestone of a bunch of features we’re going to try to get done by a certain time. There is still this mentality of feature crews driving for a deadline, so we picked the pieces of a bunch of different methodologies that seem to work well for us in delivering commercial software, and we are trying to make them work together.
Scott: Now, I’ve heard a little bit about these feature crews. Correct me if I’m wrong, but that seems to have been developed after the ship of Visual Studio 2005. It came out of … I think they called it like, MQ or Milestone Quality, or something like that?
Matt: Yeah, it was kind of built around the milestone for quality.
Scott: So, some of the ramifications of that are that if there’s a feature you’re dependent on to build your feature, it might take you longer to get that. But, when Microsoft releases things like Community Tech Previews, they’re going to be more stable by default, because things were only checked into the source base that passed a higher quality bar. Is that right?
Matt: That’s the goal, and we were able to benefit from the approach with the ASP.NET AJAX release. We embraced the feature crew model when it was still ramping up for the rest of the division, and it worked out pretty well for us. We learned a few things we could do better.
We also learned that some of the dependencies became a little bit difficult to manage. If you truly had a feature that needed something else to be completed, then you ended up with a staggered set of feature work waiting for future builds. You needed an equal set of things that could be done early, in order for everybody to keep being productive.
Richard: It sounds like what you’re doing is trying to make sure that your interim quality is higher by testing feature-by-feature, and by enforcing a quality standard on individual pieces instead of waiting for the entire conglomeration. Is that a correct characterization?
Matt: Yes, hold the line on quality all along. Don’t accumulate debt in order to get more features done earlier.
Richard: That seems really smart to me. There’s an old saying you can’t test quality into a product, and that’s really true. It seems really smart to me. Was that inspired by the SDL’s focus on early process involvement of quality assurance?
Matt: Yes. Before, we almost had two instances of lifecycles. You had development working on one lifecycle model, and you had the quality assurance teams running along six to eight weeks behind the dev team.
But, that wasn’t really working efficiently because all we were doing was having one team try to play catch-up with the other team. The dev team was cranking out features, and the test team was finding problems with it. Then, the dev team was trying to catch-up on the problems that the test team found.
Scott: With Visual Studio 2005 when it came out, there were a lot of issues around stability and quality, and things like that.
It was easily the most complicated development environment that Microsoft had ever built, maybe that anybody had ever built. It had amazing levels of cross-product dependencies; there were big products of their own that were put in there, like Team System.
Some of the impression, at least from the outside, was that it didn’t all come together and stabilize at the end the way that everybody had hoped. So, is that correct? Do you feel like maybe the initial version wasn’t as stable as everybody hoped, and so some of these things have been done as a response?
Matt: That may be the conclusion from my team’s perspective, focusing on the platform. I think we probably feel like we’re in pretty good shape because we have criteria and goals that were met.
But, I know in the MQ timeframes a lot of people focused on a lot of little things that weren’t quite right that they wanted to fix up for the next release. That was primarily targeted at the development tools environment, not the platform.
Scott: You work in the open source world in a sense with Atlas, right? Because there are a set of controls for Atlas that are effectively open source in that Microsoft puts the source out there. They’re maybe not open source in the sense that anybody in the world can just modify the source and submit changes to it, but please talk a little bit about this foray into open source with some of the Atlas controls.
Matt: The AJAX Toolkit Project is a separate team around the corner from us, and it is shared source. They take code submissions from the public. They have released the really snappy UI controls that everybody uses on top of the Atlas part of the platform.
It’s been well received, and people are able to grab the source, and build it, and contribute. It’s a little different for us in comparison to what we’ve done before. We’ll see how much the community itself drives lots of new controls, or if it’s something where they like the fact that the controls are out there, but they don’t really actively contribute a lot.
Scott: Certainly a key benefit of open source is that the source is out there for you to learn from. I think that even taking the Windows Forms controls and the ASP.NET controls, certainly a lot of best practices might have been gleaned from being able to look at how Microsoft did what it did.
Matt: Well, we’ve done that with the Atlas release now, too. That’s a first for us, and I’m really proud of it. When we released ASP.NET AJAX, we released the source symbols for it so you can attach a debugger, and step right through our source code.
It’s not the same as taking source code in from the community, but we look at it as being very open and transparent. It’s one of those things where in the open source community, everybody says, “I have the source, so I can do what I want with it.” From my perspective, though, people tend to say that but not do it.
Scott: I think they call it the Berkeley Conundrum. It’s a variant on “if a tree falls in the woods, but nobody hears it, it doesn’t really make a sound.” If there’s so much open source code out there, and not really enough eyeballs to actually look at it, does it really matter that it’s open source?
One of the things that seems to me like a huge challenge is that inside of Microsoft you can control exactly what people get to do to. You can control the training that they have to have around security, and performance, and best practices. You can control the unit tests they write, and you can control the process that the code goes through.
When you take a look at open source, you just get code delivered to you and you don’t have any control, or any view into the process that was used to create that code.
From Microsoft’s perspective, you’re taking submissions from the community; how do you ensure things like security, performance, and reliability, when anybody could be submitting the code, and you don’t have control over how they got there?
Matt: I’m not intimately familiar with it, but my understanding is that the team has essentially taken on the burden, so far, of ensuring some of the security themselves by essentially laying our security process on top of the code before it goes out to the public. I also think there’s an element of caveat emptor.
But, because it’s coming from Microsoft, there has been a certain expectation of quality. At this point, because we have a dedicated team working on it, they’re in the code and it’s still going through quite a bit of the Microsoft process.
Also, at this point, you can’t just submit and have it go right out to CodePlex. You submit as a team, and your submission is then under review for inclusion on CodePlex.
Richard: What do you think are the advantages of closed source over open source?
Matt: In my mind, it comes down to probably two things. One, the customer knows they have somebody to go back to, that there is a company that they’re buying software from that is backing it up. Two, there is a full-time developer that is working on advancing the software.
Open source has done fantastic things with having a community effort drive things to do good software. But, the fundamental difference is that with proprietary software you have an entity that’s responsible and people employed to push things forward.
Richard: What about the development of that process itself — the mechanics of designing, implementing, and shipping a piece of software. Do you see advantages to the closed source model, and if so, what are they?
Matt: I think you may have a little bit more of a culture of discipline but I’m not sure that it’s really that much different. I think there’s so much passion from people in the open source community in what they’re building that it may be the same kind of thing. The same kind of respect somebody has for their full-time job.
Scott: There are two myths that I frequently encounter, one related to open source, and one related to proprietary software. The open source myth that I hear again and again is that many eyeballs looking at the code ensures security.
Michael Howard really threw down the gauntlet on that and said, “Show me the data, because there’s just no data to support the idea that you can make sure code is secure just by open sourcing it and letting people look at it.”
On the closed source side, one of the things that I often hear as being an advantage is this notion that there’s a company to go back to, there’s a company that stands behind it. My impression is, I don’t have any guarantee that a bug that I find will get fixed in the product. I don’t have any guarantee that a problem that I’m running into will get solved by the company that produced it. I can file a bug, I can call product support; maybe I’ll get a resolution, maybe I won’t.
Do you think that Microsoft or any other company really offers that level of guarantee?
Matt: In my experience having worked as a quick-fix engineer where we investigated customer reported problems from PSS, it was very serious. If you couldn’t find some way to workaround it, it needed to be a product fix. Now with that being said, if there was a reasonable workaround, then you’re back to the issue of whether it’s worth the risk to go patch product code.
There’s not a case I can think of where a customer had a serious problem and we didn’t either find them a workaround or patch the product in order to get them back on track. The open source community can’t guarantee every requested change will be made either.
Scott: I guess the only caveat to that might be that obviously, it’s not going to be instantaneous, in either the closed source or the open source world. I remember a problem I found, and the answer was it’ll be fixed in Service Pack 2, but Service Pack 2 didn’t come out for another six months.
Matt: Very few fixes in proprietary or open source development are instantaneous. More complex issues may take longer, and again there is the concern about compatibility. When something like Service Pack 2 ships, it’s had many hours of testing devoted to it to ensure that even small fixes are getting good coverage.
Scott: Clearly, from your perspective, that guarantee of supportability is real, but there are just certain kinds of physical constraints on how fast the resolution might be available. But in your experience if there is a critical problem, either the customer gets a workaround for it or Microsoft starts working on a fix?
Matt: That’s been my experience in ten years here.
Richard: We talked to one of the guys that works on the Linux kernel. He basically said what happens is that people turn code in to the mailing list, and then everybody takes a look at it. If they’re interested and they snipe at it, then the developer makes some changes. Then, ultimately it either gets adopted by the maintainer – the person who owns that section of the kernel – or it doesn’t.
So, at least in some sense there’s no process to enforce quality during implementation beyond peer review. It seems to me intuitively that that’s a disadvantage. But, I just wanted to get your take on it.
Matt: Of course we’re not working on the kernel, but I know that before a piece of code I write is going to get anywhere that it’s going to get exercised quite a bit. It seems like inherently that’s an advantage.
Scott: Matt, talk a little bit about motivation inside of Microsoft. How much is it sort of computer-generated, and how much is it, “Developer Number 13 had this much code checked in, this few defects, found this many bugs…?” How does it work in the “big house?”
Matt: For us it’s all about our developer customers. It’s about the level of enthusiasm and people’s feedback, and it all feeds off of that. We get our customer’s excitement, what they want to see, what they’re going to be able to build. They say to us, “If only you added this feature you could save me time, and here’s what I want to see next.”
The people here in Microsoft are completely driven by what our customers want to see next. We may be in a bit of a unique position because our customer is a developer, and developers get really excited about the platform.
Scott: I’ve heard from people who’ve interviewed at Microsoft come back and say, “There really is a sense that by working at Microsoft you have an opportunity to change the world.” I would guess that many of Microsoft’s products … if you take a look at Office, Visual Studio, Silverlight, and how you hope those products progress over time, most of these are big bets. In most of these cases, you are part of something that you hope is going to be really big and industry-changing.
Matt: Yes, it’s the kind of thing where I’ve had people come up to me at ASP.NET conferences and say, “You changed how I do my work. It’s been an incredible experience and I can’t wait to see what you guys do next.” It makes an impact on their life. It makes it fun.
It’s different than what Scott was driving at; the “how many defects” and “how many lines of code” kind of thing. We really run off of developer-driven enthusiasm. There’s an enthusiastic, sharp, hardworking group of people that are really focused on the customer and how to make better developer experiences.
Richard: Well, thanks very much for talking with us. I appreciate it.