Interviewee: David Campbell
In this interview with David Campbell we talked to him about:
- His background with the SQL Server team and Microsoft
- What about planning, implementation, testing, delivery, and servicing do you think are unique to a product as high profile and critical as SQL Server.
- The Security Development Lifecycle and SQL Server.
- Usability and SQL Server’s Development.
- His experience with OpenSource databases.
- How much the experiences of mySQL and other open source databases have affected SQL Server’s development.
Sean: Tell us a bit about your role at Microsoft with the SQL Server Team and as a Microsoft Fellow.
David: I have been working on SQL Server for roughly 13 years in a variety of roles but my passions are in systems and product development. Currently, I am heading up a team we call “Strategy, Infrastructure and Architecture” – SIA for short but I’ve worked on SQL Server as an SDE (Microsoft lingo: SDE = Software Development Engineer), Development Manager, Product Unit Manager and General Manager overseeing a number of component groups within the product.
Sean: SQL is a product upon which enterprises bet their business. So much of what’s mission critical depends on SQL Server. What about planning, implementation, testing, delivery, and servicing do you think are unique to a product as high profile and critical as SQL Server?
David: In the business world, enterprise database products define “mission critical”. A crash in many line of business applications may result in a disruption of service until the server or application is restarted but a severe crash in a database server resulting in data loss can cripple a business. I learned this lesson the hard way before coming to Microsoft when I worked on database systems at Digital Equipment Corporation (DEC). One day I received a page from our support staff and found that a product defect had resulted in a production line shutdown for a major semiconductor manufacturer. They had to send folks home; some of the physical equipment suffered damage as a result of stopping the production with partially completed batches of material and they reminded me that they were losing roughly a million dollars an hour in business while things were stopped.
Getting to your question might require a little background on SQL Server. Some folks might know that Microsoft hired a bunch of people from the database industry in the early to mid 1990’s to work on SQL Server as Microsoft tried to become a player in enterprise database systems. We acquired a source license for Sybase 4.2 and shipped two versions, SQL Server 6.0 and SQL Server 6.5, on that architecture. We hired a number of query processing experts and they started building a completely new query processor from a clean sheet of paper. I worked on the storage engine team and SQL Server was getting beaten up in the market around this time since we didn’t support row level locking as the Sybase architecture we inherited only supported page locking. I was responsible for the design of the row level locking feature for SQL Server 7.0 and the more I dug into the Sybase architecture the more challenging the design became since the entire Sybase transaction and recovery system was predicated on page level locking and it was very difficult to do a clean row locking design without a number of major compromises. After a number of sleepless nights and arduous design meetings we ultimately came to the conclusion that we would have to rewrite much of the Sybase storage engine to do this feature correctly. As part of this major architecture shift we made a decision to change the on disk format for SQL Server 7.0. This meant customers would have to unload and reload all their data as part of the upgrade to the new release.
So, now we have the context to start answering your original question. We started what would become SQL Server 7.0 with 2 strikes against us: It was really going to be a V1 product with a V7.0 name and we would require every customer to completely unload an reload their data to migrate to the new product. We knew that poor quality would mean strike three. Since we had a number of people that had experience building enterprise database systems we weren’t lacking in design knowledge so the real success factors for the release came down to doing a great job of architecture, implementation and validation. On the validation front we did a number of interesting things that could probably fill a book but I’ll highlight a couple of them here.
Given that we were going to make everyone migrate their data we knew that we need to make the data migration process both highly performance and rock solid. We engaged our sales force to ask our customers to help us by sharing their databases so we could convert them to the new format in the lab. We called this the 1,000 DB challenge and we wanted to get 1,000 real customer databases of all different sizes and complexity so we could run them through a database conversion lab that we created. The other interesting thing we did in this space was to write a playback system. This consisted of a capture utility that would log the customer activity on a production database and then a playback utility that would allow us to play back the actual customer workload against the customer database in our lab. We could play the workload back in “real time” which included the dwell and think time between queries, or “compressed” where we just jammed the queries into the server as fast as we could. In this mode we could stuff a day’s worth of work into the system in an hour or so and really stress the server. We’d ask customers to take a backup of the database that corresponded to the workload and to capture the actual work using the capture utility and then send us the database and capture log. This data allowed us to test the conversion of the database from the old version of SQL Server to the new version and also let us test the new version of SQL Server by replaying the actual customer workload on the new system. Later on we started to capture query performance of replay on the old vs. new version of the product to find performance bugs.
The next interesting thing we did on the validation front came from Don Slutz, a long time database veteran that was working in Microsoft Research at the time. He wrote a program he called RAGS that was really a model based testing system that used the SQL language grammar and the schema of an existing database to generate bizarre, but syntactically legal, SQL statements and feed them into the query processor. Basically, he married the state domain from an existing database schema with that specified by the SQL grammar and was able to probe all the dark corners of the search space programmatically. He wrote an MSR technical report that is available on the Microsoft Research we site. The way this played out was pretty interesting. At first it was pretty easy for Don to crash the query processor. So he filed a bunch of bugs, the developers fixed a bunch of bugs then Don ran it again, etc. After a couple of iterations of this cycle RAGS needed to generate some pretty ugly queries to crash things. You’d wind up with these 5 page SQL queries that looked like random gibberish but was legal SQL syntax and the query processing team would spend a bunch of time figuring out what the query was supposed to do and then figure out why it crashed the system. Later on Don used RAGS against different versions of SQL Server and other database products generating queries over equivalent schemas and comparing both the results and the performance of the different products across a wide range complex queries.
I could list more if you guys want to write a book…
Sean: It’s hard to imagine a product that has higher security requirements than a database server. It has to talk on the network, and it has to be able to store sensitive information. Since the introduction of the SDL, SQL has seen a dramatic reduction in vulnerabilities. How does the SDL play out on a day to day basis? How does it affect your architecture?
David: Great question! To fully appreciate the change you have to understand that 10-20 years ago there was very little widespread security knowledge in the software development world. In some sense the environment didn’t require it; most systems weren’t interconnected and remotely accessible. This meant that security breaches required typically physical breaches and people had a model in their head for physical security. It was easier to understand locks on doors than buffer overruns. What is interesting about the SDL and SQL’s evolution is that we’ve gone from a period in which security review and validation was done mostly after the coding was done to today’s world where we formally design security threat models as part of the design process before writing code. We also have a great training program in place that everyone touching the code needs to take. We also have refreshers that developers must take for emerging threats. Additionally, we’ve developed a great set of static code analysis and run-time tools to avoid and detect potential security issues. In 2002 we had a “security push” where the entire development team stood down and reviewed every line of code in the product. We did a smaller push in 2004 for SQL Server 2005 and with our current release we have truly integrated the security best practices into the development process and don’t need a separate security push as security is simply part of our day to day process. One challenge in the security space is that the threats are constantly evolving so we can’t rest and, as long as the bad guys are learning new tricks, we need to up our game and having a process in place for rapid mitigation in the event of a new threat or vulnerability.
In terms of how security has affected our architecture we are much more mindful of the threat environment each line of code is executing in. For example, there is a small amount of code that performs the initial client authorization before allowing a connection into the server. Since a remote client executes this code pre-authorization, any remote code that can access the port over the network can execute this portion of code. Architecturally, we strive to keep this code to a minimum and it is very thoroughly reviewed. Similarly, the security base is designed in a layered fashion so you have smallish amounts of thoroughly reviewed code providing services that other aspects of the security system are built upon.
Sean: How would you respond to a statement like, “If you really wanted to make SQL secure, you’d forget about the SDL and just open-source it so that many eyeballs could look at the code.”
David: I think there is some benefit to having more eyes on a piece of code. There is benefit from each person’s fresh perspective, benefit from varied knowledge, etc. However, eyeballs alone are not nearly enough. The SDL process evolution has been really interesting in that we have changed the culture, habits, and processes of every Microsoft developer in a way that is much more effective than having many otherwise competent programmer’s looking at the code. With SDL we have many eyes looking at threat models, many looking a the design of a security feature and ultimately, many looking at the code following a pretty rigorous and proven process. I think there’s sufficient independent objective evidence to say that SDL is working for us. If you have a few minutes go to the National Vulnerability Database at nist.gov and check out the vulnerability reports for SQL Server vs. other databases. I’ll warn you, if you try to search for Oracle flaws they are hard to find since every major and minor release of the Oracle database server is listed separately so you need to do a little aggregation to get the real picture.
Sean: It’s one thing to design powerful functionality, it’s another to make it easy to use. Talk about how Microsoft insures not just functionality, but usability.
David: OK. Now you’ve hit on one of my personal passions. I believe that many technologies go through stages of evolution as they mature. I define three major phases – nascent, developing, and refined. There are many examples that can serve as a lesson for software developers.
Consider televisions; thirty to forty years ago when TVs were a nascent technology you almost needed to be a technician to own one. Certainly, you needed to know how to take the back off and pull the tubes out so you could go to Radio Shack and test them when one of them went bad. Furthermore, there were knobs on televisions that were there solely because the technology hadn’t matured to the point where they weren’t necessary. I often ask audiences how many people miss the horizontal and vertical control knobs on their TV and we’re getting to the point where many in the audience don’t know that these knobs even existed to keep the picture from rolling and waving in early TVs.
Twenty years ago TVs entered the developing age as they became fully solid state. They were much more reliable and the technology developed to the point where many of the knobs disappeared. In this phase they were good enough for the masses. You didn’t need to be a technician to own one but it was nice to have one in the neighborhood. Frankly, I sort of think this is where PCs are today.
Today’s TVs are refined in that the user’s control surface captures the user’s intent rather than exposing the control surface of the underlying technology. For example, instead of fiddling directly with color temperature, saturation and hue to adjust the picture, my TV has a control that asks me if I want to watch sports, movies, or regular programming and adjusts the color parameters accordingly. Furthermore, some TVs are aware of the operating environment such as ambient light and compensate for that. You can do this same thought experiment on automobiles, microwave ovens, etc. When viewed through this perspective, most system software still has a long way to go to become a refined technology. Of course, there are cases where these maturity phases ripple and repeat through a single technology where advances happen in waves. I think my new Smartphone is an example of that. It’s way more capable than my previous cell phone but I never had to reboot my earlier one.
So, how are we doing on SQL Server? One simple example is the work that we did in the database engine in SQL Server 7.0 when we got rid of many of the knobs and made many of them self-tuning. SQL Server 6.5 had roughly 100 knobs whereas SQL Server 7.0, which was much more complex in many ways, had roughly 20. Many people thought that more knobs meant more control but, in reality, we found many systems were performing poorly in the field due to mis-configuration. We classified the knobs into those that should simply take care of themselves – things like the number of locks the server could allocate when it booted, the number of hash buckets in the cache manager, etc. These were our “horizontal and vertical control knobs”. For other knobs we set them up where the database server managed them by default but if an administrator wanted to impose constraints he could. The amount of memory allocated to the server was in this category; by default, SQL Server manages memory in cooperation with the demands in the operating system but you can set low and high watermarks if you want to. In other areas we used control theory and feedback loops to have the system adapt dynamically to the environment and control things based upon instantaneous system response. The adaptive systems work we did was really interesting in that existing control knobs were often static; a good example is “sort memory”. In many database systems before SQL Server 7.0 you set aside a portion of memory for sorting to perform queries and build indexes, etc. During times where you weren’t building indexes or didn’t have queries that required a sort that reserved memory was just laying fallow and wasted. Further, if you had a sort that needed a little more space than what you had reserved it would spill to disk because the sort wouldn’t fit in the reserved space – even if there plenty of memory unused elsewhere in the system. In SQL Server 7.0 we created an internal memory broker that could use server memory for whatever purpose made the most sense over time so things like the procedure cache, workspace, sort, and the buffer pool all cooperated to use memory in the most efficient manner over time. The result was fewer knobs and a more efficient system.
This was great work but we ran into a situation where we were ahead of the market in many respects. DBAs were worried we were going to put them out of a job. They thought they were getting paid, in some part, to respond to their pager at 2:00 AM because a big batch job failed because they hadn’t allocated enough lock blocks for the server. Our competitors also used this maturation against us – things like, “How can SQL Server be a real enterprise database system – it only has 20 knobs and our system has 500!”. What’s interesting is that if you look at the major database systems they have all made major investments in self-tuning and ease of use.
Market dynamics also demand that we make these systems easier to use and self managing. As an industry we’ve expanded database deployment from an installed base that was likely measured in the small 100,000s of units 20 years ago to one that is likely measured in 100,000,000s of units today. They had better be easier to use than 20 years ago!
Scott: How do customer needs and requirements make it into the planning process? How do you handle situations where the customer is asking for the wrong feature? (The customer asks for a setting so they can tune X, and you realize that if a subsystem was redesigned, they wouldn’t need to tune X)
David: We have had many situations where customers have asked for features that they have seen in other systems that didn’t make sense for SQL Server. Often customers ask for performance features that other products have that may provide a large advantage in those products but, given the way that SQL Server is architected, these same features may provide little to no benefit on our architecture. One example is raw device and partitioning support. 20 years ago many UNIX systems didn’t have advanced I/O features such as the ability to avoid the file cache, scatter/gather I/O or great asynchronous I/O. In fact, many early UNIX file systems had 32 bit file offsets so the maximum size of an individual database file could only be 2 or 4 GB in size. NTFS, in contrast, supported 64 bit file addressing, great asynchronous I/O, and the ability to do unbuffered I/O from the start. So, whereas other products needed separate partitions over multiple files with an I/O thread per file to simulate asynchronous I/O – SQL Server didn’t need any of this. This didn’t prevent customers from asking though. Of course, once you get up to very large databases, partitioning makes sense for a number of reasons such as the ability to physically manage large tables in index in smaller pieces but the point is that SQL Server didn’t need partitioning to get I/O parallelism in the same way some other systems did. We had to educate many customers on these points so they understood how our architecture achieved the performance that other systems did but through a different architecture.
I’m also mindful that control doesn’t always represent progress and often simplicity is the best approach – especially when it affords an opportunity for the software to do a better job. I think one good example involves a feature known as “tempdb in RAM”. Prior to SQL Server 7.0 you could allocate a region of RAM to hold the temporary database which is used for scratch tables and intermediate query results. In certain environments, placing “tempdb in RAM” could provide a significant benefit given the way that SQL Server 6.x was architected. Unfortunately, it was easy to under or over allocate the amount of memory and wind up spilling to disk or paging excessively. Another point is that the amount of RAM allocated to tempdb was statically determined; once set you needed to reboot the server to change it. Since the amount of space needed for tempdb varies depending on the workload, the optimal caching strategy for tempdb is dynamic. We removed “tempdb in RAM” in SQL Server 7.0 and did a number of other optimizations under the cover to better manage tempdb pages in memory so the actual customer result was much, much better across a wide range of scenarios for SQL Server 7.0 but customers who had improved their performance 2-3x by using the tempdb in RAM feature in SQL Server 6.5 screamed loudly. I finally wrote a long mail and included some experimental results that proved that the new approach was better but I still received hate mail for several years after that decision.
Scott: Areas like the developer division have strived for greater transparency. In open source, all development and decisions are transparent. Talk about how SQL Server views transparency during development, and how you have to balance expectations from customers who want full transparency, vs. not shooting yourself in the foot by disclosing early and giving a closed source competitor like Oracle an advantage?
David: You touch upon a very real challenge. SQL Server has matured to the point where we do a very good job on the fundamentals and, as a result, it’s more important that we listen to and work with our customers to continue to produce a product that helps them better run their business. We’ve talked mostly about the core relational database engine thus far but today’s SQL Server includes data analysis, data mining, reporting, enterprise class ETL, etc. The solutions we deliver in this space touch a broader range of customers and the pace of innovation is much faster than that seen in the core relational database engine. As a result, we need to be much more in tune with our customers to produce the right product. We’re making some real progress on including key customers and experts in our design process. For some of our more complex SQL Server 2008 improvements we’ve done joint design with key customers and MVPs and they’ve helped us make many of the tough scenario and design tradeoffs required to deliver the right feature. Done well, this reduces the need for iterative field testing to get solid feedback on new features.
Disclosing early is a real risk and, yes, our competitors do listen and respond. It’s funny, they pay much more attention to what we say now than they did 10 years ago.
David: Yes, I have experience in open source and certainly follow the key open source databases from the perspective of technology evolution and adoption. I don’t look at any source code from the open source databases to avoid any potential IP issues.
Sean: What do you think are some things that are easy to accomplish in a closed source model that would be challenging in open source?
David: I think one thing is something I would call “consistency in the large”. This isn’t necessarily easy in the closed source world but in a coordinated engineering environment you can align things in a way that lead to a degree of consistency which creates customer value. For example, the products within Microsoft’s Server and Tools business have a set of “Common Engineering Criteria” (CEC) that we all follow. The criteria include things such as common processes and UE content that lead to a more uniform customer experience and aligned features, such as a Best Practices Analyzer or having a management pack for our management tools when we RTM. This sort of broad consistency doesn’t happen organically and is one of the challenges that large scale open source efforts have to wrestle with.
David: Certainly; and it goes both ways. Things like persisted views have been done by all major database players. I can’t recall whether Oracle or IBM did it first but now SQL Server, Oracle, and DB2 all have a form of persisted views with varying degrees of updatability and query matching sophistication. Oracle was the last major database player to have a credible fully cost based optimizer. I would say that SQL Server led in the advancement of real self-tuning and ease of use. Unfortunately we did it in 1998 and tried to sell it to a market that didn’t understand its value and IBM did a great thing (for IBM) later on by coining the phrase “autonomous computing” and selling the value. I don’t think MySQL has contributed too much yet to the big 3 but I do like MySQL’s notion of installable storage engines with implementations optimized for various scenarios. What’s interesting is that SQL Server 7.0 was architected to support this concept with the OLE-DB interface between the relational and storage engine but MySQL has really gotten value from the concept.
Sean: If you had the opportunity to borrow more from the development model of the open source community when it comes to SQL Server what are the top couple of things you would like to bring over to SQL Server development that you see out there today in the Open Source community?
Scott: In general I like the notion of tapping into a large development community’s collective energy. Different people have different passions and, if you can harness the energy effectively, it can lead to some interesting results. For example, if someone gets excited about adding a specific feature or fixing a particular bug they can often just make it happen. Of course, one challenge in the open source world is maintaining architectural and feature consistency. Smart people don’t always agree on what’s important for the customer or know how a particular feature should be implemented in a way that maintains the system’s design tenets. I think a key aspect of great design is in providing the most value with the simplest and most intuitive implementation and user model. The open source project maintainers have a very tough job keeping this in check.
Another aspect of the open source world which is powerful is the ability to generate an ecosystem of tools and add-ons around a core technology. This is an ecosystem effect that may or may not require open source access to the core technology. For example, imagine if MySQL were a closed source project but that it had a vibrant open source community building installable storage engines for various scenarios. We are starting to do some of this with SQL Server through Codeplex.
Sean: How much do you think SQL Server’s development has been impacted by Open Source development efforts in terms of how they build community around various database products?
David: Enabling the community to help the community and giving the user base a direct voice to the development team is one of the most exciting and powerful concepts to come out of the open source development model. We have learned a lot from this and in our product review meetings we regularly review input from our community including how they vote on design change requests. Rather than us guessing or asking a small sample we can now reach out to our community quickly and efficiently. Frankly the Internet and access to our community have enabled us to produce a much better product. Things such as Watson, SQM, and our community product provide us direct feedback that allows us to respond in ways we couldn’t even imagine 10 years ago.
Scott: Give us some understanding of the structure of the SQL Server development team in terms of some high level estimates in terms of the number of testers, developers, number of machines in the test lab, length of time it takes test suites to run, etc. Just so folks can put the SQL Server development effort in perspective to other efforts they are familiar with.
I do not have exact numbers in my head but let’s just say we have 100’s of developers, 100’s of testers and probably 10,000 machines in various test labs. We literally have millions of individual test cases and the test system is highly automated. Many of our test machines are in offsite data centers and we’re able to connect to and control them via IP KVM. We have evolved our test methodology to include much more model based testing rather than individual test case generation… One interesting thing is that we completely revamped our development methodology between SQL Server 2005 and the current release. This was a huge cultural and process change and we have learned a lot from this experience. Perhaps we can chat a bit out this next time.