Disclaimer

  • The postings on this site are my own and don’t necessarily represent Microsoft's positions, strategies or opinions.

Twitter Updates

    follow me on Twitter
    AddThis Social Bookmark Button

    Technorati

    • Add to Technorati Favorites

    Microsoft

    February 24, 2009

    Seeding The Clouds

    Since I joined Microsoft in late 2007, I have written about science policy, Federal government interactions, and national competitiveness studies, in my role as a member of PCAST and chair of the Computing Research Association (CRA). Throughout, I have emphasized the need for strategic investment in long-term, basic research, especially as part of the economic stimulus package..

    I have also discussed the rise of multicore computing, the consequent software crisis and the need for innovation in both architecture and software, including Microsoft's support for the Microsoft/Intel-funded Universal Parallel Computing Research Centers (UPCRCs) at Illinois and UC-Berkeley. I have also mused on the future of high-performance computing and its role as an enabler of scientific discovery. I have even written about my family, my rural childhood and my life experiences.

    What I have not done is write about why I came to Microsoft and what I am doing – until now. Yes, my team manages the UPCRCs in partnership with Intel. Yes, I devote time and energy to research policy, both for the community and on behalf of Microsoft. Yes, I am involved in the future of high-performance computing, both politically and technically. However, that's not the entire story.

    It's time to talk infrastructure so large it makes petascale systems seem small. It's time to talk about why I can't remember the last time I had this much fun. It's time to pull back the curtain and talk about the future of clouds. No, I'm not talking about weather forecasting, though I really enjoyed my past collaboration with the LEAD partnership.

    I came to Microsoft to lead a new research initiative in cloud computing, one that complements our production data center infrastructure and our nascent Azure cloud software platform. You can read the press release and the web site for the official story. What follows is my personal perspective.

    The Infrastructure of Our Lives

    We all know the cloud premise – Internet delivery of software and services to distributed clients, from mobile devices to desktops. We tend not to think about how dependent we now are on those delivered services, though we are, just as we depend on the telephone and our water and electrical utilities.

    Imagine a day without the web, without search engines, without social networks, without online games, without electronic commerce, without streaming audio and video. Our world has changed, and government, business, education, recreation and social interaction are now critically dependent on reliable Internet services and the hardware and software infrastructure behind them. However, more research and technology evaluation are needed to make them as trustworthy as the telephone network.

    Building Internet services infrastructure using standard, off-the-shelf technology made sense during the 1990s Internet boom. (And yes, I remember how cool Mosaic was, when I first saw it at Illinois.) The facilities were small by today's standards, and the infrastructure could be deployed quickly. Today, however, the scale is vastly larger, our social and economic dependence is much greater and the consequences of failure are profound. Web service outages are now international news, and a cyberattack is considered an act of war.

    For background on some of the challenges and problems in scaling, you might want to follow the Data Center Knowledge and High Scalability web sites. If you are new to this space, they and other reading will redefine your notions of large and reliable. You might not think 100 megawatts could be a data center design constraint, but it is. More importantly, you should fear – yea, verily, be absolutely terrified by –the wrath of 100 million unhappy customers should your Internet service fail. Every nightmare that has ever awakened a CIO in a cold sweat at 2am is real, but magnified a thousand fold. If it were easy, though, it would neither be exciting nor fun.

    Cloud Infrastructure Challenges

    Microsoft's business, like that of other cloud service providers -- Amazon, Google, Yahoo and others – depends on an ever-expanding network of massive data centers: hundreds of thousands of servers, many, many petabytes of data, hundreds of megawatts of power, and billions of dollars in capital and operational expenses. This enormous scale – far larger than even the largest high-performance computing facilities – brings new design, deployment and management challenges, including energy efficiency, rapid deployment, resilience, geo-distribution, composability, and graceful recovery.

    I have been a "big iron" guy for a long time, and Internet and cloud services infrastructures do have analogs with petascale and exascale computing, but the workloads and optimization axes are different. Like today's HPC systems, cloud computing facilities are being built with hardware and software technologies not originally designed for deployment at such massive scale. Consequently, they are less efficient and less flexible than they either can or should be. If we built utility power plants the same way we build cloud infrastructure, we would start by visiting The Home Depot and buying millions of gasoline-powered generators. This must change.

    Imagine a world where heterogeneous multicore processors are design and optimized for diverse workloads, where solid state storage changes our historical notions of latency and bandwidth, where on-chip optics, system interconnects and LAN/WAN networking simplify data movement, where scalable systems are resilient to component failures, where programming abstractions facilitate functional dispersion across devices and facilities, where new applications are developed more quickly and efficiently. This can be.

    Cloud Computing Futures

    Over the past fourteen months, I have been quietly building the Cloud Computing Futures (CCF) team, starting with a key concept. We must treat cloud service infrastructure as an integrated system—a holistic entity—and optimize all aspects of hardware and software. I have recruited hardware and software researchers, software developers and industry partners to pursue this vision. It's been a blast.

    The CCF agenda spans next-generation storage devices and memories, new processors and processor architectures, networks, system packaging, programming models and software tools. We are a research and technology transfer team, whose roles are to explore radical new alternatives – "blank sheet of paper" approaches to cloud hardware and software infrastructure – and to drive those ideas into implementation and practice.

    Effective research in this space requires changes to both hardware and software, and the resulting prototypes must be constructed and tested at a scale difficult for small teams. This type of research and technology transfer is in academia, because the efforts often cross many research disciplines.

    For this reason, the CCF team is taking an integrated approach, drawing insights and lessons from Microsoft's production services and data center operations, and partnering with researchers, vendors and product teams worldwide. Our work builds on technical partnerships and collaborations across Microsoft, including Microsoft Research, Debra Chrapaty's Global Foundation Services (GFS) data center construction, operations and delivery team, and Ray Ozzie's Azure cloud services group. We are also partnering with an array of hardware-technology providers and companies as we build prototypes.

    Now You Know

    For me, CCF has been an opportunity to apply research experiences and ideas gleaned over the past twenty-five years of my academic career. Equally importantly, it is a chance to build prototypes at scale to test those ideas, and then help drive the promising technologies into practice. The past year has been great fun, and I have been privileged to attract and partner with some wonderful people to this adventure, including Jim Larus and Dennis Gannon.

    Now you know why I came to Microsoft. It was a chance to practice what I've been preaching. It was a chance to help design the biggest of big iron. It was a chance to help invent the future. It's a pretty cool gig for a balding old geezer like me!

    February 08, 2009

    A Few Thoughts on the Stimulus Package

    The political maneuvering and theater are well underway as the U.S. Congress debates the merits of various proposals to stimulate the economy. The U.S. House of Representatives has passed H.R. 1, the American Recovery and Reinvestment Act of 2009, and the Nelson/Collins (Senators Ben Nelson and Susan Collins) adjustments to S. 336 will likely come to the floor of the U.S. Senate for a vote in a few days. If the modified bill is approved by the Senate, we will await the negotiations that follow in conference.

    Support for scientific research is a small fraction of the stimulus plan, and the House and Senate plans differ in some marked ways. ASTRA has a handy comparison of the two proposals with respect to research investment.

    If you haven't seen legislative sausage made before, it is important to understand the process. After each legislative branch passes its version of a bill, a conference committee reconciles the differences, and the compromise must then be approved (again) by both branches. It is a competitive and often messy rugby scrum. Hence, we do not yet know what may emerge in support of scientific research and evelopment.

    Steve Ballmer on Science

    Microsoft's CEO, Steve Ballmer, recently spoke to the U.S. House Democratic Caucus Retreat. Although you can read the complete speech, I would like to highlight a few excerpts that emphasize Microsoft's strong support for innovation and the importance of continued investment in basic research. In his speech, Steve noted

    … America really has to return to growth that's built on innovation and productivity, rather than leverage and private debt.  That must happen.

    He went on to say,

    We need to pursue breakthroughs over the coming years in green technology, alternative energy, bioengineering, parallel computing, quantum computing.  Without greater government investment in the basic research, there is a danger that important advances will happen in other countries.  This is truly I think not only an issue of competitiveness, but also in a sense of national security.  Companies like ours and others can do our fair share in terms of funding of basic research, but government needs to take the lead.

    I could not agree more wholeheartedly.

    Microsoft Policy Blog

    On the subject of Microsoft and policy, the company recently launched a policy blog (Microsoft on the Issues), including support for research. A few weeks ago, I penned an entry for the Microsoft policy blog on the federal stimulus plan and scientific innovation. In addition to noting the critical importance of innovation to fuel the economy, I observed that we should treat the current crisis and any new research funds as an opportunity to rethink the way we approach university research and public/private partnerships:

    Beyond critically needed funding, the bill gives government, academia and industry a chance to rethink research partnerships and policies in ways that will harness the benefits of scientific innovation for the good of the entire nation.   …

    We now have the opportunity to further streamline our nation's research infrastructure, particularly in U.S. research universities.  …

    By rethinking public-private sector partnerships, and refining processes for acquiring and deploying information technology, we can increase research efficiency and catalyze new discoveries while reducing costs for both universities and the federal government.

    The potential influx of research funds from the stimulus package creates a great opportunity for research innovation. However, these are perilous times, and we should not (by default) assume that "business as usual" is the best approach to accelerating research. It may indeed be the best approach, but we should face the issues squarely and thoughtfully.

    What is the best way to apply information technology to science and engineering research? How can we best advance computing research itself? How can we retain our research strengths while also addressing the rising cost of higher education? What can we learn from new and effective approaches elsewhere? How can we continue to compete effectively and efficiently? As Spiderman says, "With great power, comes great responsibility."

    As always, I welcome your thoughts and ideas.

    November 23, 2008

    Reflections on SC08

    Ok, I admit it, what I said in my previous post was wrong. There was singing at SC08. The conference included both a music room where attendees could perform and a music booth where one could lip sync to classic hits. Beyond singing, the conference broke all previous attendance records, with roughly 11,000 attendees, though I doubt singing had anything to do with that!

    Clouds and Accelerators

    "Cloud" was undoubtedly the buzz word of the conference. Like the word Grid in the past, cloud is now a tabula rasa on which research groups and companies are projecting their own definitions and spins. Somewhere, there's a Dennis Milleresque cultural reference lurking that invokes either Joni Mitchell

    I've looked at clouds from both sides now,
    From up and down, and still somehow
    It's cloud illusions i recall.
    I really don't know clouds at all.

    or The Rolling Stones

    I said, Hey! You! Get off of my cloud
    Hey! You! Get off of my cloud
    Hey! You! Get off of my cloud
    Don't hang around 'cause two's a crowd
    On my cloud, baby

    In either case, I'm too tired to emit such a pithy aphorism.

    On the hardware front, accelerators, notably GPUs, and solid state storage (SSDs) dominated the exhibit floor. NVIDIA was highly visible, and vendors large and small were demonstrating software tools for accelerator programming and for SSDs.

    Microsoft News

    Microsoft broke into the top ten of the Top500 list of the world's fastest machines, based on execution of the high-performance Linpack (HPL) benchmark atop Windows HPC Server 2008. Like all Top500 runs, this required long hours by a dedicated team of people who pushed the hardware and themselves to the absolute limit. Everyone who has done this, and I remember it well from my NCSA days, knows that this is a caffeine and adrenalin-fueled, sleep deprivation process, wherever you happen to be.

    I was also pleased that HPCWire awarded its Editor's Choice Award for best industry/government collaboration to the Microsoft/Intel Universal Parallel Programming Research Center (UPCRC) program, which involves the University of Illinois at Urbana-Champaign and UC-Berkeley. Andrew Chien (Intel) and I are responsible for coordinating this program across the two companies and two universities.

    Top500 Perils

    First, one of the increasing challenges for HPL and the Top500 is the time required to complete the benchmark run. Given the scale of today's systems, regardless of hardware/software stack, the mean time before failure (MTBF) of these systems is roughly equal to the time to complete the HPL run. This alone makes benchmarking a rather stressful business.

    Beyond that, at the time of benchmark runs, the hardware is normally very new, and component infant mortality is still common. Finally, one generally has only a single window to secure the highest position on the list, because new and even larger systems appear regularly. If you miss your target of opportunity for the June or November ranking, your system will slip several positions on the list.

    Maybe I am unable to generate a pity aphorism for clouds, but I will close with an allusion to Conrad's Heart of Darkness. Considering the challenges of multicore, exascale, multidisciplinary application software and reliability, one is inclined to remark, "The horror, the horror." We have serious work ahead.

    October 28, 2008

    Beyond The Azure Blue

    From the first day I arrived at Microsoft, my academic colleagues have been asking me about Microsoft's strategy for cloud computing and when (or if) there would be public announcements. Those questions rose to a crescendo as academic groups prepared responses to the NSF eXtreme Digital (XD) TeraGrid solicitation. All I could say was that we were working on a plan, and it would become clear soon.

    I don't normally pitch Microsoft products in the blog, preferring to discuss science policy, technology research and development and global competitiveness. However, something big just happened at Microsoft, something I think will affect all of us. Moreover, as I write this, the Pacific Northwest sky is clear and azure blue, and that doesn't happen often this time of year. An omen, perhaps?

    Microsoft Azure Cloud Services

    At our Professional Developers Conference (PDC), Microsoft announced Azure, our cloud computing platform, with on-demand compute and storage to host, scale and manage Internet or cloud applications. The press release has additional business perspective and a link to the presentation. Azure is one element of the vision Ray Ozzie (See "Mind to Mind: Building Innovation") described in his 2005 Internet Services Disruption memorandum.

    The simplest description of Azure is that the initial release allows you to develop hosted Windows applications using .NET Services, though future releases will support unmanaged code and open source tools as well (Eclipse, Ruby, PHP, and Python). Within Azure, a fabric controller manages application instances and access to storage via SQL Data Services (SDS), and it hosts applications atop virtualized multicore hardware. Finally, Microsoft's Live Services offerings will be layered atop the Azure framework.

    You can read the white paper for details on the Azure design and usage approach. In addition, the software development kit (SDK) is available for download. In addition to the Azure SDK itself, there are SDKs for Visual Studio, .NET and SDS Services. Finally, there are Java and Ruby SDKs for .NET Services as well. This is a Community Technology Preview (CTP), meaning Microsoft welcomes feedback on these early capabilities and will continue to expand the capabilities of Azure over the coming months.

    Science and Technology Implications

    Earlier in the year, I wrote on both my blog and in HPCWire ("Dan's Cloudy Crystal Ball") about the possibility of outsourcing research computing services and infrastructure to the cloud. I noted then that the explosive growth of computing as an enabler of scientific discovery had strained university capabilities and Federal research budgets. Given our current economic crisis, university operating budgets and Federal research expenditures will be under even greater strain and there will be increased scrutiny on the need for each investment.

    In a world of (at best) modest research budget increases, we must ask hard questions about the best use of limited funds. Cloud computing offers a potential mechanism to increase the efficiency of current research, ensure continuity of critical data and enable new kinds of research not now feasible.

    In this model, researchers focus on the higher levels of the software stack -- applications and innovation, not low-level infrastructure. University and Federal research agency administrators, in turn, procure services from the providers based on capabilities and pricing. Finally, the cloud service providers deliver economies of scale and capabilities driven by a large market base and energy efficient infrastructure. Remember, computing infrastructure exists to enable discovery, not as monuments to technological prowess.

    In addition to efficiency, the scalability of cloud services and infrastructure opens new research possibilities. Not only is it possible federate multidisciplinary research data at far larger scales than possible in a university environment (think tens to hundreds of petabytes of low latency storage), we can escape the pernicious cycle of transitory research infrastructure.

    How often have we created data repositories as part of research projects, only to find few mechanisms to ensure their long-term sustainability and access by the broader research community? How often have we faced a miasma of distributed data sources with unknown provenance and non-compatible metadata, each supported pro bono on a best effort basis? (See my recent comments on digital document preservation.) Instead, imagine multidisciplinary data fusion and mining, where students can pose queries against integrated but diverse data sources using robust tools?

    Finally, by leveraging "pay as you go" models, we can trade time and scale on a continuous basis. Imagine applying 50,000 processors for one hour at the same cost as 50 processors for one thousand hours. In the cloud, the integral under the curve is the same and the costs are comparable, but the research effects are qualitatively different.

    The Standard Questions

    The standard questions always arise about new approaches to computing. Cloud services and data storage inevitably raise the standard ones.

    • Is it reliable and will my data persist?
    • Is it safe, private and secure?
    • Will I be captured and become captive?
    • What does it cost and what if I can't continue paying?

    We tend to forget that there are complementary issues about local infrastructure because we have already internalized and accepted the implications and risks. Moreover, local failures are rarely publicized.

    • What happens if my disks crash?
    • What if I can't pay for backups or maintenance or physical plant or …?
    • What if my network is penetrated?

    These are the standard cost/benefit/risk tradeoffs. One must make them based on statistics, economics and practical constraints. Remember that we debated the same issues when we shifted research computing from vendor-backed HPC designs to predominantly commodity components.

    Let's Reason Together

    I welcome discussion of how we can exploit cloud services and infrastructure effectively – all cloud infrastructure, not just Microsoft's Azure. To do this, the cloud service providers, hardware vendors, universities and Federal government must work together to outline an agenda, conduct experiments at scale and speak with a united voice on the opportunities.

    It's a sunny day, but my head is in the clouds.

    July 13, 2008

    Showing Up and Two Corollaries

    "Eighty to ninety percent of life is showing up." The line has been variously attributed to Yogi Berra, Woody Allen or even an anonymous wag. It's wise, though obvious advice – showing up and doing the expected generally allows one to avoid a host of problems. Appearing for jury duty avoids one being held in contempt of court, and you can't fly if you don't show up at the airport on time.  I was reflecting on the implications of "showing up" while at a recent meeting in Italy.

    Show Up and See What Happens

    My friend, Dave Turek, IBM's Vice President for Deep Computing, once explained IBM's open source and Linux strategy by saying that IBM had a deeply considered, two phase strategy for Linux and clusters for HPC, "Show up and see what happens." As he once remarked at an NCSA Private Sector Partners (PSP) meeting, "We've showed up. Now, we are waiting to see what happens."

    At NCSA, we partnered with IBM in 2001 to deploy two of the first large-scale commodity clusters for open scientific use: two 1 teraflop systems based on Intel Pentium III and Itanium processors. At the time, this was a radical, almost heretical idea – deploying commodity PC clusters as production HPC platforms. Of course, such commodity clusters now dominate the Top500 list.

    In a reprise of this experience, Microsoft and NCSA recently partnered to deploy Windows HPC Cluster 2008 on the latest incarnation of commodity cluster hardware. (The customer story has the technical details). I don't generally evangelize for Microsoft products in this blog, but I was very impressed that Windows HPC Cluster achieved substantially higher performance on the same hardware than did Linux. Microsoft, in the form of Kyril Faenov's HPC team, has definitely "showed up" in this space in a big way, and I think there are great opportunities to offer not only Windows compute clusters but also backend acceleration for desktop applications. Of course, all of this is ultimately connected to the ferment in cloud computing.

    Avoid the Obviously Wrong

    At the recent Cetraro meeting on High-Performance Computing and Grids, Miron Livny extended the "show up and see what happens" maxim by offering a corollary, "Show up and avoid doing something stupid." His observation was that evolutionarily, human success was defined by avoiding being trampled by a woolly mammoth, eaten by a hungry Bengal tiger or falling into a crevasse.

    The computing implication of Livny's corollary is that one should do reasonable things when presented with opportunities. In terms of research infrastructure, this means avoiding our academic tendency to delight in second system syndrome – building complex systems that embody all of our personally favorite features without determining if they are either needed or useful.

    At Cetraro, we debated the impact of the multicore revolution, the similarities and differences between Grids and clouds, and the commonalities between future exascale systems and the architecture of megascale data centers. (By the way, if you have not read the Department of Energy's exascale computing study, I highly recommend it.)

    There are deep technical challenges in all of these areas. However, we must avoid being trampled by the woolly mammoths; this domain is fraught with academic, government and industrial politics. I believe we need a wider dynamic range (time horizon, risk/reward and fiscal scale) of research and development projects if we are to solve these problems.

    I have made this point many times, most recently as part of the PCAST report on the U.S. NITRD program. I am scheduled to testify about this again to the House Science and Technology Committee on July 31. I will report on the hearing in August.

    Do Simple Things Quickly

    At the same Cetraro meeting, I opined that there was a second corollary, "Do the obvious, simple things quickly." I think this is the key lesson to be drawn from web2.0 mashups, and the rapid evolution of commercial clouds. The simplicity of the APIs and hosted infrastructure encourages external groups to innovate rapidly. We have seen the clear evidence of this in the explosive growth in social networking sites and in the hosted services that have appeared.

    By contrast, I think this is one of the places we have struggled with academic Grids. The software has often been too complex, and this complexity has been engendered by the distributed nature of the participating organizations, requiring "glue code" to integrate disparate policies and infrastructure across virtual organizations. In contrast, mashups and cloud services can be deployed quickly (by academic standards) using very simple APIs and service level agreements (SLAs). It will be interesting to see how the Grid/Cloud mashup evolves.