Microsoft officials have given more details on how the company has worked to increase cloud capacity since the start of the Covid-19 global pandemic. On June 16, the company detailed what it is doing on this front, in particular on how it is striving to strengthen its Teams service, demand having increased sharply from the spring.
Officials have previously spoken about the priority Microsoft has given at the request of first responders, healthcare workers, and other front-line workers. They had shared details of some of the less essential services they had set aside. And they had also publicly acknowledged that supply chain challenges led to a shortage of certain necessary components of data centers, which further contributed to the problems associated with meeting certain demands for cloud computing.
This Tuesday, they said that Microsoft data center employees took turns installing new servers (while staying at least two meters away). Microsoft first added new servers in the most affected regions.
Free up more capacity for demanded services
The group also doubled the capacity of one of its own submarine cables that carry data across the Atlantic, and “negotiated with the owners of another to open additional capacity.” Engineers have tripled the capacity deployed on the cable that links Europe to America in two weeks, they added.
At the same time, Product teams looked at all of Microsoft’s services running on Azure to free up more capacity for high-demand services like Teams, Office, Windows Virtual Desktop, Azure Active Directory application proxy, and Xbox, officials said. And in some cases, engineers have rewritten the code to improve efficiency, as they did in the case of video stream processing, which officials said had made ten times more efficient in one weekend.
Teams have been formed to distribute the reserved capacity to other regions of data centers within a week, rather than the process of several months that such a strategy would involve. In addition, Microsoft’s Azure Wide Area Network team added 110 terabytes of capacity in two months to the fiber optic network that carries Microsoft data, as well as 12 new state-of-the-art IT sites to connect the network to vendor-owned infrastructure. local internet access to reduce network congestion.
Microsoft has shifted its own internal Azure workloads to avoid worldwide demand spikes and to divert traffic from regions of high demand, the teams said. On the consumer side, Microsoft has moved game workloads out of high-demand data centers in the UK and Asia and has worked to reduce bandwidth usage during peak office hours. day.
Updating predictive models
Microsoft also had to update its forecasting models which took into account the sharp increase in cloud demand resulting from the pandemic. Microsoft added to its multiple predictive modeling techniques (ARIMA, Additive, Multiplicative, Logarithmic) basic caps by country to avoid overestimating forecasts. It also adapted its models to take into account the inflection and growth models by use by sector of activity and geographic area, while adding external data sources on the impact of the health crisis by country. “Throughout the process, we have been cautious and favored over-supply, but as usage patterns have stabilized, we have also reduced our workforce if necessary,” officials said.
Microsoft has also learned a few lessons by increasing its IT resources for Teams. By redeploying some of its micro-services to favor a larger number of small computing clusters, the company was able to avoid certain considerations of scale by cluster, accelerate its deployments and obtain a finer load rebalancing. It has also become more flexible in terms of the type of virtual machines or central processing units used to run different micro-services so that it can focus on overall computing power or memory to increase the use of Azure resources in each region. . And the engineers were able to optimize the service code itself, so as to reduce things like the time spent by a program on the processor.
Microsoft has added new routing strategies to take advantage of unused capacity. Call and meeting traffic was routed across multiple regions to handle power surges, and time-of-day load balancing helped Microsoft avoid WAN throttling, said the responsibles. Using Azure Front Door, Microsoft was able to route traffic nationally. It also made a number of cache and storage improvements, which reduced the payload size by 65%, the deserialization time by 40% and the serialization time by 20%.
Microsoft has also changed its incident management policy. It has replaced incident management rotations with a daily rather than weekly cadence. She brought in more incident managers from across the company and postponed all non-critical changes across all of her services.
New container-based deployments
All of this scaling up of cloud capacity will have an impact on how Microsoft builds and maintains its Azure-based services, like Teams, say officials. “What we can do today by simply changing the configuration files may have previously required the purchase of new equipment or even new buildings,” they add in a blog post.
Regarding the future of Teams, Microsoft plans to move from VM-based deployments to container-based deployments using its Azure Kubernetes service. The company has announced that it plans to minimize the use of REST in favor of more efficient binary protocols like gRPC.
Reports from Microsoft customers reaching Azure capacity limits in some regions did not only begin when more people started working from home during the coronavirus pandemic. Last fall, a number of East US2 Azure customers said they couldn’t even run virtual machines due to Azure capacity issues. Will these new capacity upgrades prevent future general Azure capacity issues?