Early in my career at Microsoft, I worked in Microsoft Consulting Services, supporting organizations looking to deploy Exchange 2007 and 2010 in their on-premises environments. During those engagements, the bulk of the conversations focused on availability and disaster recovery concepts for Exchange – things like CCR, SCR and building out the DAG to ensure performance and database availability during an outage – whether it was a disk outage, a server outage, a network outage or a datacenter outage.
Those were fun days. And by “fun”, I mean “I’m glad those days are over”.
It’s never a fun day when you have to tell a customer that they CAN have 99.999% availability (of course – who DOESN’T want five 9’s of availability??) for their email service, but it will probably cost them all the money they make in a year to get it.
Back then, BPOS (Business Productivity Online Service) wasn’t really on the radar for most organizations outside of some larger corporate and government customers.
Then on June 28, 2011, Microsoft announced the release of Office 365 – and the ballgame changed. In the years since then, Office 365 has become a hugely popular service, providing online services to tens of thousands of customers and millions of users.
As a result, more businesses are using Office 365 for their business-critical information. This, of course, is great for our customers, because they get access to a fantastic online service, but it requires a high degree of trust on the part of customers that Microsoft is doing everything possible to preserve the confidentiality, integrity and availability of their data.
A large part of that means that Microsoft must ensure that the impact of natural disasters, power outages, human attacks, and so on are mitigated as much as possible. I recently heard a talk given that dealt with how Microsoft builds our datacenters and account for all sorts of disasters – earthquakes, floods, undersea cable cuts – even mitigations for a meteorite hitting Redmond!
It was an intriguing discussion and it’s good to hear the stories of datacenter survivability in our online services, but the truth is, customers want and need more than stories. This is evidenced by the fact that the contracts that are drawn up for Office 365 inevitably contain requirements related to defining Microsoft’s business continuity methodology.
Our enterprise customers, particularly those from regulated industries, are routinely required to perform business continuity testing to demonstrate that they are taking the steps necessary to keep their services up and running when some form of outage or disaster occurs.
The dynamics change somewhat when a customer moves to Office 365, however. These same customers now must assess the risk of outsourcing their services to a supplier, since the business continuity plans of that supplier directly impact the customer’s adherence to the regulations as well. In the case of Office 365, Microsoft is the outsourced supplier of services, so Microsoft’s Office 365 business continuity plans become very relevant.
Let’s take a simple example:
A customer named Contoso-Med has a large on-premises infrastructure. If business continuity testing were being done in-house by Contoso-Med and they failed the test, they would be held responsible for making the necessary corrections to their processes and procedures.
Now, just because Contoso-Med has moved those same business processes and data to Office 365, they are not absolved of the responsibility to ensure that the services meet the business continuity standards defined by regulators. They must still have a way of validating that Microsoft’s business continuity processes meet the standards defined by the regulations.
However, since Contoso-Med doesn’t get to sit in and offer comments on Microsoft’s internal business continuity tests, they must have another way of confirming that they are compliant with the regulations.
First…a Definition
Before I go much further, I want to clarify something.
There are several concepts that often get intermingled and, at times, used interchangeably: high availability, service resilience, disaster recovery and business continuity. We won’t dig into details on each of these concepts but suffice it to say they all have at their core the desire to keep services running for a business when something goes wrong. However, “business continuity and disaster recovery” from Microsoft’s perspective means that Microsoft will address the recovery and continuity of critical business functions, business system software, hardware, IT infrastructure services and data required to maintain an acceptable level of operations during an incident.
To accomplish that, the Microsoft Online Service Terms (http://go.microsoft.com/?linkid=9840733),which is sometimes referred to as simply the OST, currently states the following regarding business continuity:
- Microsoft maintains emergency and contingency plans for the facilities in which Microsoft information systems that process Customer Data are located
- Microsoft’s redundant storage and its procedures for recovering data are designed to attempt to reconstruct Customer Data in its original or last-replicated state from before the time it was lost or destroyed
Nice Definition. But How Do You Do It?
I’ve referenced the Service Trust portal in a few other blog posts and described how it can help you track things like your organization’s compliance for NIST, HIPAA or GDPR. It’s also a good resource for understanding other efforts that factor into the equation of whether Microsoft’s services can be trusted by their customers and partners.
A large part of achieving that level of trust relates to how we set up the physical infrastructure of the services.
To be clear, Microsoft online services are always on, running in an active/active configuration with resilience at the service level across multiple data centers. Microsoft has designed the online services to anticipate, plan for, and address failures at the hardware, network, and datacenter levels. Over time, we have built intelligence into our products to allow us to address failures at the application layer rather than at the datacenter layer, which would mean relying on third-party hardware.
As a result, Microsoft is able to deliver significantly higher availability and reliability for Office 365 than most customers are able to achieve in their own environments, usually at a much lower cost. The datacenters operate with high redundancy and the online services are delivering against the financially backed service level agreement of 99.9%.
The Office 365 core reliability design principles include:
- Redundancy is built into every layer: Physical redundancy (through the use of multiple disk, network cards, redundant servers, geographical sites, and datacenters); data redundancy (constant replication of data across datacenters); and functional redundancy (the ability for customers to work offline when network connectivity is interrupted or inconsistent).
- Resiliency: We achieve service resiliency using active load balancing and dynamic prioritization of tasks based on current loads. Additionally, we are constantly performing recovery testing across failure domains, and exercising both automated failover and manual switchover to healthy resources.
- Distributed functionality of component services: Component services of Office 365 are distributed across datacenters and regions to help limit the scope and impact of a failure in one area and to simplify all aspects of maintenance and deployment, diagnostics, repair and recovery.
- Continuous monitoring: Our services are being actively monitored 24×7, with extensive recovery and diagnostic tools to drive automated and manual recovery of the service.
- Simplification: Managing a global, online service is complex. To drive predictability, we use standardized components and processes, wherever possible. A loose coupling among the software components results in a less complex deployment and maintenance. Lastly, a change management process that goes through progressive stages from scope to validation before being deployed worldwide helps ensure predictable behaviors.
- Human backup: Automation and technology are critical to success, but ultimately, its people who make the most critical decisions during a failure, outage or disaster scenario. The online services are staffed with 24/7 on-call support to provide rapid response and information collection towards problem resolution.
These elements exist for all the online services – Azure, Office 365, Dynamics, and so on.
But how are they leveraged during business continuity testing?
Each service team tests their contingency plans at least annually to determine the plan’s effectiveness and the service team’s readiness to execute the plan. The frequency and depth of testing is linked to a confidence level which is different for each of the online services. Confidence levels indicate the confidence and predictability of a service’s ability to recover.
For details on the confidence levels and testing frequencies for Exchange Online, SharePoint Online and OneDrive for Business, etc… please refer to the most recent ECBM Plan Validation Report available on the Office 365 Service Trust Portal.
BC/DR Plan Validation Report – FY19 Q1
A new reporting process has been developed in response to Microsoft Online Services customer expectations regarding our business continuity plan validation activities. The reporting process is designed to provide additional transparency into Microsoft’s Enterprise Business Continuity Management (EBCM) program operations.
The report will be published quarterly for the immediately preceding quarter and will be made available on the Service Trust Portal (STP). Each report will provide details from recent validations and control testing against selected online services.
For example, the FY19 Q1 report, which is posted on the Service Trust Portal (ECBM Testing Validation Report: FY19 Q1), includes information related to 9 selected online services across Office 365, Azure and Dynamics, with the testing dates and testing outcomes for each of the selected services.
The current report only covers a subset of Microsoft cloud services, and we are committed to continuously improving this reporting process.
If you have any questions or feedback related to the content of the reporting, you can send an email to the Office 365 CXP team at cxprad@microsoft.com.
- Azure SOC II audit report: The Azure SOC II report discusses business continuity (BC) starting on page 59 of the report, and the auditor confirms no exceptions noted for BC control testing on page 95.
- Azure SOC Bridge Letter Oct-Dec 2018 : The Azure SOC Bridge letter confirms that there have been no material changes to the system of internal control that would impact the conclusions reached in the SOC 1 type 2 and SOC 2 type 2 audit assessment reports.
- Global Data Centers provides insights into Microsoft’s framework for datacenter Threat, Vulnerability and Risk Assessments (TVRA)
- Office 365 Core – SSAE 18 SOC 2 Report 9-30-2018: Similar to the Azure 365 SOC II audit report (dated 10/1/2017 through 9/30/2018) discusses Microsoft’s position on business continuity (BC) in Section V, page 71 and the auditor confirms no exceptions noted for the CA-50 control test on page 66.
- Office 365 SOC Bridge Letter Q4 2018 : SOC Bridge letter confirming no material changes to the system of internal control provided by Office 365 that would impact the conclusions reached in the SOC 1 type 2 and SOC 2 type 2 audit assessment reports.
- Compliance Manager’s Office 365 NIST 800-53 control mapping provides positive (PASS) results for all 51 Business Continuity Disaster Recovery (BCDR)-related controls within Microsoft Managed Controls section, under Contingency Planning. For example, the Exchange Online Recovery Time Objective and Recovery Point Objective (EXO RPO/RTO) metrics are tested by the third-party auditor per NIST 800-53 control ID CP2(3). Other workloads, such as SharePoint Online, were also audited and discussed in the same control section.
- ISO-22301 This business continuity certification has been awarded to Microsoft Azure, Microsoft Azure Government, Microsoft Cloud App Security, Microsoft Intune, and Microsoft Power BI. This is a special one. Microsoft is the first (and currently the ONLY) hyperscale cloud service provider to receive the ISO 22301 certification, which is specifically targeted at business continuity management. That’s right. Google doesn’t have it. Amazon Web Services doesn’t have it. Just Microsoft.
- The Office 365 Service Health TechNet article provides useful information and insights related to Microsoft’s notification policy and post-incident review processes
- The Exchange Online (EXO) High Availability TechNet article outlines how continuous and multiple EXO replication in geographically dispersed data centers ensures data restoration capability in the wake of messaging infrastructure failure
- Microsoft’s Office 365 Data Resiliency Overview outlines ways Microsoft has built redundancy directly into our cloud services, moving away from complex physical infrastructure toward intelligent software to build data resiliency
- Microsoft’s current SLA commitments for online services
- Current worldwide up times are reported on Office 365 Trust Center Operations Transparency
- Azure SLAs and uptime reports are found on Azure Support
As you can see, there are a lot of places where you can find information related to business continuity, service resilience and related topics for Office 365.
This type of information is very useful for partners and customers who need to understand how Microsoft “keeps the lights on” with its Office 365 service and ensures that customers are able to meet regulatory standards, even if their data is in the cloud.