Keeping the Lights On: Business Continuity for Office 365

Early in my career at Microsoft, I worked in Microsoft Consulting Services, supporting organizations looking to deploy Exchange 2007 and 2010 in their on-premises environments. During those engagements, the bulk of the conversations focused on availability and disaster recovery concepts for Exchange – things like CCR, SCR and building out the DAG to ensure performance and database availability during an outage – whether it was a disk outage, a server outage, a network outage or a datacenter outage.

Those were fun days. And by “fun”, I mean “I’m glad those days are over”.

It’s never a fun day when you have to tell a customer that they CAN have 99.999% availability (of course – who DOESN’T want five 9’s of availability??) for their email service, but it will probably cost them all the money they make in a year to get it.

Back then, BPOS (Business Productivity Online Service) wasn’t really on the radar for most organizations outside of some larger corporate and government customers.

Then on June 28, 2011, Microsoft announced the release of Office 365 – and the ballgame changed. In the years since then, Office 365 has become a hugely popular service, providing online services to tens of thousands of customers and millions of users.

As a result, more businesses are using Office 365 for their business-critical information. This, of course, is great for our customers, because they get access to a fantastic online service, but it requires a high degree of trust on the part of customers that Microsoft is doing everything possible to preserve the confidentiality, integrity and availability of their data.

A large part of that means that Microsoft must ensure that the impact of natural disasters, power outages, human attacks, and so on are mitigated as much as possible. I recently heard a talk given that dealt with how Microsoft builds our datacenters and account for all sorts of disasters – earthquakes, floods, undersea cable cuts – even mitigations for a meteorite hitting Redmond!

It was an intriguing discussion and it’s good to hear the stories of datacenter survivability in our online services, but the truth is, customers want and need more than stories. This is evidenced by the fact that the contracts that are drawn up for Office 365 inevitably contain requirements related to defining Microsoft’s business continuity methodology.

Our enterprise customers, particularly those from regulated industries, are routinely required to perform business continuity testing to demonstrate that they are taking the steps necessary to keep their services up and running when some form of outage or disaster occurs.

The dynamics change somewhat when a customer moves to Office 365, however. These same customers now must assess the risk of outsourcing their services to a supplier, since the business continuity plans of that supplier directly impact the customer’s adherence to the regulations as well. In the case of Office 365, Microsoft is the outsourced supplier of services, so Microsoft’s Office 365 business continuity plans become very relevant.

Let’s take a simple example:

A customer named Contoso-Med has a large on-premises infrastructure. If business continuity testing were being done in-house by Contoso-Med and they failed the test, they would be held responsible for making the necessary corrections to their processes and procedures.

Now, just because Contoso-Med has moved those same business processes and data to Office 365, they are not absolved of the responsibility to ensure that the services meet the business continuity standards defined by regulators. They must still have a way of validating that Microsoft’s business continuity processes meet the standards defined by the regulations.

However, since Contoso-Med doesn’t get to sit in and offer comments on Microsoft’s internal business continuity tests, they must have another way of confirming that they are compliant with the regulations.

First…a Definition

Before I go much further, I want to clarify something.

There are several concepts that often get intermingled and, at times, used interchangeably: high availability, service resilience, disaster recovery and business continuity. We won’t dig into details on each of these concepts but suffice it to say they all have at their core the desire to keep services running for a business when something goes wrong. However, “business continuity and disaster recovery” from Microsoft’s perspective means that Microsoft will address the recovery and continuity of critical business functions, business system software, hardware, IT infrastructure services and data required to maintain an acceptable level of operations during an incident.

To accomplish that, the Microsoft Online Service Terms (http://go.microsoft.com/?linkid=9840733),which is sometimes referred to as simply the OST, currently states the following regarding business continuity:

  • Microsoft maintains emergency and contingency plans for the facilities in which Microsoft information systems that process Customer Data are located
  • Microsoft’s redundant storage and its procedures for recovering data are designed to attempt to reconstruct Customer Data in its original or last-replicated state from before the time it was lost or destroyed

 

Nice Definition. But How Do You Do It?

I’ve referenced the Service Trust portal in a few other blog posts and described how it can help you track things like your organization’s compliance for NIST, HIPAA or GDPR. It’s also a good resource for understanding other efforts that factor into the equation of whether Microsoft’s services can be trusted by their customers and partners.

A large part of achieving that level of trust relates to how we set up the physical infrastructure of the services.

To be clear, Microsoft online services are always on, running in an active/active configuration with resilience at the service level across multiple data centers. Microsoft has designed the online services to anticipate, plan for, and address failures at the hardware, network, and datacenter levels. Over time, we have built intelligence into our products to allow us to address failures at the application layer rather than at the datacenter layer, which would mean relying on third-party hardware.

As a result, Microsoft is able to deliver significantly higher availability and reliability for Office 365 than most customers are able to achieve in their own environments, usually at a much lower cost. The datacenters operate with high redundancy and the online services are delivering against the financially backed service level agreement of 99.9%.

The Office 365 core reliability design principles include:

  • Redundancy is built into every layer: Physical redundancy (through the use of multiple disk, network cards, redundant servers, geographical sites, and datacenters); data redundancy (constant replication of data across datacenters); and functional redundancy (the ability for customers to work offline when network connectivity is interrupted or inconsistent).
  • Resiliency: We achieve service resiliency using active load balancing and dynamic prioritization of tasks based on current loads. Additionally, we are constantly performing recovery testing across failure domains, and exercising both automated failover and manual switchover to healthy resources.
  • Distributed functionality of component services: Component services of Office 365 are distributed across datacenters and regions to help limit the scope and impact of a failure in one area and to simplify all aspects of maintenance and deployment, diagnostics, repair and recovery.
  • Continuous monitoring: Our services are being actively monitored 24×7, with extensive recovery and diagnostic tools to drive automated and manual recovery of the service.
  • Simplification: Managing a global, online service is complex. To drive predictability, we use standardized components and processes, wherever possible. A loose coupling among the software components results in a less complex deployment and maintenance. Lastly, a change management process that goes through progressive stages from scope to validation before being deployed worldwide helps ensure predictable behaviors.
  • Human backup: Automation and technology are critical to success, but ultimately, its people who make the most critical decisions during a failure, outage or disaster scenario. The online services are staffed with 24/7 on-call support to provide rapid response and information collection towards problem resolution.

These elements exist for all the online services – Azure, Office 365, Dynamics, and so on.

But how are they leveraged during business continuity testing?

Each service team tests their contingency plans at least annually to determine the plan’s effectiveness and the service team’s readiness to execute the plan. The frequency and depth of testing is linked to a confidence level which is different for each of the online services. Confidence levels indicate the confidence and predictability of a service’s ability to recover.

For details on the confidence levels and testing frequencies for Exchange Online, SharePoint Online and OneDrive for Business, etc… please refer to the most recent ECBM Plan Validation Report available on the Office 365 Service Trust Portal.

BC/DR Plan Validation Report – FY19 Q1

A new reporting process has been developed in response to Microsoft Online Services customer expectations regarding our business continuity plan validation activities. The reporting process is designed to provide additional transparency into Microsoft’s Enterprise Business Continuity Management (EBCM) program operations.

The report will be published quarterly for the immediately preceding quarter and will be made available on the Service Trust Portal (STP). Each report will provide details from recent validations and control testing against selected online services.

For example, the FY19 Q1 report, which is posted on the Service Trust Portal (ECBM Testing Validation Report: FY19 Q1), includes information related to 9 selected online services across Office 365, Azure and Dynamics, with the testing dates and testing outcomes for each of the selected services.

The current report only covers a subset of Microsoft cloud services, and we are committed to continuously improving this reporting process.

If you have any questions or feedback related to the content of the reporting, you can send an email to the Office 365 CXP team at cxprad@microsoft.com.

Additional Business Continuity resources are available on the Trust Center , Service Trust Portal, Compliance Manager and TechNet

  1. Azure SOC II audit report:  The Azure SOC II report  discusses business continuity (BC) starting on page 59 of the report, and the auditor confirms no exceptions noted for BC control testing on page 95.
  2. Azure SOC Bridge Letter Oct-Dec 2018 : The Azure SOC Bridge letter confirms that there have been no material changes to the system of internal control that would impact the conclusions reached in the SOC 1 type 2 and SOC 2 type 2 audit assessment reports.
  3. Global Data Centers provides insights into Microsoft’s framework for datacenter Threat, Vulnerability and Risk Assessments (TVRA)
  4. Office 365 Core – SSAE 18 SOC 2 Report 9-30-2018: Similar to the Azure  365 SOC II audit report (dated 10/1/2017 through 9/30/2018) discusses Microsoft’s position on business continuity (BC) in Section V, page 71 and the auditor confirms no exceptions noted for the CA-50 control test on page 66.
  5. Office 365 SOC Bridge Letter Q4 2018 : SOC Bridge letter confirming no material changes to the system of internal control provided by Office 365 that would impact the conclusions reached in the SOC 1 type 2 and SOC 2 type 2 audit assessment reports.
  6. Compliance Manager’s Office 365 NIST 800-53 control mapping provides positive (PASS) results for all 51 Business Continuity Disaster Recovery (BCDR)-related controls within Microsoft Managed Controls section, under Contingency Planning. For example, the Exchange Online Recovery Time  Objective and Recovery Point Objective (EXO RPO/RTO) metrics are tested by the third-party auditor per NIST 800-53 control ID CP2(3). Other workloads, such as SharePoint Online, were also audited and discussed in the same control section.
  7. ISO-22301  This business continuity certification has been awarded to Microsoft Azure, Microsoft Azure Government, Microsoft Cloud App Security, Microsoft Intune, and Microsoft Power BI. This is a special one. Microsoft is the first (and currently the ONLY) hyperscale cloud service provider to receive the ISO 22301 certification, which is specifically targeted at business continuity management. That’s right. Google doesn’t have it. Amazon Web Services doesn’t have it. Just Microsoft.
  8. The Office 365 Service Health TechNet article provides useful information and insights related to Microsoft’s notification policy and post-incident review processes
  9. The Exchange Online (EXO) High Availability TechNet article outlines how continuous and multiple EXO replication in geographically dispersed data centers ensures data restoration capability in the wake of messaging infrastructure failure
  10. Microsoft’s Office 365 Data Resiliency Overview outlines ways Microsoft has built redundancy directly into our cloud services, moving away from complex physical infrastructure toward intelligent software to build data resiliency
  11. Microsoft’s current SLA commitments for online services
  12. Current worldwide up times are reported on Office 365 Trust Center Operations Transparency
  13. Azure SLAs and uptime reports are found on Azure Support

As you can see, there are a lot of places where you can find information related to business continuity, service resilience and related topics for Office 365.

This type of information is very useful for partners and customers who need to understand how Microsoft “keeps the lights on” with its Office 365 service and ensures that customers are able to meet regulatory standards, even if their data is in the cloud.

 

Leveraging the Office 365 Service Assurance Portal in Customer Scenarios

August 23, 2017
By David Branscome

In the partner organization at Microsoft, we get lots of requests from partners that are in the process of responding to an RFP for Office 365 or Azure deployments. Maybe the partner has described the Microsoft datacenters to their customers as being ISO 27001 or FedRAMP compliant. But now the customer has stated that they need to know how certain controls are implemented in Microsoft’s datacenters. In many cases, the customer is audited regularly, and they have to be able to provide evidence that their data is stored in a specific manner or that access is controlled in a specific way.

The problem is, getting access into the Microsoft datacenters is REALLY difficult. Most Microsoft employees haven’t even been in one of the cloud datacenters – including myself. (There’s a decent virtual tour here, but I’d sure like to see all the blinky lights someday.)

In any case, partners don’t have to get a datacenter tour to respond to these types of information requests from customers. The information is literally at their fingertips in the Office 365 portal – just go to the Security & Compliance section and on the left side, find the Service Assurance section.

Wait…I Don’t See it!

But wait a second.

This data isn’t available to everyone. So, a compliance officer with no special permissions in Office 365 would see something like this:

They don’t even see the Admin or Security & Compliance application icons – let alone the Service Assurance menu. Now what?

As you’d expect, not everyone with an account in Office 365 needs to see that organization’s security configuration. If there are some users who need to be able to access the Service Assurance Center, here’s how to grant those permissions:

Log in to the Office 365 portal with Global Admin credentials.

Go to the Security and Compliance app and select Permissions.

In Permissions, check the box for Service Assurance User.

Select Edit role group and in the Members area, click on Edit.

Select Choose members to add the people who should have these permissions.

Click Add and then find the user.

Finish the wizard and you’ll see the user as a member of the Service Assurance User permissions group.

When the user logs in again, they will be able to go to https://protection.office.com and see the Service Assurance center:

Okay…Now What?

Now that you have the necessary permissions, you can start digging into the content in the Service Assurance center. You could start off by looking at all the controls and audited elements, but maybe you want to be more specific in your approach.

Let’s say you want to see how Office 365 meets ISO 27001 standards.

The first thing I’d recommend is to go to the Settings area and define the region whose controls are relevant – in this case, Europe. You’ll also need to select at least one of the industries whose regulations would be relevant to your search, then click Save.

As the green box indicates, you can now go into the Compliance Reports, Trust Documents and Audited Controls and review the content for the relevant region and industry. So, let’s take a look at what’s there.
If you look in the Compliance Reports area, you’ll see the listing of the certificates that Microsoft cloud datacenters have achieved, and you can click on and download the certificate itself.

For example, if I expand the ISO reports section and scroll down, I see a report named “Office 365 Germany ISO 27001 ISO 27017 and ISO 27018 Audit Assessment Report”. If I click on it, I can open the PDF file itself, which provides me with the final report stating that Office 365 meets the expectations for compliance.

But this only tells me if Microsoft complied with the controls or not. It doesn’t tell me what was actually tested as part of the process.

For that, I can go to the Audited Controls section, where I see the ISO 27018-2014 audit report and I can download it for review.

In this case, the report is an Excel spreadsheet which details things like the title of the control, the implementation and testing details, when it was tested and who performed the testing. This kind of information is generally enough for a customer’s audit team to be reassured of Microsoft’s compliance with the standard.
Don’t forget – if you want to change the scope of the controls (the region/country where the controls are relevant, which industry regulations apply, etc..) you can change the parameters in the Settings tab.

The Trusted Cloud

Microsoft is constantly working to achieve, maintain and even exceed compliance standards in order to secure customer data and make our cloud the most trusted one on the planet. The Service Assurance section of Office 365 is one evidence of that effort. Make sure to take advantage of it!
Additionally, check out the resources in the Microsoft Trust Center for information about GDPR, security, protection of user’s personally identifiable information and Microsoft’s commitment to providing customers with the controls necessary to secure their environment and user identities.

https://www.microsoft.com/en-us/trustcenter