Office 365 Basic fault finding Part One
It’s a common question. You have moved your email provision and web servers into the cloud so you don’t have any servers to check anymore – hallelujah! – but what do you do when your users report problems accessing said email or websites? No servers means no logs. OK, you can get logs in Azure but Office 365 is a consumer/business end product and isn’t really designed with diagnostics in mind.
So, here is some basic guidance on how to track down what has actually gone wrong in your brave new world of cloud servers and federated authentication.
Practically everything is a web service in Office 365/Azure so if you encounter an HTTP error code of some sort then there is likely something wrong with a server or service between the user and you or the application. All federated authentication requests - for most decent sized organisations anyway - whether for Office 365 itself or for an application hosted within it will end up back at your federation servers that are likely hosts partly or fully in your own network. So, if they are unavailable for some reason such as your Internet connection being down or maxed out then the user will likely receive an HTTP error of some sort. Common errors are:
401.x Generally authentication errors such as a bad password but can simply indicate that whatever the user is trying to access doesn’t exist. This can happen for whole variety of reasons such as the user account not having an assigned Office 365 licence (and thus can’t access a mailbox) or files having being renamed or deleted in a shared OneDrive.
403.x Authorisation errors, usually returned by an application that is using your federated authentication (such as a hosted HR server) and usually means that access has been denied to something but the presented credentials are correct. This is likely to be an application problem rather than a service problem as the user may lack required permissions within an application.
500.x Generic server errors. The sub code – the x part – may give more information but usually something to do with an application receives something it doesn’t know how to handle. This, as you might imagine, is more common with web applications than web services like ADFS. Resolution will depend on the exact error code and what you can find in the HTTP logs for either the server or the service (assuming there are logs that you can access, of course). If the server or service is hosted (many HR and Finance systems are hosted these days) then this will need to be escalated to the supplier.
503.x Generic transport errors such as a bad gateway at the web application end. ADFS and Office 365 are unlikely to be the cause of such errors but they are quite common in Azure hosted web applications whenever a server (or Azure web app) in the cloud is rebooted after patching and the Azure load balancer hasn’t noticed it has done so. Generally, speaking, waiting 15 minutes and trying again will let the user back in.
More in Part 2.