The company said this bug caused the defective software to deploy directly into its production environment, bypassing its normal validation process.
In a detailed post about the incident, Microsoft said the issue resulted in customers encountering errors while performing authentication operations for all its and third-party applications and services that depend on Azure Active Directory for authentication.
Apart from these applications, those that were using Azure B2C for authentication were also affected.
"Users who were not already authenticated to cloud services using Azure AD were more likely to experience issues and may have seen multiple authentication request failures corresponding to the average availability numbers shown below," Microsoft said. "These have been aggregated across different customers and workloads.
"Europe: 81% success rate for the duration of the incident.
"Americas: 17% success rate for the duration of the incident, improving to 37% just before mitigation.
"Asia: 72% success rate in the first 120 minutes of the incident. As business-hours peak traffic started, availability dropped to 32% at its lowest.
"Australia: 37% success rate for the duration of the incident."
It said service was restored for most customers by 00.23 UTC on 29 September (11.23am AEDT ON 30 September), adding that there had been infrequent authentication request failures after this as well, which could have affected customers until 02:25 UTC (1.25pm AEDT on 30 September).
Explaining the bug in detail, Microsoft said: "Azure AD was designed to be a geo-distributed service deployed in an active-active configuration with multiple partitions across multiple data centres around the world, built with isolation boundaries.
"Normally, changes initially target a validation ring that contains no customer data, followed by an inner ring that contains Microsoft-only users, and lastly our production environment. These changes are deployed in phases across five rings over several days.
"In this case, the SDP system failed to correctly target the validation test ring due to a latent defect that impacted the system’s ability to interpret deployment metadata. Consequently, all rings were targeted concurrently. The incorrect deployment caused service availability to degrade.
"Within minutes of impact, we took steps to revert the change using automated rollback systems which would normally have limited the duration and severity of impact. However, the latent defect in our SDP system had corrupted the deployment metadata, and we had to resort to manual rollback processes. This significantly extended the time to mitigate the issue."
The company apologised for the incident. "We sincerely apologise for the impact to affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future," it said.