Azure Platform Operations and AI Playbook (2026): Monitoring, IaC, DevOps, Recovery, Migration, and AI Services
Your cloud center of excellence is unifying platform operations from observability to delivery pipelines and AI service adoption.
Azure Platform Operations and AI Playbook (2026): Monitoring, IaC, DevOps, Recovery, Migration, and AI Services
Scenario
Your cloud center of excellence is unifying platform operations from observability to delivery pipelines and AI service adoption.
Scope
This article is updated for Azure platform guidance available as of May 18, 2026. It is intentionally implementation-focused, with practical CLI workflows, operational checks, and architecture reasoning you can use in production design reviews.
How to read this playbook
Use each section as a decision module. Start with workload shape, validate against security and operations constraints, deploy a proof-of-concept with Azure CLI, and finalize only after measurable verification. This avoids architecture decisions based on preference alone and gives your team a repeatable standard.
Cross-cutting decision framework
- Define workload behavior: bursty, steady, stateful, event-driven, or latency-sensitive.
- Define control requirements: platform-managed, partially managed, or full runtime control.
- Define resilience and recovery targets: RTO, RPO, and acceptable blast radius.
- Define governance boundaries: identity model, secrets handling, and policy enforcement.
- Define operational ownership: who patches, monitors, scales, and responds during incidents.
- Define cost model expectations: idle cost, burst cost, and growth path over 12 months.
Implementation baseline used in examples
- Region baseline:
eastusfor tutorial consistency - Resource naming: short deterministic names for scriptability
- Security baseline: managed identities, least-privilege, and audit logs
- Validation baseline: deploy, load test, observe, rollback, and document
39) Azure Monitor or Application Insights
Decision context
When teams compare Azure Monitor and Application Insights, the failure mode is usually to optimize for only one metric such as raw latency or monthly cost. A durable Azure architecture needs to optimize for reliability model, operational maturity, security boundaries, release velocity, and failure containment. In production environments, this means you should decide early who owns runtime operations, what telemetry standard is mandatory, and how recovery targets are validated under incident pressure. For Monitoring workloads, this design discipline matters more than headline feature lists.
When Azure Monitor is the better anchor
Azure Monitor is usually the better anchor when your workload shape closely maps to its native control model. The strongest outcomes happen when platform teams align release workflows, scaling signals, and security policy with how the service was designed. In practice, this gives you lower cognitive load during operations, more predictable incident response, and cleaner governance reviews. You also reduce hidden coupling because your architecture matches the managed abstractions Azure already optimizes.
When Application Insights is the better anchor
Application Insights becomes the better anchor when your primary risk is tied to constraints that Azure Monitor does not solve elegantly. This can include specific protocol behavior, tenancy separation, deterministic deployment controls, or specialized tooling already used by your team. If your staff can operate Application Insights confidently and your change-management process is mature, choosing it can reduce long-term migration churn and prevent tactical workarounds from becoming permanent platform debt.
Practical tutorial
Use the following CLI flow to stand up a minimal proof-of-concept and test the assumptions before any platform-wide standard is declared.
az group create -n rg-ops-playbook -l eastus
az monitor log-analytics workspace create -g rg-ops-playbook -n lawops2026 -l eastus
az monitor app-insights component create -g rg-ops-playbook -a appiops2026 -l eastus --kind web --application-type web
After deployment, run a focused validation loop:
- Confirm security controls are attached and auditable.
- Validate scaling behavior under synthetic workload.
- Verify rollback steps are executable without portal-only actions.
- Capture baseline cost and performance metrics for a two-week window.
- Record operational friction points in a decision log.
Guardrails and anti-patterns
Common anti-patterns are building dual-service hybrids too early, skipping policy-as-code, and finalizing platform standards without realistic failure testing. Avoid making the decision in architecture diagrams only. Demand concrete evidence from load tests, deployment frequency analysis, and on-call playbooks. If two services look equivalent on paper, prefer the one your team can run safely at 2 AM during an incident.
Production recommendation
Treat this decision as an operating model decision, not only a feature decision. Document required capabilities, what you will not support, and the exception process. Then enforce the standard using templates, CI validation, and policy controls so project teams can move quickly without reopening the same design debate every sprint.
40) Log Analytics or Application Insights
Decision context
When teams compare Log Analytics and Application Insights, the failure mode is usually to optimize for only one metric such as raw latency or monthly cost. A durable Azure architecture needs to optimize for reliability model, operational maturity, security boundaries, release velocity, and failure containment. In production environments, this means you should decide early who owns runtime operations, what telemetry standard is mandatory, and how recovery targets are validated under incident pressure. For Monitoring workloads, this design discipline matters more than headline feature lists.
When Log Analytics is the better anchor
Log Analytics is usually the better anchor when your workload shape closely maps to its native control model. The strongest outcomes happen when platform teams align release workflows, scaling signals, and security policy with how the service was designed. In practice, this gives you lower cognitive load during operations, more predictable incident response, and cleaner governance reviews. You also reduce hidden coupling because your architecture matches the managed abstractions Azure already optimizes.
When Application Insights is the better anchor
Application Insights becomes the better anchor when your primary risk is tied to constraints that Log Analytics does not solve elegantly. This can include specific protocol behavior, tenancy separation, deterministic deployment controls, or specialized tooling already used by your team. If your staff can operate Application Insights confidently and your change-management process is mature, choosing it can reduce long-term migration churn and prevent tactical workarounds from becoming permanent platform debt.
Practical tutorial
Use the following CLI flow to stand up a minimal proof-of-concept and test the assumptions before any platform-wide standard is declared.
az monitor log-analytics workspace show -g rg-ops-playbook -n lawops2026
az monitor app-insights component show -g rg-ops-playbook -a appiops2026
After deployment, run a focused validation loop:
- Confirm security controls are attached and auditable.
- Validate scaling behavior under synthetic workload.
- Verify rollback steps are executable without portal-only actions.
- Capture baseline cost and performance metrics for a two-week window.
- Record operational friction points in a decision log.
Guardrails and anti-patterns
Common anti-patterns are building dual-service hybrids too early, skipping policy-as-code, and finalizing platform standards without realistic failure testing. Avoid making the decision in architecture diagrams only. Demand concrete evidence from load tests, deployment frequency analysis, and on-call playbooks. If two services look equivalent on paper, prefer the one your team can run safely at 2 AM during an incident.
Production recommendation
Treat this decision as an operating model decision, not only a feature decision. Document required capabilities, what you will not support, and the exception process. Then enforce the standard using templates, CI validation, and policy controls so project teams can move quickly without reopening the same design debate every sprint.
41) Activity Log or Diagnostic Logs
Decision context
When teams compare Activity Log and Diagnostic Logs, the failure mode is usually to optimize for only one metric such as raw latency or monthly cost. A durable Azure architecture needs to optimize for reliability model, operational maturity, security boundaries, release velocity, and failure containment. In production environments, this means you should decide early who owns runtime operations, what telemetry standard is mandatory, and how recovery targets are validated under incident pressure. For Monitoring workloads, this design discipline matters more than headline feature lists.
When Activity Log is the better anchor
Activity Log is usually the better anchor when your workload shape closely maps to its native control model. The strongest outcomes happen when platform teams align release workflows, scaling signals, and security policy with how the service was designed. In practice, this gives you lower cognitive load during operations, more predictable incident response, and cleaner governance reviews. You also reduce hidden coupling because your architecture matches the managed abstractions Azure already optimizes.
When Diagnostic Logs is the better anchor
Diagnostic Logs becomes the better anchor when your primary risk is tied to constraints that Activity Log does not solve elegantly. This can include specific protocol behavior, tenancy separation, deterministic deployment controls, or specialized tooling already used by your team. If your staff can operate Diagnostic Logs confidently and your change-management process is mature, choosing it can reduce long-term migration churn and prevent tactical workarounds from becoming permanent platform debt.
Practical tutorial
Use the following CLI flow to stand up a minimal proof-of-concept and test the assumptions before any platform-wide standard is declared.
az monitor activity-log list --max-events 20
az monitor diagnostic-settings create --name send-to-law --resource <resource-id> --workspace <workspace-id> --logs [{\"category\":\"AuditEvent\",\"enabled\":true}]
After deployment, run a focused validation loop:
- Confirm security controls are attached and auditable.
- Validate scaling behavior under synthetic workload.
- Verify rollback steps are executable without portal-only actions.
- Capture baseline cost and performance metrics for a two-week window.
- Record operational friction points in a decision log.
Guardrails and anti-patterns
Common anti-patterns are building dual-service hybrids too early, skipping policy-as-code, and finalizing platform standards without realistic failure testing. Avoid making the decision in architecture diagrams only. Demand concrete evidence from load tests, deployment frequency analysis, and on-call playbooks. If two services look equivalent on paper, prefer the one your team can run safely at 2 AM during an incident.
Production recommendation
Treat this decision as an operating model decision, not only a feature decision. Document required capabilities, what you will not support, and the exception process. Then enforce the standard using templates, CI validation, and policy controls so project teams can move quickly without reopening the same design debate every sprint.
42) ARM Templates or Bicep
Decision context
When teams compare ARM Templates and Bicep, the failure mode is usually to optimize for only one metric such as raw latency or monthly cost. A durable Azure architecture needs to optimize for reliability model, operational maturity, security boundaries, release velocity, and failure containment. In production environments, this means you should decide early who owns runtime operations, what telemetry standard is mandatory, and how recovery targets are validated under incident pressure. For IaC workloads, this design discipline matters more than headline feature lists.
When ARM Templates is the better anchor
ARM Templates is usually the better anchor when your workload shape closely maps to its native control model. The strongest outcomes happen when platform teams align release workflows, scaling signals, and security policy with how the service was designed. In practice, this gives you lower cognitive load during operations, more predictable incident response, and cleaner governance reviews. You also reduce hidden coupling because your architecture matches the managed abstractions Azure already optimizes.
When Bicep is the better anchor
Bicep becomes the better anchor when your primary risk is tied to constraints that ARM Templates does not solve elegantly. This can include specific protocol behavior, tenancy separation, deterministic deployment controls, or specialized tooling already used by your team. If your staff can operate Bicep confidently and your change-management process is mature, choosing it can reduce long-term migration churn and prevent tactical workarounds from becoming permanent platform debt.
Practical tutorial
Use the following CLI flow to stand up a minimal proof-of-concept and test the assumptions before any platform-wide standard is declared.
az deployment group create -g rg-ops-playbook --template-file ./infra/main.json --parameters ./infra/main.parameters.json
az deployment group create -g rg-ops-playbook --template-file ./infra/main.bicep
After deployment, run a focused validation loop:
- Confirm security controls are attached and auditable.
- Validate scaling behavior under synthetic workload.
- Verify rollback steps are executable without portal-only actions.
- Capture baseline cost and performance metrics for a two-week window.
- Record operational friction points in a decision log.
Guardrails and anti-patterns
Common anti-patterns are building dual-service hybrids too early, skipping policy-as-code, and finalizing platform standards without realistic failure testing. Avoid making the decision in architecture diagrams only. Demand concrete evidence from load tests, deployment frequency analysis, and on-call playbooks. If two services look equivalent on paper, prefer the one your team can run safely at 2 AM during an incident.
Production recommendation
Treat this decision as an operating model decision, not only a feature decision. Document required capabilities, what you will not support, and the exception process. Then enforce the standard using templates, CI validation, and policy controls so project teams can move quickly without reopening the same design debate every sprint.
43) Bicep or Terraform (on Azure)
Decision context
When teams compare Bicep and Terraform (on Azure), the failure mode is usually to optimize for only one metric such as raw latency or monthly cost. A durable Azure architecture needs to optimize for reliability model, operational maturity, security boundaries, release velocity, and failure containment. In production environments, this means you should decide early who owns runtime operations, what telemetry standard is mandatory, and how recovery targets are validated under incident pressure. For IaC workloads, this design discipline matters more than headline feature lists.
When Bicep is the better anchor
Bicep is usually the better anchor when your workload shape closely maps to its native control model. The strongest outcomes happen when platform teams align release workflows, scaling signals, and security policy with how the service was designed. In practice, this gives you lower cognitive load during operations, more predictable incident response, and cleaner governance reviews. You also reduce hidden coupling because your architecture matches the managed abstractions Azure already optimizes.
When Terraform (on Azure) is the better anchor
Terraform (on Azure) becomes the better anchor when your primary risk is tied to constraints that Bicep does not solve elegantly. This can include specific protocol behavior, tenancy separation, deterministic deployment controls, or specialized tooling already used by your team. If your staff can operate Terraform (on Azure) confidently and your change-management process is mature, choosing it can reduce long-term migration churn and prevent tactical workarounds from becoming permanent platform debt.
Practical tutorial
Use the following CLI flow to stand up a minimal proof-of-concept and test the assumptions before any platform-wide standard is declared.
az deployment group create -g rg-ops-playbook --template-file ./infra/platform.bicep
terraform init
terraform plan
terraform apply -auto-approve
After deployment, run a focused validation loop:
- Confirm security controls are attached and auditable.
- Validate scaling behavior under synthetic workload.
- Verify rollback steps are executable without portal-only actions.
- Capture baseline cost and performance metrics for a two-week window.
- Record operational friction points in a decision log.
Guardrails and anti-patterns
Common anti-patterns are building dual-service hybrids too early, skipping policy-as-code, and finalizing platform standards without realistic failure testing. Avoid making the decision in architecture diagrams only. Demand concrete evidence from load tests, deployment frequency analysis, and on-call playbooks. If two services look equivalent on paper, prefer the one your team can run safely at 2 AM during an incident.
Production recommendation
Treat this decision as an operating model decision, not only a feature decision. Document required capabilities, what you will not support, and the exception process. Then enforce the standard using templates, CI validation, and policy controls so project teams can move quickly without reopening the same design debate every sprint.
44) Azure DevOps or GitHub Actions
Decision context
When teams compare Azure DevOps and GitHub Actions, the failure mode is usually to optimize for only one metric such as raw latency or monthly cost. A durable Azure architecture needs to optimize for reliability model, operational maturity, security boundaries, release velocity, and failure containment. In production environments, this means you should decide early who owns runtime operations, what telemetry standard is mandatory, and how recovery targets are validated under incident pressure. For DevOps workloads, this design discipline matters more than headline feature lists.
When Azure DevOps is the better anchor
Azure DevOps is usually the better anchor when your workload shape closely maps to its native control model. The strongest outcomes happen when platform teams align release workflows, scaling signals, and security policy with how the service was designed. In practice, this gives you lower cognitive load during operations, more predictable incident response, and cleaner governance reviews. You also reduce hidden coupling because your architecture matches the managed abstractions Azure already optimizes.
When GitHub Actions is the better anchor
GitHub Actions becomes the better anchor when your primary risk is tied to constraints that Azure DevOps does not solve elegantly. This can include specific protocol behavior, tenancy separation, deterministic deployment controls, or specialized tooling already used by your team. If your staff can operate GitHub Actions confidently and your change-management process is mature, choosing it can reduce long-term migration churn and prevent tactical workarounds from becoming permanent platform debt.
Practical tutorial
Use the following CLI flow to stand up a minimal proof-of-concept and test the assumptions before any platform-wide standard is declared.
az extension add --name azure-devops
az devops configure --defaults organization=https://dev.azure.com/<org> project=<project>
az pipelines create --name app-ci --repository https://github.com/<org>/<repo> --branch main --yml-path azure-pipelines.yml
After deployment, run a focused validation loop:
- Confirm security controls are attached and auditable.
- Validate scaling behavior under synthetic workload.
- Verify rollback steps are executable without portal-only actions.
- Capture baseline cost and performance metrics for a two-week window.
- Record operational friction points in a decision log.
Guardrails and anti-patterns
Common anti-patterns are building dual-service hybrids too early, skipping policy-as-code, and finalizing platform standards without realistic failure testing. Avoid making the decision in architecture diagrams only. Demand concrete evidence from load tests, deployment frequency analysis, and on-call playbooks. If two services look equivalent on paper, prefer the one your team can run safely at 2 AM during an incident.
Production recommendation
Treat this decision as an operating model decision, not only a feature decision. Document required capabilities, what you will not support, and the exception process. Then enforce the standard using templates, CI validation, and policy controls so project teams can move quickly without reopening the same design debate every sprint.
45) Azure Site Recovery or Azure Backup
Decision context
When teams compare Azure Site Recovery and Azure Backup, the failure mode is usually to optimize for only one metric such as raw latency or monthly cost. A durable Azure architecture needs to optimize for reliability model, operational maturity, security boundaries, release velocity, and failure containment. In production environments, this means you should decide early who owns runtime operations, what telemetry standard is mandatory, and how recovery targets are validated under incident pressure. For BCDR workloads, this design discipline matters more than headline feature lists.
When Azure Site Recovery is the better anchor
Azure Site Recovery is usually the better anchor when your workload shape closely maps to its native control model. The strongest outcomes happen when platform teams align release workflows, scaling signals, and security policy with how the service was designed. In practice, this gives you lower cognitive load during operations, more predictable incident response, and cleaner governance reviews. You also reduce hidden coupling because your architecture matches the managed abstractions Azure already optimizes.
When Azure Backup is the better anchor
Azure Backup becomes the better anchor when your primary risk is tied to constraints that Azure Site Recovery does not solve elegantly. This can include specific protocol behavior, tenancy separation, deterministic deployment controls, or specialized tooling already used by your team. If your staff can operate Azure Backup confidently and your change-management process is mature, choosing it can reduce long-term migration churn and prevent tactical workarounds from becoming permanent platform debt.
Practical tutorial
Use the following CLI flow to stand up a minimal proof-of-concept and test the assumptions before any platform-wide standard is declared.
az backup vault create -g rg-ops-playbook -n rsvops2026 -l eastus
az backup vault backup-properties set -g rg-ops-playbook -n rsvops2026 --soft-delete-feature-state Enable
After deployment, run a focused validation loop:
- Confirm security controls are attached and auditable.
- Validate scaling behavior under synthetic workload.
- Verify rollback steps are executable without portal-only actions.
- Capture baseline cost and performance metrics for a two-week window.
- Record operational friction points in a decision log.
Guardrails and anti-patterns
Common anti-patterns are building dual-service hybrids too early, skipping policy-as-code, and finalizing platform standards without realistic failure testing. Avoid making the decision in architecture diagrams only. Demand concrete evidence from load tests, deployment frequency analysis, and on-call playbooks. If two services look equivalent on paper, prefer the one your team can run safely at 2 AM during an incident.
Production recommendation
Treat this decision as an operating model decision, not only a feature decision. Document required capabilities, what you will not support, and the exception process. Then enforce the standard using templates, CI validation, and policy controls so project teams can move quickly without reopening the same design debate every sprint.
46) Azure Migrate or Azure Site Recovery
Decision context
When teams compare Azure Migrate and Azure Site Recovery, the failure mode is usually to optimize for only one metric such as raw latency or monthly cost. A durable Azure architecture needs to optimize for reliability model, operational maturity, security boundaries, release velocity, and failure containment. In production environments, this means you should decide early who owns runtime operations, what telemetry standard is mandatory, and how recovery targets are validated under incident pressure. For Migration workloads, this design discipline matters more than headline feature lists.
When Azure Migrate is the better anchor
Azure Migrate is usually the better anchor when your workload shape closely maps to its native control model. The strongest outcomes happen when platform teams align release workflows, scaling signals, and security policy with how the service was designed. In practice, this gives you lower cognitive load during operations, more predictable incident response, and cleaner governance reviews. You also reduce hidden coupling because your architecture matches the managed abstractions Azure already optimizes.
When Azure Site Recovery is the better anchor
Azure Site Recovery becomes the better anchor when your primary risk is tied to constraints that Azure Migrate does not solve elegantly. This can include specific protocol behavior, tenancy separation, deterministic deployment controls, or specialized tooling already used by your team. If your staff can operate Azure Site Recovery confidently and your change-management process is mature, choosing it can reduce long-term migration churn and prevent tactical workarounds from becoming permanent platform debt.
Practical tutorial
Use the following CLI flow to stand up a minimal proof-of-concept and test the assumptions before any platform-wide standard is declared.
az resource create -g rg-ops-playbook -n migrateproj2026 --resource-type Microsoft.Migrate/migrateProjects --api-version 2023-01-01 --location eastus --properties {\"publicNetworkAccess\":\"Enabled\"}
az backup vault show -g rg-ops-playbook -n rsvops2026
After deployment, run a focused validation loop:
- Confirm security controls are attached and auditable.
- Validate scaling behavior under synthetic workload.
- Verify rollback steps are executable without portal-only actions.
- Capture baseline cost and performance metrics for a two-week window.
- Record operational friction points in a decision log.
Guardrails and anti-patterns
Common anti-patterns are building dual-service hybrids too early, skipping policy-as-code, and finalizing platform standards without realistic failure testing. Avoid making the decision in architecture diagrams only. Demand concrete evidence from load tests, deployment frequency analysis, and on-call playbooks. If two services look equivalent on paper, prefer the one your team can run safely at 2 AM during an incident.
Production recommendation
Treat this decision as an operating model decision, not only a feature decision. Document required capabilities, what you will not support, and the exception process. Then enforce the standard using templates, CI validation, and policy controls so project teams can move quickly without reopening the same design debate every sprint.
47) Azure OpenAI or Azure Cognitive Services
Decision context
When teams compare Azure OpenAI and Azure Cognitive Services, the failure mode is usually to optimize for only one metric such as raw latency or monthly cost. A durable Azure architecture needs to optimize for reliability model, operational maturity, security boundaries, release velocity, and failure containment. In production environments, this means you should decide early who owns runtime operations, what telemetry standard is mandatory, and how recovery targets are validated under incident pressure. For AI/ML workloads, this design discipline matters more than headline feature lists.
When Azure OpenAI is the better anchor
Azure OpenAI is usually the better anchor when your workload shape closely maps to its native control model. The strongest outcomes happen when platform teams align release workflows, scaling signals, and security policy with how the service was designed. In practice, this gives you lower cognitive load during operations, more predictable incident response, and cleaner governance reviews. You also reduce hidden coupling because your architecture matches the managed abstractions Azure already optimizes.
When Azure Cognitive Services is the better anchor
Azure Cognitive Services becomes the better anchor when your primary risk is tied to constraints that Azure OpenAI does not solve elegantly. This can include specific protocol behavior, tenancy separation, deterministic deployment controls, or specialized tooling already used by your team. If your staff can operate Azure Cognitive Services confidently and your change-management process is mature, choosing it can reduce long-term migration churn and prevent tactical workarounds from becoming permanent platform debt.
Practical tutorial
Use the following CLI flow to stand up a minimal proof-of-concept and test the assumptions before any platform-wide standard is declared.
az cognitiveservices account create -g rg-ops-playbook -n aoaiplaybook2026 -l eastus --kind OpenAI --sku S0
az cognitiveservices account create -g rg-ops-playbook -n aiservicesplaybook2026 -l eastus --kind CognitiveServices --sku S0
After deployment, run a focused validation loop:
- Confirm security controls are attached and auditable.
- Validate scaling behavior under synthetic workload.
- Verify rollback steps are executable without portal-only actions.
- Capture baseline cost and performance metrics for a two-week window.
- Record operational friction points in a decision log.
Guardrails and anti-patterns
Common anti-patterns are building dual-service hybrids too early, skipping policy-as-code, and finalizing platform standards without realistic failure testing. Avoid making the decision in architecture diagrams only. Demand concrete evidence from load tests, deployment frequency analysis, and on-call playbooks. If two services look equivalent on paper, prefer the one your team can run safely at 2 AM during an incident.
Production recommendation
Treat this decision as an operating model decision, not only a feature decision. Document required capabilities, what you will not support, and the exception process. Then enforce the standard using templates, CI validation, and policy controls so project teams can move quickly without reopening the same design debate every sprint.
48) Azure ML or Azure OpenAI
Decision context
When teams compare Azure ML and Azure OpenAI, the failure mode is usually to optimize for only one metric such as raw latency or monthly cost. A durable Azure architecture needs to optimize for reliability model, operational maturity, security boundaries, release velocity, and failure containment. In production environments, this means you should decide early who owns runtime operations, what telemetry standard is mandatory, and how recovery targets are validated under incident pressure. For AI/ML workloads, this design discipline matters more than headline feature lists.
When Azure ML is the better anchor
Azure ML is usually the better anchor when your workload shape closely maps to its native control model. The strongest outcomes happen when platform teams align release workflows, scaling signals, and security policy with how the service was designed. In practice, this gives you lower cognitive load during operations, more predictable incident response, and cleaner governance reviews. You also reduce hidden coupling because your architecture matches the managed abstractions Azure already optimizes.
When Azure OpenAI is the better anchor
Azure OpenAI becomes the better anchor when your primary risk is tied to constraints that Azure ML does not solve elegantly. This can include specific protocol behavior, tenancy separation, deterministic deployment controls, or specialized tooling already used by your team. If your staff can operate Azure OpenAI confidently and your change-management process is mature, choosing it can reduce long-term migration churn and prevent tactical workarounds from becoming permanent platform debt.
Practical tutorial
Use the following CLI flow to stand up a minimal proof-of-concept and test the assumptions before any platform-wide standard is declared.
az extension add --name ml
az ml workspace create -g rg-ops-playbook -n mlwops2026 -l eastus
az cognitiveservices account show -g rg-ops-playbook -n aoaiplaybook2026
After deployment, run a focused validation loop:
- Confirm security controls are attached and auditable.
- Validate scaling behavior under synthetic workload.
- Verify rollback steps are executable without portal-only actions.
- Capture baseline cost and performance metrics for a two-week window.
- Record operational friction points in a decision log.
Guardrails and anti-patterns
Common anti-patterns are building dual-service hybrids too early, skipping policy-as-code, and finalizing platform standards without realistic failure testing. Avoid making the decision in architecture diagrams only. Demand concrete evidence from load tests, deployment frequency analysis, and on-call playbooks. If two services look equivalent on paper, prefer the one your team can run safely at 2 AM during an incident.
Production recommendation
Treat this decision as an operating model decision, not only a feature decision. Document required capabilities, what you will not support, and the exception process. Then enforce the standard using templates, CI validation, and policy controls so project teams can move quickly without reopening the same design debate every sprint.
End-to-end validation flow
After completing the pair-level proofs, run a final integrated user journey in a non-production subscription. Validate provisioning speed, deployment rollback, observability completeness, incident simulation, and teardown hygiene. Architecture decisions are only complete when the full path from deployment to failure recovery has been tested and documented.
Security, operations, and cost checklist
- Enforce least privilege on all deployment identities.
- Capture audit evidence for every control-plane change.
- Enable standardized logging and alert routing before go-live.
- Define rollback scripts and test them monthly.
- Pin module and API versions in IaC to reduce drift.
- Track cost by environment and workload tags.
- Keep a service exception process with explicit owner sign-off.
References
- https://learn.microsoft.com/en-us/azure/azure-monitor/overview
- https://learn.microsoft.com/en-us/azure/azure-monitor/app/app-insights-overview
- https://learn.microsoft.com/en-us/azure/azure-monitor/essentials/activity-log
- https://learn.microsoft.com/en-us/azure/azure-resource-manager/templates/overview
- https://learn.microsoft.com/en-us/azure/azure-resource-manager/bicep/overview
- https://learn.microsoft.com/en-us/azure/developer/terraform/overview
- https://learn.microsoft.com/en-us/azure/devops/user-guide/services
- https://learn.microsoft.com/en-us/cli/azure/pipelines
- https://learn.microsoft.com/en-us/azure/site-recovery/site-recovery-overview
- https://learn.microsoft.com/en-us/azure/backup/
- https://learn.microsoft.com/en-us/azure/templates/Microsoft.Migrate/migrateprojects
- https://learn.microsoft.com/en-us/azure/ai-foundry/openai/overview
- https://learn.microsoft.com/en-us/azure/ai-studio/concepts/what-are-ai-services
- https://learn.microsoft.com/en-us/azure/machine-learning/overview-what-is-azure-machine-learning
- https://learn.microsoft.com/en-us/azure/
- https://learn.microsoft.com/en-us/cli/azure/
- https://learn.microsoft.com/en-us/azure/architecture/