AWS Config and SSM Automation: Enforcing Immutable Infrastructure Against CI/CD Drift

AWS Config and SSM Automation: Enforcing Immutable Infrastructure Against CI/CD Drift

Quick Summary ⚡️

Configuration drift, the deviation of a deployed resource's actual state from its defined Infrastructure as Code (IaC) template, is a critical risk in modern CI/CD pipelines. This guide focuses on leveraging AWS Config Rules for continuous, policy-driven detection and AWS Config Remediation for automated correction. We detail how to use the built-in CLOUDFORMATION_STACK_DRIFT_DETECTION_CHECK rule, implement custom rules via AWS Lambda or Guard Policy to police CI/CD-specific security settings (e.g., security group ingress), and integrate AWS Systems Manager (SSM) Automation to execute rollback or synchronization workflows. This approach transforms drift detection from a reactive audit function into a proactive, preventative control, ensuring the Git repository remains the single, immutable source of truth for production infrastructure.


Table of Contents 📜


Introduction: The CI/CD Drift Paradox

In a world governed by Infrastructure as Code (IaC), the CI/CD pipeline is the only sanctioned path to production. Yet, configuration drift remains a prevalent, insidious problem. Drift occurs when the actual configuration of an AWS resource, say, an S3 bucket policy or a Lambda function's memory limit diverges from the configuration defined in its source code (e.g., Terraform or CloudFormation). While manual "click-ops" during an emergency is a major contributor, CI/CD pipelines themselves can inadvertently introduce drift through:

  • Emergency Break-Glass Procedures: A necessary manual fix during an outage that is not immediately codified back into IaC.
  • Asynchronous Service Changes: AWS services (like RDS applying a maintenance update or Auto Scaling Groups modifying instance count) making changes outside the IaC tool's state file.
  • Conflicting Automation: Auxiliary tools or legacy scripts modifying resources that the main CI/CD pipeline owns.
  • Improper Pipeline Permissions: Pipelines with permissions broad enough to bypass explicit IaC constraints (a critical security failure).

The goal is to move beyond simply alerting on drift. We must implement a closed-loop system where drift is not just detected, but automatically and surgically corrected reverting the infrastructure back to the codified state. This is where the powerful combination of AWS Config Rules and AWS Systems Manager Automation comes into play for modern distributed systems.


Phase 1: Continuous Drift Detection with AWS Config Rules

AWS Config continuously monitors and records resource configurations. Its core value lies in the Config Rule, a mechanism that compares a resource's actual state (the Configuration Item) against a desired policy. When a deviation is found, the resource is flagged as NON_COMPLIANT, triggering an action across your AWS backend.


Cinematic visualization of AWS Config preventing configuration drift by comparing the clean IaC Source of Truth (Git) against the chaotic Live Configuration

Leveraging Managed Rules for IaC Stack Drift

For infrastructure provisioned via AWS CloudFormation, the starting point is the built-in managed rule: CLOUDFORMATION_STACK_DRIFT_DETECTION_CHECK. This rule periodically invokes CloudFormation's native drift detection functionality against a stack. While it's an excellent overall drift detector, its evaluation frequency is limited. For critical, CI/CD induced drift, we need faster, more targeted checks to minimize the window of non-compliance.



# AWS CLI command to enable the CloudFormation drift detection rule
aws configservice put-config-rule --config-rule '{
  "ConfigRuleName": "CLOUDFORMATION_STACK_DRIFT_DETECTION_CHECK",
  "Scope": {
    "ComplianceResourceTypes": ["AWS::CloudFormation::Stack"]
  },
  "Source": {
    "Owner": "AWS",
    "SourceIdentifier": "CLOUDFORMATION_STACK_DRIFT_DETECTION_CHECK"
  },
  "InputParameters": {
    "detectionPeriod": "15" # Check every 15 days, adjust frequency for cost/risk trade-off. 
  }
}'


Custom Rules for Granular CI/CD Policy Enforcement

The real power in controlling CI/CD pipelines comes from Custom Rules. These rules allow us to enforce specific security or operational policies that, if violated, indicate an unauthorized manual change (drift). Custom rules can be implemented using two methods, chosen based on the complexity and scope of the policy:


Rule Type Mechanism Best Use Case Trade-Off
Lambda-backed Custom Rule Python/Node.js function evaluates configuration item data. Complex, multi-resource logic (e.g., "Check if a Lambda function's execution role allows modification of a peer service's S3 bucket"). Higher maintenance cost, requires managing Lambda code and permissions, but allows deep logic.
Guard Policy Custom Rule Uses HashiCorp's Sentinel-based Policy as Code (Guard DSL). Simple, single-resource compliance checks (e.g., "Ensure all EC2 Security Groups do not have 0.0.0.0/0 ingress on port 22"). Limited to structural and property checks, less flexible for checking external state or complex business logic.

For instance, let’s define a custom rule that flags any change to a critical security group (owned by the CI/CD pipeline) that opens port 22 (SSH) to the world, a common "break-glass" operation. This should only be allowed via a temporary, audited process, never as a permanent configuration drift.



# Pseudocode for a Custom Lambda Rule (Python)
import json

DESIRED_CIDR = '10.0.0.0/8' # Example: Only internal VPC access allowed

def evaluate_security_group(configuration_item):
    sg_id = configuration_item['configuration']['groupId']
    sg_rules = configuration_item['configuration']['ipPermissions']
    
    # Policy: Check if any ingress rule for HTTP/HTTPS is NOT limited to the DESIRED_CIDR
    for rule in sg_rules:
        if rule.get('fromPort') in [80, 443] and rule.get('ipProtocol') == 'tcp':
            ip_ranges = rule.get('ipRanges', [])
            for ip in ip_ranges:
                cidr = ip.get('cidrIp')
                if cidr != DESIRED_CIDR and cidr == '0.0.0.0/0':
                    return {
                        'compliance': 'NON_COMPLIANT',
                        'annotation': f'Security Group {sg_id} has public ingress on critical port.'
                    }
    
    return {'compliance': 'COMPLIANT'}

# (Lambda handler logic would route the event to this function)


Phase 2: Automated Remediation with Systems Manager

Detection is only half the battle. The configuration needs to be reverted without human intervention to enforce the IaC blueprint. AWS Config provides the Remediation Action feature, which automatically triggers a corrective task usually an AWS Systems Manager (SSM) Automation document when a resource is marked as NON_COMPLIANT.


Automated remediation process showing AWS Config triggering a Systems Manager (SSM) Automation document to surgically correct a drifted AWS resource configuration, restoring compliance

The SSM Automation document is the workhorse. It can be a simple action (like calling PutBucketPolicy with the correct, codified JSON) or a complex workflow involving multiple steps and resources. For IaC-managed resources, the key is to reference the original, desired state.



# AWS CloudFormation snippet for defining a Config Rule with Remediation
Resources:
  # 1. The Config Rule: Detects the drift
  RestrictedSecurityGroupRule:
    Type: AWS::Config::ConfigRule
    # ... Rule properties defined here ...

  # 2. The Remediation Action: Automatically executed on NON_COMPLIANT status
  SecurityGroupRemediation:
    Type: AWS::Config::RemediationConfiguration
    Properties:
      ConfigRuleName: !Ref RestrictedSecurityGroupRule
      TargetType: SSM_AUTOMATION
      # Using a custom SSM document tailored for reverting this specific resource type
      TargetId: MyCustomSGRevertAutomationDoc 
      TargetVersion: "1"
      Automatic: true  # Key: Auto-revert is enabled for true immutable infrastructure
      Parameters:
        GroupId:
          ResourceValue:
            Value: RESOURCE_ID  # Passed by Config, representing the drifted resource
        DesiredStateJson:
          StaticValue:
            Values: ["{\"ipPermissions\": [{\"ipProtocol\": \"tcp\", \"fromPort\": 443, \"toPort\": 443, \"ipRanges\": [{\"cidrIp\": \"10.0.0.0/8\"}]}]}"]


The Automatic: true flag is the critical line that transforms the rule from an auditing tool into a preventative gatekeeper. It ensures that any deviation is met with an immediate, pre-approved corrective action, embodying the GitOps principle of continuous reconciliation.


Remediation Strategy: IaC Rollback vs. Synchronization

When dealing with drift, a senior engineer must choose the correct remediation strategy based on resource state and blast radius:

  1. Property Synchronization (Revert): This is the preferred method for stateful or critical resources. It involves using the SSM Automation document to specifically revert only the drifted property, leaving the rest of the resource intact. For example, reverting an overly permissive Security Group rule (using RevokeSecurityGroupIngress) without affecting any other rules or tags.
    • Use Case: Stateful resources (RDS, EC2 instances), S3 buckets, or network resources where the identity must be preserved.
  2. External Synchronization (IaC Re-apply): For CloudFormation and Terraform managed stacks, the remediation action can be to trigger an external process (via a Lambda function called from SSM) that re-applies the latest compliant IaC template. This forces the IaC tool to reconcile the live state with the template.
    • Use Case: Stateless resources (Lambda functions, ECS services), or complex stacks where the drift is too widespread for a simple property revert.

For high-availability backend services, the SSM document should be designed to use transactional, minimal-impact API calls, ensuring the automated correction itself does not cause an outage.


Production Implementation and Trade-Offs

Implementing this robust control mechanism introduces several architectural and operational trade-offs that impact cost, scale, and latency:


Factor Impact on System Design Production Risk
Latency of Correction AWS Config evaluates resources based on a trigger (Configuration Item Change or scheduled). Remediation isn't instantaneous (typically 5-10 minutes end-to-end). A drifted security resource (e.g., an open security port) can be exposed for a few minutes before the auto-remediation executes. This time window must be factored into the overall security architecture.
Remediation Idempotency SSM documents must be idempotent. If the remediation itself creates a new configuration change, it can trigger Config again, leading to an infinite compliance loop. A remediation loop can lead to excessive resource utilization, throttling, and a runaway AWS bill. Strict conditional logic is required in SSM.
Cost and Scale AWS Config charges per recorded configuration item and per rule evaluation. High-frequency, multi-resource checks in a large environment can become expensive for a cloud deployment. Need to optimize rule scope (tags, resource types) and choose event-driven triggers over scheduled ones where the resource has low modification frequency.

IAM Permissions - The Ultimate Guardrail:

The most important preventative measure is to restrict the ability for CI/CD users and all human users to make manual changes in the first place. AWS Identity and Access Management (IAM) is the first line of defense. The Config/Remediation pattern is the second line of defense for when the first fails (e.g., during a legitimate break-glass moment where manual intervention is necessary, but the configuration must be reverted rapidly). This multilayered approach provides robust security.


Final Thoughts

The pursuit of a robust, secure, and stable backend environment demands zero tolerance for configuration drift. The CI/CD pipeline ensures deployment consistency, but AWS Config and Systems Manager Automation provide the essential "immune system" that enforces the codified state continuously, moving us closer to truly immutable infrastructure.


By defining your configuration policies as code (Custom Config Rules) and tying them directly to self-healing workflows (SSM Remediation), you move the organization's infrastructure discipline from a subjective, audit-driven task to an objective, automatically enforced engineering principle. This minimizes human error, ensures a clean audit trail, and decisively establishes your Git repository as the single, immutable source of truth, minimizing operational risk.


The critical engineering takeaway is this: Do not rely solely on your IaC tool's state file to detect manual drift. Use an external, authoritative, and policy-driven system like AWS Config to continuously validate the live cloud environment against the policy definition, and automate the correction as a compliance enforcement mechanism. This separation of concerns deployment vs. validation is vital for production-grade security and stability.

Post a Comment

Previous Post Next Post