Mastering Hybrid Cloud with Google Anthos: Unified Kubernetes Management Across Any Environment

Introduction: Google Anthos provides a unified platform for managing applications across on-premises data centers, Google Cloud, and other cloud providers. This comprehensive guide explores Anthos’s enterprise capabilities, from GKE Enterprise and Config Management to Service Mesh and multi-cluster networking. After implementing hybrid cloud architectures for enterprises with complex compliance and data residency requirements, I’ve found Anthos delivers exceptional value through consistent Kubernetes management, policy-as-code governance, and unified observability across environments. Organizations should leverage Anthos for workload portability, centralized policy management, and gradual cloud migration while implementing proper cluster fleet management and security controls from the start.

Anthos Architecture: Unified Hybrid Cloud Platform

Anthos extends Google’s Kubernetes expertise beyond GCP to any environment. GKE Enterprise provides managed Kubernetes on Google Cloud with advanced features like multi-cluster management, fleet-wide policies, and integrated security. Anthos on VMware runs Kubernetes on existing VMware infrastructure, enabling organizations to modernize applications without immediate cloud migration. Anthos on bare metal eliminates the virtualization layer for maximum performance in edge and high-performance computing scenarios.

Fleet management provides a single pane of glass for clusters across all environments. Register clusters from any provider (GKE, EKS, AKS, on-premises) into a fleet for centralized management. Fleet-scoped resources like namespaces and RBAC policies apply consistently across all member clusters. This abstraction enables platform teams to manage hundreds of clusters with consistent governance while allowing application teams to deploy without environment-specific configurations.

Connect Gateway provides secure, identity-aware access to registered clusters without exposing Kubernetes APIs to the internet. Users authenticate through Google Cloud IAM, and Connect Gateway proxies requests to clusters through an outbound connection from the cluster. This architecture eliminates the need for VPNs or public endpoints while maintaining full kubectl compatibility and audit logging.

Config Management and Policy Governance

Anthos Config Management implements GitOps for fleet-wide configuration. Store Kubernetes manifests, policies, and configurations in Git repositories, and Config Management automatically syncs them to registered clusters. Config Sync handles the synchronization, supporting Kustomize and Helm for environment-specific customization. Changes flow through standard Git workflows—pull requests, reviews, and approvals—providing audit trails and rollback capabilities.

Policy Controller enforces guardrails across the fleet using Open Policy Agent (OPA) Gatekeeper. Define constraints that prevent policy violations before resources are created—block privileged containers, require resource limits, enforce naming conventions, or mandate specific labels. Policy bundles provide pre-built policies for common compliance requirements (CIS benchmarks, PCI-DSS, HIPAA). Audit mode allows testing policies without enforcement, identifying violations before enabling blocking.

Config Controller provides a managed control plane for provisioning GCP resources using Kubernetes Resource Model (KRM). Define GCP infrastructure (VPCs, Cloud SQL, GKE clusters) as Kubernetes manifests, and Config Controller reconciles the desired state with actual infrastructure. This enables infrastructure-as-code with the same GitOps workflows used for application configuration, unifying infrastructure and application management.

Production Terraform Configuration

Here’s a comprehensive Terraform configuration for Anthos with fleet management and Config Management:

# Anthos Enterprise Configuration
terraform {
  required_version = ">= 1.5.0"
  required_providers {
    google = { source = "hashicorp/google", version = "~> 5.0" }
  }
}

variable "project_id" { type = string }
variable "region" { type = string, default = "us-central1" }

# Enable required APIs
resource "google_project_service" "apis" {
  for_each = toset([
    "anthos.googleapis.com",
    "gkehub.googleapis.com",
    "anthosconfigmanagement.googleapis.com",
    "mesh.googleapis.com",
    "connectgateway.googleapis.com",
    "gkeconnect.googleapis.com"
  ])
  
  service            = each.value
  disable_on_destroy = false
}

# GKE cluster for production workloads
resource "google_container_cluster" "prod" {
  name     = "prod-cluster"
  location = var.region
  
  # Enable Autopilot for managed node pools
  enable_autopilot = true
  
  # Fleet registration
  fleet {
    project = var.project_id
  }
  
  # Binary Authorization
  binary_authorization {
    evaluation_mode = "PROJECT_SINGLETON_POLICY_ENFORCE"
  }
  
  # Workload Identity
  workload_identity_config {
    workload_pool = "${var.project_id}.svc.id.goog"
  }
  
  # Private cluster configuration
  private_cluster_config {
    enable_private_nodes    = true
    enable_private_endpoint = false
    master_ipv4_cidr_block  = "172.16.0.0/28"
  }
  
  # Network configuration
  network    = google_compute_network.vpc.name
  subnetwork = google_compute_subnetwork.subnet.name
  
  ip_allocation_policy {
    cluster_secondary_range_name  = "pods"
    services_secondary_range_name = "services"
  }
  
  # Security configuration
  master_authorized_networks_config {
    cidr_blocks {
      cidr_block   = "10.0.0.0/8"
      display_name = "Internal"
    }
  }
  
  # Logging and monitoring
  logging_config {
    enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"]
  }
  
  monitoring_config {
    enable_components = ["SYSTEM_COMPONENTS"]
    managed_prometheus {
      enabled = true
    }
  }
}

# VPC for clusters
resource "google_compute_network" "vpc" {
  name                    = "anthos-vpc"
  auto_create_subnetworks = false
}

resource "google_compute_subnetwork" "subnet" {
  name          = "anthos-subnet"
  ip_cidr_range = "10.0.0.0/20"
  region        = var.region
  network       = google_compute_network.vpc.id
  
  secondary_ip_range {
    range_name    = "pods"
    ip_cidr_range = "10.1.0.0/16"
  }
  
  secondary_ip_range {
    range_name    = "services"
    ip_cidr_range = "10.2.0.0/20"
  }
  
  private_ip_google_access = true
}

# Fleet membership for external cluster (example)
resource "google_gke_hub_membership" "external" {
  membership_id = "external-cluster"
  
  endpoint {
    gke_cluster {
      resource_link = "//container.googleapis.com/projects/${var.project_id}/locations/${var.region}/clusters/external"
    }
  }
  
  authority {
    issuer = "https://container.googleapis.com/v1/projects/${var.project_id}/locations/${var.region}/clusters/external"
  }
}

# Config Management feature
resource "google_gke_hub_feature" "configmanagement" {
  name     = "configmanagement"
  location = "global"
  
  depends_on = [google_project_service.apis["anthosconfigmanagement.googleapis.com"]]
}

# Config Management for prod cluster
resource "google_gke_hub_feature_membership" "prod_config" {
  location   = "global"
  feature    = google_gke_hub_feature.configmanagement.name
  membership = google_gke_hub_membership.prod.membership_id
  
  configmanagement {
    version = "1.17.0"
    
    config_sync {
      source_format = "unstructured"
      
      git {
        sync_repo   = "https://github.com/your-org/config-repo"
        sync_branch = "main"
        policy_dir  = "clusters/prod"
        secret_type = "token"
      }
    }
    
    policy_controller {
      enabled                    = true
      template_library_installed = true
      referential_rules_enabled  = true
      
      monitoring {
        backends = ["PROMETHEUS"]
      }
    }
  }
}

resource "google_gke_hub_membership" "prod" {
  membership_id = "prod-cluster"
  
  endpoint {
    gke_cluster {
      resource_link = google_container_cluster.prod.id
    }
  }
}

# Service Mesh feature
resource "google_gke_hub_feature" "servicemesh" {
  name     = "servicemesh"
  location = "global"
  
  depends_on = [google_project_service.apis["mesh.googleapis.com"]]
}

# Service Mesh for prod cluster
resource "google_gke_hub_feature_membership" "prod_mesh" {
  location   = "global"
  feature    = google_gke_hub_feature.servicemesh.name
  membership = google_gke_hub_membership.prod.membership_id
  
  mesh {
    management = "MANAGEMENT_AUTOMATIC"
  }
}

# Multi-cluster ingress
resource "google_gke_hub_feature" "multiclusteringress" {
  name     = "multiclusteringress"
  location = "global"
  
  spec {
    multiclusteringress {
      config_membership = google_gke_hub_membership.prod.id
    }
  }
}

# Fleet namespace for team isolation
resource "google_gke_hub_namespace" "team_a" {
  scope_namespace_id = "team-a"
  scope_id           = google_gke_hub_scope.teams.scope_id
  scope              = google_gke_hub_scope.teams.name
}

resource "google_gke_hub_scope" "teams" {
  scope_id = "teams"
}

# Fleet RBAC for team access
resource "google_gke_hub_scope_rbac_role_binding" "team_a_admin" {
  scope_rbac_role_binding_id = "team-a-admin"
  scope_id                   = google_gke_hub_scope.teams.scope_id
  
  role {
    predefined_role = "ADMIN"
  }
  
  group = "team-a-admins@example.com"
}

Python SDK for Fleet Management

This Python implementation demonstrates enterprise patterns for Anthos fleet management and policy compliance:

"""Anthos Fleet Manager - Enterprise Python Implementation"""
from dataclasses import dataclass
from typing import List, Dict, Optional
from google.cloud import gke_hub_v1
from google.cloud import container_v1
import logging
from datetime import datetime

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@dataclass
class ClusterInfo:
    name: str
    location: str
    membership_id: str
    state: str
    config_sync_status: str
    policy_controller_status: str

@dataclass
class PolicyViolation:
    cluster: str
    namespace: str
    resource_type: str
    resource_name: str
    constraint: str
    message: str

class AnthosFleetManager:
    """Enterprise Anthos fleet management."""
    
    def __init__(self, project_id: str):
        self.project_id = project_id
        self.hub_client = gke_hub_v1.GkeHubClient()
        self.container_client = container_v1.ClusterManagerClient()
        self.parent = f"projects/{project_id}/locations/global"
    
    def list_fleet_members(self) -> List[ClusterInfo]:
        """List all clusters in the fleet."""
        request = gke_hub_v1.ListMembershipsRequest(parent=self.parent)
        
        clusters = []
        for membership in self.hub_client.list_memberships(request=request):
            # Get config management status
            config_sync = "Unknown"
            policy_controller = "Unknown"
            
            if membership.state:
                state = membership.state.code.name
            else:
                state = "Unknown"
            
            clusters.append(ClusterInfo(
                name=membership.name.split("/")[-1],
                location=membership.endpoint.gke_cluster.resource_link.split("/")[5] if membership.endpoint.gke_cluster else "external",
                membership_id=membership.name,
                state=state,
                config_sync_status=config_sync,
                policy_controller_status=policy_controller
            ))
        
        return clusters
    
    def register_cluster(self, cluster_name: str, 
                        cluster_location: str) -> str:
        """Register a GKE cluster to the fleet."""
        membership = gke_hub_v1.Membership(
            endpoint=gke_hub_v1.MembershipEndpoint(
                gke_cluster=gke_hub_v1.GkeCluster(
                    resource_link=f"//container.googleapis.com/projects/{self.project_id}/locations/{cluster_location}/clusters/{cluster_name}"
                )
            )
        )
        
        request = gke_hub_v1.CreateMembershipRequest(
            parent=self.parent,
            membership_id=cluster_name,
            resource=membership
        )
        
        operation = self.hub_client.create_membership(request=request)
        result = operation.result()
        
        logger.info(f"Registered cluster: {cluster_name}")
        return result.name
    
    def unregister_cluster(self, membership_id: str) -> bool:
        """Unregister a cluster from the fleet."""
        try:
            request = gke_hub_v1.DeleteMembershipRequest(
                name=f"{self.parent}/memberships/{membership_id}"
            )
            
            operation = self.hub_client.delete_membership(request=request)
            operation.result()
            
            logger.info(f"Unregistered cluster: {membership_id}")
            return True
        except Exception as e:
            logger.error(f"Failed to unregister cluster: {e}")
            return False
    
    def get_fleet_health(self) -> Dict:
        """Get overall fleet health status."""
        clusters = self.list_fleet_members()
        
        health = {
            "total_clusters": len(clusters),
            "healthy": 0,
            "degraded": 0,
            "unhealthy": 0,
            "clusters": []
        }
        
        for cluster in clusters:
            status = "healthy"
            if cluster.state != "READY":
                status = "unhealthy"
            elif cluster.config_sync_status == "ERROR":
                status = "degraded"
            
            health["clusters"].append({
                "name": cluster.name,
                "status": status,
                "state": cluster.state
            })
            
            health[status] += 1
        
        return health
    
    def sync_config(self, membership_id: str) -> bool:
        """Trigger config sync for a cluster."""
        # This would typically interact with Config Sync API
        # For now, we'll log the action
        logger.info(f"Triggering config sync for: {membership_id}")
        return True
    
    def get_policy_violations(self) -> List[PolicyViolation]:
        """Get policy violations across the fleet."""
        # This would query Policy Controller audit logs
        # Placeholder implementation
        violations = []
        
        # In production, query Cloud Logging for policy violations
        # filter = 'resource.type="k8s_cluster" AND jsonPayload.kind="ConstraintViolation"'
        
        return violations
    
    def apply_fleet_policy(self, policy_name: str, 
                          policy_spec: Dict) -> bool:
        """Apply a policy across the fleet."""
        # This would create a constraint template and constraint
        # that Config Management syncs to all clusters
        logger.info(f"Applying fleet policy: {policy_name}")
        return True
    
    def get_workload_distribution(self) -> Dict:
        """Get workload distribution across clusters."""
        clusters = self.list_fleet_members()
        distribution = {}
        
        for cluster in clusters:
            # In production, query each cluster for workload counts
            distribution[cluster.name] = {
                "deployments": 0,
                "pods": 0,
                "services": 0
            }
        
        return distribution

class ConfigSyncManager:
    """Manage Config Sync across fleet."""
    
    def __init__(self, project_id: str):
        self.project_id = project_id
        self.hub_client = gke_hub_v1.GkeHubClient()
    
    def get_sync_status(self, membership_id: str) -> Dict:
        """Get Config Sync status for a cluster."""
        # Query the feature membership for sync status
        return {
            "synced": True,
            "last_sync": datetime.utcnow().isoformat(),
            "commit": "abc123",
            "errors": []
        }
    
    def force_sync(self, membership_id: str) -> bool:
        """Force a sync for a specific cluster."""
        logger.info(f"Forcing sync for: {membership_id}")
        return True

Cost Optimization and Best Practices

Anthos pricing includes per-vCPU charges for managed clusters and features. GKE Enterprise provides the full Anthos feature set for GKE clusters at a per-cluster fee. For on-premises and multi-cloud deployments, Anthos charges per vCPU across all registered clusters. Optimize costs by right-sizing clusters and using Autopilot mode where appropriate to eliminate over-provisioning.

Fleet architecture significantly impacts operational efficiency. Group clusters by environment (dev, staging, prod) or by team for appropriate policy inheritance. Use fleet namespaces for multi-tenancy rather than creating separate clusters for each team. Implement cluster templates to ensure consistent configuration across new clusters.

Config Management reduces operational overhead but requires investment in GitOps practices. Start with a simple repository structure and evolve as needs grow. Use Kustomize overlays for environment-specific configuration rather than duplicating manifests. Implement policy constraints in audit mode first, then gradually enable enforcement after addressing existing violations.

Anthos Architecture - showing fleet management, Config Management, and multi-cloud deployment — Anthos Enterprise Architecture – Illustrating fleet management across GCP, on-premises, and multi-cloud environments with Config Management and Service Mesh integration.

Key Takeaways and Best Practices

Anthos provides a unified platform for managing Kubernetes across any environment with consistent policies and observability. Use fleet management to centralize cluster operations while maintaining environment-specific configurations through Config Management. Implement Policy Controller to enforce security and compliance guardrails across all clusters.

Start with GKE Enterprise for cloud-native workloads, then extend to on-premises or multi-cloud as requirements dictate. The Terraform and Python examples provided here establish patterns for production-ready Anthos deployments that scale from single clusters to enterprise-wide hybrid cloud architectures while maintaining security and operational efficiency.

Discover more from C4: Container, Code, Cloud & Context

Subscribe to get the latest posts sent to your email.

Searching in