Security Lab — GitOps-managed data center networking

A reference design for managing data center networking with GitOps-style desired state, validation, staged rollout, and drift detection.

A data center network can be GitOps-managed, but the phrase needs discipline. GitOps for networking should not mean “merge a pull request and blindly push configs to every router.” The right model is Git-controlled desired state, strong validation, staged deployment, telemetry-backed verification, and a clear break-glass path.

The goal is to make network operations boring: every change is reviewed, rendered, tested, applied in a controlled sequence, and checked against the real state afterward.

Executive Recommendation

Use GitOps to manage the network operating model, not just the device configuration.

Layer	Recommendation	Why
Source of truth	Nautobot or NetBox plus Git	The inventory, IPAM, VRFs, circuits, tenants, and topology need structure before automation can be trusted.
Desired state	Git repository with declarative YAML, JSON, or data models	Pull requests become the review and approval workflow.
Validation	CI pipeline with schema checks, rendered config diffs, policy checks, and simulation	Bad route policy should fail before touching production.
Deployment	Ansible, Nornir, Scrapli, vendor APIs, or a custom controller	The deployment layer should be replaceable; the desired state should not depend on one tool.
Pre-production	EVE-NG, containerlab, Batfish, pyATS, or vendor lab gear	Networks need behavioral tests, not only syntax checks.
Verification	Streaming telemetry, route checks, reachability tests, config state, and alert correlation	The pipeline must prove that the network converged.
Drift handling	Alert or open a pull request first; auto-reconcile only low-risk drift	Blind overwrites can turn a local mistake into a fabric-wide outage.

What Belongs In Git

Git should hold the intended network state and the automation that produces device changes.

Good candidates:

sites, rooms, racks, devices, roles, platforms, and software versions
interface descriptions, cabling, LAGs, MLAG/vPC pairs, and breakout maps
VLANs, VRFs, route targets, route distinguishers, and tenant networks
BGP underlay neighbors and ASN assignments
MP-BGP EVPN overlay parameters
prefix lists, route maps, communities, and route policy templates
loopbacks, point-to-point links, management addresses, and IPAM reservations
DNS records, NTP, syslog, SNMP, telemetry, and AAA references
firewall objects and policies where supported by review and simulation
load balancer VIPs and pool membership
Cloudflare DNS, Access, Tunnel, Magic WAN, and security settings via Terraform
monitoring targets, alert rules, and dashboard definitions

Keep secrets out of Git. Git can reference a secret path, but the secret value belongs in Vault, a secrets manager, or another controlled secret store.

What Needs Extra Guardrails

Some changes should never auto-apply without extra review:

management interface changes
AAA, TACACS, RADIUS, SSH, and local user changes
underlay routing changes
default route changes
BGP transit, peering, and private-connect changes
route reflector policy
EVPN route-target changes
firewall deny rules
NAT changes
DDoS and blackhole policy
firmware or network OS upgrades
automation credential changes
any change affecting out-of-band access

These changes need explicit approval, maintenance windows where appropriate, and a pre-written rollback.

Reference Pipeline

Engineer / operator
        |
        v
Pull request
        |
        v
CI validation
  - schema check
  - source-of-truth consistency
  - render configs
  - show intended diff
  - route-policy checks
  - simulation / lab tests
  - blast-radius scoring
        |
        v
Approval
        |
        v
Staged deployment
  - pre-checks
  - lock affected devices
  - apply in batches
  - post-checks after each batch
  - stop on failure
        |
        v
Telemetry verification
  - BGP state
  - EVPN state
  - interface state
  - reachability
  - error counters
  - alert status
        |
        v
Change record and drift monitor

The pipeline should be able to answer three questions before it touches a device:

What exactly will change?
What traffic or tenants could be affected?
How will we know the change worked?

Operating Modes

Do not start with full automatic production reconciliation. Move in stages.

Mode	Behavior	When to use
Read-only GitOps	Git defines intent, CI renders diffs, humans apply changes manually	First adoption phase or high-risk networks.
Assisted GitOps	Merge triggers automation into lab or staging; production needs manual approval	Good first production model.
Controlled production GitOps	Low-risk changes auto-apply; high-risk changes require approval	Mature operations with good rollback and telemetry.
Closed-loop GitOps	Drift is detected continuously; system opens PRs or reconciles safe drift	Mature networks with strong source-of-truth discipline.

The trap is jumping straight to closed-loop automation before the inventory and validation are trustworthy.

Source Of Truth Model

The source of truth should describe the network at multiple levels.

Data domain	Examples
Physical	sites, racks, devices, optics, cables, patch panels
Logical	tenants, VRFs, VLANs, VNIs, route targets, security zones
Routing	ASNs, loopbacks, BGP neighbors, route reflectors, policy
Services	DNS, NTP, syslog, telemetry collectors, AAA, TACACS/RADIUS
Cloud	VPCs, subnets, gateways, private-connect handoffs, customer VRFs
Security	firewall zones, ACLs, WAF policies, IDS sensor placement
Operations	owners, maintenance windows, criticality, rollback class

Nautobot or NetBox should carry structured inventory and IPAM. Git should carry templates, policy, environment definitions, and versioned changes. The two must be reconciled; if they disagree, the pipeline should stop.

Validation Stack

A serious data center network pipeline needs more than YAML lint.

Tool class	Examples	Use
Schema validation	JSON Schema, Pydantic, Yamale	Reject malformed desired state.
Policy-as-code	OPA, Conftest, custom checks	Enforce standards and forbidden patterns.
Config rendering	Jinja2, Ansible templates, Nornir tasks	Produce device-specific intent from common data.
Network analysis	Batfish	Validate reachability and policy behavior before deployment.
Lab simulation	EVE-NG, containerlab	Test routing and failure behavior in a disposable topology.
State validation	pyATS/Genie, Nornir, Scrapli, gNMI	Compare operational state against expected state.
Observability	Prometheus, Grafana, OpenTelemetry, SIEM/NDR logs	Prove health and detect regressions.

For Cisco-heavy environments, pyATS and Genie are valuable. For vendor-neutral fabrics, gNMI/OpenConfig, Nornir, Scrapli, and Batfish provide a better cross-platform base.

Deployment Pattern

Deploy in batches, not all at once.

Recommended batch order:

Render all device configs and intended API changes.
Run pre-checks on every affected device.
Lock the change scope in the automation system.
Apply to one lab or canary device first.
Apply to one leaf pair or one rack.
Verify BGP, EVPN, interfaces, reachability, and alerts.
Continue rack-by-rack or site-by-site.
Stop automatically if post-checks fail.
Record evidence in the change record.

For high-risk changes, keep a human approval gate between each batch.

Drift Detection

There are two kinds of drift.

Drift type	Example	Action
Benign drift	interface counter, learned route, neighbor uptime	Observe only.
Config drift	manual config change, missing route map, changed BGP neighbor	Alert or open a correction PR.
Emergency drift	break-glass config during incident	Preserve evidence, then reconcile after review.
Dangerous drift	unauthorized AAA or route policy change	Page operator and block auto-reconcile until reviewed.

For network infrastructure, do not auto-reconcile everything. The network might be in an emergency state for a reason. The safer pattern is:

detect drift -> classify -> alert or PR -> approve -> reconcile

Only low-risk drift should self-heal automatically.

Git Repository Shape

One practical structure:

network-gitops/
  inventory/
    sites/
    devices/
    links/
    ipam/
  intent/
    tenants/
    vrfs/
    bgp/
    evpn/
    services/
    security/
  templates/
    cisco-nxos/
    arista-eos/
    juniper-junos/
    sonic/
    frr/
  policy/
    conftest/
    route-policy/
    blast-radius/
  pipelines/
    render/
    validate/
    deploy/
    verify/
  labs/
    eve-ng/
    containerlab/
  evidence/
    README.md

Keep generated configs either as build artifacts or in a separate generated branch. Do not let generated output become the source of truth unless there is a clear reason.

Change Classes

Every change should have a class. The class decides approval and rollout behavior.

Class	Examples	Approval	Rollout
Low risk	interface description, monitoring target, non-forwarding metadata	peer review	auto or next batch
Medium risk	new VLAN, new tenant VRF, new BGP customer with route limits	network owner	staged
High risk	route policy, EVPN route targets, firewall deny, private-connect	senior network owner	maintenance window
Critical	underlay, edge BGP, management, AAA, automation credentials	change board and rollback owner	manual gate per batch

This classification is more important than the tool. Without it, GitOps becomes a faster way to make bigger mistakes.

Security Controls

GitOps for networking needs security controls from the start.

Signed commits for production repos.
Protected branches and required reviews.
CODEOWNERS by network domain.
Short-lived deployment credentials.
Secrets stored outside Git.
Per-vendor and per-site deploy roles.
Audit logs from Git, CI, automation, and devices.
Break-glass account with separate approval and post-use review.
Drift alerts sent to SOC.
Change evidence retained with the ticket or pull request.

The automation account should not have unlimited access everywhere. A pipeline that changes leaf interface descriptions does not need authority to change edge BGP policy.

SOC Integration

A GitOps-managed network should feed the SOC.

Useful events:

pull request opened, approved, merged, or reverted
deployment started, completed, failed, or rolled back
route policy changed
BGP session changed
EVPN neighbor changed
route reflector state changed
interface error or flap after deployment
unexpected config drift
break-glass access used
automation credential failure

This ties network operations to incident response. If an alert fires five minutes after a route policy PR merged, the SOC should see the change context immediately.

Vendor Strategy

GitOps should abstract the operating workflow, not hide every vendor detail.

Platform	Fit
SONiC	Strong for open/disaggregated fabrics with Linux and FRR heritage.
Arista EOS	Strong for cloud fabrics, APIs, telemetry, and automation.
Juniper Junos / Apstra	Strong for structured EVPN/VXLAN operations and validation.
Cisco NX-OS / IOS-XE / IOS-XR	Strong where Cisco operational skills, TAC, and certification alignment matter.
Nokia SR Linux / SR OS	Strong for service-provider-style routing and model-driven operations.
Linux + FRR	Strong for route reflectors, service routing, labs, and private-cloud routing components.

The GitOps pipeline should support multiple platforms if the business expects vendor diversity. But do not over-abstract: route policy and EVPN behavior still need platform-specific tests.

Failure Scenarios To Test

Before production adoption, test these:

bad prefix list blocks tenant reachability
wrong route target leaks a tenant route
BGP route reflector policy rejects an expected route
leaf pair deployment succeeds on one leaf and fails on the other
automation loses SSH/API access mid-change
telemetry says success but route table disagrees
manual emergency config creates drift
rollback restores config but not BGP state
source of truth and live network disagree
CI passes syntax but simulation catches reachability loss

Each failure should produce a runbook update.

Implementation Phases

Phase 1: Inventory And Read-Only Validation

Build source of truth.
Import device inventory, interfaces, IPAM, VRFs, and BGP sessions.
Render intended configs without applying them.
Compare rendered state to live configs.
Report drift without changing devices.

Phase 2: Lab-Gated Changes

Build EVE-NG or containerlab test topologies.
Test BGP, EVPN/VXLAN, route policy, and failure behavior.
Run CI validation on every pull request.
Keep production application manual.

Phase 3: Assisted Production Deployment

Allow approved merges to trigger deployment plans.
Require a human approval before production apply.
Deploy low-risk changes to canary devices first.
Store post-check evidence.

Phase 4: Controlled Automation

Auto-apply low-risk changes.
Keep medium, high, and critical changes behind approval gates.
Add drift PRs.
Add SOC alerts for failed deploys and unexpected drift.

Phase 5: Closed-Loop Operations

Reconcile safe drift automatically.
Open pull requests for unsafe drift.
Enforce SLOs for convergence and route propagation.
Run regular failure drills.

Recommended First Lab

Use the existing EVE-NG environment to prove the workflow:

2 Linux FRR route reflectors
2 leaf switches or routers
2 tenant VRFs
BGP underlay
MP-BGP EVPN overlay
one route-policy change through Git
one simulated bad route leak blocked by CI
one drift event detected after manual change

This lab proves the operating model before touching a real production fabric.

Final Position

Yes, data center network operations can be GitOps-managed. The defensible design is:

Git holds the desired state. CI proves the change is valid. Automation applies it in controlled stages. Telemetry proves the result. Drift detection keeps the network honest. Humans retain approval over high-risk changes.

That is the model to propose for a regional cloud provider or large private cloud. It is cloud-native without being reckless, and it gives both engineers and executives something they can audit: every network change has intent, review, evidence, and rollback.