Security observability and SIEM
Audit-log pipelines, SIEM vs SOAR vs XDR, retention strategy, security SLIs, detection-as-code, incident response workflow, tabletop exercises, and the lab's central-SIEM gap.
If you have implemented every control in Modules 02 through 10, your platform is hard to attack. It will still be attacked. The next question is whether you find out, and how long it takes between the first attempt and the first responder action. That is the question security observability answers.
Two ideas anchor this module. All the data sources go somewhere central — a SIEM or equivalent — because correlation across sources is what catches modern attacks. Security has SLIs — MTTD, MTTR, false-positive rate, ATT&CK coverage — and the numbers drive the program the same way SRE SLOs drive the platform team.
What security observability is
Three data types fuel security detection.
Audit logs — who did what, when, to which resource. The Kubernetes API audit log, the cluster’s IAM logs, application-level audit events (a user moved $1,000, a developer pushed an image). Audit logs are the system of record; without them, post-incident forensics is guessing.
Security events — detections from a sensor that something specific happened. RHACS alerts, Falco rule firings, IDS/IPS hits, malware-scanner detections, login-anomaly alerts from the IdP. Each event is a hypothesis; the SIEM correlates events with audit logs to confirm or rule out incidents.
Telemetry-for-security — metrics, traces, network flow data that is collected primarily for observability but used for security analysis. A spike in 500 errors on a single endpoint, an outbound DNS query for a suspicious domain, a sudden burst in egress bytes — these signals live in the application’s observability stack but feed the SIEM via export rules.
All three flow into a central correlation engine. The SIEM is where “a developer logged in from a new country” + “the developer’s pod made an unusual outbound call” + “the cluster audit log shows a new ServiceAccount was created” become “this is an incident” instead of three uncorrelated anomalies.
SIEM vs SOAR vs XDR — the jargon
These terms get used as if they were synonyms; they are not.
SIEM (Security Information and Event Management) is log aggregation + correlation + alerting + dashboards. You send every security-relevant log to it; it parses, normalises, retains, and alerts. The big four commercial vendors are Splunk Enterprise Security, IBM QRadar, Microsoft Sentinel, and Elastic Security. Wazuh is the dominant open-source option.
SOAR (Security Orchestration, Automation, and Response) is playbooks that act on SIEM alerts. A SIEM detects; a SOAR responds. Splunk SOAR (formerly Phantom), Palo Alto XSOAR, Tines, and Torq are the major options. The SOAR’s job is to compress the time between alert and action — “page on-call, gather diagnostics, apply a containment policy” is a 30-second SOAR playbook instead of a 30-minute manual checklist.
XDR (Extended Detection and Response) is a vendor-defined category that bundles SIEM + EDR + NDR + SOAR + threat intelligence into one product. CrowdStrike Falcon, SentinelOne Singularity, Microsoft Defender XDR, Palo Alto Cortex XDR are the leaders. XDR works when you buy the whole stack from one vendor; it fights you when you have a heterogeneous environment.
In 2026 most large enterprises run SIEM + SOAR (frequently from the same vendor) + EDR. Mid-size companies often consolidate on an XDR. The category boundaries blur every year — Sentinel is a SIEM and a SOAR; CrowdStrike’s Falcon LogScale is now also a SIEM; Elastic Security covers all three. The right question is not “what category” but “which controls are covered end-to-end.”
The audit-log architecture
Reading the diagram:
- Five data sources on the left feed a forwarder layer (Vector or Fluent Bit) before reaching the SIEM. The forwarder is the choke point for parsing, filtering, and PII redaction.
- The SIEM is the correlation engine. It writes recent data to hot storage for fast search and ships older data to a cold archive for retention.
- The SOAR consumes critical alerts from the SIEM and runs playbooks — auto-collect diagnostics, apply containment, page on-call.
- Solid black edges are batched log shipping; dashed green animated edges are real-time webhooks (RHACS to the forwarder, SIEM to SOAR, SOAR to on-call); dashed grey is the long-term archive flow.
The five sources, in roughly the order of audit relevance:
- Kubernetes audit logs — every API call against the kube-apiserver, with the user, verb, resource, and request body. Filtered (drop verbose reads) and routed to the SIEM.
- Application audit logs — domain events your apps emit when something audit-relevant happens. A “user changed password” event, a “transfer initiated” event, a “permission granted” event. Standardise the format (often a JSON schema per event class) so the SIEM can index them consistently.
- Cloud IAM logs — AWS CloudTrail, Google Cloud Audit Logs, Azure Activity Log. Every IAM action across the cloud account.
- Runtime detections — RHACS sensors, Falco rules, EDR events on hosts. Webhooked to the SIEM for low latency.
- Network observability — NetObserv, Cilium Hubble, VPC flow logs. Sampled, because the volume is otherwise prohibitive.
Retention strategy
Hot vs cold is the cost-vs-utility tradeoff.
- Hot storage (1 to 90 days) is searchable from the SIEM. Supports investigations and live dashboards. Expensive per gigabyte.
- Cold storage (90 days to 7+ years) is an object-storage archive — typically S3 or equivalent — restorable on demand but not real-time queryable. Cheap per gigabyte.
The compliance floor varies:
- PCI-DSS Req 10.5.1 — 12 months minimum total retention, with three months immediately available.
- SOC 2 — depends on the customer commitment, but 12 months is the de facto floor for most SaaS customer asks.
- ISO 27001 — driven by the organisation’s documented data-retention policy; the standard requires you have one and apply it consistently.
- Bank-grade — many transaction-related logs require 7 years. Some jurisdictions go longer for AML-related records.
Plan storage and cost accordingly. A typical mid-size platform generates 1 to 10 TB of security-relevant logs per month; at 24 months hot + cold the total is significant but manageable with object-storage tiering.
What to log — and what NOT to log
Log everything that proves who did what:
- Authentication events — every login attempt, success and failure, with source IP and user agent. The IdP usually emits these directly.
- Authorization decisions — every allow/deny by the policy engine, with the principal, the resource, and the policy that decided.
- Admin actions — kubeconfig usage, IAM role assumptions, secret access, configuration changes.
- Change events — deploys, rollbacks, scaling actions, schedule changes.
- Error rates by service — not for individual error logs but for aggregate rates that surface anomalies.
Do not log:
- PII fields beyond what is needed for audit. Username and user ID are usually enough; full name, address, phone number, and account number rarely belong in security logs.
- Full request bodies. Storage explosion plus the PII risk. Log a hash plus the relevant headers.
- Passwords or their hashes. Obvious in principle; surprisingly easy to leak via debug logs.
- Card numbers. PCI scope contamination — once a card number lands in a SIEM, that SIEM is now in PCI scope.
- API keys, secrets, tokens. Same reason — once they are in the SIEM, the SIEM becomes a high-value target.
“What not to log” is where most teams fail an audit. PII leaking into log fields is the most common finding in BFSI security reviews. The fix is two-part: a forwarder-layer redaction policy (Vector’s mask transform), and an application-layer policy that bans certain fields from log lines in the first place. Belt and braces.
Security SLIs and SLOs
Modern SecOps borrows the SRE concept of service-level indicators.
- MTTD (Mean Time To Detect) — minutes between the actual event happening and the first alert. Critical-severity target: under 15 minutes. The number is dominated by detection coverage and parsing latency, not by the SIEM’s own speed.
- MTTR (Mean Time To Respond) — minutes between the first alert and the first responder action. Critical target: under one hour. High target: under four hours.
- False-positive rate — percentage of alerts that turn out to be benign. Target: under 20%. Above that the on-call learns to ignore the SIEM, and the entire program becomes ceremonial.
- Coverage — percentage of MITRE ATT&CK techniques with at least one detection rule. Target: 60 to 80%. 100% is unrealistic and unnecessary; the gap is the prioritised backlog.
- Mean cost per alert — total SIEM + SOAR + analyst-time cost divided by alert count. A weekly review of the noisiest alert sources cuts this number dramatically.
Track them on a dashboard the entire security team sees. The numbers drive priorities. An MTTD of 60 minutes for a critical-severity scenario means you need more detection sources or faster parsing; a 50% false-positive rate means you need tuning before you add new rules.
Detection-as-code
The mature pattern is to manage detection rules the way you manage infrastructure: in git, code-reviewed, tested, deployed via pipeline.
The cross-SIEM format is Sigma. A Sigma rule is YAML that describes a detection — selectors, conditions, severity, ATT&CK mapping. The sigma-cli tool converts Sigma to Splunk SPL, Elastic KQL, Sentinel KQL, Wazuh rules, or any of two dozen other backends. Write once, deploy to whichever SIEM you happen to run.
A working detection-as-code pipeline:
- One repo with all Sigma rules, organised by source (Kubernetes, AWS IAM, application).
- Every new rule has a unit test — a known-bad event in a fixtures directory, an assertion that the rule fires.
sigma-cli backend testor a custom harness. - CI runs the tests on every PR. Failing tests block the merge.
- A deployment pipeline converts the merged rules to the target SIEM’s format and pushes via the SIEM’s API.
- Monthly tuning review — every rule with a false-positive rate above the threshold is either tuned or retired.
The trap is to write rules without tests. A rule that nobody verified for two years either does not fire (and nobody noticed) or fires constantly on benign traffic (and the on-call has tuned it out of muscle memory). Tests prevent both.
The incident response workflow
NIST SP 800-61 (Computer Security Incident Handling Guide) defines six phases. Map them to your on-call runbooks.
Preparation — done before any incident. Playbooks per scenario, on-call rotations, communication channels, communication templates, evidence-collection scripts. The expensive phase to skip.
Identification — confirm an event is an incident, not a false positive. Classify severity (P0 to P4 or similar). Open the incident channel; activate the rotation.
Containment — isolate the affected component. For a compromised pod: cordon the node, apply a deny-all NetworkPolicy, capture a memory snapshot, then delete. For a leaked credential: revoke the credential, audit the credential’s recent usage. For a malicious admin action: revoke the user’s session, audit other actions by the same actor.
Eradication — remove the underlying cause. Patch the vulnerability, rebuild the image, rotate the secret, remove the malicious workload.
Recovery — restore service. Bring the workload back, validate it is healthy, monitor closely for recurrence.
Lessons learned — postmortem. Identify what worked, what did not, what process or tooling change would have helped. Update playbooks and detection rules; the next incident of the same shape should be easier.
Map each phase to a specific on-call runbook in the lab’s docs (the pattern at /docs/openshift-platform/operations/on-call-and-escalation). A runbook with no clear phase boundary is a runbook that loses time in the handoff between phases.
The SOAR playbook example
A concrete playbook: “high-severity RHACS alert: shell spawned in customer-facing namespace.” The SOAR’s job is the 5 minutes of evidence-gathering and containment that delays human response.
Auto-collect, in parallel: the pod’s manifest and image digest; the last 50 cluster audit events for that pod; the recent NetworkPolicy changes in the namespace; the recent IAM activity for the pod’s ServiceAccount; the pod’s recent egress destinations from NetObserv.
Auto-action: cordon the node so no further workloads schedule on it; apply a deny-all NetworkPolicy to the pod’s namespace (cuts off the attacker’s ongoing access without killing the pod, which preserves memory for forensics).
Notify: page on-call via the incident manager; open a JIRA / Linear ticket with the collected evidence pre-attached; post to the #security-incidents Slack with a summary.
Wait: human investigation; human-driven recovery. The SOAR handed off after the auto-actions; from here it is the responder’s call.
This is the single playbook with the largest MTTD/MTTR improvement in most environments. The bottleneck for a 3am incident is rarely the diagnosis once you have the data; it is collecting the data while still sleepy.
Tabletop exercises and chaos
Run a tabletop quarterly. Pick a realistic scenario:
- Insider credential leak — a developer’s GitHub token leaks in a public gist; the attacker uses it to push a malicious image.
- Ransomware on build server — your CI runner is compromised; the attacker stages a malicious build.
- Zero-day on a public API — an unauthenticated RCE in a dependency you ship; exploitation in the wild within hours.
- Compromised third-party SaaS — a SaaS provider you depend on (logging, observability, IdP) is breached; what is your blast radius?
Walk through the playbook in a room. Identify gaps: missing detections, missing runbook steps, missing comms templates, missing approval authorities. Tabletops are cheap; finding the gap in production is expensive.
Chaos engineering for security is a more aggressive version. Inject a fake malicious event (kubectl exec -- bash in a test namespace, a deliberately-leaked test credential) and time how long it takes for the alert to fire and the responder to take action. Treat the numbers as a baseline; the next quarter’s drill should be faster.
Lab posture
What the lab has today:
- Audit logs going to Loki, per-spoke. Cluster audit events, application logs, ingress access logs. Hot retention 30 days, no cold archive yet.
- SigNoz for application observability — APM, metrics, traces. SigNoz is not a SIEM; it is the observability stack for application behaviour, useful as a security telemetry source but not the correlation engine.
- RHACS for runtime detection on
spoke-dc-v6. Policies tuned for the brac-poc and bank-employees tenants; webhook to nothing yet. - Compliance Operator scanning nightly (Module 10), results in cluster.
Major gaps, all currently Tier-1 or Tier-2 in the BFSI readiness review at /docs/openshift-platform/foundations/bfsi-readiness-review:
- No central SIEM. Logs are per-cluster in Loki; cross-source correlation is manual. Wazuh on a small VM is the planned starting point; Splunk Cloud is the BFSI-target endpoint.
- No SOAR. Containment is human-driven from the on-call runbook. Tines or n8n is the planned starting point.
- No detection-as-code. No Sigma rules in git, no CI for detections, no test fixtures. Greenfield exercise once the SIEM is up.
- MTTD and MTTR not measured. No baseline; first task once the SIEM is live.
- Cold-archive retention is not configured. PCI-DSS Req 10.5.1’s 12-month minimum is not met for security logs (it is met for transaction logs separately).
The order of operations is the order this module recommends: stand up the SIEM, add the data sources, write the first ten detections, instrument MTTD/MTTR, then bring in SOAR. Skipping ahead to SOAR before the SIEM is reliably ingesting is a common failure.
Try this
Exercise 1. Stand up Wazuh on a small test cluster — the upstream Helm chart works. Configure the Kubernetes audit log to forward to it via the audit-webhook policy. Generate test events (kubectl get secrets, kubectl create serviceaccount) and verify they appear in the Wazuh dashboard within a minute. Note the latency; that is your floor MTTD for any detection based on the audit log.
Exercise 2. Write a Sigma rule for “user added to the cluster-admin ClusterRoleBinding.” The detection looks for verb: create OR update, objectRef.resource: clusterrolebindings, and requestObject.roleRef.name: cluster-admin. Convert it to Wazuh, Splunk, and Elastic format with sigma-cli convert. Test by adding a test user to a CRB in a sandbox cluster and confirming the rule fires.
Exercise 3. Draft an IR playbook for “a secret was committed to a public git repo.” The playbook covers: detection (gitleaks scan in CI, GitHub Secret Scanning, post-commit hook); identification (which secret? when committed? still active?); containment (revoke the secret immediately; rotate dependents); eradication (force-push the history rewrite, or treat the secret as burned and rebuild); recovery (deploy with the new secret); lessons learned (why did pre-commit fail? add a CI gate). Aim for one page of action steps an on-call can execute.
Common failure modes
SIEM ingest costs explode. Application-debug logging routed to security ingest is the usual culprit; $50K/month of DEBUG logs nobody investigates. Filter at the forwarder, not at the SIEM. Build a per-source cost dashboard so the trend is visible.
Critical alerts buried in noise. False-positive rate above 50% means the on-call ignores everything, including the one alert that mattered. Postmortems for “we had the data but missed the incident” almost always trace back here. Aggressive tuning is the answer; retire rules that do not improve precision after two cycles.
Audit logs lost during a region failover. Single-region SIEM with no cross-region replication is the canonical incident-postponement bug. For BFSI, cross-region replication is mandatory; for everyone else it is a Tier-2 issue worth budgeting for.
Detection coverage measured but never improved. The dashboard says 60% ATT&CK coverage; six months later it still says 60%. Without a process that converts coverage gaps into backlog items, the number is decorative.
SOAR playbooks that take destructive actions without confirmation. An auto-applied “deny-all NetworkPolicy” that breaks a customer-facing pod during a false positive is a worse incident than the one it was responding to. SOAR auto-actions belong on the “reversible and bounded” side of the line; anything destructive needs a human in the loop or a tight blast radius.
Logging in production but not in pre-production. An attacker stages in dev or staging where logs are off; by the time they reach prod, the early stages are invisible. Apply the same observability everywhere; the storage cost of dev logs is negligible compared to the forensic value during an incident.
References
- NIST SP 800-61 r2, Computer Security Incident Handling Guide:
https://csrc.nist.gov/publications/detail/sp/800-61/rev-2/final - MITRE ATT&CK:
https://attack.mitre.org/ - MITRE D3FEND:
https://d3fend.mitre.org/ - Sigma project:
https://github.com/SigmaHQ/sigma - Sigma CLI:
https://github.com/SigmaHQ/sigma-cli - Splunk Enterprise Security:
https://www.splunk.com/en_us/products/enterprise-security.html - IBM QRadar:
https://www.ibm.com/qradar - Microsoft Sentinel:
https://learn.microsoft.com/en-us/azure/sentinel/ - Elastic Security:
https://www.elastic.co/security - Wazuh:
https://documentation.wazuh.com/ - Falco:
https://falco.org/docs/ - Vector:
https://vector.dev/docs/