DraftInternalISO 9001ISO 27001ISO 14001

SW-SAAS-OPS-001

Swedwise SaaS Platform - Operations Guide

Version

1.0

Owner

SaaS Operations Manager

Effective Date

[TBD]

Review Date

[TBD]

Swedwise SaaS Platform - Operations Guide

1. Purpose

This procedure establishes comprehensive operational standards and procedures for the Swedwise SaaS platform, ensuring:

Reliable 24x7 service delivery meeting SLA commitments (99.9% uptime)
Consistent incident and problem management across all service components
Effective change management minimizing service disruption
Proactive monitoring and capacity management
Security operations aligned with ISO 27001 requirements
Continuous improvement and operational excellence

This document provides the foundation for all SaaS operations. Component-specific procedures are detailed in separate addendum documents (see Section 12).

2. Scope

This procedure applies to:

In Scope:

All Swedwise SaaS platform services and infrastructure
Swedwise Communications service (OpenText Communications + Notifications)
Shared platform components (Kubernetes, databases, networking, security)
24x7 operations center and support organization
All operational staff (operations engineers, support staff, on-call personnel)
Customer-facing SaaS services and APIs

Out of Scope:

Internal Swedwise IT systems (managed separately)
Customer-specific project deliveries (unless affecting shared infrastructure)
Professional services and consulting work
Sales and marketing systems (unless integrated with SaaS platform)

3. Relationship to IMS Procedures

The SaaS Operations procedures implement and extend the Swedwise Integrated Management System:

IMS Procedure	SaaS Operations Implementation
SW-ISMS-PRO-001: Incident Management	Security incidents follow IMS procedure; SaaS-specific technical response detailed here
SW-IMS-PRO-008: Change Management	All SaaS changes follow IMS change process; technical implementation detailed here
SW-ISMS-POL-001: Information Security Policy	Security operations implement policy requirements
SW-QMS-PRO-XXX: Service Level Management	SLA monitoring and reporting procedures

Principle: When conflict exists, IMS procedures take precedence. SaaS operations provide technical implementation details.

4. Operations Organization

4.1. Operations Roles and Responsibilities

Role	Responsibility	Hours	Escalation
SaaS Operations Manager	Overall operational responsibility, SLA accountability, resource management, continuous improvement	Business hours + on-call	CEO
SaaS Operations Center (SOC)	24x7 monitoring, incident response, first-level troubleshooting, customer communication	24x7 shifts	Operations Manager
Technical Support L2	In-depth technical troubleshooting, problem analysis, escalation support	Extended hours + on-call	Operations Manager
Technical Support L3	Vendor liaison (OpenText), complex issues, architecture changes, performance optimization	On-call	CISO / CTO
Customer Success Manager	Customer liaison, SLA reporting, planned maintenance communication, escalation management	Business hours	Management Team
On-Call Engineer (Primary)	After-hours incident response, critical issue escalation	24x7 rotation	On-Call Engineer (Secondary)
On-Call Engineer (Secondary)	Backup on-call, high-severity incidents, vendor escalation	24x7 rotation	Operations Manager

4.2. On-Call Schedule

Rotation:

Primary On-Call: 1-week rotation among qualified operations engineers
Secondary On-Call: Operations Manager or designated senior engineer
Schedule Published: 4 weeks in advance, accessible via [TBD - on-call system]

On-Call Requirements:

Response time: 15 minutes for Critical incidents
Mobile phone availability
Laptop with VPN access
Access to password vault and runbooks
Handover notes between shifts

Escalation Path:

Incident → SOC → On-Call Primary → On-Call Secondary → Operations Manager → CEO

4.3. Coverage Model

Time Window	Coverage	Staffing
Business Hours (Mon-Fri 08:00-17:00 CET)	Full team	SOC + L2 + L3 + Management
Extended Hours (Mon-Fri 17:00-22:00 CET)	Remote monitoring + on-call	SOC (remote) + On-Call
Night/Weekend (22:00-08:00, Sat-Sun)	Monitoring + on-call	Automated monitoring + On-Call

4.4. Handover Procedures

Daily Operational Handover (at shift change):

Review open incidents and status
Review planned maintenance activities
Check monitoring alerts and system health
Highlight customer issues or concerns
Update operational log and handover notes

Weekly Operations Review:

Every Monday 09:00 CET
Review previous week's incidents, changes, and performance
Plan upcoming maintenance and changes
Address operational issues and improvements

5. Service Monitoring

5.1. Monitoring Philosophy

Proactive Monitoring:

Detect issues before customers are impacted
Alert on trends before thresholds are breached
Predict capacity issues and plan scaling

Layered Monitoring:

Infrastructure Layer: Servers, networks, storage
Platform Layer: Kubernetes, databases, message queues
Application Layer: APIs, services, integrations
Business Layer: SLA metrics, customer usage, billing

5.2. 24x7 Monitoring

Monitoring Tools

Tool	Purpose	Access
Prometheus + Grafana	Metrics collection and visualization	https://metrics.swedwise.com
ELK Stack	Centralized logging and analysis	https://logs.swedwise.com
AlertManager	Alert routing and management	Integrated with PagerDuty
PagerDuty	On-call alerting and incident tracking	Mobile app + web
Kubernetes Dashboard	Container orchestration monitoring	VPN access only
Fortinet FortiGate	Firewall and network security monitoring	VPN access only

Monitoring Dashboards

1. Platform Health Dashboard (primary operations view)

Cluster health (node status, pod health, resource utilization)
Database health (connections, replication lag, query performance)
Network health (bandwidth, latency, errors)
Storage health (disk usage, IOPS, backup status)

2. Service Health Dashboard (customer-facing services)

API availability and response times
Service uptime (per component)
Error rates and types
Queue depths and processing times

3. SLA Dashboard (for SLA reporting)

Current month uptime percentage
Incident history and downtime
Customer-specific SLA status
SLA breach risk indicators

4. Capacity Dashboard (for planning)

Resource utilization trends (CPU, memory, storage)
Growth projections
Tenant usage patterns
Capacity thresholds and forecasts

5.3. Alerting Strategy

Alert Severity Levels

Severity	Definition	Response Time	Notification
Critical	Service outage or imminent failure, customer impact	15 minutes	PagerDuty → Phone call
High	Service degradation, potential customer impact	30 minutes	PagerDuty → SMS + Email
Medium	Performance issue, no immediate customer impact	1 hour	Email + Slack
Low	Informational, trending toward issue	4 hours	Slack notification

Key Alerts (Platform Level)

Critical Alerts:

API endpoint down (>3 consecutive failed health checks)
Database primary failure
Kubernetes node failure (>30% nodes unavailable)
Storage >95% full
Network outage (complete loss of connectivity)

High Alerts:

API response time >5 seconds (p99) for >5 minutes
Database replication lag >60 seconds
Pod restart loop (>3 restarts in 10 minutes)
Storage >85% full
Error rate >5% for >5 minutes

Medium Alerts:

API response time >2 seconds (p95) for >10 minutes
Memory utilization >85% sustained
Backup job failed
Certificate expiration <30 days
Queue depth >1000 jobs

Low Alerts:

API response time >1 second (p95) trending up
Disk utilization >70%
Unusual traffic patterns
Failed login attempts >10 per minute

Alert Routing

# AlertManager routing rules example
routes:
  - match:
      severity: critical
    receiver: pagerduty-critical
    repeat_interval: 15m

  - match:
      severity: high
    receiver: pagerduty-high
    repeat_interval: 30m

  - match:
      severity: medium
    receiver: slack-ops
    repeat_interval: 2h

  - match:
      severity: low
    receiver: slack-info
    repeat_interval: 24h

5.4. Health Check Endpoints

All services must expose health check endpoints:

Liveness Check (/health/live):

Purpose: Is the service process running?
Response: 200 OK or 503 Service Unavailable
Action on failure: Kubernetes restarts the pod

Readiness Check (/health/ready):

Purpose: Is the service ready to accept traffic?
Response: 200 OK (ready) or 503 (not ready)
Action on failure: Remove from load balancer

Health Check Format:

{
  "status": "healthy",
  "timestamp": "2025-01-15T10:30:45Z",
  "version": "1.2.3",
  "checks": {
    "database": "healthy",
    "cache": "healthy",
    "storage": "healthy",
    "dependencies": "healthy"
  }
}

6. Incident Management

6.1. Incident Management Process

For SaaS platform incidents, follow SW-ISMS-PRO-001 (Incident Management Procedure) with these SaaS-specific additions:

Incident Categories (SaaS-Specific)

Category	Examples	Typical Severity
Service Outage	API unavailable, complete service down, authentication failure	Critical
Service Degradation	Slow response times, intermittent errors, partial functionality loss	High to Medium
Data Issue	Document generation failures, data corruption, missing data	High to Critical
Integration Failure	External API failures (email, SMS), customer integration errors	Medium to High
Security Incident	Unauthorized access, DDoS attack, data breach	Critical
Infrastructure Failure	Kubernetes node failure, database failure, network outage	Critical to High
Capacity Issue	Resource exhaustion, quota exceeded, performance degradation	Medium to High

SaaS-Specific Severity Criteria

Critical Incident:

Complete service outage affecting all customers
Data breach or security compromise
Data loss affecting multiple customers
SLA breach imminent or occurring

High Incident:

Service degradation affecting multiple customers
Single customer complete outage
Integration failure affecting customer workflows
Security threat detected but contained

Medium Incident:

Performance issues with workaround available
Single customer functionality issue
Non-critical integration failure
Intermittent errors affecting small subset of users

Low Incident:

Cosmetic issues
Single user issues
Informational security events

6.2. Incident Communication

Internal Communication

Incident Channel:

Create dedicated Slack channel: #incident-YYYYMMDD-NNN
Include: SOC, On-Call, Operations Manager, Customer Success
Use channel for all incident-related communication

Status Updates:

Critical: Every 30 minutes until resolved
High: Every 1 hour until resolved
Medium: Every 4 hours or at milestones
Low: At resolution

Customer Communication

When to Notify Customers:

Service outage or degradation visible to customers
Potential data impact
Extended downtime expected (>30 minutes)
SLA breach occurring or likely

Communication Channels:

Email to designated customer contacts (primary)
Service status page update (if available)
In-application banner (if applicable)

Communication Templates:
See SW-ISMS-PRO-001 Section 7.9 for customer communication templates.

Customer Communication Approval:

Critical Incidents: Operations Manager approval required
Customer-Specific Issues: Customer Success Manager coordinates
General Service Issues: SOC can send pre-approved templates

6.3. SaaS-Specific Incident Procedures

API Service Outage

Verify outage scope (all endpoints or specific service?)
Check Kubernetes pod status and logs
Check database connectivity
Check external dependencies (OpenText, email/SMS gateways)
Attempt service restart if safe
Escalate to L2/L3 if restart doesn't resolve
Implement temporary failover if available

Database Failure

Check PostgreSQL cluster status (primary, replicas)
Verify Patroni automatic failover triggered
If failover failed, trigger manual promotion of replica
Verify application reconnects to new primary
Investigate root cause (storage, memory, corruption)
Plan re-syncing of old primary as new replica

Kubernetes Node Failure

Check node status: kubectl get nodes
Verify pods rescheduled to healthy nodes
Check if node can be cordoned and drained
If hardware failure, work with data center partner
Monitor pod scheduling and resource availability
Add temporary node capacity if needed

Document Generation Failures

Check OpenText Communications pod status
Review application logs for errors
Check template validity and accessibility
Verify database connectivity for job queue
Check object storage access for template/asset retrieval
Test with known-good document to isolate issue

Notification Delivery Failures

Check notification service pod status
Verify external gateway connectivity (email, SMS)
Check authentication credentials for gateways
Review bounce/failure logs
Check for rate limiting or quota issues
Test with single notification to isolate issue

7. Problem Management

7.1. Purpose of Problem Management

Problem Management focuses on identifying and addressing the root causes of incidents to prevent recurrence.

Goals:

Reduce the number of incidents
Minimize impact of incidents that cannot be prevented
Improve overall service quality and stability

7.2. Problem Management Process

Incident(s) → Problem Identification → Root Cause Analysis →
Solution Development → Implementation → Verification → Closure

Problem Triggers

Problems are identified from:

Recurring Incidents: Same issue occurring multiple times
Post-Incident Reviews: Root cause identified but not yet addressed
Proactive Analysis: Monitoring trends, vulnerability assessments
Customer Feedback: Patterns in customer complaints

Problem Record

Create a Problem Record in [TBD - Problem tracking system] containing:

Problem ID and title
Related incidents (INC-YYYY-MM-####)
Affected services and components
Business impact and urgency
Root cause analysis findings
Proposed solution
Implementation plan
Verification criteria

7.3. Root Cause Analysis (RCA)

Techniques:

5 Whys: Iteratively ask "why" to drill down to root cause
Fishbone Diagram: Identify contributing factors across categories (people, process, technology, environment)
Timeline Analysis: Reconstruct event sequence leading to incident
Change Analysis: Identify recent changes that may have contributed

RCA Report Contents:

Problem summary and business impact
Timeline of events
Root cause(s) identified
Contributing factors
Lessons learned
Preventive actions recommended
Action plan with owners and deadlines

RCA Timing:

Critical Problems: Within 5 business days
High Problems: Within 10 business days
Medium Problems: Within 30 days

7.4. Known Error Database

Known Errors are problems with documented workarounds but not yet permanent solutions.

Known Error Record Contains:

Problem ID reference
Symptoms and diagnosis
Workaround procedure
Status (temporary workaround, permanent fix planned, permanent fix implemented)
Date permanent solution expected

Access:

All operations staff have access to Known Error Database
SOC uses during incident response to quickly apply workarounds
Customer Success can share workarounds with customers (approved cases)

7.5. Problem Resolution

Resolution Options:

Permanent Fix: Address root cause, implement via Change Management
Workaround: Temporary mitigation, plan permanent fix for future
Accept Risk: If fix cost exceeds benefit, document and accept
Vendor Fix: Escalate to vendor (OpenText), track until resolution

Change Management Integration:

All permanent fixes require Change Request (SW-IMS-PRO-008)
Problem ID referenced in Change Request
Post-implementation verification confirms problem resolved

8. Change Management

8.1. Change Management Process

All SaaS platform changes follow SW-IMS-PRO-008 (Change Management Procedure).

This section provides SaaS-specific implementation details.

8.2. SaaS Change Categories

Standard Changes (Pre-Approved)

Standard Change	Procedure	Approval
Security patch deployment (non-critical systems)	Apply to test, verify 48h, deploy to production	Operations Manager
SSL certificate renewal	Automated via Let's Encrypt; manual verification	Automatic
User account provisioning/deprovisioning	Follow access control procedure	Customer Success Manager
Tenant configuration change (single tenant)	Documented procedure, testing in staging	Operations Engineer
Backup schedule adjustment	Modify backup job, verify next run	Operations Manager
Log retention policy update	Update Elasticsearch lifecycle policy	Operations Manager

Normal Changes (Require CAB Approval)

New service feature deployment
OpenText Communications version upgrade
Kubernetes cluster upgrade
Database schema changes affecting multiple tenants
Network configuration changes
Security control changes
New integration implementation
Capacity expansion (adding nodes, storage)

Emergency Changes

Security vulnerability patch (actively exploited)
Service restoration after critical outage
Data breach containment
Critical bug fix (data loss risk)

8.3. SaaS Change Windows

Standard Maintenance Windows:

Window	Schedule	Duration	Use Case
Weekly Maintenance	Wednesday 22:00-02:00 CET	4 hours	Low-risk changes, patches, configuration updates
Monthly Extended	First Sunday 00:00-06:00 CET	6 hours	Higher-risk changes, version upgrades, infrastructure changes
Emergency	Anytime	As needed	Critical security or service restoration

Customer Notification Requirements:

Change Impact	Advance Notice	Approval
No service disruption	3 business days	Operations Manager
Brief disruption (<15 min)	5 business days	Operations Manager + Customer Success
Extended disruption (>15 min)	10 business days	Management Team
Major functionality change	10 business days	Management Team + Customer notification

8.4. SaaS Change Procedures

Pre-Implementation Checklist

Change Request approved by CAB
Testing completed in staging environment
Rollback plan documented and tested
Customer notification sent (if required)
On-call engineer briefed
Backup completed and verified
Monitoring alerts configured for change verification

During Implementation

Create incident Slack channel: #change-YYYYMMDD-NNN
Execute change steps per approved plan
Document each step and timestamp
Monitor metrics and logs for issues
Decision point: Proceed or rollback?

Post-Implementation

Verification tests passed
Monitoring confirms stable operation
Customer notification of completion (if applicable)
Change record updated with outcome
Post-Implementation Review scheduled
Documentation updated

8.5. Change Freeze Periods

Defined Change Freeze Periods:

Christmas/New Year: December 20 - January 5 (Hard freeze)
Summer Vacation: July 1-31 (Soft freeze)
Major Customer Deliverables: As announced (Soft freeze)

During Change Freeze:

Hard Freeze: Emergency changes only
Soft Freeze: Critical and high-risk changes require CEO approval

9. Backup and Recovery

9.1. Backup Strategy

Objectives:

RPO (Recovery Point Objective): 6 hours maximum data loss
RTO (Recovery Time Objective): 4 hours maximum recovery time

9.2. Platform-Level Backups

Database Backups (PostgreSQL)

Backup Type	Frequency	Retention	Storage Location
Full Backup	Daily (02:00 CET)	7 days local, 30 days offsite	Primary DC + DR DC
Incremental (WAL)	Continuous	7 days	Primary DC + DR DC
Point-in-Time Recovery	Every 6 hours	7-day window	DR DC

Backup Verification:

Automated restore test: Weekly (random database)
Manual restore test: Monthly (full platform)
DR failover test: Quarterly

Object Storage Backups

Content Type	Backup Method	Retention
Templates	Cross-region replication + versioning	30 versions
Generated Documents	Cross-region replication	90 days hot, 7 years archive
Assets	Cross-region replication + versioning	10 versions
Configuration Files	Git repository + encrypted backup	Indefinite

Kubernetes Configuration Backups

Method: GitOps - all manifests in Git repository
Frequency: On every change (automated)
Retention: Full Git history
Secrets: Separate encrypted backup (Sealed Secrets or Vault)

9.3. Backup Monitoring

Backup Success Alerts:

Alert on backup job failure
Alert on backup size anomaly (>50% change)
Alert on backup duration exceeding threshold
Alert on replication lag >30 minutes

Backup Reports:

Daily backup status report (automated email)
Weekly backup verification summary
Monthly DR readiness report

9.4. Recovery Procedures

Database Recovery Scenarios

Scenario 1: Single Table Corruption

Identify affected tenant and table
Restore from most recent backup to temporary database
Export affected table data
Verify data integrity
Import corrected data to production
Verify customer functionality

Scenario 2: Database Primary Failure

Patroni automatically promotes replica to primary (~30 seconds)
Verify applications reconnect to new primary
Investigate failed primary (hardware, corruption)
Rebuild failed primary as new replica
Re-sync data from new primary
Monitor replication lag until caught up

Scenario 3: Complete Database Loss

Declare disaster, activate DR plan
Promote DR site database to primary
Restore most recent backup (if DR site not sync'd)
Apply WAL archives to achieve minimal data loss
Update DNS to point to DR site
Verify application connectivity and functionality
Customer notification of recovery

Object Storage Recovery

Scenario 1: Accidental Deletion

Identify deleted object and timestamp
Use object versioning to restore previous version
Verify restoration with customer

Scenario 2: Object Corruption

Identify corrupted object
Restore from cross-region replica
If replica also corrupted, restore from backup

Full Platform Recovery (Disaster Recovery)

See Section 10: Disaster Recovery for complete procedures.

10. Disaster Recovery

10.1. DR Strategy

Disaster Recovery Objectives:

RTO: 4 hours (maximum time to restore service)
RPO: 6 hours (maximum data loss)
DR Site: Secondary data center in Sweden

DR Scenarios:

Complete data center failure (fire, flood, power loss)
Multiple simultaneous infrastructure failures
Prolonged network outage
Ransomware or catastrophic security breach

10.2. DR Architecture

Primary Site (Production):

Entiros DC - Karlstad, Sweden
Full Kubernetes cluster
PostgreSQL HA cluster
Object storage (primary)

Secondary Site (DR):

Entiros DC - Stockholm, Sweden
Standby Kubernetes cluster (minimal)
PostgreSQL standby (read-only, async replication)
Object storage replica (async replication)

Data Replication:

Database: WAL archiving every 6 hours
Object Storage: Near real-time async replication
Configuration: Git-based, always current

10.3. DR Activation Procedure

Activation Criteria:

Primary data center completely unavailable
Multiple critical infrastructure failures with no quick resolution
Recovery time at primary site exceeds 4 hours
Management decision to activate DR

Activation Authority:

Operations Manager + CEO (or designated deputy)

DR Activation Steps:

Phase 1: Assessment and Decision (T+0 to T+30 min)

Confirm primary site unavailable
Assess scope and estimated recovery time
Management decision: Activate DR or wait
Declare DR activation
Notify all stakeholders (internal team, customers, vendors)

Phase 2: DR Site Preparation (T+30 to T+90 min)

Scale up DR Kubernetes cluster to production capacity
Promote PostgreSQL standby to primary (writable)
Verify database connectivity and data integrity
Update application configurations to use DR database
Deploy application pods to DR cluster
Verify internal service health checks

Phase 3: DNS Cutover (T+90 to T+120 min)

Update DNS records to point to DR site IP addresses
Wait for DNS propagation (5-15 minutes with low TTL)
Monitor traffic shift from primary to DR
Verify customer API calls reaching DR site

Phase 4: Verification and Communication (T+120 to T+240 min)

Test all critical customer workflows
Verify monitoring and alerting operational
Check SLA dashboards and metrics
Send customer notification: "Service restored, operating from DR site"
Begin post-incident review planning

Phase 5: Stabilization (T+240+)

Monitor DR site performance and stability
Address any issues discovered during failover
Plan return to primary site (when available)
Coordinate with data center partner on primary site recovery

10.4. Failback to Primary Site

When to Failback:

Primary site fully recovered and tested
DR site stable for at least 24 hours
Planned maintenance window available
Customer notification completed (10 days advance)

Failback Procedure:

Sync data from DR site to primary site
Verify primary site readiness (full testing)
Plan cutover during maintenance window
Execute reverse of DR activation procedure
Update DNS back to primary site
Monitor for 48 hours post-failback
Document lessons learned

10.5. DR Testing

Test Type	Frequency	Scope
Backup Restore	Monthly	Restore single tenant database, verify integrity
Partial Failover	Quarterly	Failover non-critical services to DR, verify functionality
Full DR Exercise	Annually	Complete failover simulation, all services, customer notification simulation (not actual)
DR Plan Review	Quarterly	Review and update DR procedures, contact lists

11. Capacity Management

11.1. Capacity Management Goals

Ensure adequate resources to meet current and future demand
Avoid service degradation due to resource constraints
Optimize resource utilization and cost
Plan capacity expansion in advance of need

11.2. Capacity Monitoring

Key Capacity Metrics:

Metric	Current Threshold	Warning	Critical
CPU Utilization (cluster average)	70%	>80% sustained	>90%
Memory Utilization (cluster average)	75%	>85% sustained	>95%
Storage Usage (total)	-	>80%	>90%
Database Connections	200 max	>150	>180
API Request Rate	1000 req/sec capacity	>700 req/sec	>900 req/sec
Document Generation Queue	-	>500 jobs queued	>1000 jobs

11.3. Capacity Planning Process

Monthly Capacity Review:

Analyze usage trends (past 30 days)
Project growth (next 90 days)
Identify capacity bottlenecks
Plan scaling actions if thresholds will be exceeded
Document in monthly operations report

Scaling Triggers:

Resource utilization trending toward warning threshold
Sustained periods near capacity
New customer onboarding with significant usage
Seasonal peaks (if known)

Scaling Actions:

Resource	Scaling Method	Timeline
Compute (Kubernetes)	Add worker nodes via cluster autoscaler	Automatic (15 min) or manual (4 hours)
Database	Vertical scaling (larger instance) or add read replicas	4-8 hours planned downtime
Storage	Add storage capacity	No downtime (elastic)
Network	Upgrade bandwidth with ISP	1-2 weeks coordination

11.4. Capacity Forecasting

Forecasting Model:

Historical growth rate (monthly)
Known customer pipeline (new customers onboarding)
Seasonal factors (if applicable)
Planned feature releases increasing resource usage

12-Month Capacity Plan:

Reviewed quarterly
Aligned with budget planning
Presented to management with cost projections

12. Security Operations

12.1. Security Operations Principles

SaaS security operations implement SW-ISMS-POL-001 (Information Security Policy) with focus on:

24x7 security monitoring and threat detection
Rapid incident response to security threats
Proactive vulnerability management
Secure access controls and logging
Compliance with ISO 27001 requirements

12.2. Security Monitoring

Security Monitoring Tools:

Fortinet FortiGate: Network-level threat detection (IDS/IPS)
ELK Security: SIEM capabilities, log correlation
Kubernetes Audit Logs: API access and configuration changes
Falco: Runtime security monitoring for containers
Database Audit Logs: Sensitive data access tracking

Security Dashboards:

Failed authentication attempts
Privilege escalation attempts
Cross-tenant access attempts (should be zero)
Unusual API patterns (potential abuse)
Security policy violations
Certificate expiration warnings

12.3. Access Control

Administrative Access:

VPN Required: All administrative access via VPN with MFA
Bastion Hosts: Jump servers for SSH access to infrastructure
Just-In-Time (JIT) Access: Temporary elevated privileges, logged and monitored
Principle of Least Privilege: Users have minimum access needed

Access Reviews:

Quarterly review of all administrative accounts
Immediate revocation for departed staff
Annual access recertification by managers

12.4. Vulnerability Management

Vulnerability Scanning:

Infrastructure Scanning: Weekly automated scans (Nessus or equivalent)
Container Scanning: On every build (integrated with CI/CD)
Dependency Scanning: Daily scans for vulnerable libraries
Penetration Testing: Annual third-party assessment

Vulnerability Response:

Severity	Response Time	Action
Critical (CVSS 9-10)	24 hours	Emergency patch, potential service disruption accepted
High (CVSS 7-8.9)	7 days	Planned patch during maintenance window
Medium (CVSS 4-6.9)	30 days	Include in next scheduled release
Low (CVSS 0-3.9)	90 days	Backlog, address when convenient

12.5. Logging and Audit Trails

What is Logged:

All authentication attempts (success and failure)
API access (with tenant context)
Administrative actions (configuration changes, database access)
Data access (document generation, template changes)
Security events (firewall blocks, IDS alerts)

Log Retention:

Application Logs: 7 days (hot in Elasticsearch), 90 days (archive)
Audit Logs: 7 years (compliance requirement)
Security Logs: 2 years

Log Analysis:

Automated correlation for security events
Daily review of security logs by SOC
Weekly security summary report to CISO

12.6. Security Incident Response

Security incidents follow SW-ISMS-PRO-001 with these SaaS-specific additions:

Security Incident Types:

Unauthorized access or data breach
DDoS attack
Malware detection
Insider threat
Third-party breach (OpenText, data center)

Security Incident Response Team:

CISO: Incident command for security incidents
Operations Manager: Technical coordination
Legal/Privacy Officer: GDPR breach assessment and notification
Customer Success: Customer communication
External: Third-party forensics (if needed)

Immediate Security Actions:

Isolate affected systems (network segmentation)
Preserve evidence (logs, disk images)
Assess scope and impact
Containment actions (block IPs, disable accounts)
Notify CISO and management immediately

13. Reporting

13.1. SLA Reporting

Monthly SLA Report (delivered by 5th business day of following month):

Recipients:

Each customer (via Customer Success Manager)
Management Team
Operations Team

Report Contents:

Service uptime percentage (current month)
Planned vs. unplanned downtime breakdown
Incident summary (count by severity, MTTR)
Performance metrics (API response times)
SLA status (met/not met)
Credits applied (if SLA breach)

SLA Calculation:

Uptime % = ((Total Minutes in Month - Unplanned Downtime Minutes) / Total Minutes in Month) × 100

Target: ≥99.9% (allows ~43 minutes unplanned downtime per month)

SLA Exclusions:

Planned maintenance (with required advance notice)
Customer-caused issues (misconfiguration, quota exceeded)
Force majeure events
Third-party service outages beyond Swedwise control

13.2. Operational Metrics Reports

Daily Operations Summary (automated):

Incidents opened/closed (past 24 hours)
System health snapshot
Backup status
Capacity utilization

Weekly Operations Report:

Incident summary (count, MTTR, trends)
Change summary (completed, upcoming)
Capacity trends
Action items and follow-ups

Monthly Operations Report (to Management Team):

SLA performance summary (all customers)
Incident analysis and trends
Problem management summary
Change success rate
Capacity planning updates
Security summary
Continuous improvement initiatives

13.3. Customer Usage Reports

Quarterly Usage Report (to each customer):

Documents generated (count, trends)
Notifications sent (email, SMS breakdown)
Storage consumed
API call volume
User activity (active users, login frequency)
Performance trends (response times)
Recommendations (optimization opportunities)

14. Continuous Improvement

14.1. Improvement Sources

Post-Incident Reviews (lessons learned)
Problem Management findings
Customer feedback and complaints
Operations team suggestions
Industry best practices
Audit findings
Monitoring and metrics analysis

14.2. Improvement Process

Identify Improvement Opportunity
- Problem identified through various sources
- Document current state and desired state
Assess and Prioritize
- Impact: High, Medium, Low
- Effort: High, Medium, Low
- Priority: Impact vs. Effort matrix
Plan Improvement
- Define solution
- Resource requirements
- Implementation timeline
- Success criteria
Implement
- Follow Change Management procedure
- Communicate to team
- Train staff if needed
Verify and Close
- Measure against success criteria
- Document lessons learned
- Share with team

14.3. Continuous Improvement Metrics

Metric	Target	Purpose
MTTR (Mean Time to Resolve)	Trend downward	Measure incident response efficiency
Incident Recurrence Rate	<10%	Measure problem management effectiveness
Change Success Rate	≥95%	Measure change quality
Customer Satisfaction (CSAT)	≥4.0/5.0	Measure overall service quality
Unplanned Downtime	<43 min/month	Measure reliability

14.4. Regular Review Meetings

Meeting	Frequency	Purpose
Daily Standup	Daily (09:00 CET)	Operational coordination, issue escalation
Weekly Ops Review	Monday (10:00 CET)	Review previous week, plan upcoming week
Monthly Ops Meeting	First Friday of month	Metrics review, continuous improvement, planning
Quarterly Service Review	Every quarter	Strategic review with management, capacity planning

15. Component-Specific Operations

This document provides platform-wide operations procedures. For component-specific operational details, refer to:

Addendum Document	Scope
SW-SAAS-OPS-COMP-001	Operations Addendum - Communications
SW-SAAS-OPS-COMP-002	Operations Addendum - Notifications
[Future]	Operations Addendum - E-Archive
[Future]	Operations Addendum - E-Sign Integration

Component addendums must reference and comply with this main Operations Guide.

16. Inputs and Outputs

Inputs:

Monitoring alerts and metrics
Customer support tickets
Change requests
Vulnerability scan results
Capacity planning data
Audit findings

Outputs:

Incident reports and post-incident reviews
Problem records and RCA reports
Change implementation records
SLA reports (monthly)
Operations metrics reports (daily, weekly, monthly)
Capacity planning forecasts
Security event summaries

17. Records

Record	Retention Period	Location	Owner
Incident Reports	3 years	[TBD - Incident system]	Operations Manager
Problem Records	5 years	[TBD - Problem system]	Operations Manager
Change Records	3 years	[TBD - Change system]	Operations Manager
SLA Reports	7 years	[TBD - Document repository]	Customer Success Manager
Operational Logs	7 days (hot), 90 days (archive)	ELK Stack	Operations Manager
Security Audit Logs	7 years	[TBD - SIEM]	CISO
Backup Verification Reports	2 years	[TBD - Document repository]	Operations Manager
DR Test Reports	5 years	[TBD - Document repository]	Operations Manager

IMS Procedures:

SW-ISMS-PRO-001: Incident Management Procedure
SW-IMS-PRO-008: Change Management Procedure
SW-IMS-PRO-002: Risk Assessment Procedure
SW-IMS-PRO-007: Communication Procedure

SaaS Documentation:

SW-SAAS-ARCH-001: Swedwise Communications Technical Architecture
SW-SAAS-SVC-001: Service Description
SW-SAAS-OPS-COMP-001: Operations Addendum - Communications
SW-SAAS-OPS-COMP-002: Operations Addendum - Notifications
SW-SAAS-SUP-001: Support Procedures [TBD]

External Standards:

ISO 9001:2015 - Clause 8.5 (Operations)
ISO 27001:2022 - Clause 8.1 (Operational planning and control)
ITIL 4 - Service Operation

19. Document Control

Version	Date	Author	Changes	Approved By
1.0	[TBD]	SaaS Operations Manager	Initial operations guide creation	Management Team

Next Review Date: [TBD - 6 months from effective date for initial version, then annually]

Document Classification: Internal

Document Owner: SaaS Operations Manager

This procedure is approved by Swedwise AB management and is effective from the date specified above. All operations staff are required to read, understand, and comply with this procedure.