SW-SAAS-ARCH-001
Swedwise SaaS Platform - Technical Architecture
Version
1.0
Owner
Technical Lead
Effective Date
2025-01-15
Review Date
2026-01-15
Swedwise SaaS Platform - Technical Architecture
Platform: Swedwise SaaS Platform
Date: 2025-01-15
Version: 1.0
Classification: Confidential
Executive Summary
This document describes the technical architecture of the Swedwise SaaS Platform, deployed on a Kubernetes-based infrastructure in a Swedish data center. The architecture is designed for multi-tenancy, high availability, security, and scalability to support enterprise-grade SaaS service components.
Platform Characteristics:
- Multi-tenant SaaS architecture with strict data isolation
- Kubernetes orchestration for automatic scaling and resilience
- 99.9% availability SLA with redundant components
- Swedish data residency for GDPR compliance
- ISO 27001 certified security controls
Service Components:
The platform hosts multiple service components, each documented in separate technical architecture addendums:
| Component | Document ID | Description |
|---|---|---|
| Communications | SW-SAAS-ARCH-COMP-001 | OpenText Exstream document generation |
| Notifications | SW-SAAS-ARCH-COMP-002 | Multi-channel notification delivery (Email, SMS) |
| [Future] | - | Additional service components |
1. High-Level Architecture
1.1. Architecture Overview
┌─────────────────────────────────────────────────────────────────────────┐
│ CUSTOMER LAYER │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Customer │ │ Customer │ │ Customer │ │
│ │ Tenant A │ │ Tenant B │ │ Tenant C │ │
│ │ │ │ │ │ │ │
│ │ Users/Apps │ │ Users/Apps │ │ Users/Apps │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
└─────────┼──────────────────┼──────────────────┼────────────────────────┘
│ │ │
└──────────────────┴──────────────────┘
│
┌─────────▼─────────┐
│ Internet/VPN │
└─────────┬─────────┘
│
┌────────────────────────────▼──────────────────────────────────────────┐
│ NETWORK SECURITY LAYER │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Fortinet Next-Gen Firewall (IDS/IPS) │ │
│ │ - DDoS Protection │ │
│ │ - Threat Intelligence │ │
│ │ - SSL/TLS Inspection │ │
│ └──────────────────────┬────────────────────────────────────────┘ │
└─────────────────────────┼──────────────────────────────────────────────┘
│
┌─────────────────────────▼──────────────────────────────────────────────┐
│ LOAD BALANCING & INGRESS LAYER │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ Kubernetes Ingress Controllers (Redundant) │ │
│ │ - TLS Termination │ │
│ │ - Layer 7 Routing │ │
│ │ - Rate Limiting │ │
│ └────────────────────────┬───────────────────────────────────────┘ │
└─────────────────────────┼──────────────────────────────────────────────┘
│
┌─────────────────────────▼──────────────────────────────────────────────┐
│ KUBERNETES CLUSTER LAYER │
│ (OpenText Experience Cloud) │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ CONTROL PLANE │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ etcd │ │ API │ │Scheduler │ │ │
│ │ │ (HA) │ │ Server │ │Controller│ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ WORKER NODES (Redundant) │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────┐ │ │
│ │ │ APPLICATION POD LAYER │ │ │
│ │ │ │ │ │
│ │ │ ┌───────────────┐ ┌───────────────┐ ┌─────────────┐│ │ │
│ │ │ │ Service │ │ Service │ │ Tenant ││ │ │
│ │ │ │ Component A │ │ Component B │ │ Management ││ │ │
│ │ │ │ (Pods) │ │ (Pods) │ │ Services ││ │ │
│ │ │ │ │ │ │ │ ││ │ │
│ │ │ │ Multi-tenant │ │ Multi-tenant │ │ ││ │ │
│ │ │ └───────┬───────┘ └───────┬───────┘ └──────┬──────┘│ │ │
│ │ └──────────┼──────────────────┼─────────────────┼───────┘ │ │
│ │ │ │ │ │ │
│ │ ┌──────────▼──────────────────▼─────────────────▼───────┐ │ │
│ │ │ SHARED SERVICES LAYER │ │ │
│ │ │ │ │ │
│ │ │ ┌────────────┐ ┌────────────┐ ┌────────────────┐ │ │ │
│ │ │ │ Identity & │ │ API │ │ Integration │ │ │ │
│ │ │ │ Auth │ │ Gateway │ │ Broker │ │ │ │
│ │ │ │ (SSO/MFA) │ │ │ │ │ │ │ │
│ │ │ └────────────┘ └────────────┘ └────────────────┘ │ │ │
│ │ └────────────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────▼──────────────────────────────────────────────┐
│ DATA PERSISTENCE LAYER │
│ │
│ ┌────────────────┐ ┌────────────────┐ ┌─────────────────────┐ │
│ │ PostgreSQL │ │ Document │ │ Object Storage │ │
│ │ Cluster (HA) │ │ Database │ │ (S3-compatible) │ │
│ │ │ │ (Per-Tenant) │ │ │ │
│ │ - Tenant Meta │ │ │ │ - Generated Docs │ │
│ │ - Config DB │ │ - Templates │ │ - Assets/Media │ │
│ │ - User Data │ │ - Job History │ │ - Archived Output │ │
│ └────────────────┘ └────────────────┘ └─────────────────────┘ │
└────────────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────▼──────────────────────────────────────────────┐
│ MONITORING & OBSERVABILITY LAYER │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌──────────────┐ │
│ │ Prometheus │ │ Grafana │ │ ELK │ │ Alerting │ │
│ │ Metrics │ │ Dashboards │ │ Logs │ │ Manager │ │
│ └────────────┘ └────────────┘ └────────────┘ └──────────────┘ │
└────────────────────────────────────────────────────────────────────────┘
│
┌─────────────────────────▼──────────────────────────────────────────────┐
│ BACKUP & DISASTER RECOVERY │
│ │
│ Primary DC (Sweden) ───────────────────▶ Secondary DC (Sweden) │
│ - Every 6 hours - Async Replication │
│ - 7 days retention - DR Site │
└────────────────────────────────────────────────────────────────────────┘
1.2. Platform Technology Stack
| Layer | Technology | Purpose |
|---|---|---|
| Orchestration | Kubernetes | Container orchestration, auto-scaling, self-healing |
| Container Runtime | Docker | Application containerization |
| Database | PostgreSQL (HA cluster) | Relational data storage |
| Object Storage | S3-compatible storage | Document, template, and asset storage |
| Cache | Redis Cluster | Session management, caching |
| Message Queue | RabbitMQ/Kafka | Asynchronous job processing |
| Load Balancing | Kubernetes Ingress / NGINX | Traffic distribution, SSL/TLS termination |
| Firewall | Fortinet Next-Gen Firewall | Network security, IDS/IPS |
| Monitoring | Prometheus + Grafana | Metrics collection and visualization |
| Logging | ELK Stack (Elasticsearch, Logstash, Kibana) | Centralized logging and analysis |
| Secrets Management | Kubernetes Secrets / HashiCorp Vault | Secure credential storage |
Service Component Technologies (see component addendums for details):
- Communications: OpenText Communications (Exstream)
- Notifications: OpenText Notifications + Email/SMS Gateways
2. Platform Component Overview
2.1. Kubernetes Cluster Architecture
The platform runs on a dedicated Kubernetes cluster with the following characteristics:
Control Plane (High Availability)
- 3x Master Nodes: Redundant control plane for fault tolerance
- etcd Cluster: Distributed key-value store for cluster state (3+ nodes)
- API Server: RESTful API for cluster management
- Scheduler: Pod placement and resource allocation
- Controller Manager: Cluster-level functions (replication, endpoints, service accounts)
Worker Nodes
- Minimum 6 Worker Nodes: Distributed across multiple physical hosts
- Auto-scaling: Dynamic node provisioning based on workload
- Taints and Tolerations: Dedicated nodes for sensitive workloads
- Node Affinity: Pod placement rules for tenant isolation
Pod Architecture
Each application component runs as a microservice in a pod:
Platform Services:
| Pod Type | Replicas | Resources | Purpose |
|---|---|---|---|
| Tenant Management | 2+ | 2 CPU, 4 GB RAM | Multi-tenant orchestration |
| API Gateway | 3+ | 2 CPU, 4 GB RAM | API routing and rate limiting |
| Auth Service | 3+ | 2 CPU, 4 GB RAM | Authentication and SSO |
| Integration Broker | 2+ | 2 CPU, 8 GB RAM | External system integration |
Service Component Pods (see component addendums for detailed specifications):
| Service Component | Document | Pod Types |
|---|---|---|
| Communications | SW-SAAS-ARCH-COMP-001 | Exstream API, Designer |
| Notifications | SW-SAAS-ARCH-COMP-002 | Notification Engine, Queue Workers |
2.2. Database Layer
PostgreSQL High-Availability Cluster
┌─────────────────────────────────────────────────────────┐
│ PostgreSQL HA Cluster (Patroni) │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌──────────┐ │
│ │ Primary │────▶│ Replica 1 │────▶│Replica 2 │ │
│ │ (Write) │ │ (Read) │ │ (Read) │ │
│ └─────────────┘ └─────────────┘ └──────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ WAL Archiving & Point-in-Time Recovery │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Key Features:
- Automatic Failover: Patroni manages leader election (< 30 seconds RTO)
- Streaming Replication: Synchronous replication to primary replica
- Read Replicas: Scale read operations across multiple replicas
- Connection Pooling: PgBouncer for efficient connection management
Database Separation:
- Platform Database: Tenant metadata, user accounts, subscriptions
- Tenant Databases: Per-tenant data isolation (dedicated schemas or databases)
- Audit Database: Security events, access logs, change tracking
Object Storage (S3-Compatible)
Storage Classes:
- Hot Storage: Frequently accessed documents (generated output, active templates)
- Warm Storage: Archived documents (30-90 days)
- Cold Storage: Long-term archival (compliance retention)
Storage Structure per Tenant:
/tenants/{tenant-id}/
├── templates/ # Document templates
├── assets/ # Images, fonts, branding
├── output/ # Generated documents
│ ├── active/ # Last 30 days
│ └── archive/ # Older documents
└── uploads/ # Customer-uploaded content
2.3. Platform Services
The Swedwise SaaS Platform provides foundational multi-tenant capabilities:
Core Platform Services
- Tenant Provisioning: Automated tenant creation and configuration
- Identity Management: Centralized authentication with SSO/SAML support
- API Management: Rate limiting, throttling, API versioning
- Usage Metering: Transaction tracking for billing
- Analytics Engine: Usage analytics and reporting
Integration Framework
- REST API: Standard RESTful APIs for all services
- Webhooks: Event-driven integrations
- File Transfer: SFTP/FTPS for batch processing
- Message Queue: Asynchronous job processing (Kafka/RabbitMQ)
Service Component Integration
Each service component integrates with the platform through:
- Shared authentication and authorization
- Common API gateway routing
- Unified monitoring and logging
- Centralized configuration management
For component-specific integration details, see the respective architecture addendums.
3. Multi-Tenant Architecture
3.1. Tenant Isolation Model
The platform implements a hybrid multi-tenant architecture with multiple layers of isolation:
┌─────────────────────────────────────────────────────────────┐
│ ISOLATION LAYERS │
│ │
│ Layer 1: Application Logic (Shared Pods) │
│ ├── Shared application code │
│ ├── Per-tenant configuration injection │
│ └── Context-based data filtering │
│ │
│ Layer 2: Database Isolation │
│ ├── Separate database schemas per tenant │
│ ├── Row-level security policies │
│ └── Encrypted tenant keys │
│ │
│ Layer 3: Storage Isolation │
│ ├── Tenant-specific object storage paths │
│ ├── Access control policies (IAM) │
│ └── Encryption with tenant-specific keys │
│ │
│ Layer 4: Network Isolation (Optional for Sensitive) │
│ ├── Dedicated namespaces │
│ ├── Network policies │
│ └── Private networking │
└─────────────────────────────────────────────────────────────┘
3.2. Tenant Configuration
Each tenant has a dedicated configuration profile:
tenant:
id: "tenant-abc-123"
name: "Acme Corporation"
status: active
tier: enterprise
resources:
database_schema: "tenant_abc_123"
storage_bucket: "tenants/tenant-abc-123/"
namespace: "default" # or dedicated namespace
quotas:
max_users: 100
max_storage_gb: 500
max_monthly_documents: 100000
max_monthly_notifications: 500000
api_rate_limit: 1000/min
features:
sso_enabled: true
api_access: true
custom_branding: true
advanced_analytics: false
security:
encryption_key_id: "key-abc-123"
data_classification: "confidential"
ip_whitelist: ["203.0.113.0/24"]
mfa_required: true
3.3. Data Isolation Strategy
Database-Level Isolation
Option 1: Schema-per-Tenant (Current Implementation)
- Each tenant has a dedicated PostgreSQL schema
- Shared database instance for operational efficiency
- Schema-level access control
- Suitable for standard tier customers
Option 2: Database-per-Tenant (Enterprise Tier)
- Dedicated PostgreSQL database for enterprise customers
- Complete logical separation
- Independent backup/restore capabilities
- Higher isolation for regulatory requirements
Application-Level Isolation
- Tenant Context Injection: Every request carries tenant ID
- Query Filtering: All database queries filtered by tenant ID
- Data Validation: Cross-tenant access attempts blocked at application layer
- Audit Logging: All data access logged with tenant context
3.4. Resource Allocation
Pod Resource Limits (Per Tenant Workload)
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
Storage Quotas
- Per-Tenant Storage Quota: Enforced at object storage level
- Database Size Monitoring: Alerts when tenant exceeds 80% of quota
- Automatic Scaling: Option to automatically increase quota (with billing)
4. Network Architecture and Security
4.1. Network Topology
┌─────────────────────────────────────────────────────────────┐
│ INTERNET │
└────────────────────────┬────────────────────────────────────┘
│
┌─────▼─────┐
│ DNS │
│ (CloudFlare/Route53)
└─────┬─────┘
│
┌────────────────────────▼────────────────────────────────────┐
│ DMZ ZONE │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Fortinet Next-Gen Firewall (Active/Passive HA) │ │
│ │ - Public IP: External interface │ │
│ │ - Private IP: Internal interface │ │
│ │ - Management IP: Admin interface │ │
│ └─────────────────────┬───────────────────────────────┘ │
└────────────────────────┼─────────────────────────────────────┘
│
┌────────────────────────▼─────────────────────────────────────┐
│ APPLICATION ZONE │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Kubernetes Cluster Network (Calico CNI) │ │
│ │ │ │
│ │ Pod Network: 10.244.0.0/16 │ │
│ │ Service Network: 10.96.0.0/12 │ │
│ │ Node Network: 192.168.1.0/24 │ │
│ │ │ │
│ │ ┌────────────────────────────────────────────────┐ │ │
│ │ │ Network Policies (Microsegmentation) │ │ │
│ │ │ - Pod-to-Pod rules │ │ │
│ │ │ - Namespace isolation │ │ │
│ │ │ - Egress filtering │ │ │
│ │ └────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
│
┌────────────────────────▼─────────────────────────────────────┐
│ DATA ZONE │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ PostgreSQL │ │ Object │ │ Backup │ │
│ │ Private IP │ │ Storage │ │ Storage │ │
│ │ 192.168.2.x │ │ Private Net │ │ Air-gapped │ │
│ └──────────────┘ └──────────────┘ └─────────────────┘ │
└──────────────────────────────────────────────────────────────┘
4.2. Security Layers
Layer 1: Perimeter Security
- DDoS Protection: CloudFlare or equivalent CDN with DDoS mitigation
- Web Application Firewall (WAF): OWASP Top 10 protection
- Rate Limiting: API and HTTP rate limiting at edge
- GeoIP Filtering: Optional geographic access restrictions
Layer 2: Network Security
-
Next-Gen Firewall: Fortinet FortiGate (or equivalent)
- Intrusion Detection System (IDS)
- Intrusion Prevention System (IPS)
- SSL/TLS Inspection
- Application-layer filtering
- Threat intelligence feeds
-
VPN Access: Secure administrative access
- IPsec VPN for site-to-site connectivity
- SSL VPN for remote administration
- Multi-factor authentication required
Layer 3: Kubernetes Network Security
-
Calico Network Policies: Pod-level microsegmentation
# Example: Restrict database access apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: postgres-access-policy spec: podSelector: matchLabels: app: postgresql ingress: - from: - podSelector: matchLabels: tier: application ports: - protocol: TCP port: 5432 -
Service Mesh (Optional): Istio/Linkerd for mTLS between services
-
Pod Security Policies: Restrict privileged containers, host networking
-
Secrets Management: Kubernetes Secrets or HashiCorp Vault
Layer 4: Application Security
- Authentication: OpenID Connect (OIDC) / SAML 2.0
- Authorization: Role-Based Access Control (RBAC)
- API Security:
- OAuth 2.0 for API access
- API keys with rotation policy
- JWT tokens with short expiration
- Input Validation: OWASP validation at API gateway
- CORS Policies: Strict cross-origin resource sharing
Layer 5: Data Security
-
Encryption at Rest:
- AES-256 encryption for databases
- S3 server-side encryption (SSE)
- Tenant-specific encryption keys (optional)
-
Encryption in Transit:
- TLS 1.3 for all external connections
- mTLS for internal service communication
- Certificate rotation (Let's Encrypt or enterprise CA)
-
Data Loss Prevention (DLP):
- Sensitive data detection in documents
- PII/GDPR compliance scanning
- Automated data classification
4.3. Security Monitoring
┌─────────────────────────────────────────────────────────────┐
│ SECURITY MONITORING STACK │
│ │
│ ┌────────────────┐ ┌────────────────┐ ┌──────────────┐ │
│ │ Firewall │ │ Kubernetes │ │ Application │ │
│ │ Logs │──│ Audit Logs │──│ Logs │ │
│ └────────┬───────┘ └────────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └───────────────────┴──────────────────┘ │
│ │ │
│ ┌─────────▼─────────┐ │
│ │ Log Aggregation │ │
│ │ (Logstash/Fluentd)│ │
│ └─────────┬─────────┘ │
│ │ │
│ ┌─────────▼─────────┐ │
│ │ SIEM Platform │ │
│ │ (ELK/Splunk) │ │
│ │ │ │
│ │ - Correlation │ │
│ │ - Alerting │ │
│ │ - Dashboards │ │
│ └─────────┬─────────┘ │
│ │ │
│ ┌─────────▼─────────┐ │
│ │ Incident Response│ │
│ │ (PagerDuty/Ops) │ │
│ └───────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Security Events Monitored:
- Failed authentication attempts
- Privilege escalation attempts
- Unusual API access patterns
- Cross-tenant access attempts
- Database query anomalies
- Network traffic anomalies
- Configuration changes
- Certificate expiration warnings
5. Scalability and High Availability
5.1. Horizontal Scaling
Automatic Pod Scaling (HPA)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: opentext-comms-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: opentext-comms-api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: document_generation_queue_depth
target:
type: AverageValue
averageValue: "100"
Scaling Triggers:
- CPU utilization > 70%
- Memory utilization > 80%
- Request queue depth > 100 jobs
- Response time > 2 seconds (p95)
- Custom metrics: Documents/minute, notifications/minute
Cluster Autoscaling
- Kubernetes Cluster Autoscaler: Adds worker nodes when pods can't be scheduled
- Node Pools: Different node types for different workloads
- Compute-optimized: Document generation
- Memory-optimized: Template caching
- General-purpose: Application services
5.2. High Availability Design
Service-Level HA
| Component | HA Configuration | RPO | RTO |
|---|---|---|---|
| Control Plane | 3 master nodes, etcd quorum | N/A | < 1 min |
| Application Pods | Min 3 replicas, anti-affinity | N/A | < 30 sec |
| PostgreSQL | Primary + 2 replicas, Patroni | < 1 min | < 30 sec |
| Object Storage | 3x replication | 0 | Immediate |
| Load Balancers | Active/Active | N/A | < 5 sec |
| Firewall | Active/Passive HA | N/A | < 10 sec |
Availability Zones
- Multi-AZ Deployment: Worker nodes distributed across 3 availability zones
- Pod Anti-Affinity: Replicas scheduled on different physical hosts
affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - opentext-comms topologyKey: "kubernetes.io/hostname"
Health Checks
# Liveness probe: Restart unhealthy containers
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# Readiness probe: Remove unhealthy pods from load balancer
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
5.3. Load Balancing Strategy
External Load Balancing
- Layer 4 (TCP/UDP): Firewall load balancing to Kubernetes ingress
- Layer 7 (HTTP/HTTPS): Kubernetes Ingress Controllers
- Session affinity (sticky sessions) for stateful operations
- Weighted routing for blue/green deployments
- Geographic routing (future: multi-region)
Internal Load Balancing
- Kubernetes Services: ClusterIP services for internal communication
- Service Mesh: Istio for advanced traffic management (optional)
- Circuit breaking
- Retry policies
- Timeout configuration
- A/B testing
6. Disaster Recovery Architecture
6.1. DR Strategy
DR Objectives:
- RTO (Recovery Time Objective): 4 hours
- RPO (Recovery Point Objective): 6 hours (backup frequency)
- SLA Impact: DR events excluded from availability SLA calculation
6.2. Backup Architecture
┌─────────────────────────────────────────────────────────────┐
│ PRIMARY DATA CENTER │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ PostgreSQL │ │ Object │ │ Kubernetes │ │
│ │ Continuous │ │ Storage │ │ Config │ │
│ │ Archiving │ │ Replication │ │ Backups │ │
│ └──────┬───────┘ └──────┬───────┘ └────────┬────────┘ │
│ │ │ │ │
│ └─────────────────┴────────────────────┘ │
│ │ │
│ Every 6 hours │
│ │ │
└───────────────────────────┼────────────────────────────────┘
│
┌─────────▼─────────┐
│ Secure Transfer │
│ (TLS/VPN) │
└─────────┬─────────┘
│
┌───────────────────────────▼────────────────────────────────┐
│ SECONDARY DATA CENTER (DR SITE) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────────┐ │
│ │ PostgreSQL │ │ Object │ │ Kubernetes │ │
│ │ Standby │ │ Storage │ │ Standby │ │
│ │ (Read-only) │ │ Replica │ │ Cluster │ │
│ └──────────────┘ └──────────────┘ └─────────────────┘ │
│ │
│ - 7-day backup retention │
│ - Point-in-time recovery capability │
│ - Quarterly DR testing │
└────────────────────────────────────────────────────────────┘
6.3. Backup Components
Database Backups
- Continuous WAL Archiving: PostgreSQL Write-Ahead Logs streamed to DR site
- Daily Full Backups: Automated via pg_basebackup
- Incremental Backups: Every 6 hours using WAL archiving
- Point-in-Time Recovery: Restore to any point within 7-day window
- Backup Encryption: AES-256 encryption of backup files
Object Storage Backups
- Cross-Region Replication: Async replication to DR site (near real-time)
- Versioning: Last 30 versions of each object retained
- Lifecycle Policies:
- Active: 30 days
- Archive: 90 days
- Compliance: 7 years (if required)
Configuration Backups
- Kubernetes Manifests: Git repository with all configurations
- Secrets: Encrypted backup of secrets (separate from configs)
- Infrastructure as Code: Terraform/Ansible scripts for cluster rebuild
6.4. DR Procedures
Failover Scenarios
Scenario 1: Single Component Failure
- Detection: Automatic via health checks
- Action: Kubernetes automatically restarts failed pods
- Impact: No customer impact (< 30 seconds)
- Escalation: None (automatic recovery)
Scenario 2: Database Failure
- Detection: Patroni detects primary failure
- Action: Automatic promotion of replica to primary
- Impact: 30-60 seconds of database unavailability
- Escalation: Operations team notified
Scenario 3: Availability Zone Failure
- Detection: Multiple pod/node failures
- Action: Pods rescheduled to healthy zones
- Impact: 2-5 minutes (pod startup time)
- Escalation: Incident declared, management notified
Scenario 4: Complete Data Center Failure
- Detection: All health checks fail, ops team declares disaster
- Action: Manual DR failover procedure
- Steps:
- Activate DR site (T+0)
- Promote standby database to primary (T+15 min)
- Update DNS to point to DR site (T+30 min)
- Verify all services operational (T+60 min)
- Customer notification (T+90 min)
- Impact: Up to 4 hours RTO
- Escalation: Executive team, all customers notified
6.5. DR Testing
| Test Type | Frequency | Scope |
|---|---|---|
| Component Failover | Monthly | Single pod/database replica failover |
| Backup Restore | Monthly | Restore sample tenant from backup |
| Partial Failover | Quarterly | Failover non-critical services to DR |
| Full DR Exercise | Annually | Complete failover, customer notification simulation |
7. Monitoring and Observability Stack
7.1. Monitoring Architecture
┌─────────────────────────────────────────────────────────────┐
│ DATA COLLECTION LAYER │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Prometheus │ │ Node │ │ cAdvisor │ │
│ │ Exporters │ │ Exporter │ │ (Container)│ │
│ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ │
│ │ │ │ │
│ └───────────────┴───────────────┘ │
│ │ │
│ ┌─────────▼─────────┐ │
│ │ Prometheus │ │
│ │ (HA Pair) │ │
│ │ - Time-series │ │
│ │ - Alerting │ │
│ │ - 30-day retention │
│ └─────────┬─────────┘ │
└────────────────────────┼─────────────────────────────────────┘
│
┌────────────────────────▼─────────────────────────────────────┐
│ VISUALIZATION LAYER │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Grafana Dashboards │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────┐ │ │
│ │ │ Infrastructure│ │ Application │ │ Tenant │ │ │
│ │ │ Dashboard │ │ Dashboard │ │Dashboard │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
│
┌────────────────────────▼─────────────────────────────────────┐
│ LOGGING LAYER │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Fluentd/ │ │ Logstash │ │Elasticsearch│ │
│ │ Fluent Bit│──│ (Parse) │──│ Cluster │ │
│ │ (Collect) │ │ │ │ (7-day) │ │
│ └────────────┘ └────────────┘ └──────┬──────┘ │
│ │ │
│ ┌───────▼──────┐ │
│ │ Kibana │ │
│ │ (Visualize) │ │
│ └──────────────┘ │
└──────────────────────────────────────────────────────────────┘
│
┌────────────────────────▼─────────────────────────────────────┐
│ TRACING LAYER │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Jaeger │ │ Zipkin │ │ OpenTelemetry │
│ │ Collector │──│ (Storage) │ │ (Optional) │ │
│ └────────────┘ └────────────┘ └────────────┘ │
└──────────────────────────────────────────────────────────────┘
│
┌────────────────────────▼─────────────────────────────────────┐
│ ALERTING & INCIDENT LAYER │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Alert Manager (Prometheus) │ │
│ │ - Alert aggregation and deduplication │ │
│ │ - Routing rules (team, severity, time) │ │
│ │ - Silencing and inhibition │ │
│ └────────────────────┬────────────────────────────────┘ │
│ │ │
│ ┌──────────┴──────────┐ │
│ │ │ │
│ ┌────────▼────────┐ ┌────────▼────────┐ │
│ │ PagerDuty │ │ Email/Slack │ │
│ │ (On-call) │ │ (Notifications)│ │
│ └─────────────────┘ └─────────────────┘ │
└──────────────────────────────────────────────────────────────┘
7.2. Key Metrics
Infrastructure Metrics
| Metric | Threshold | Alert |
|---|---|---|
| Node CPU utilization | > 80% for 5 min | Warning |
| Node memory utilization | > 85% for 5 min | Warning |
| Disk utilization | > 80% | Warning, > 90% Critical |
| Network errors | > 0.1% packet loss | Warning |
| Pod restart count | > 3 in 10 min | Critical |
Application Metrics
| Metric | Threshold | Alert |
|---|---|---|
| API response time (p95) | > 2 seconds | Warning |
| API response time (p99) | > 5 seconds | Critical |
| Error rate | > 1% | Warning, > 5% Critical |
| Queue depth | > 1000 jobs | Warning |
| Authentication failures | > 10/min | Warning |
Service Component Metrics (see component addendums):
- Communications: Document generation success rate, template load time
- Notifications: Delivery rate, bounce rate, queue depth
Business Metrics
| Metric | Purpose |
|---|---|
| Documents generated/hour | Capacity planning |
| Notifications sent/hour | Capacity planning |
| Active users per tenant | Usage tracking |
| API calls per tenant | Billing verification |
| Storage consumed per tenant | Quota management |
7.3. Dashboard Structure
Operations Dashboard
- Cluster Health: Node status, pod status, resource utilization
- Service Health: Service availability, response times, error rates
- Capacity: CPU/memory/storage trends, forecasting
- Alerts: Active alerts, alert history
Tenant Dashboard (Per Customer)
- Usage Metrics: Documents generated, notifications sent
- Performance: Response times, success rates
- Quota Status: Storage used, API calls, user licenses
- SLA Status: Uptime percentage, incident history
Security Dashboard
- Authentication Events: Login attempts, failures, MFA usage
- API Security: Rate limiting triggers, blocked requests
- Network Security: Firewall blocks, IDS/IPS events
- Compliance: Audit log entries, policy violations
7.4. Log Management
Log Sources
- Application Logs: Structured JSON logs from all services
- Access Logs: HTTP access logs (ingress, API gateway)
- Audit Logs: Security events, configuration changes
- System Logs: OS, Kubernetes, database logs
Log Retention
| Log Type | Retention | Storage |
|---|---|---|
| Application logs | 7 days (hot) | Elasticsearch |
| Application logs | 90 days (warm) | S3/archive |
| Audit logs | 7 years | S3/compliance tier |
| Access logs | 30 days | Elasticsearch |
Log Analysis Use Cases
- Debugging: Trace requests across microservices
- Security: Detect suspicious patterns, intrusion attempts
- Compliance: Audit trail for data access
- Performance: Identify slow queries, bottlenecks
7.5. Distributed Tracing
OpenTelemetry Implementation:
- Trace Context Propagation: Trace ID passed through all services
- Span Collection: Each service records timing and metadata
- Sampling: 100% of errors, 10% of successful requests
- Retention: 7 days of trace data
Trace Analysis:
- Identify slow services in request chain
- Detect cascading failures
- Optimize inter-service communication
- Troubleshoot timeout issues
8. Performance Optimization
8.1. Caching Strategy
Application-Level Caching
┌─────────────────────────────────────────────────────────────┐
│ CACHING LAYERS │
│ │
│ Layer 1: CDN Cache (CloudFlare) │
│ ├── Static assets (images, CSS, JS) │
│ ├── TTL: 1 hour to 1 day │
│ └── Purge on deployment │
│ │
│ Layer 2: Application Cache (Redis) │
│ ├── Session data (TTL: 24 hours) │
│ ├── User profiles (TTL: 1 hour) │
│ ├── Tenant configuration (TTL: 5 minutes) │
│ └── API responses (TTL: varies by endpoint) │
│ │
│ Layer 3: Database Query Cache │
│ ├── PostgreSQL shared_buffers (4 GB) │
│ ├── PgBouncer connection pooling │
│ └── Read replicas for read-heavy queries │
│ │
│ Layer 4: Template Cache │
│ ├── Compiled document templates │
│ ├── TTL: Until template version changes │
│ └── Pre-warming on deployment │
└─────────────────────────────────────────────────────────────┘
Redis Cluster Configuration
- Topology: 3-node cluster with replication
- Persistence: RDB snapshots every 15 minutes + AOF
- Eviction Policy: LRU (Least Recently Used)
- Max Memory: 16 GB per node
8.2. Database Optimization
Connection Pooling
Application Pods (50 pods × 10 connections) = 500 connections
│
▼
┌───────────────────────┐
│ PgBouncer Pool │
│ (Transaction Mode) │
│ Max: 100 connections│
└───────────┬───────────┘
│
▼
┌───────────────────────┐
│ PostgreSQL Primary │
│ Max: 200 connections │
└───────────────────────┘
Query Optimization
- Indexes: Covering indexes on frequently queried columns
- Partitioning: Time-based partitioning for large tables (job history, audit logs)
- Materialized Views: Pre-aggregated data for dashboards
- Query Plan Analysis: Regular EXPLAIN ANALYZE on slow queries
8.3. Content Delivery Optimization
Document Generation Pipeline
Request → Queue → Worker Pool → Template Cache → Generate → S3 Upload
│ │ │ │ │ │
│ │ │ │ │ └─ Async
│ │ │ │ └─ Parallel processing
│ │ │ └─ In-memory cache
│ │ └─ Auto-scaling (3-20 workers)
│ └─ RabbitMQ/Kafka (persistent)
└─ Immediate response with job ID
Optimization Techniques:
- Batch Processing: Group similar documents for efficiency
- Template Pre-compilation: Cache compiled templates
- Parallel Rendering: Multi-threaded document generation
- Output Streaming: Stream large documents to storage
8.4. Network Optimization
- HTTP/2: Multiplexing for reduced latency
- Compression: Gzip/Brotli compression for API responses
- Keep-Alive: Persistent connections to reduce overhead
- DNS Caching: Aggressive DNS caching (5 min TTL)
9. Compliance and Audit
9.1. Compliance Requirements
| Regulation | Scope | Implementation |
|---|---|---|
| GDPR | EU data protection | Swedish data residency, data processing agreements, right to erasure |
| ISO 27001 | Information security | Full ISMS implementation, regular audits |
| PCI DSS | Payment data (if applicable) | Tokenization, network segmentation (future) |
| Swedish Data Protection | National regulations | Data residency, DPA compliance |
9.2. Audit Logging
Audit Events
- Authentication: Login, logout, failed attempts, MFA events
- Authorization: Permission changes, role assignments
- Data Access: Document views, downloads, exports
- Configuration Changes: Tenant settings, user management
- Administrative Actions: Database access, system configuration
Audit Log Format
{
"timestamp": "2025-01-15T10:30:45.123Z",
"event_type": "document.view",
"actor": {
"user_id": "user-123",
"email": "john.doe@example.com",
"ip_address": "203.0.113.45",
"user_agent": "Mozilla/5.0..."
},
"tenant_id": "tenant-abc-123",
"resource": {
"type": "document",
"id": "doc-456",
"path": "/templates/invoice.docx"
},
"action": "view",
"result": "success",
"metadata": {
"session_id": "sess-789",
"request_id": "req-012"
}
}
9.3. Data Residency
Commitment: All customer data stored within Sweden
- Primary DC: Sweden (Entiros AB)
- DR DC: Sweden (separate facility)
- No Cross-Border Transfer: Data never leaves Swedish jurisdiction
- Subprocessor Control: All subprocessors bound by DPA
10. Deployment and Release Management
10.1. CI/CD Pipeline
┌─────────────────────────────────────────────────────────────┐
│ CI/CD PIPELINE │
│ │
│ 1. Code Commit (Git) │
│ └─▶ 2. CI Build (GitHub Actions/GitLab CI) │
│ ├─ Unit Tests │
│ ├─ Integration Tests │
│ ├─ Security Scanning (SAST) │
│ ├─ Dependency Check │
│ └─ Build Docker Image │
│ └─▶ 3. Push to Container Registry │
│ └─▶ 4. Deploy to Test Cluster │
│ ├─ Automated E2E Tests │
│ └─ Performance Tests │
│ └─▶ 5. Manual Approval │
│ └─▶ 6. Deploy to Production
│ ├─ Canary (10%)
│ ├─ Monitor (15 min)
│ └─ Full Rollout
└─────────────────────────────────────────────────────────────┘
10.2. Deployment Strategies
Blue/Green Deployment
- Two Environments: Blue (current), Green (new)
- Traffic Switch: Instant cutover via DNS/load balancer
- Rollback: Switch back to Blue if issues detected
- Use Case: Major version upgrades
Canary Deployment
- Gradual Rollout: 10% → 25% → 50% → 100%
- Monitoring: Watch error rates, performance during rollout
- Automated Rollback: If metrics exceed thresholds
- Use Case: Standard releases
Rolling Update
- Default Strategy: Kubernetes rolling update
- Max Unavailable: 25% of pods
- Max Surge: 25% additional pods
- Use Case: Minor updates, patches
10.3. Release Cadence
| Release Type | Frequency | Scope | Customer Impact |
|---|---|---|---|
| Hotfix | As needed | Critical bug fix | Immediate, minimal downtime |
| Patch | Monthly | Bug fixes, minor features | Scheduled maintenance window |
| Minor | Quarterly | New features, improvements | Scheduled, tested in advance |
| Major | Annually | Breaking changes, major features | Advanced notice, migration support |
11. Future Architecture Enhancements
11.1. Planned Improvements (6-12 months)
| Enhancement | Benefit | Timeline |
|---|---|---|
| Service Mesh (Istio) | mTLS, advanced traffic management | Q2 2025 |
| Multi-Region Deployment | Lower latency, geographic redundancy | Q3 2025 |
| AI/ML Integration | Document intelligence, predictive analytics | Q4 2025 |
| GraphQL API | Flexible querying, reduced overfetching | Q3 2025 |
| Event-Driven Architecture | Improved scalability, decoupling | Q2 2025 |
11.2. Capacity Planning
Current Platform Capacity (Launch):
- Tenants: Up to 50 active tenants
- Users: Up to 5,000 concurrent users
- API Requests: 10,000 requests/minute
12-Month Projection:
- Tenants: 100-150 active tenants
- Users: 10,000-15,000 concurrent users
- API Requests: 50,000 requests/minute
Scaling Path:
- Compute: Add 3-5 worker nodes per quarter
- Database: Upgrade to larger instance, add read replicas
- Storage: Linear scaling with object storage (no limits)
- Network: Upgrade bandwidth as needed (10 Gbps → 40 Gbps)
Service Component Capacity (see component addendums):
- Communications: Document generation throughput
- Notifications: Delivery throughput by channel
12. Service Component Architecture Addendums
This platform architecture document is supplemented by component-specific technical architecture addendums:
| Document ID | Title | Description |
|---|---|---|
| SW-SAAS-ARCH-COMP-001 | Communications Technical Architecture | OpenText Exstream document generation architecture |
| SW-SAAS-ARCH-COMP-002 | Notifications Technical Architecture | Multi-channel notification delivery architecture |
Each addendum provides:
- Component-specific pod configurations and resource requirements
- Component-specific APIs and integration patterns
- Component-specific monitoring, metrics, and alerting
- Component-specific performance tuning and optimization
- Component-specific backup and recovery procedures
13. Appendices
13.1. Technology Version Matrix
| Component | Version | EOL Date |
|---|---|---|
| Kubernetes | 1.28.x | Oct 2024 (upgrade to 1.29 planned) |
| Docker | 24.0.x | Ongoing support |
| PostgreSQL | 15.x | Nov 2027 |
| Redis | 7.2.x | Ongoing support |
| Prometheus | 2.48.x | Ongoing support |
| Grafana | 10.2.x | Ongoing support |
| OpenText Comms | [Version TBD] | Per OpenText support policy |
13.2. Network Ports and Protocols
| Port | Protocol | Purpose | Access |
|---|---|---|---|
| 443 | HTTPS | Web application, API | Public |
| 80 | HTTP | Redirect to HTTPS | Public |
| 22 | SSH | Server administration | VPN only |
| 5432 | PostgreSQL | Database | Internal only |
| 6379 | Redis | Cache | Internal only |
| 9090 | Prometheus | Metrics | Internal only |
| 3000 | Grafana | Monitoring dashboards | VPN only |
| 5601 | Kibana | Log visualization | VPN only |
13.3. DNS Configuration
| Record Type | Name | Value | TTL |
|---|---|---|---|
| A | app.swedwise.com | [Load Balancer IP] | 300 |
| CNAME | api.swedwise.com | app.swedwise.com | 300 |
| CNAME | www.swedwise.com | app.swedwise.com | 300 |
| MX | swedwise.com | [Mail server] | 3600 |
| TXT | _dmarc.swedwise.com | [DMARC policy] | 3600 |
| TXT | swedwise.com | [SPF record] | 3600 |
13.4. SSL/TLS Configuration
- Certificate Authority: Let's Encrypt (automated renewal)
- Certificate Type: Wildcard (*.swedwise.com)
- TLS Version: TLS 1.2+ (TLS 1.3 preferred)
- Cipher Suites: Modern, secure ciphers only (no RC4, 3DES)
- HSTS: Enabled with 1-year max-age
- OCSP Stapling: Enabled
13.5. Contact Information
| Role | Responsibility | Contact |
|---|---|---|
| Technical Lead | Architecture decisions | tech-lead@swedwise.com |
| Operations Manager | 24/7 operations | ops@swedwise.com |
| Security Officer | Security incidents | security@swedwise.com |
| Data Center Partner | Infrastructure | Entiros AB - support@entiros.se |
Document Control
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2025-01-15 | Technical Lead | Initial platform architecture document |
| 1.1 | 2025-01-15 | Technical Lead | Refactored to platform-level; Communications and Notifications moved to addendums |
Classification: Confidential
Distribution: Internal use and customer NDAs only
Review Date: 2026-01-15
Related Documents:
- SW-SAAS-ARCH-COMP-001: Communications Technical Architecture Addendum
- SW-SAAS-ARCH-COMP-002: Notifications Technical Architecture Addendum
This document contains confidential technical information about the Swedwise SaaS Platform architecture. Unauthorized distribution or disclosure is prohibited.