DraftInternalISO 9001ISO 27001

SW-SAAS-OPS-COMP-001

Operations Addendum - Communications

Version

1.0

Owner

SaaS Operations Manager

Effective Date

[TBD]

Review Date

[TBD]

Operations Addendum - Communications

1. Addendum Overview

1.1. Purpose

This operations addendum provides component-specific operational procedures for the Communications (OpenText Exstream) service component within the Swedwise SaaS platform.

This document supplements SW-SAAS-OPS-001 (SaaS Platform Operations Guide). When conflicts exist, the main Operations Guide takes precedence.

1.2. Service Component Overview

Component	Technology	Purpose
OpenText Communications (Exstream)	OpenText Exstream v24.x	High-volume document composition and generation
Template Designer	OpenText Designer	Document template creation and editing
Batch Processing Engine	Kubernetes Jobs + Message Queue	Scheduled and batch document generation

1.3. Service Boundaries

In Scope:

Document generation via API and batch processing
Template management and version control
Customer-specific configurations and branding
Multi-tenant isolation and security

Out of Scope:

Customer internal systems (integration endpoints)
Integration broker operations (covered in parent document)

1.4. Key Operational Characteristics

Multi-Tenant: Shared infrastructure, isolated data per tenant
High-Volume: Designed for batch processing (thousands of documents per job)
Asynchronous: Most operations queued and processed asynchronously
Template-Dependent: Document generation quality depends on template design

2. Component Architecture

2.1. Communications (Exstream) Architecture

┌────────────────────────────────────────────────────────────┐
│              OPENTEXT COMMUNICATIONS PODS                  │
│                                                            │
│  ┌──────────────────┐  ┌──────────────────┐              │
│  │  Exstream API    │  │  Exstream        │              │
│  │  (3+ replicas)   │  │  Designer        │              │
│  │                  │  │  (2 replicas)    │              │
│  │  - REST API      │  │                  │              │
│  │  - Job Queue     │  │  - Template Edit │              │
│  │  - Rendering     │  │  - Preview       │              │
│  └────────┬─────────┘  └────────┬─────────┘              │
│           │                     │                         │
└───────────┼─────────────────────┼─────────────────────────┘
            │                     │
            ▼                     ▼
┌────────────────────────────────────────────────────────────┐
│                    DATA LAYER                              │
│                                                            │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────────┐  │
│  │ PostgreSQL   │  │  Object      │  │  Message Queue │  │
│  │ (Tenant DBs) │  │  Storage     │  │  (RabbitMQ)    │  │
│  │              │  │              │  │                │  │
│  │ - Templates  │  │ - Assets     │  │ - Job Queue    │  │
│  │ - Job Status │  │ - Output     │  │ - Batch Queue  │  │
│  │ - Metadata   │  │ - Archives   │  │ - Priority Q   │  │
│  └──────────────┘  └──────────────┘  └────────────────┘  │
└────────────────────────────────────────────────────────────┘

2.2. Component Dependencies

Component	Depends On	Impact if Unavailable
Exstream API	PostgreSQL, Object Storage, Message Queue	Cannot accept new jobs; existing jobs may fail
Exstream Designer	PostgreSQL, Object Storage	Template editing unavailable

3. Component Monitoring

3.1. Key Metrics

Metric	Normal Range	Warning	Critical	Alert Target
API Response Time (p95)	<1 second	>2 seconds	>5 seconds	SOC + On-Call
Document Generation Success Rate	>99%	<99%	<95%	SOC + On-Call
Queue Depth	<100 jobs	>500 jobs	>1000 jobs	SOC
Template Compilation Time	<30 seconds	>60 seconds	>120 seconds	Operations Manager
Pod CPU Utilization	<70%	>80%	>90%	SOC
Pod Memory Utilization	<75%	>85%	>95%	SOC + On-Call
Storage Usage (templates + assets)	-	>80% quota	>90% quota	SOC

3.2. Health Checks

Exstream API Health Check (/health):

{
  "status": "healthy",
  "timestamp": "2025-01-15T10:30:45Z",
  "version": "24.4.0",
  "checks": {
    "database": "healthy",
    "object_storage": "healthy",
    "message_queue": "healthy",
    "template_cache": "healthy",
    "license": "valid"
  },
  "metrics": {
    "active_jobs": 45,
    "queued_jobs": 12,
    "avg_generation_time_seconds": 0.75
  }
}

Exstream Designer Health Check (/health):

{
  "status": "healthy",
  "timestamp": "2025-01-15T10:30:45Z",
  "checks": {
    "database": "healthy",
    "object_storage": "healthy",
    "template_compilation": "healthy"
  }
}

3.3. Prometheus Metrics

Custom Metrics Exported:

exstream_documents_generated_total (counter, by tenant)
exstream_document_generation_duration_seconds (histogram)
exstream_job_queue_depth (gauge)
exstream_template_cache_hit_ratio (gauge)
exstream_errors_total (counter, by error_type)

Grafana Dashboard Panels:

Documents generated per hour (trend)
Generation success rate (%)
Queue depth over time
P50, P95, P99 generation times
Error breakdown (by type and tenant)
Resource utilization (CPU, memory, storage)

4. Incident Handling - Component-Specific

4.1. Document Generation Failures

Symptoms:

Documents failing to generate (error response from API)
Increased error rate in logs
Customer reports missing documents

Triage Steps:

Check Exstream pod status: kubectl get pods -n saas-platform -l app=exstream
Check pod logs for errors: kubectl logs -n saas-platform <pod-name> --tail=100
Verify database connectivity (check health endpoint)
Check object storage access (templates, assets)
Review recent template changes (possible template corruption)

Common Causes and Solutions:

Cause	Symptoms	Solution
Template Error	Specific template fails consistently	Review template logs, validate template syntax, roll back to previous version
Missing Asset	Generation fails at specific point	Check object storage for missing images/fonts, restore from backup if needed
Database Connection Pool Exhausted	Intermittent failures, "too many connections" error	Restart pods (Kubernetes will recreate with fresh connections), investigate connection leak
Out of Memory	Pod restarts frequently, OOMKilled in logs	Increase pod memory limits, investigate memory leak in template processing
Corrupted Template Cache	Inconsistent results, works after cache clear	Clear template cache, restart affected pods

Escalation:

If template issue: Escalate to Customer Success + Template Designer
If OpenText platform issue: Escalate to L3 + OpenText Support
If infrastructure issue: Follow standard infrastructure incident procedures

4.2. Template Designer Issues

Symptoms:

Cannot open template editor
Template save failures
Preview not working

Triage Steps:

Check Designer pod status
Check Designer logs for errors
Verify user authentication and permissions
Check database connectivity (template metadata)
Check object storage access (template files)

Common Causes and Solutions:

Cause	Symptoms	Solution
Pod Crash	500 errors, cannot load designer	Check pod logs for crash reason, restart pod, escalate if persistent
License Issue	Designer won't start, license error in logs	Verify OpenText license valid and not expired, contact OpenText to renew
Storage Full	Cannot save templates	Check tenant storage quota, increase quota or clean up old templates
Template Lock	Cannot edit template, "locked by another user"	Check database for stale locks, clear lock if user session ended abnormally

4.3. Batch Processing Failures

Symptoms:

Scheduled batch jobs not running
Batch job taking much longer than expected
Partial batch completion (some documents generated, others failed)

Triage Steps:

Check Kubernetes Job status: kubectl get jobs -n saas-platform
Check Job pod logs: kubectl logs -n saas-platform job/<job-name>
Verify job trigger (cron schedule, API call)
Check message queue depth and processing rate
Review job input data validity

Common Causes and Solutions:

Cause	Symptoms	Solution
Job Trigger Failure	Job didn't start at scheduled time	Check CronJob configuration, verify scheduler running, check job history
Input Data Error	Job fails immediately or after processing few items	Validate input data format, check for null values or missing fields, review error logs
Resource Limits	Job killed (OOMKilled), job stuck at same progress	Increase job resource limits (CPU, memory), investigate data volume or template complexity
Queue Congestion	Job runs slowly, queue depth increasing	Scale up worker pods, increase queue concurrency, optimize slow templates
Partial Failure	Some documents succeed, others fail	Review error logs for failed items, may need to retry failed subset separately

5. Upgrade Procedures

5.1. OpenText Communications Upgrade

Frequency: Quarterly (aligned with OpenText release cycle)

Upgrade Types:

Patch Release (24.4.1 → 24.4.2): Bug fixes, security patches
Minor Release (24.4 → 24.5): New features, enhancements
Major Release (24.x → 25.x): Breaking changes, major features

Upgrade Process:

Phase 1: Planning and Preparation (2 weeks before)

Review OpenText release notes and known issues
Assess compatibility with current templates and integrations
Identify breaking changes requiring customer notification
Schedule maintenance window (typically Monthly Extended Window)
Notify customers (10 business days advance for major, 5 days for minor)

Phase 2: Testing (1 week before)

Deploy new version to staging environment
Run automated test suite (document generation, API tests)
Test representative templates from each tenant
Performance benchmarking (compare to current version)
Verify rollback procedure

Phase 3: Backup (Day of upgrade, T-1 hour)

Take full database backup
Snapshot template repository (object storage)
Backup Kubernetes configurations
Verify backup integrity

Phase 4: Implementation (During maintenance window)

Announce maintenance start (customer notification)
Drain existing requests (allow in-flight jobs to complete)
Scale down current Exstream pods to 0
Deploy new version pods
Run database migrations (if applicable)
Scale up new version pods
Verify health checks passing
Run smoke tests (generate test documents)

Phase 5: Verification (T+15 min)

Monitor error rates and performance metrics
Test each tenant with sample document generation
Verify template designer functionality
Check integration endpoints operational
Review logs for unexpected errors

Phase 6: Monitoring and Rollback Decision (T+30 min)

If issues detected: Execute rollback procedure
If stable: Announce maintenance completion
Continue enhanced monitoring for 24 hours
Post-Implementation Review within 7 days

Rollback Procedure:

Scale down new version pods to 0
Restore database from pre-upgrade backup (if schema changed)
Scale up previous version pods
Verify service restoration
Customer notification of rollback
Investigate issues before re-attempting upgrade

5.2. Template Migration Between Environments

Purpose: Deploy new or updated templates from staging to production

Template Migration Workflow:

Template Designer (Staging) → Review & Approval → Version Control (Git) →
Deploy to Production → Verification → Customer Notification

Steps:

Template Development (in staging environment)
- Customer Success or customer creates/modifies template
- Test with sample data
- Preview output documents
Review and Approval
- Customer Success reviews template output
- Customer approves template (if customer-created)
- Operations verifies template best practices
Version Control
- Export template from staging
- Commit to Git repository with version tag
- Document changes in release notes
Production Deployment (during change window)
- Import template to production environment
- Verify template compiles successfully
- Test generation with production-like data
- Activate template (make available to customer)
Verification
- Generate test document in production
- Compare output to staging
- Monitor error rates for template-related issues
Customer Notification (if significant change)
- Notify customer template is live
- Provide change summary if breaking changes

Template Versioning:

Semantic versioning: <major>.<minor>.<patch>
Previous versions retained for rollback (last 10 versions)
Template changes logged in audit trail

6. Backup Specifics

6.1. Template Backup

What is Backed Up:

Template definition files (.xsl, .xml, etc.)
Template metadata (version, author, modification date)
Template assets (images, fonts, logos)

Backup Method:

Primary: Object storage versioning (automatic)
Secondary: Cross-region replication (async)
Tertiary: Daily snapshot to separate backup storage

Retention:

Active versions: 30 versions per template
Deleted templates: 90 days in backup
Compliance: 7 years for templates used in regulatory documents

Recovery Procedures:

Single template recovery: Restore from object storage version history (<5 min)
Tenant template recovery: Restore all templates for tenant from daily backup (<30 min)
Complete template repository recovery: Restore from DR site backup (<2 hours)

6.2. Generated Document Archival

Document Archival Policy:

Document Type	Retention	Storage Tier	Purpose
Customer documents (active)	30 days	Hot (fast access)	Immediate customer retrieval
Customer documents (archive)	90 days	Warm (slower access)	Occasional retrieval
Compliance documents	7 years	Cold (archival)	Regulatory retention
Test documents	7 days	Hot	Troubleshooting, verification

Archival Process:

Automated: Lifecycle policy moves documents between tiers
Customer-initiated: Customer can request extended retention
Deletion: Documents purged after retention period (logged for audit)

Retrieval:

Hot storage: <1 second retrieval
Warm storage: <10 seconds retrieval
Cold storage: <1 hour retrieval (may require support ticket)

7. Performance Optimization

7.1. Document Generation Performance

Optimization Techniques:

Template Caching:

Compiled templates cached in memory
Cache hit ratio target: >90%
Cache size: 2 GB per Exstream pod
Cache invalidation: On template update

Batch Processing Optimization:

Group similar documents (same template) in single batch
Parallel processing: Up to 10 concurrent documents per pod
Resource allocation: Dedicated worker pods for large batches

Data Optimization:

Minimize database queries during generation (prefetch data)
Use prepared statements for repeated queries
Implement connection pooling (PgBouncer)

Asset Optimization:

Compress images before template inclusion
Use vector graphics (SVG) where possible
Cache fonts and static assets in pod local storage

Performance Benchmarks:

Document Type	Target Generation Time	Actual (Median)	Notes
Simple (text only)	<0.5 seconds	0.3 seconds	Invoices, statements
Standard (with images)	<1 second	0.75 seconds	Marketing materials
Complex (multi-page, charts)	<3 seconds	2.1 seconds	Reports, contracts
Large batch (1000 docs)	<10 minutes	7 minutes	Parallel processing

7.2. Scaling Strategies

Horizontal Scaling Triggers:

Metric	Scale Up Threshold	Scale Down Threshold	Action
CPU Utilization	>70% for 5 min	<40% for 15 min	Add/remove pod
Queue Depth	>500 jobs	<100 jobs	Add/remove pod
API Response Time	p95 >2 seconds	p95 <0.5 seconds	Add/remove pod

Auto-Scaling Configuration:

# Exstream API HPA
minReplicas: 3
maxReplicas: 20
targetCPUUtilization: 70
targetMemoryUtilization: 80
scaleUpStabilization: 60s  # Wait 60s before scaling up
scaleDownStabilization: 300s  # Wait 5 min before scaling down

Vertical Scaling (when horizontal insufficient):

Increase pod resource limits (CPU, memory)
Requires Change Request and maintenance window
Typical upgrade: 4 CPU / 16 GB → 8 CPU / 32 GB

8. Template Management

8.1. Template Lifecycle

Create → Test (Staging) → Review → Approve → Deploy (Production) →
Monitor → Update → Deprecate → Archive

Lifecycle Stages:

Stage	Description	Access	Duration
Draft	Initial creation, frequent changes	Designer only	Unlimited
Testing	Deployed to staging, under testing	Designer + Customer Success	1-2 weeks
Review	Under approval process	Customer Success + Customer	1 week
Active	Live in production, generating documents	All authorized users	Indefinite
Deprecated	Marked for replacement, still functional	All authorized users	90 days
Archived	Removed from production, retained for compliance	Read-only	7 years

8.2. Template Version Control

Versioning Strategy:

Major version (x.0.0): Breaking changes, incompatible with previous version
Minor version (1.x.0): New features, backward compatible
Patch version (1.0.x): Bug fixes, no functional changes

Version Control Integration:

Templates stored in Git repository
Each version tagged with semantic version
Change log maintained in commit messages

Version Rollback:

Can revert to any previous version within 30-version history
Rollback requires Change Request (unless emergency)
Customer notified of rollback if affects active documents

8.3. Template Best Practices

Performance Best Practices:

Keep templates simple (avoid excessive nested loops)
Optimize XSL/XSLT transformations
Minimize external data calls within template
Use cached assets where possible
Avoid large images (compress to <500 KB)

Maintainability Best Practices:

Use clear naming conventions
Document complex logic with comments
Use reusable template fragments (avoid duplication)
Test with edge cases (null data, special characters, large datasets)

Security Best Practices:

Sanitize input data (prevent XSS in generated documents)
Validate data types and ranges
Don't embed sensitive data in templates (fetch at generation time)
Review templates for information disclosure risks

8.4. Template Troubleshooting

Common Template Issues:

Issue	Symptoms	Resolution
Compilation Error	Template fails to load, syntax error in logs	Review template XSL syntax, validate against schema, check for missing closing tags
Generation Timeout	Document generation exceeds timeout (default 60s)	Optimize template (simplify logic, reduce loops), increase timeout (with approval), investigate data volume
Missing Data	Generated document has blank fields	Check data source, verify data mapping in template, review conditional logic
Formatting Issues	Incorrect layout, overlapping elements	Review template positioning (absolute vs. relative), test with different data lengths
Memory Exhaustion	Pod OOMKilled during generation	Optimize template memory usage, increase pod memory limits, batch large documents differently
Asset Not Found	Images or fonts missing in output	Verify asset path in template, check object storage for asset file, verify asset permissions

9. Runbooks - Common Operational Procedures

9.1. Restart Exstream Service

When to Use:

Exstream pods unresponsive
Memory leak suspected
After configuration change requiring restart

Procedure:

Verify backup completed recently (check backup status)
Announce maintenance to customers if during business hours
Drain in-flight requests: kubectl scale deployment exstream-api --replicas=0
Wait 60 seconds for graceful shutdown
Scale up: kubectl scale deployment exstream-api --replicas=3
Monitor pod startup: kubectl get pods -w -l app=exstream
Verify health checks: curl https://api.swedwise.com/health
Test document generation with sample request
Monitor error rates for 15 minutes post-restart
Announce service restored (if maintenance was announced)

Expected Downtime: 2-3 minutes

9.2. Clear Template Cache

When to Use:

Template rendering inconsistently
Recent template update not taking effect
Suspected template cache corruption

Procedure:

Identify affected tenant(s)
Connect to Exstream pod: kubectl exec -it <pod-name> -- /bin/bash
Clear cache: rm -rf /opt/exstream/cache/* (or vendor-specific command)
Exit pod
Verify cache cleared: Generate test document, check logs for cache miss
Monitor performance (cache rebuild may temporarily slow generation)

Expected Impact: Temporary performance degradation (<30 min)

9.3. Restore Deleted Template

When to Use:

Customer accidentally deleted template
Template corruption requiring restoration
Regulatory requirement to restore historical template

Procedure:

Identify template to restore (tenant ID, template ID, version)

Check object storage version history:

aws s3api list-object-versions \
  --bucket tenants \
  --prefix tenant-abc-123/templates/invoice-template.xsl

Identify correct version by timestamp and version ID

Restore object:

aws s3api copy-object \
  --copy-source tenants/tenant-abc-123/templates/invoice-template.xsl?versionId=<version-id> \
  --bucket tenants \
  --key tenant-abc-123/templates/invoice-template.xsl

Verify template restored in Exstream Designer
Test template: Generate sample document
Notify customer of restoration
Document restoration in operations log

Expected Time: 15-30 minutes

9.4. Investigate High Queue Depth

When to Use:

Queue depth alert triggered (>1000 jobs for documents)
Customer reports delayed document generation
Dashboard shows increasing queue trend

Procedure:

Identify affected queue: Document generation

Check queue metrics:

# RabbitMQ example
rabbitmqctl list_queues name messages consumers

Investigate root cause:
- Low processing rate: Check worker pod count, CPU/memory utilization
- High ingress rate: Check API call volume, identify source
- Worker failures: Check pod logs for errors, restart failed pods
Take corrective action:
- Scale up workers: Increase pod replicas temporarily
```
kubectl scale deployment exstream-api --replicas=10
```
- Throttle ingress: Apply rate limiting to offending tenant (if abuse)
- Fix worker issues: Restart failed pods, increase resources
Monitor queue drain rate: Calculate estimated time to clear
Communicate to customers if significant delay expected
Post-incident: Analyze root cause, adjust auto-scaling thresholds if needed

Expected Resolution Time: 15 minutes (diagnosis) + variable (queue drain time)

10. Continuous Improvement

10.1. Component-Specific Metrics

Communications Component:

Document generation success rate: Target >99%
Average generation time: Trending down
Template error rate: Target <1%
Queue processing efficiency: Target >95%

10.2. Improvement Initiatives

Current Improvement Areas:

Reduce document generation time for complex templates
Optimize batch processing for large jobs
Enhance monitoring and alerting granularity
Automate common runbook procedures

10.3. Lessons Learned

Post-Incident Reviews:

Conducted after every High or Critical incident
Focus on component-specific root causes
Identify preventive measures
Update runbooks with new procedures

Monthly Retrospectives:

Operations team reviews component performance
Discusses challenges and improvement opportunities
Prioritizes automation and optimization tasks

Main Operations Guide:

SW-SAAS-OPS-001: Swedwise SaaS Platform - Operations Guide (parent document)

Component Documentation:

SW-SAAS-COMP-001: Swedwise Communications - Component Description

Architecture and Design:

SW-SAAS-ARCH-001: Swedwise Communications Technical Architecture

IMS Procedures:

SW-ISMS-PRO-001: Incident Management Procedure
SW-IMS-PRO-008: Change Management Procedure

Vendor Documentation:

OpenText Communications (Exstream) Administration Guide

12. Document Control

Version	Date	Author	Changes	Approved By
1.0	[TBD]	SaaS Operations Manager	Initial Communications operations addendum	Operations Manager

Next Review Date: [TBD - 6 months from effective date, then annually]

Document Classification: Internal

Document Owner: SaaS Operations Manager

Parent Document: SW-SAAS-OPS-001 (Swedwise SaaS Platform - Operations Guide)

This operations addendum is supplementary to SW-SAAS-OPS-001. All general SaaS operations procedures from the main guide apply unless specifically overridden in this addendum. Operations staff must read and understand both documents.