SW-SAAS-OPS-COMP-001
Operations Addendum - Communications
Version
1.0
Owner
SaaS Operations Manager
Effective Date
[TBD]
Review Date
[TBD]
Operations Addendum - Communications
1. Addendum Overview
1.1. Purpose
This operations addendum provides component-specific operational procedures for the Communications (OpenText Exstream) service component within the Swedwise SaaS platform.
This document supplements SW-SAAS-OPS-001 (SaaS Platform Operations Guide). When conflicts exist, the main Operations Guide takes precedence.
1.2. Service Component Overview
| Component | Technology | Purpose |
|---|---|---|
| OpenText Communications (Exstream) | OpenText Exstream v24.x | High-volume document composition and generation |
| Template Designer | OpenText Designer | Document template creation and editing |
| Batch Processing Engine | Kubernetes Jobs + Message Queue | Scheduled and batch document generation |
1.3. Service Boundaries
In Scope:
- Document generation via API and batch processing
- Template management and version control
- Customer-specific configurations and branding
- Multi-tenant isolation and security
Out of Scope:
- Customer internal systems (integration endpoints)
- Integration broker operations (covered in parent document)
1.4. Key Operational Characteristics
- Multi-Tenant: Shared infrastructure, isolated data per tenant
- High-Volume: Designed for batch processing (thousands of documents per job)
- Asynchronous: Most operations queued and processed asynchronously
- Template-Dependent: Document generation quality depends on template design
2. Component Architecture
2.1. Communications (Exstream) Architecture
┌────────────────────────────────────────────────────────────┐
│ OPENTEXT COMMUNICATIONS PODS │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Exstream API │ │ Exstream │ │
│ │ (3+ replicas) │ │ Designer │ │
│ │ │ │ (2 replicas) │ │
│ │ - REST API │ │ │ │
│ │ - Job Queue │ │ - Template Edit │ │
│ │ - Rendering │ │ - Preview │ │
│ └────────┬─────────┘ └────────┬─────────┘ │
│ │ │ │
└───────────┼─────────────────────┼─────────────────────────┘
│ │
▼ ▼
┌────────────────────────────────────────────────────────────┐
│ DATA LAYER │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────┐ │
│ │ PostgreSQL │ │ Object │ │ Message Queue │ │
│ │ (Tenant DBs) │ │ Storage │ │ (RabbitMQ) │ │
│ │ │ │ │ │ │ │
│ │ - Templates │ │ - Assets │ │ - Job Queue │ │
│ │ - Job Status │ │ - Output │ │ - Batch Queue │ │
│ │ - Metadata │ │ - Archives │ │ - Priority Q │ │
│ └──────────────┘ └──────────────┘ └────────────────┘ │
└────────────────────────────────────────────────────────────┘
2.2. Component Dependencies
| Component | Depends On | Impact if Unavailable |
|---|---|---|
| Exstream API | PostgreSQL, Object Storage, Message Queue | Cannot accept new jobs; existing jobs may fail |
| Exstream Designer | PostgreSQL, Object Storage | Template editing unavailable |
3. Component Monitoring
3.1. Key Metrics
| Metric | Normal Range | Warning | Critical | Alert Target |
|---|---|---|---|---|
| API Response Time (p95) | <1 second | >2 seconds | >5 seconds | SOC + On-Call |
| Document Generation Success Rate | >99% | <99% | <95% | SOC + On-Call |
| Queue Depth | <100 jobs | >500 jobs | >1000 jobs | SOC |
| Template Compilation Time | <30 seconds | >60 seconds | >120 seconds | Operations Manager |
| Pod CPU Utilization | <70% | >80% | >90% | SOC |
| Pod Memory Utilization | <75% | >85% | >95% | SOC + On-Call |
| Storage Usage (templates + assets) | - | >80% quota | >90% quota | SOC |
3.2. Health Checks
Exstream API Health Check (/health):
{
"status": "healthy",
"timestamp": "2025-01-15T10:30:45Z",
"version": "24.4.0",
"checks": {
"database": "healthy",
"object_storage": "healthy",
"message_queue": "healthy",
"template_cache": "healthy",
"license": "valid"
},
"metrics": {
"active_jobs": 45,
"queued_jobs": 12,
"avg_generation_time_seconds": 0.75
}
}
Exstream Designer Health Check (/health):
{
"status": "healthy",
"timestamp": "2025-01-15T10:30:45Z",
"checks": {
"database": "healthy",
"object_storage": "healthy",
"template_compilation": "healthy"
}
}
3.3. Prometheus Metrics
Custom Metrics Exported:
exstream_documents_generated_total(counter, by tenant)exstream_document_generation_duration_seconds(histogram)exstream_job_queue_depth(gauge)exstream_template_cache_hit_ratio(gauge)exstream_errors_total(counter, by error_type)
Grafana Dashboard Panels:
- Documents generated per hour (trend)
- Generation success rate (%)
- Queue depth over time
- P50, P95, P99 generation times
- Error breakdown (by type and tenant)
- Resource utilization (CPU, memory, storage)
4. Incident Handling - Component-Specific
4.1. Document Generation Failures
Symptoms:
- Documents failing to generate (error response from API)
- Increased error rate in logs
- Customer reports missing documents
Triage Steps:
- Check Exstream pod status:
kubectl get pods -n saas-platform -l app=exstream - Check pod logs for errors:
kubectl logs -n saas-platform <pod-name> --tail=100 - Verify database connectivity (check health endpoint)
- Check object storage access (templates, assets)
- Review recent template changes (possible template corruption)
Common Causes and Solutions:
| Cause | Symptoms | Solution |
|---|---|---|
| Template Error | Specific template fails consistently | Review template logs, validate template syntax, roll back to previous version |
| Missing Asset | Generation fails at specific point | Check object storage for missing images/fonts, restore from backup if needed |
| Database Connection Pool Exhausted | Intermittent failures, "too many connections" error | Restart pods (Kubernetes will recreate with fresh connections), investigate connection leak |
| Out of Memory | Pod restarts frequently, OOMKilled in logs | Increase pod memory limits, investigate memory leak in template processing |
| Corrupted Template Cache | Inconsistent results, works after cache clear | Clear template cache, restart affected pods |
Escalation:
- If template issue: Escalate to Customer Success + Template Designer
- If OpenText platform issue: Escalate to L3 + OpenText Support
- If infrastructure issue: Follow standard infrastructure incident procedures
4.2. Template Designer Issues
Symptoms:
- Cannot open template editor
- Template save failures
- Preview not working
Triage Steps:
- Check Designer pod status
- Check Designer logs for errors
- Verify user authentication and permissions
- Check database connectivity (template metadata)
- Check object storage access (template files)
Common Causes and Solutions:
| Cause | Symptoms | Solution |
|---|---|---|
| Pod Crash | 500 errors, cannot load designer | Check pod logs for crash reason, restart pod, escalate if persistent |
| License Issue | Designer won't start, license error in logs | Verify OpenText license valid and not expired, contact OpenText to renew |
| Storage Full | Cannot save templates | Check tenant storage quota, increase quota or clean up old templates |
| Template Lock | Cannot edit template, "locked by another user" | Check database for stale locks, clear lock if user session ended abnormally |
4.3. Batch Processing Failures
Symptoms:
- Scheduled batch jobs not running
- Batch job taking much longer than expected
- Partial batch completion (some documents generated, others failed)
Triage Steps:
- Check Kubernetes Job status:
kubectl get jobs -n saas-platform - Check Job pod logs:
kubectl logs -n saas-platform job/<job-name> - Verify job trigger (cron schedule, API call)
- Check message queue depth and processing rate
- Review job input data validity
Common Causes and Solutions:
| Cause | Symptoms | Solution |
|---|---|---|
| Job Trigger Failure | Job didn't start at scheduled time | Check CronJob configuration, verify scheduler running, check job history |
| Input Data Error | Job fails immediately or after processing few items | Validate input data format, check for null values or missing fields, review error logs |
| Resource Limits | Job killed (OOMKilled), job stuck at same progress | Increase job resource limits (CPU, memory), investigate data volume or template complexity |
| Queue Congestion | Job runs slowly, queue depth increasing | Scale up worker pods, increase queue concurrency, optimize slow templates |
| Partial Failure | Some documents succeed, others fail | Review error logs for failed items, may need to retry failed subset separately |
5. Upgrade Procedures
5.1. OpenText Communications Upgrade
Frequency: Quarterly (aligned with OpenText release cycle)
Upgrade Types:
- Patch Release (24.4.1 → 24.4.2): Bug fixes, security patches
- Minor Release (24.4 → 24.5): New features, enhancements
- Major Release (24.x → 25.x): Breaking changes, major features
Upgrade Process:
Phase 1: Planning and Preparation (2 weeks before)
- Review OpenText release notes and known issues
- Assess compatibility with current templates and integrations
- Identify breaking changes requiring customer notification
- Schedule maintenance window (typically Monthly Extended Window)
- Notify customers (10 business days advance for major, 5 days for minor)
Phase 2: Testing (1 week before)
- Deploy new version to staging environment
- Run automated test suite (document generation, API tests)
- Test representative templates from each tenant
- Performance benchmarking (compare to current version)
- Verify rollback procedure
Phase 3: Backup (Day of upgrade, T-1 hour)
- Take full database backup
- Snapshot template repository (object storage)
- Backup Kubernetes configurations
- Verify backup integrity
Phase 4: Implementation (During maintenance window)
- Announce maintenance start (customer notification)
- Drain existing requests (allow in-flight jobs to complete)
- Scale down current Exstream pods to 0
- Deploy new version pods
- Run database migrations (if applicable)
- Scale up new version pods
- Verify health checks passing
- Run smoke tests (generate test documents)
Phase 5: Verification (T+15 min)
- Monitor error rates and performance metrics
- Test each tenant with sample document generation
- Verify template designer functionality
- Check integration endpoints operational
- Review logs for unexpected errors
Phase 6: Monitoring and Rollback Decision (T+30 min)
- If issues detected: Execute rollback procedure
- If stable: Announce maintenance completion
- Continue enhanced monitoring for 24 hours
- Post-Implementation Review within 7 days
Rollback Procedure:
- Scale down new version pods to 0
- Restore database from pre-upgrade backup (if schema changed)
- Scale up previous version pods
- Verify service restoration
- Customer notification of rollback
- Investigate issues before re-attempting upgrade
5.2. Template Migration Between Environments
Purpose: Deploy new or updated templates from staging to production
Template Migration Workflow:
Template Designer (Staging) → Review & Approval → Version Control (Git) →
Deploy to Production → Verification → Customer Notification
Steps:
-
Template Development (in staging environment)
- Customer Success or customer creates/modifies template
- Test with sample data
- Preview output documents
-
Review and Approval
- Customer Success reviews template output
- Customer approves template (if customer-created)
- Operations verifies template best practices
-
Version Control
- Export template from staging
- Commit to Git repository with version tag
- Document changes in release notes
-
Production Deployment (during change window)
- Import template to production environment
- Verify template compiles successfully
- Test generation with production-like data
- Activate template (make available to customer)
-
Verification
- Generate test document in production
- Compare output to staging
- Monitor error rates for template-related issues
-
Customer Notification (if significant change)
- Notify customer template is live
- Provide change summary if breaking changes
Template Versioning:
- Semantic versioning:
<major>.<minor>.<patch> - Previous versions retained for rollback (last 10 versions)
- Template changes logged in audit trail
6. Backup Specifics
6.1. Template Backup
What is Backed Up:
- Template definition files (.xsl, .xml, etc.)
- Template metadata (version, author, modification date)
- Template assets (images, fonts, logos)
Backup Method:
- Primary: Object storage versioning (automatic)
- Secondary: Cross-region replication (async)
- Tertiary: Daily snapshot to separate backup storage
Retention:
- Active versions: 30 versions per template
- Deleted templates: 90 days in backup
- Compliance: 7 years for templates used in regulatory documents
Recovery Procedures:
- Single template recovery: Restore from object storage version history (<5 min)
- Tenant template recovery: Restore all templates for tenant from daily backup (<30 min)
- Complete template repository recovery: Restore from DR site backup (<2 hours)
6.2. Generated Document Archival
Document Archival Policy:
| Document Type | Retention | Storage Tier | Purpose |
|---|---|---|---|
| Customer documents (active) | 30 days | Hot (fast access) | Immediate customer retrieval |
| Customer documents (archive) | 90 days | Warm (slower access) | Occasional retrieval |
| Compliance documents | 7 years | Cold (archival) | Regulatory retention |
| Test documents | 7 days | Hot | Troubleshooting, verification |
Archival Process:
- Automated: Lifecycle policy moves documents between tiers
- Customer-initiated: Customer can request extended retention
- Deletion: Documents purged after retention period (logged for audit)
Retrieval:
- Hot storage: <1 second retrieval
- Warm storage: <10 seconds retrieval
- Cold storage: <1 hour retrieval (may require support ticket)
7. Performance Optimization
7.1. Document Generation Performance
Optimization Techniques:
Template Caching:
- Compiled templates cached in memory
- Cache hit ratio target: >90%
- Cache size: 2 GB per Exstream pod
- Cache invalidation: On template update
Batch Processing Optimization:
- Group similar documents (same template) in single batch
- Parallel processing: Up to 10 concurrent documents per pod
- Resource allocation: Dedicated worker pods for large batches
Data Optimization:
- Minimize database queries during generation (prefetch data)
- Use prepared statements for repeated queries
- Implement connection pooling (PgBouncer)
Asset Optimization:
- Compress images before template inclusion
- Use vector graphics (SVG) where possible
- Cache fonts and static assets in pod local storage
Performance Benchmarks:
| Document Type | Target Generation Time | Actual (Median) | Notes |
|---|---|---|---|
| Simple (text only) | <0.5 seconds | 0.3 seconds | Invoices, statements |
| Standard (with images) | <1 second | 0.75 seconds | Marketing materials |
| Complex (multi-page, charts) | <3 seconds | 2.1 seconds | Reports, contracts |
| Large batch (1000 docs) | <10 minutes | 7 minutes | Parallel processing |
7.2. Scaling Strategies
Horizontal Scaling Triggers:
| Metric | Scale Up Threshold | Scale Down Threshold | Action |
|---|---|---|---|
| CPU Utilization | >70% for 5 min | <40% for 15 min | Add/remove pod |
| Queue Depth | >500 jobs | <100 jobs | Add/remove pod |
| API Response Time | p95 >2 seconds | p95 <0.5 seconds | Add/remove pod |
Auto-Scaling Configuration:
# Exstream API HPA
minReplicas: 3
maxReplicas: 20
targetCPUUtilization: 70
targetMemoryUtilization: 80
scaleUpStabilization: 60s # Wait 60s before scaling up
scaleDownStabilization: 300s # Wait 5 min before scaling down
Vertical Scaling (when horizontal insufficient):
- Increase pod resource limits (CPU, memory)
- Requires Change Request and maintenance window
- Typical upgrade: 4 CPU / 16 GB → 8 CPU / 32 GB
8. Template Management
8.1. Template Lifecycle
Create → Test (Staging) → Review → Approve → Deploy (Production) →
Monitor → Update → Deprecate → Archive
Lifecycle Stages:
| Stage | Description | Access | Duration |
|---|---|---|---|
| Draft | Initial creation, frequent changes | Designer only | Unlimited |
| Testing | Deployed to staging, under testing | Designer + Customer Success | 1-2 weeks |
| Review | Under approval process | Customer Success + Customer | 1 week |
| Active | Live in production, generating documents | All authorized users | Indefinite |
| Deprecated | Marked for replacement, still functional | All authorized users | 90 days |
| Archived | Removed from production, retained for compliance | Read-only | 7 years |
8.2. Template Version Control
Versioning Strategy:
- Major version (x.0.0): Breaking changes, incompatible with previous version
- Minor version (1.x.0): New features, backward compatible
- Patch version (1.0.x): Bug fixes, no functional changes
Version Control Integration:
- Templates stored in Git repository
- Each version tagged with semantic version
- Change log maintained in commit messages
Version Rollback:
- Can revert to any previous version within 30-version history
- Rollback requires Change Request (unless emergency)
- Customer notified of rollback if affects active documents
8.3. Template Best Practices
Performance Best Practices:
- Keep templates simple (avoid excessive nested loops)
- Optimize XSL/XSLT transformations
- Minimize external data calls within template
- Use cached assets where possible
- Avoid large images (compress to <500 KB)
Maintainability Best Practices:
- Use clear naming conventions
- Document complex logic with comments
- Use reusable template fragments (avoid duplication)
- Test with edge cases (null data, special characters, large datasets)
Security Best Practices:
- Sanitize input data (prevent XSS in generated documents)
- Validate data types and ranges
- Don't embed sensitive data in templates (fetch at generation time)
- Review templates for information disclosure risks
8.4. Template Troubleshooting
Common Template Issues:
| Issue | Symptoms | Resolution |
|---|---|---|
| Compilation Error | Template fails to load, syntax error in logs | Review template XSL syntax, validate against schema, check for missing closing tags |
| Generation Timeout | Document generation exceeds timeout (default 60s) | Optimize template (simplify logic, reduce loops), increase timeout (with approval), investigate data volume |
| Missing Data | Generated document has blank fields | Check data source, verify data mapping in template, review conditional logic |
| Formatting Issues | Incorrect layout, overlapping elements | Review template positioning (absolute vs. relative), test with different data lengths |
| Memory Exhaustion | Pod OOMKilled during generation | Optimize template memory usage, increase pod memory limits, batch large documents differently |
| Asset Not Found | Images or fonts missing in output | Verify asset path in template, check object storage for asset file, verify asset permissions |
9. Runbooks - Common Operational Procedures
9.1. Restart Exstream Service
When to Use:
- Exstream pods unresponsive
- Memory leak suspected
- After configuration change requiring restart
Procedure:
- Verify backup completed recently (check backup status)
- Announce maintenance to customers if during business hours
- Drain in-flight requests:
kubectl scale deployment exstream-api --replicas=0 - Wait 60 seconds for graceful shutdown
- Scale up:
kubectl scale deployment exstream-api --replicas=3 - Monitor pod startup:
kubectl get pods -w -l app=exstream - Verify health checks:
curl https://api.swedwise.com/health - Test document generation with sample request
- Monitor error rates for 15 minutes post-restart
- Announce service restored (if maintenance was announced)
Expected Downtime: 2-3 minutes
9.2. Clear Template Cache
When to Use:
- Template rendering inconsistently
- Recent template update not taking effect
- Suspected template cache corruption
Procedure:
- Identify affected tenant(s)
- Connect to Exstream pod:
kubectl exec -it <pod-name> -- /bin/bash - Clear cache:
rm -rf /opt/exstream/cache/*(or vendor-specific command) - Exit pod
- Verify cache cleared: Generate test document, check logs for cache miss
- Monitor performance (cache rebuild may temporarily slow generation)
Expected Impact: Temporary performance degradation (<30 min)
9.3. Restore Deleted Template
When to Use:
- Customer accidentally deleted template
- Template corruption requiring restoration
- Regulatory requirement to restore historical template
Procedure:
- Identify template to restore (tenant ID, template ID, version)
- Check object storage version history:
aws s3api list-object-versions \ --bucket tenants \ --prefix tenant-abc-123/templates/invoice-template.xsl - Identify correct version by timestamp and version ID
- Restore object:
aws s3api copy-object \ --copy-source tenants/tenant-abc-123/templates/invoice-template.xsl?versionId=<version-id> \ --bucket tenants \ --key tenant-abc-123/templates/invoice-template.xsl - Verify template restored in Exstream Designer
- Test template: Generate sample document
- Notify customer of restoration
- Document restoration in operations log
Expected Time: 15-30 minutes
9.4. Investigate High Queue Depth
When to Use:
- Queue depth alert triggered (>1000 jobs for documents)
- Customer reports delayed document generation
- Dashboard shows increasing queue trend
Procedure:
- Identify affected queue: Document generation
- Check queue metrics:
# RabbitMQ example rabbitmqctl list_queues name messages consumers - Investigate root cause:
- Low processing rate: Check worker pod count, CPU/memory utilization
- High ingress rate: Check API call volume, identify source
- Worker failures: Check pod logs for errors, restart failed pods
- Take corrective action:
- Scale up workers: Increase pod replicas temporarily
kubectl scale deployment exstream-api --replicas=10 - Throttle ingress: Apply rate limiting to offending tenant (if abuse)
- Fix worker issues: Restart failed pods, increase resources
- Scale up workers: Increase pod replicas temporarily
- Monitor queue drain rate: Calculate estimated time to clear
- Communicate to customers if significant delay expected
- Post-incident: Analyze root cause, adjust auto-scaling thresholds if needed
Expected Resolution Time: 15 minutes (diagnosis) + variable (queue drain time)
10. Continuous Improvement
10.1. Component-Specific Metrics
Communications Component:
- Document generation success rate: Target >99%
- Average generation time: Trending down
- Template error rate: Target <1%
- Queue processing efficiency: Target >95%
10.2. Improvement Initiatives
Current Improvement Areas:
- Reduce document generation time for complex templates
- Optimize batch processing for large jobs
- Enhance monitoring and alerting granularity
- Automate common runbook procedures
10.3. Lessons Learned
Post-Incident Reviews:
- Conducted after every High or Critical incident
- Focus on component-specific root causes
- Identify preventive measures
- Update runbooks with new procedures
Monthly Retrospectives:
- Operations team reviews component performance
- Discusses challenges and improvement opportunities
- Prioritizes automation and optimization tasks
11. Related Documents
Main Operations Guide:
- SW-SAAS-OPS-001: Swedwise SaaS Platform - Operations Guide (parent document)
Component Documentation:
- SW-SAAS-COMP-001: Swedwise Communications - Component Description
Architecture and Design:
- SW-SAAS-ARCH-001: Swedwise Communications Technical Architecture
IMS Procedures:
- SW-ISMS-PRO-001: Incident Management Procedure
- SW-IMS-PRO-008: Change Management Procedure
Vendor Documentation:
- OpenText Communications (Exstream) Administration Guide
12. Document Control
| Version | Date | Author | Changes | Approved By |
|---|---|---|---|---|
| 1.0 | [TBD] | SaaS Operations Manager | Initial Communications operations addendum | Operations Manager |
Next Review Date: [TBD - 6 months from effective date, then annually]
Document Classification: Internal
Document Owner: SaaS Operations Manager
Parent Document: SW-SAAS-OPS-001 (Swedwise SaaS Platform - Operations Guide)
This operations addendum is supplementary to SW-SAAS-OPS-001. All general SaaS operations procedures from the main guide apply unless specifically overridden in this addendum. Operations staff must read and understand both documents.