OP.PL.4 Dimensionamiento y Gestión de Capacidades
Documentos de referencia
- ISO/IEC 27000
- 27002:2013
- 12.1.3 - Gestión de capacidades
- 27002:2013
- NIST SP 800-53 rev.4
- [SA-2] Allocation of Resources
- [AU-4] Audit Storage Capacity
Definiciones
- EAL. Evaluation Assurance Level. Niveles de confianza en la evaluación.
- TOE. Target of Evaluation. Objetivo de evaluación.
Guía de implantación
-
Conviene destacar que esta medida de seguridad no es meramente técnica, sino que tiene implicaciones presupuestarias y por ello debe gestionarse con tiempo para que las necesidades queden debidamente recogidas en los presupuestos. Si en todas las medidas de seguridad hay que huir de la improvisación, en esta con mayor razón.
-
Nótese que en entornos flexibles como es el empleo de recursos en la nube, el dimensionado efectivo del sistema puede ser dinámico, adecuándose a las necesidades del servicio.
Implementación en Legit Health Plus
1. Marco de Dimensionamiento y Gestión de Capacidades
1.1 Estrategia de Capacity Management
Legit Health Plus implementa un enfoque proactivo de gestión de capacidades que abarca:
- Planificación de capacidad basada en proyecciones de crecimiento
- Monitorización continua de recursos y rendimiento
- Escalado automático en infraestructura cloud
- Optimización de costes mediante rightsizing
- Gestión de picos de demanda estacionales o excepcionales
- Planificación presupuestaria a medio y largo plazo
1.2 Modelo de Capacidad Multinivel
La gestión de capacidades se estructura en múltiples niveles:
2. Dimensionamiento por Componentes
2.1 Infraestructura de Aplicación
Servicios Core - Dimensionamiento Actual:
Componente | Configuración Actual | Capacidad Máxima | Utilización Target |
---|---|---|---|
API Gateway | 4x ECS Tasks (2 vCPU, 4GB) | 1000 req/sec | 70% CPU |
ML Inference | 2x GPU instances (g4dn.xlarge) | 100 concurrent | 80% GPU |
Image Processing | Auto-scaling 2-10 tasks | 500 images/min | 75% CPU |
Auth Service | 2x ECS Tasks (1 vCPU, 2GB) | 200 auth/sec | 60% CPU |
Background Jobs | 3x ECS Tasks (1 vCPU, 2GB) | 1000 jobs/hour | 70% CPU |
2.2 Infraestructura de Datos
Almacenamiento - Capacidad Planificada:
Sistema | Configuración | Capacidad Actual | Proyección 12M | Crecimiento Anual |
---|---|---|---|---|
DocumentDB | 3-node cluster (r6g.large) | 1TB | 5TB | 400% |
S3 Storage | Standard + IA tiers | 50TB | 200TB | 300% |
Redis Cache | 2x r6g.large cluster | 32GB | 128GB | 300% |
CloudWatch Logs | 30-day retention | 500GB/month | 2TB/month | 300% |
Backup Storage | S3 Glacier Deep Archive | 100TB | 500TB | 400% |
2.3 Red y Conectividad
Ancho de Banda Planificado:
Network Capacity Planning:
Internet Gateway:
current: 10 Gbps
projected: 50 Gbps
bottlenecks: Image upload bursts
Inter-AZ Traffic:
current: 5 Gbps
projected: 20 Gbps
pattern: DB replication, cross-AZ failover
VPN Connections:
current: 2x 1 Gbps
projected: 4x 10 Gbps
usage: Healthcare provider integrations
CDN (CloudFront):
current: Unlimited
cost_optimization: Regional caching strategy
3. Modelos de Demanda y Proyecciones
3.1 Patrones de Uso Identificados
Análisis de Demanda Histórica:
Métrica | Q1 2024 | Q2 2024 | Q3 2024 | Q4 2024 | Proyección Q1 2025 |
---|---|---|---|---|---|
Usuarios Activos | 5,000 | 8,000 | 12,000 | 18,000 | 27,000 |
Imágenes/Día | 10,000 | 16,000 | 24,000 | 36,000 | 54,000 |
API Calls/Día | 100K | 160K | 240K | 360K | 540K |
Storage (TB) | 15 | 25 | 38 | 55 | 85 |
Concurrent Users | 200 | 320 | 480 | 720 | 1,080 |
3.2 Factores de Crecimiento
Drivers de Demanda:
- Expansión geográfica: +200% usuarios por nueva región
- Nuevas especialidades médicas: +50% por especialidad
- Integraciones con HIS/EHR: +300% API calls por integración
- Mejoras en algoritmos: +25% precisión → +40% retención
- Programas de screening: Picos estacionales +500%
3.3 Modelado Predictivo
Modelos de Forecasting:
# Modelo de predicción de capacidad
capacity_model = {
'base_growth': 0.15, # 15% monthly growth
'seasonal_factor': {
'Q1': 1.2, # Peak screening season
'Q2': 0.9, # Low season
'Q3': 1.0, # Normal
'Q4': 1.1 # Conference season
},
'expansion_multipliers': {
'new_region': 2.0,
'new_specialty': 1.5,
'enterprise_client': 3.0
},
'confidence_intervals': {
'p50': 'base_forecast',
'p80': 'base_forecast * 1.4',
'p95': 'base_forecast * 1.8'
}
}
4. Auto-scaling y Elasticidad
4.1 Políticas de Auto-scaling
ECS Auto-scaling Configuration:
api_service_scaling:
target_tracking:
cpu_utilization: 70%
memory_utilization: 75%
step_scaling:
scale_out:
- metric: RequestCount > 500/min
scaling: +2 tasks
- metric: RequestCount > 1000/min
scaling: +4 tasks
scale_in:
- metric: RequestCount < 200/min
scaling: -1 task (min 2)
ml_inference_scaling:
scheduled_scaling:
business_hours:
min_capacity: 4 instances
max_capacity: 20 instances
off_hours:
min_capacity: 2 instances
max_capacity: 10 instances
predictive_scaling:
enable: true
forecast_period: 7 days
buffer: 20%
4.2 Database Auto-scaling
DocumentDB Scaling Strategy:
Métrica | Threshold | Acción |
---|---|---|
CPU > 80% | 5 min sustained | Add read replica |
Connections > 90% | 2 min sustained | Scale up instance class |
Storage > 85% | Alert only | Manual review required |
Network I/O > 80% | 10 min sustained | Consider sharding |
4.3 Storage Tiering Automático
S3 Lifecycle Policies:
storage_lifecycle:
medical_images:
- transition: Standard → IA (30 days)
- transition: IA → Glacier (90 days)
- transition: Glacier → Deep Archive (365 days)
- expiration: Never (regulatory retention)
application_logs:
- transition: Standard → IA (7 days)
- transition: IA → Glacier (30 days)
- expiration: 2555 days (7 years regulatory)
backup_data:
- immediate: Glacier Deep Archive
- expiration: 2920 days (8 years)
5. Monitorización y Alerting
5.1 Métricas de Capacidad Críticas
Dashboard Principal - KPIs de Capacidad:
Métrica | SLA | Warning | Critical | Frecuencia |
---|---|---|---|---|
API Response Time | < 500ms p95 | > 400ms | > 800ms | 1 min |
Image Processing Time | < 30s p95 | > 25s | > 45s | 1 min |
Database CPU | < 70% avg | > 60% | > 85% | 5 min |
Storage Usage | < 80% | > 70% | > 90% | 15 min |
Concurrent Users | N/A | > 80% capacity | > 95% capacity | 1 min |
Error Rate | < 0.1% | > 0.05% | > 0.2% | 1 min |
5.2 Alerting Automático
Alert Routing Matrix:
alerts:
capacity_warning:
recipients: [devops, sre-team]
escalation_time: 30min
channels: [slack, email]
capacity_critical:
recipients: [oncall, cto, devops]
escalation_time: 5min
channels: [pagerduty, phone, slack]
auto_actions: [trigger_scaling, create_incident]
budget_alerts:
recipients: [finops, cto]
thresholds: [50%, 80%, 95%, 100%]
frequency: daily
5.3 Observabilidad Avanzada
Stack de Monitorización:
Herramienta | Función | Métricas Clave |
---|---|---|
CloudWatch | AWS metrics, logs | Infrastructure, application metrics |
DataDog | APM, synthetics | End-to-end latency, user experience |
Grafana | Dashboards | Business metrics, capacity trends |
Prometheus | Custom metrics | Application-specific KPIs |
ELK Stack | Log analysis | Error patterns, usage analytics |
6. Optimización de Costes
6.1 FinOps - Gestión Financiera de Cloud
Estructura de Costes Actual (Mensual):
Categoría | Coste | % Total | Optimización Identificada |
---|---|---|---|
Compute (ECS/EC2) | $15,000 | 35% | Reserved Instances: -25% |
Storage (S3/EBS) | $8,000 | 18% | Lifecycle policies: -30% |
Database | $12,000 | 28% | Rightsizing: -20% |
Network/CDN | $5,000 | 12% | Caching optimization: -15% |
Monitoring/Logs | $3,000 | 7% | Retention policies: -25% |
Total | $43,000 | 100% | Potential savings: -23% |
6.2 Strategies de Optimización
Cost Optimization Roadmap:
Q1_2025:
- implement: Reserved Instance strategy
savings: $3,750/month
- implement: S3 lifecycle policies
savings: $2,400/month
Q2_2025:
- implement: Database rightsizing
savings: $2,400/month
- implement: Log retention optimization
savings: $750/month
Q3_2025:
- implement: Spot instances for batch workloads
savings: $1,500/month
- implement: CDN optimization
savings: $750/month
Annual_Savings: $139,800
ROI_on_optimization: 340%
7. Capacity Planning - Presupuestos
7.1 Proyecciones Presupuestarias
Budget Planning FY2025:
Trimestre | Usuarios Proyectados | Coste Infraestructura | Crecimiento |
---|---|---|---|
Q1 | 30,000 | $52,000/mes | +21% |
Q2 | 40,000 | $68,000/mes | +31% |
Q3 | 55,000 | $89,000/mes | +31% |
Q4 | 75,000 | $118,000/mes | +33% |
Total FY | - | $1,308,000/año | +180% |
7.2 Contingency Planning
Escenarios de Capacidad:
Escenario | Probabilidad | Impacto Coste | Contingencia |
---|---|---|---|
Crecimiento Base | 60% | Budget baseline | Planned scaling |
Crecimiento Acelerado | 25% | +40% budget | Emergency scaling fund |
Pandemia/Screening masivo | 10% | +200% capacity | Spot instances + CDN boost |
Recession/Slow growth | 5% | -30% demand | Scale-down automation |
8. Gestión de Recursos Especializados
8.1 GPU Computing para ML
Configuración GPU Clusters:
ml_gpu_capacity:
training_cluster:
instances: 4x p3.8xlarge (V100)
usage_pattern: Batch training jobs
cost_optimization: Spot instances (70% savings)
inference_cluster:
instances: 6x g4dn.xlarge (T4)
usage_pattern: Real-time inference
scaling: Auto-scale 2-20 instances
development:
instances: 2x g4dn.large
usage_pattern: Model development
scheduling: Shared resource pool
8.2 Specialized Storage Requirements
Medical Imaging Storage:
Tipo | Rendimiento | Capacidad | Coste/TB/mes | Use Case |
---|---|---|---|---|
EFS (General) | 100 MB/s | 1TB | $300 | Shared model artifacts |
S3 Standard | N/A | 50TB | $23 | Active image processing |
S3 IA | N/A | 150TB | $12.50 | Recent images (30-90 days) |
S3 Glacier | Minutes | 500TB | $4 | Archive (90+ days) |
S3 Deep Archive | Hours | 2PB | $1 | Long-term regulatory |
9. Business Continuity y Disaster Recovery
9.1 RTO/RPO Requirements
Recovery Objectives por Servicio:
Servicio | RTO | RPO | DR Strategy |
---|---|---|---|
API Core | 15 min | 1 min | Multi-AZ + auto-failover |
ML Inference | 30 min | 5 min | Multi-region model deployment |
Database | 1 hour | 15 min | Cross-region read replica |
File Storage | 4 hours | 1 hour | Cross-region replication |
Monitoring | 5 min | Real-time | Multi-region deployment |
9.2 Capacity for DR
DR Infrastructure Sizing:
disaster_recovery:
primary_region: eu-west-1 (100% capacity)
dr_region: eu-central-1 (warm standby)
dr_capacity_allocation:
compute: 50% of primary (scale on demand)
storage: 100% (continuous replication)
network: 100% (redundant connections)
failover_scenarios:
planned_maintenance: Zero downtime
availability_zone_failure: < 15min RTO
region_failure: < 4hour RTO
cost_impact: +35% infrastructure cost
10. Compliance y Auditoría
10.1 Capacity Management Audit Trail
Auditoría de Decisiones de Capacidad:
audit_requirements:
capacity_changes:
approval_required: > 25% capacity change
documentation: Business justification + technical assessment
retention: 7 years
budget_variances:
threshold: +/- 15% monthly budget
escalation: CFO + CTO notification
review_cycle: Monthly
performance_slas:
monitoring: Continuous
reporting: Monthly SLA reports
compliance: 99%+ SLA adherence required
10.2 Regulatory Compliance para Healthcare
Medical Device Capacity Requirements:
Regulación | Requirement | Implementation |
---|---|---|
FDA QSR | Change control for capacity | Formal approval process |
EU MDR | Performance monitoring | Continuous metrics collection |
HIPAA | Availability requirements | 99.9% uptime SLA |
GDPR | Data processing capacity | Privacy by design scaling |
11. Automatización y Tooling
11.1 Infrastructure as Code
Capacity Automation Stack:
automation_tools:
infrastructure:
- terraform: Infrastructure provisioning
- ansible: Configuration management
- helm: Kubernetes deployments
monitoring:
- prometheus: Metrics collection
- grafana: Visualization
- alertmanager: Alert routing
optimization:
- aws_cost_explorer: Cost analysis
- rightsizing_recommendations: AWS Compute Optimizer
- custom_scripts: Cleanup automation
11.2 Self-Healing Infrastructure
Automated Remediation:
# Ejemplo de auto-remediation
def handle_capacity_alert(alert_type, metrics):
if alert_type == "high_cpu":
if can_scale_horizontally():
trigger_auto_scaling()
else:
create_incident("Manual intervention required")
elif alert_type == "storage_full":
if is_log_storage():
cleanup_old_logs()
else:
expand_storage_tier()
elif alert_type == "memory_pressure":
restart_memory_intensive_services()
if not improved():
scale_up_instance_size()
12. Métricas y KPIs
12.1 Operational Excellence KPIs
Métrica | Target | Q4 2024 | Trend |
---|---|---|---|
Capacity Utilization | 70-80% | 74% | ✅ |
Cost per Transaction | < $0.02 | $0.018 | ↓ |
Auto-scaling Events | < 50/day | 32/day | ✅ |
Capacity Planning Accuracy | ±10% | ±8% | ✅ |
Time to Scale | < 5 min | 3.2 min | ✅ |
Budget Variance | ±5% | +3% | ✅ |
12.2 Business Impact Metrics
Business KPI | Target | Current | Impact |
---|---|---|---|
Revenue per GB | $0.15 | $0.18 | ↑ 20% |
User Satisfaction | > 4.5/5 | 4.7/5 | ↑ |
Time to Market | < 2 weeks | 10 days | ↑ 30% |
Cost of Downtime | < $1K/hour | $0/hour | ✅ |
13. Roadmap de Evolución
13.1 Capacity Management Maturity
Niveles de Madurez:
- Reactive (Actual): ✅ Monitoring, alerting, manual scaling
- Proactive (Q1 2025): 🔄 Predictive scaling, cost optimization
- Predictive (Q3 2025): 📋 ML-based forecasting, automated optimization
- Autonomous (2026): 📋 Self-managing infrastructure, AI-driven decisions
13.2 Technology Evolution
Emerging Technologies:
technology_roadmap:
2025:
- serverless_ml: AWS Lambda + SageMaker inference
- spot_fleet: 80% cost savings on batch workloads
- graviton_processors: 20% better price/performance
2026:
- kubernetes_adoption: EKS for better resource utilization
- service_mesh: Istio for traffic management
- chaos_engineering: Automated resilience testing
2027:
- quantum_ready: Quantum-safe cryptography planning
- edge_computing: Regional inference deployment
- ai_ops: Fully automated operations
Anexo A: Capacity Sizing Calculator
class CapacityCalculator:
def __init__(self, growth_rate=0.15, seasonal_factor=1.0):
self.growth_rate = growth_rate
self.seasonal_factor = seasonal_factor
def project_capacity(self, current_usage, months_ahead):
base_projection = current_usage * (1 + self.growth_rate) ** months_ahead
return base_projection * self.seasonal_factor
def calculate_cost(self, capacity_requirements):
cost_model = {
'compute': capacity_requirements['cpu_hours'] * 0.05,
'storage': capacity_requirements['gb_storage'] * 0.023,
'network': capacity_requirements['gb_transfer'] * 0.09,
'ml_inference': capacity_requirements['ml_requests'] * 0.001
}
return sum(cost_model.values())
Anexo B: Emergency Scaling Runbook
Procedimiento de Escalado de Emergencia:
- Detección: Alert automática o manual
- Evaluación: Determinar scope y urgencia (< 5 min)
- Autorización: Auto-aprobada si < 200% capacity
- Ejecución: Terraform apply + monitoring
- Validación: Verificar métricas en 15 min
- Comunicación: Stakeholder notification
- Post-mortem: Análisis de causa raíz en 48h
Signature meaning
The signatures for the approval process of this document can be found in the verified commits at the repository for the QMS. As a reference, the team members who are expected to participate in this document and their roles in the approval process, as defined in Annex I Responsibility Matrix
of the GP-001
, are:
- Author: Team members involved
- Reviewer: JD-003, JD-004
- Approver: JD-001