Reference architecture for a production OpenStack private cloud covering control plane HA, network design, Ceph storage, compute sizing, security, and monitoring.
OpenStack Private Cloud Architecture
A production OpenStack private cloud requires careful architecture across compute, storage, networking, and control plane layers. This guide presents a reference architecture for a medium-scale private cloud (50–200 compute nodes) using OpenStack 2024.2 Dalmatian.
Architecture Overview
| Layer |
Components |
Node Count |
| Control plane |
Keystone, Nova API, Neutron, Glance, Cinder API, Horizon, HAProxy |
3 (HA) |
| Network |
Neutron L3/DHCP agents, OVN controllers |
2–3 |
| Compute |
Nova compute, OVS/OVN, libvirt |
50–200 |
| Storage |
Ceph MON/MGR/OSD or enterprise SAN |
3+ (Ceph) |
| Monitoring |
Prometheus, Grafana, Alertmanager |
1–3 |
Network Architecture
A production cloud uses four physically or logically separated networks:
| Network |
Purpose |
VLAN/Subnet |
MTU |
| Management |
API, database, message queue |
VLAN 10 / 172.29.236.0/22 |
1500 |
| Tunnel/Overlay |
VXLAN/Geneve between compute nodes |
VLAN 20 / 172.29.240.0/22 |
9000 |
| Storage |
Ceph replication, iSCSI traffic |
VLAN 30 / 172.29.244.0/22 |
9000 |
| Provider/External |
Tenant external access, floating IPs |
VLAN 40 / public range |
1500 |
Network Hardware
- Spine-leaf topology for predictable latency
- 25 GbE minimum for compute nodes (2x bonded)
- 100 GbE spine uplinks
- Jumbo frames (MTU 9000) on storage and tunnel networks
- MLAG/VPC for switch-level redundancy
Control Plane Design
The control plane runs on three nodes behind HAProxy for high availability:
+-----------+
| HAProxy |
| (VIP) |
+-----+-----+
|
+-------------+-------------+
| | |
+-----+-----+ +----+------+ +----+------+
| ctrl-01 | | ctrl-02 | | ctrl-03 |
| Keystone | | Keystone | | Keystone |
| Nova API | | Nova API | | Nova API |
| Neutron | | Neutron | | Neutron |
| Glance | | Glance | | Glance |
| MariaDB | | MariaDB | | MariaDB |
| RabbitMQ | | RabbitMQ | | RabbitMQ |
+-----------+ +-----------+ +-----------+
Database
- MariaDB Galera Cluster across all 3 controllers
- Synchronous replication ensures consistency
- Use SSDs for database storage
Message Queue
- RabbitMQ cluster with mirrored queues
- 3-node cluster for quorum
- Monitor queue depth for capacity planning
API Load Balancing
- HAProxy with a virtual IP (keepalived)
- Health checks on each API endpoint
- SSL termination at HAProxy
Compute Node Design
Each compute node runs:
| Service |
Purpose |
| nova-compute |
Instance lifecycle management |
| OVN controller / OVS agent |
Virtual networking |
| libvirt + QEMU/KVM |
Hypervisor |
| ceph-common |
RBD client for Ceph storage |
| telegraf/node_exporter |
Monitoring agent |
Hardware Recommendations
| Component |
Specification |
| CPU |
2x Intel Xeon or AMD EPYC (64+ cores total) |
| RAM |
512 GB–1 TB DDR5 |
| Boot disk |
2x 480 GB SSD (RAID 1) |
| NIC |
2x 25 GbE (bonded, LACP) |
| GPU (optional) |
NVIDIA A100/H100 for AI workloads |
CPU and RAM Allocation
# /etc/nova/nova.conf
[DEFAULT]
cpu_allocation_ratio = 4.0 # 4:1 for general workloads
ram_allocation_ratio = 1.0 # no RAM overcommit in production
reserved_host_memory_mb = 8192 # reserve 8 GB for host OS
reserved_host_cpus = 4 # reserve 4 cores for host
Storage Architecture
Ceph (Recommended)
| Component |
Specification |
| MON/MGR nodes |
3 (can colocate with controllers) |
| OSD nodes |
5+ dedicated storage nodes |
| OSD disks |
NVMe for performance, HDD for capacity |
| Replication |
3x for production |
| Pools |
volumes (Cinder), images (Glance), vms (Nova ephemeral) |
Storage Tiers
Use Ceph CRUSH rules to create tiers:
# Fast tier (NVMe)
ceph osd crush rule create-replicated fast-rule default host ssd
ceph osd pool create fast-volumes 128 128 replicated fast-rule
# Bulk tier (HDD)
ceph osd crush rule create-replicated bulk-rule default host hdd
ceph osd pool create bulk-volumes 128 128 replicated bulk-rule
Map to Cinder volume types for user-facing storage tiers.
Security Architecture
| Layer |
Mechanism |
| API |
TLS everywhere (HAProxy termination or end-to-end) |
| Authentication |
Keystone with LDAP/AD federation |
| Network |
Security groups, project isolation |
| Secrets |
Barbican for key management |
| Compliance |
Audit logging via Oslo.messaging notifications |
Monitoring and Operations
| Tool |
Purpose |
| Prometheus + Grafana |
Metrics collection and dashboards |
| Alertmanager |
Alert routing (PagerDuty, Slack) |
| ELK / Loki |
Log aggregation |
| Ceilometer / Gnocchi |
OpenStack-native metering |
| Rally / Tempest |
Performance and integration testing |
Key Metrics to Monitor
- Nova: instance count, scheduler latency, hypervisor utilization
- Neutron: agent health, port creation latency
- Ceph: cluster health, OSD latency, pool usage
- RabbitMQ: queue depth, message rates
- HAProxy: backend health, request latency
Capacity Planning
| Resource |
Formula |
| vCPUs available |
(physical cores x allocation ratio) - reserved |
| RAM available |
(physical RAM x allocation ratio) - reserved |
| Storage |
(total OSD capacity / replication factor) x 0.85 |
| Instances per host |
min(vCPU / flavor vCPUs, RAM / flavor RAM) |
Deployment Tools
| Tool |
Best For |
| OpenStack-Ansible |
Full control, bare-metal |
| Kolla-Ansible |
Containerized, easy upgrades |
| TripleO/Director |
Red Hat environments |
| Sunbeam/MicroStack |
Small-scale, Canonical |
Summary
A production OpenStack private cloud requires a three-node HA control plane, spine-leaf networking with jumbo frames, Ceph distributed storage, and comprehensive monitoring. The architecture scales from 50 to 200+ compute nodes by adding hardware without changing the control plane design.