Reference architecture for a production OpenStack private cloud covering control plane HA, network design, Ceph storage, compute sizing, security, and monitoring.

OpenStack Private Cloud Architecture

A production OpenStack private cloud requires careful architecture across compute, storage, networking, and control plane layers. This guide presents a reference architecture for a medium-scale private cloud (50–200 compute nodes) using OpenStack 2024.2 Dalmatian.

Architecture Overview

Layer	Components	Node Count
Control plane	Keystone, Nova API, Neutron, Glance, Cinder API, Horizon, HAProxy	3 (HA)
Network	Neutron L3/DHCP agents, OVN controllers	2–3
Compute	Nova compute, OVS/OVN, libvirt	50–200
Storage	Ceph MON/MGR/OSD or enterprise SAN	3+ (Ceph)
Monitoring	Prometheus, Grafana, Alertmanager	1–3

Network Architecture

A production cloud uses four physically or logically separated networks:

Network	Purpose	VLAN/Subnet	MTU
Management	API, database, message queue	VLAN 10 / 172.29.236.0/22	1500
Tunnel/Overlay	VXLAN/Geneve between compute nodes	VLAN 20 / 172.29.240.0/22	9000
Storage	Ceph replication, iSCSI traffic	VLAN 30 / 172.29.244.0/22	9000
Provider/External	Tenant external access, floating IPs	VLAN 40 / public range	1500

Network Hardware

Spine-leaf topology for predictable latency
25 GbE minimum for compute nodes (2x bonded)
100 GbE spine uplinks
Jumbo frames (MTU 9000) on storage and tunnel networks
MLAG/VPC for switch-level redundancy

Control Plane Design

The control plane runs on three nodes behind HAProxy for high availability:

                    +-----------+
                    | HAProxy   |
                    | (VIP)     |
                    +-----+-----+
                          |
            +-------------+-------------+
            |             |             |
      +-----+-----+ +----+------+ +----+------+
      | ctrl-01   | | ctrl-02   | | ctrl-03   |
      | Keystone  | | Keystone  | | Keystone  |
      | Nova API  | | Nova API  | | Nova API  |
      | Neutron   | | Neutron   | | Neutron   |
      | Glance    | | Glance    | | Glance    |
      | MariaDB   | | MariaDB   | | MariaDB   |
      | RabbitMQ  | | RabbitMQ  | | RabbitMQ  |
      +-----------+ +-----------+ +-----------+

Database

MariaDB Galera Cluster across all 3 controllers
Synchronous replication ensures consistency
Use SSDs for database storage

Message Queue

RabbitMQ cluster with mirrored queues
3-node cluster for quorum
Monitor queue depth for capacity planning

API Load Balancing

HAProxy with a virtual IP (keepalived)
Health checks on each API endpoint
SSL termination at HAProxy

Compute Node Design

Each compute node runs:

Service	Purpose
nova-compute	Instance lifecycle management
OVN controller / OVS agent	Virtual networking
libvirt + QEMU/KVM	Hypervisor
ceph-common	RBD client for Ceph storage
telegraf/node_exporter	Monitoring agent

Hardware Recommendations

Component	Specification
CPU	2x Intel Xeon or AMD EPYC (64+ cores total)
RAM	512 GB–1 TB DDR5
Boot disk	2x 480 GB SSD (RAID 1)
NIC	2x 25 GbE (bonded, LACP)
GPU (optional)	NVIDIA A100/H100 for AI workloads

CPU and RAM Allocation

# /etc/nova/nova.conf
[DEFAULT]
cpu_allocation_ratio = 4.0     # 4:1 for general workloads
ram_allocation_ratio = 1.0     # no RAM overcommit in production
reserved_host_memory_mb = 8192 # reserve 8 GB for host OS
reserved_host_cpus = 4         # reserve 4 cores for host

Storage Architecture

Ceph (Recommended)

Component	Specification
MON/MGR nodes	3 (can colocate with controllers)
OSD nodes	5+ dedicated storage nodes
OSD disks	NVMe for performance, HDD for capacity
Replication	3x for production
Pools	`volumes` (Cinder), `images` (Glance), `vms` (Nova ephemeral)

Storage Tiers

Use Ceph CRUSH rules to create tiers:

# Fast tier (NVMe)
ceph osd crush rule create-replicated fast-rule default host ssd
ceph osd pool create fast-volumes 128 128 replicated fast-rule

# Bulk tier (HDD)
ceph osd crush rule create-replicated bulk-rule default host hdd
ceph osd pool create bulk-volumes 128 128 replicated bulk-rule

Map to Cinder volume types for user-facing storage tiers.

Security Architecture

Layer	Mechanism
API	TLS everywhere (HAProxy termination or end-to-end)
Authentication	Keystone with LDAP/AD federation
Network	Security groups, project isolation
Secrets	Barbican for key management
Compliance	Audit logging via Oslo.messaging notifications

Monitoring and Operations

Tool	Purpose
Prometheus + Grafana	Metrics collection and dashboards
Alertmanager	Alert routing (PagerDuty, Slack)
ELK / Loki	Log aggregation
Ceilometer / Gnocchi	OpenStack-native metering
Rally / Tempest	Performance and integration testing

Key Metrics to Monitor

Nova: instance count, scheduler latency, hypervisor utilization
Neutron: agent health, port creation latency
Ceph: cluster health, OSD latency, pool usage
RabbitMQ: queue depth, message rates
HAProxy: backend health, request latency

Capacity Planning

Resource	Formula
vCPUs available	(physical cores x allocation ratio) - reserved
RAM available	(physical RAM x allocation ratio) - reserved
Storage	(total OSD capacity / replication factor) x 0.85
Instances per host	min(vCPU / flavor vCPUs, RAM / flavor RAM)

Deployment Tools

Tool	Best For
OpenStack-Ansible	Full control, bare-metal
Kolla-Ansible	Containerized, easy upgrades
TripleO/Director	Red Hat environments
Sunbeam/MicroStack	Small-scale, Canonical

Summary

A production OpenStack private cloud requires a three-node HA control plane, spine-leaf networking with jumbo frames, Ceph distributed storage, and comprehensive monitoring. The architecture scales from 50 to 200+ compute nodes by adding hardware without changing the control plane design.