ip·Solis -- Production Deployment Guide

This guide walks you through setting up the ip·Solis platform on a fresh on-premises server. No prior knowledge of the codebase is required.

Prerequisites
Get the Software
Configure Environment Variables
SSL / TLS Certificate Setup
Create the Production Compose Overlay
Start the Stack
Initial Admin Setup
- Install Your License (Pro)
Portal SSO — OpenID Connect (Portal Authentication)
Verify the Deployment
Backup & Maintenance
Updating to a New Version
High-Availability Deployments (optional)
Troubleshooting
Clean Reset (Test Environments)

1. Prerequisites

Server Requirements

Component	Minimum	Recommended
OS	Linux (Debian/Ubuntu recommended)	Ubuntu 22.04 LTS or newer
CPU	2 cores	4 cores
RAM	4 GB	8 GB
Disk	20 GB	50 GB (depends on number of managed assets)

Software

Install the following before proceeding:

Docker Engine >= 24.0 -- Install Docker
Docker Compose >= 2.20 (included with Docker Engine)
Git -- to clone the repository

After installing Docker, add the deployment user to the docker group so docker compose commands work without sudo:

sudo usermod -aG docker $USER
# Then log out and back in (or: newgrp docker)

Verify your installation:

docker --version        # Docker version 24.x or higher
docker compose version  # Docker Compose version v2.20 or higher
git --version

Network Requirements

The server needs outbound access to:

Destination	Purpose
Your Active Directory / LDAP server (port 389 or 636)	User validation, manager lookup, group membership
Your SMTP relay	Email notifications
vSphere / XenServer (if applicable)	VM lifecycle automation
SCCM server (if applicable)	Task sequence triggers

Inbound: ports 80 and 443 must be reachable from your users' browsers.

None of the destinations above are public internet — they are all systems on your own network. No outbound internet is required at runtime.

Air-gapped runtime

The browser loads nothing from external origins — all CSS, JavaScript and fonts (Tailwind, htmx, the Monaco script editor, the Inter font) are compiled/vendored into the image at build time and served from the ipSolis host itself. The internet is needed only during install/update — i.e. to build the image or docker pull a prebuilt one. Once the image is on the host, ipSolis runs fully air-gapped.

If you build the image yourself, the build (docker compose build) needs internet to fetch the Node/Tailwind toolchain and the vendored assets. The running host does not.
Set an admin logo as an uploaded image / data-URL, not an external URL — an app_logo_url pointing at the public internet would be the one browser request that leaves the host.
Optional hardening: set CSP_ENABLED=true in .env to emit a Content-Security-Policy (default-src 'self' for assets) that blocks any accidental external request. Off by default (it would block an external app_logo_url).

2. Get the Software

Fresh environment recommended: Docker volumes (database data) survive rm -rf /opt/ipsolis — they live under /var/lib/docker/volumes/ and persist until explicitly removed. For a clean first install, ensure no old volumes exist. See Clean Reset (Test Environments).

Clone the repository — no authentication required:

cd /opt
sudo git clone https://github.com/XenPool/ipsolis.git ipsolis
sudo chown -R $USER:$USER ipsolis    # own the repo so later git pull / docker compose need no sudo
cd ipsolis

ip·Solis ships as ready-to-run Docker images — there's nothing to build yourself. The images are public (no docker login needed). The pull and start commands follow in Section 6.

Licensing: ip·Solis is free for productive use with up to 25 active users (plus private and 30-day evaluation use). Productive use with more than 25 active users — including by public-sector bodies and non-profit organizations — requires a commercial volume-band license. See LICENSE; get a license at the ip·Solis license shop.

3. Configure Environment Variables

Copy the example file and edit it:

cp .env.example .env
nano .env

Required settings to change

# Secure database credentials
POSTGRES_PASSWORD=<generate-a-strong-password>

# Secure API secrets -- use random strings of 32+ characters
API_SECRET_KEY=<random-string-min-32-chars>
WEBHOOK_SECRET_TOKEN=<random-string>
ADMIN_API_KEY=<random-string-min-32-chars>

# CORS -- set to your production domain  ← replace YOUR_HOSTNAME.YOUR_COMPANY.COM
CORS_ORIGINS=https://YOUR_HOSTNAME.YOUR_COMPANY.COM
FLOWER_PASSWORD=<strong-password>

Tip: Generate secure passwords with:
openssl rand -base64 32

Secure the .env file: it stores secrets in plaintext, so restrict access to it:
chmod 600 .env
Treat ADMIN_API_KEY as a bootstrap credential. After initial setup, create RBAC admin accounts and issue per-integration API tokens (Admin → API Tokens) — these are stored hashed, are scoped, expiring and revocable, and are the preferred way to authenticate automations.

4. SSL / TLS Certificate Setup

The platform runs behind an nginx reverse proxy that terminates SSL. You need a TLS certificate and private key.

Choose one of the three options below (A, B, or C) — they are mutually exclusive alternatives, depending on how your server is reached. After you have a cert.pem + key.pem in nginx/ssl/, continue with **Configure nginx** — that step is required for all options.

Option A: Internal / Self-Signed Certificate (Intranet)

If your server is only accessible within your corporate network, use mkcert to generate a trusted certificate:

# Install mkcert (one-time)
# Ubuntu/Debian:
sudo apt install -y libnss3-tools
sudo curl -JLO "https://dl.filippo.io/mkcert/latest?for=linux/amd64"
sudo chmod +x mkcert-v*-linux-amd64
sudo mv mkcert-v*-linux-amd64 /usr/local/bin/mkcert

# Install the local CA into your system trust store
sudo mkcert -install

# Generate the certificate for your hostname  ← replace YOUR_HOSTNAME.YOUR_COMPANY.COM
sudo mkdir -p nginx/ssl
sudo mkcert -cert-file nginx/ssl/cert.pem -key-file nginx/ssl/key.pem YOUR_HOSTNAME.YOUR_COMPANY.COM

Important: For browsers on other machines to trust this certificate, you must distribute the root CA (mkcert -CAROOT shows the path) to client machines via Group Policy or your enterprise CA trust store.

Installing the root CA on a Windows client:

# On the server — make the root CA available for download
sudo cp $(sudo mkcert -CAROOT)/rootCA.pem /tmp/ipsolis-rootCA.pem
sudo chmod 644 /tmp/ipsolis-rootCA.pem

Copy the file to your Windows laptop (SCP, USB, etc.), then:

Option 1 — via double-click:

Rename the file to ipsolis-rootCA.crt
Double-click → Install Certificate
Local Machine → Trusted Root Certification Authorities
Restart your browser

Option 2 — via PowerShell (as Administrator):

certutil -addstore -f "ROOT" ipsolis-rootCA.crt

After installation Chrome, Edge and Firefox (using the Windows trust store) will trust the certificate without warnings.

Option B: Certificate from your Enterprise CA (Recommended for production)

If your organization runs an internal Certificate Authority (e.g., Active Directory Certificate Services):

Generate a CSR on the server: (replace YOUR_HOSTNAME.YOUR_COMPANY.COM)

sudo mkdir -p nginx/ssl
sudo openssl req -new -newkey rsa:2048 -nodes \
  -keyout nginx/ssl/key.pem \
  -out nginx/ssl/server.csr \
  -subj "/CN=YOUR_HOSTNAME.YOUR_COMPANY.COM"

Submit nginx/ssl/server.csr to your CA and obtain the signed certificate.
Save the signed certificate as nginx/ssl/cert.pem.

If your CA provides an intermediate/chain certificate, append it to cert.pem:

cat signed-cert.pem intermediate-ca.pem | sudo tee nginx/ssl/cert.pem > /dev/null

Option C: Let's Encrypt (Public-facing servers)

If your server is publicly accessible, you can use free certificates from Let's Encrypt:

sudo apt install -y certbot
sudo certbot certonly --standalone -d YOUR_HOSTNAME.YOUR_COMPANY.COM  # ← replace

# Symlink into the ssl directory
sudo mkdir -p nginx/ssl
sudo ln -sf /etc/letsencrypt/live/YOUR_HOSTNAME.YOUR_COMPANY.COM/fullchain.pem nginx/ssl/cert.pem
sudo ln -sf /etc/letsencrypt/live/YOUR_HOSTNAME.YOUR_COMPANY.COM/privkey.pem nginx/ssl/key.pem

Set up auto-renewal (Option C only)

# Test renewal
sudo certbot renew --dry-run

# Add a cron job to reload nginx after renewal
echo "0 3 * * * certbot renew --quiet --post-hook 'docker exec ipsolis-nginx nginx -s reload'" | sudo crontab -

Configure nginx

End of the certificate options. Whichever option (A, B, or C) you chose above, continue here — the following steps apply to all setups.

The repository already ships a ready-to-use nginx/nginx.conf with the placeholder YOUR_HOSTNAME.YOUR_COMPANY.COM. Replace it with your actual FQDN — the same hostname you used for the certificate above (sed handles both occurrences in one pass):

# ← replace ipsolis.example.com with your actual FQDN
sudo sed -i 's/YOUR_HOSTNAME.YOUR_COMPANY.COM/ipsolis.example.com/g' nginx/nginx.conf

The file will look like this afterwards (for reference):

server {
    listen 80;
    server_name YOUR_HOSTNAME.YOUR_COMPANY.COM;
    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl;
    server_name YOUR_HOSTNAME.YOUR_COMPANY.COM;

    ssl_certificate     /etc/nginx/ssl/cert.pem;
    ssl_certificate_key /etc/nginx/ssl/key.pem;
    ssl_protocols       TLSv1.2 TLSv1.3;
    ssl_ciphers         HIGH:!aNULL:!MD5;

    client_max_body_size 2g;

    # WebSocket / HTMX support
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";

    location / {
        proxy_pass         http://api:8000;
        proxy_set_header   Host              $host;
        proxy_set_header   X-Real-IP         $remote_addr;
        proxy_set_header   X-Forwarded-For   $proxy_add_x_forwarded_for;
        proxy_set_header   X-Forwarded-Proto $scheme;
    }
}

Use the same hostname in the certificate generation step (Option A/B/C above).

5. Production Compose Overlay

docker-compose.prod.yml is already included in the repository — no action needed. The overlay adds nginx for SSL termination and removes the dev bind-mounts from api and worker.

6. Start the Stack

Pull the prebuilt images and start the stack. Setting COMPOSE_FILE once means every later docker compose command (exec, ps, logs, down) uses the right files automatically — no need to repeat -f.

Pick which version to run:

Production: set IPSOLIS_VERSION to a specific release so the running system never changes unexpectedly. Raise it when you want to upgrade (see Section 11).
Test / first try: leave it unset to always get the latest build.

cd /opt/ipsolis

# Production — pin a tested release:
export IPSOLIS_VERSION=x.x.x   # e.g. 0.6.12

# Pre-live / test — track latest (leave IPSOLIS_VERSION unset):
# (nothing to export)

export COMPOSE_FILE=docker-compose.ghcr.yml:docker-compose.prod.yml
docker compose pull
docker compose up -d

Then — run migrations and verify:

docker compose exec -T api alembic upgrade head   # uses $COMPOSE_FILE set above
docker compose ps

Expected output -- all services should show Up (healthy):

NAME             STATUS
ipsolis-postgres      Up (healthy)
ipsolis-redis         Up (healthy)
ipsolis-api           Up (healthy)
ipsolis-worker        Up (healthy)
ipsolis-beat-1   Up
ipsolis-nginx         Up

Verify the application:

# Direct API health check
curl -f http://localhost:8000/health | python3 -m json.tool

# Through nginx (HTTPS)
curl -fsk https://YOUR_HOSTNAME.YOUR_COMPANY.COM/health | python3 -m json.tool

7. Initial Admin Setup

First-run admin account (RBAC)

Open https://YOUR_HOSTNAME.YOUR_COMPANY.COM/ui/ in your browser. On the very first visit (when admin_users is empty), the login page renders a "Create first administrator" form instead of the normal sign-in form. Fill in:

Field	Notes
Username	3–128 chars, allowed: `[a-zA-Z0-9._@-]+`. Lower-cased at write time.
Password	≥ 12 chars. PBKDF2-SHA256 / 600k iterations (OWASP-2023).
Confirm password	Must match.

Submitting creates the first superadmin and auto-logs you in. This is idempotent against races — if two operators hit the form at the same time, only one wins; the other gets a "use the sign-in form" message.

After the first superadmin exists, the form switches to the regular username + password sign-in.

Add additional admin users

Once signed in, navigate to Admin Users in the left nav (superadmin-only). Create per-user accounts in the role appropriate to each operator:

superadmin > admin > approver > auditor > helpdesk

The full role ladder, per-asset-type ACL grants, separation-of-duties enforcement, and password-policy options are configurable in the Admin UI under Settings → Access Control.

Legacy `ADMIN_API_KEY` fallback

The ADMIN_API_KEY from .env continues to authenticate as a virtual superadmin even after first-run setup, so existing scripts / X-Admin-Key headers don't break on upgrade. To use it on the login page: leave Username blank, paste the key into Password. Audit attribution shows up as admin:legacy_key so auditors can tell when the fallback path was used.

For new integrations prefer Per-integration API tokens (Admin UI → API Tokens) — named, expiring, revocable bearer tokens with optional role binding and scoped permissions. The legacy single shared key is kept for back-compat only.

Install Your License

Private use and 30-day evaluation require no license file. Any productive or organizational deployment — including public-sector and non-profit use — requires a commercial license; XenPool delivers a signed .lic file after purchase.

Install it through the Admin UI:

Navigate to Admin → License (or open https://YOUR_HOSTNAME.YOUR_COMPANY.COM/ui/license).
Click Upload license and select your ipsolis.lic file.
The page reloads showing licensee name and expiry — no restart required.

Grace period: when a license expires, a 30-day grace period applies before the license status reverts to unlicensed. The Admin UI shows an amber warning banner and the daily health alert email fires each day throughout the window.

Overwriting: upload a new .lic at any time to renew. The old file is replaced in-place; the license cache refreshes on the next request (mtime-keyed, zero downtime).

Env-var override (air-gapped / automated deployments): mount the .lic file into the container at an alternate path and set:

IPSOLIS_LICENSE_PATH=/run/secrets/ipsolis.lic

The default path is /app/license/ipsolis.lic (inside the ipsolis-api container). Docker secrets or a bind-mount both work.

Configuration Checklist

The in-app Setup checklist on the dashboard guides you through all required steps. The order below matches the checklist:

1. Set application title and logo (Essential)

Navigate to Admin > Settings → General:

Setting	Description
`app.title`	Application name shown in the portal and emails (default: `ip·Solis`)
`app.logo`	Logo upload (PNG/SVG recommended)

2. Configure SMTP (Essential)

Navigate to Admin > Settings → Email:

Setting	Description	Example
`smtp.host`	SMTP relay hostname	`smtp.yourcompany.com`
`smtp.port`	SMTP port	`587`
`smtp.user`	SMTP username (if auth required)	`selfservice@yourcompany.com`
`smtp.password`	SMTP password	(marked as secret)
`smtp.tls`	Use STARTTLS	`true`
`smtp.from`	Sender email address	`noreply@yourcompany.com`
`smtp.from_name`	Sender display name	`ip·Solis`

Navigate to Admin > Email Templates to customize notification email text.

3. Connect to Active Directory (Essential)

Navigate to Admin > Settings → Active Directory:

Setting	Description	Example
`ad.server`	AD domain controller hostname or IP	`dc01.yourcompany.com`
`ad.port`	LDAP port	`389` (or `636` for LDAPS)
`ad.base_dn`	Search base DN	`DC=yourcompany,DC=com`
`ad.domain`	NetBIOS domain name	`YOURCOMPANY`
`ad.username`	Service account (sAMAccountName)	`svc-selfservice`
`ad.password`	Service account password	(marked as secret)
`ad.use_ssl`	Use LDAPS	`true` or `false`

Required AD permissions depend on the modules and runbook steps in use. As a baseline:

Read on user objects (attributes: mail, displayName, sAMAccountName, userPrincipalName, manager, memberOf, distinguishedName)

Write member on group objects — required for AD group-based access assignment

Additional permissions (e.g. on computer objects, OUs, or other attributes) may be needed depending on the runbooks and modules deployed.

4. Enable portal SSO (OIDC) (Essential)

See Section 8 for the full setup (Entra ID, Okta, and any other OIDC provider).

5. Create your first asset type (Essential)

Go to Admin > Asset Types > New
Fill in the name, description, and category
Configure the automation strategy (Group Access, Runbook, or Composite)
Set approval requirements if needed
Optionally restrict access with an Eligible Requestors group DN
Save

6. Add at least one asset to the pool (Essential)

Go to Admin > Asset Pool > New and add at least one asset.

For pure capacity_pooled asset types (quota without dedicated instances) this step can be skipped.

Set up Runbooks (if applicable)

ip·Solis ships with a fully configured example runbook: "Virtual Machine Recycler" — a standalone runbook that includes all required script modules (XenServer/XCP-ng, SCCM, Active Directory) and can serve as a template for your own automation.

Find it under Admin > Runbooks to inspect, copy, or adapt it.

To create asset-type runbooks:

Go to Admin > Runbooks > New
Define the steps (PowerShell modules or built-in modules)
Link the runbook to an asset type

Any number of custom runbooks with any combination of steps can be created.

Recommended next steps

Microsoft Teams approval cards: Go to Admin > Settings → Email and add a Teams webhook URL — approvers receive an Adaptive Card with a one-click review link in addition to email.
Stream audit log to SIEM: Configure a Splunk HEC or webhook endpoint under Admin > Settings → Compliance.
Issue per-integration API tokens: Go to Admin > API Tokens to create named, revocable bearer tokens for ServiceNow, scripts, or Prometheus — replaces the shared X-Admin-Key.

After a DB restore: The api_tokens table is restored along with the database. Review all tokens under Admin > API Tokens — revoke any old or unused tokens and issue new, dedicated tokens for active integrations only.

8. Portal SSO — OpenID Connect (Portal Authentication)

The self-service portal authenticates end users via generic OpenID Connect (OIDC). Any standards-compliant identity provider works through a single code path — Entra ID, Okta, Ping, Google Workspace, Keycloak, Authentik, Zitadel, … — because each provider self-configures from its issuer URL via the discovery document (<issuer>/.well-known/openid-configuration). Adding a new IdP is a config entry, not a code change. On-prem AD/LDAP username+password login can be offered alongside OIDC.

Configure in Admin UI

Navigate to Admin → Settings → Authentication:

Require login to access the portal — toggle on for multi-user production. (Off = portal open with a shared anonymous identity; demo / air-gapped labs only.)
(optional) Also offer on-prem AD / LDAP login — uses the LDAP service account configured in the same page.
Under OIDC providers, click + Add provider and fill in:

Field	Description
Provider ID	Stable URL-safe slug (`a–z 0–9 _ -`), e.g. `entra`, `okta`. Appears in the callback URL and cannot be changed later.
Display name	Button label on the login page, e.g. `Entra ID`.
Issuer URL	The OIDC issuer (discovery is derived from it). See recipes below.
Client ID	Application / client ID from the IdP app registration.
Client secret	Confidential-client secret (stored encrypted; supports `vault://…` / `ccp://…` references).
Redirect URI	Leave blank to auto-derive `https://YOUR_HOST/portal/auth/<provider-id>/callback`, or set an explicit value. Register this exact URI in the IdP app.
Allowed domains	(optional) comma-separated UPN/email domain allow-list. Blank = allow any.
Advanced	Scopes (default `openid profile email`) and claim mapping (username/email/name).

Click Test to run a discovery probe (confirms the issuer is reachable and the authorization/token/JWKS endpoints resolve), then Save.

When more than one login method is enabled the portal shows a chooser at /portal/login; with exactly one it redirects straight to it.

Recipe — Microsoft Entra ID (Azure AD)

Azure Portal → App registrations → New registration.
Redirect URI (Web): https://YOUR_HOST/portal/auth/entra/callback.
Copy the Application (client) ID and Directory (tenant) ID; create a client secret under Certificates & secrets.
In ip·Solis: Provider ID entra, Issuer URL https://login.microsoftonline.com/<tenant-id>/v2.0, plus the client id/secret.

Recipe — Okta

Okta Admin → Applications → Create App Integration → OIDC / Web Application.
Sign-in redirect URI: https://YOUR_HOST/portal/auth/okta/callback.
Copy the Client ID and Client secret.
In ip·Solis: Provider ID okta, Issuer URL https://<your-org>.okta.com (or your custom authorization-server issuer), plus the client id/secret.

9. Verify the Deployment

Run through this checklist to confirm everything works:

HTTPS: https://YOUR_HOSTNAME.YOUR_COMPANY.COM loads with a valid certificate
Admin UI: https://YOUR_HOSTNAME.YOUR_COMPANY.COM/ui/ is accessible
First-run setup: visiting the admin login renders the "Create first administrator" form (or, if already done, the regular sign-in form with no error)
Setup checklist: the dashboard shows the in-app setup checklist; tick off Essential items as you configure them
Portal login: Users can sign in via an OIDC provider (Entra ID, Okta, …) — the discovery Test passes and a real login completes
AD lookup: On the order form, user validation (deputy, RDP, admin fields) resolves names
Email: Submit a test order and confirm notification email arrives
Health check: curl -fsk https://YOUR_HOSTNAME.YOUR_COMPANY.COM/health returns {"status": "ok"}
(optional) API tokens: issue a per-integration token for any automation that previously used X-Admin-Key
(optional) SIEM streaming: configure under Settings → Compliance if you have Splunk / Sentinel / a generic webhook receiver
(optional) Prometheus: scrape /metrics from your monitoring; the dashboard ships in docs/grafana/

10. Backup & Maintenance

Database Backup

The PostgreSQL data is stored in a Docker volume (postgres_data). Back it up regularly:

# Dump the database
docker compose exec -T postgres pg_dump -U xpuser ipsolis > backup_$(date +%Y%m%d).sql

# Restore from backup
cat backup_20260414.sql | docker compose exec -T postgres psql -U xpuser ipsolis

Logs

View container logs:

# All services
docker compose logs --tail=50

# Specific service
docker compose logs api --tail=100 -f    # follow mode
docker compose logs worker --tail=100

Disk Cleanup

Periodically remove old Docker images:

docker image prune -f

11. Updating to a New Version

Back up before upgrading

Always snapshot the database first — pg_dump from the Postgres container, or use the in-app Maintenance → Backups page (Admin UI) which writes a timestamped SQL dump to the bind-mounted ./backups/ directory. Configure a daily backup schedule in the same UI so the snapshot is fresh when an unexpected regression appears.

Pre-flight SSL check — run this before pulling. If either file is missing, the nginx container will start but serve no HTTPS traffic.
cd /opt/ipsolis
ls -la nginx/ssl/cert.pem nginx/ssl/key.pem
If missing, regenerate the certificate (see section 4) before proceeding.

Pull the new images. Production: bump IPSOLIS_VERSION to the new tested release. Pre-live / test: leave it unset to pull the latest build.

cd /opt/ipsolis
git pull origin main                 # refresh compose files / nginx.conf / docs
export IPSOLIS_VERSION=x.x.x          # production: set to the release you want (e.g. 0.6.12)
                                     # pre-live/test: leave unset to track :latest
export COMPOSE_FILE=docker-compose.ghcr.yml:docker-compose.prod.yml
docker compose pull                   # fetch the new images
docker compose up -d                  # recreate changed containers (no build)
docker compose exec -T api alembic upgrade head   # apply any new migrations
docker compose restart nginx          # pick up new container IPs / config
curl -fsk https://YOUR_HOSTNAME.YOUR_COMPANY.COM/health | python3 -m json.tool

Migrations are safe to run multiple times -- Alembic tracks which have already been applied and skips them. Each feature slice typically ships its own migration; review api/alembic/versions/ between upgrades for the changeset, and docker compose exec api alembic history to see the chain.

Beat HA failover during the restart

If you run multiple Beat replicas (--scale beat=N), docker compose up -d rolls the containers one at a time and the leader lock hands over to the surviving replica within ~13 s. For single-Beat installs there's a brief gap during the restart where periodic tasks aren't running — usually invisible since cadences are minutes / hours.

12. High-Availability Deployments (optional)

Optional — most deployments don't need this. A single-instance ip·Solis (the default from sections 1–7) is the right choice for the vast majority of installs. A brief outage — e.g. during an upgrade restart — is rarely critical: in-flight work resumes and scheduled tasks run on minute/hour cadences. Only consider the patterns below if you have a specific availability requirement (e.g. an SLA) or are scaling to very high order volumes. Otherwise you can safely skip this section.

ip·Solis scales horizontally at the API and worker layers. The Beat scheduler supports multi-replica HA via celery-redbeat. This section covers the two tested scaling scenarios: API replicas and worker replicas.

12.1 Multi-replica API

The API is stateless by design — every replica handles every request equally and there's no need for sticky-session affinity at the load balancer.

What makes it stateless:

Sessions use Starlette's `SessionMiddleware` in cookie-signed mode (api/app/main.py): the entire session payload (admin user id, role, csrf token) lives in the xp_session cookie itself, signed with API_SECRET_KEY. No server-side session table.
Tokenized URLs (/approve/<token>, /portal/certifications/review/<token>, etc.) are HMAC-signed with the same API_SECRET_KEY and verify-only. No replay table.
All request state lives in Postgres or Redis — both shared across replicas.

What every replica MUST share:

What	Why	How
`API_SECRET_KEY`	Signs session cookies + approval tokens. Different keys per replica = clients see "session invalid" / "approval link expired" half the time.	Pin in `.env`; load via `env_file:` in compose so every replica reads the same file.
`DATABASE_URL` / `CELERY_BROKER_URL` / `CELERY_RESULT_BACKEND`	Shared Postgres + Redis backplane.	Same as above.
Shared filesystem mounts	`licenses/`, `scripts/`, `backups/` are bind-mounted; replicas reading the same paths must see the same content. On a single host, that's automatic. On multiple hosts, use NFS / GlusterFS / a shared volume driver — or migrate the relevant content to S3-compatible object storage (a deferred slice).	Single-host deployments don't need any extra plumbing.

Scaling commands:

# Single-host: bump the api replica count via compose
docker compose -f docker-compose.ghcr.yml -f docker-compose.prod.yml \
  up -d --scale api=3

# Verify each replica is reachable through the load balancer
for i in 1 2 3; do
  curl -fsk https://YOUR_HOSTNAME.YOUR_COMPANY.COM/health \
    -H 'X-Replica-Probe: '$i
done

Load-balancer config notes:

No sticky sessions required. Round-robin or least-connections is fine.
Health check: GET /health (unauthenticated). Returns {status: ok | degraded} aggregating database, redis, and beat liveness. The endpoint is fast (one Redis ping + one DB SELECT 1) so a 5–10s LB check interval is safe.
TLS termination: keep on the load balancer (or the existing nginx sidecar from section 5). Replicas serve plain HTTP internally; the `https_only=True` flag on SessionMiddleware guards the cookie's Secure bit irrespective of where TLS terminates.

Rolling restart during upgrades: the upgrade flow in section 11 stops and restarts every replica together, which is fine for small fleets where ~30s of API downtime is acceptable. For zero-downtime rolls, fold the up -d step into a per-replica loop:

for i in 1 2 3; do
  docker compose stop api-$i
  docker compose -f docker-compose.ghcr.yml -f docker-compose.prod.yml \
    up -d --no-deps api-$i
  # Wait for the new container to pass health
  until curl -fsk http://localhost/health > /dev/null 2>&1; do
    sleep 2
  done
done

This requires an LB that can drain one backend at a time; with the default round-robin nginx upstream, in-flight requests on the restarting replica drop. The drain logic is your LB's responsibility.

12.2 Multi-replica worker

Celery workers are stateless consumers — they pull from the named Redis queues and process tasks. Adding more workers is a one-line scale-up; the worker code itself doesn't change.

Queue topology (defined in worker/tasks/__init__.py):

Queue	Tasks	Why a separate queue
`provision`	Order workflows (`dynamic_runner`, `standalone_runner`, `ps_module_installer`, `sccm_probe`) — anything that touches AD / SCCM / vSphere / XenServer.	Provisioning steps shell out to PowerShell (~5–60s/step) and hold connections to external systems. Isolating them keeps a slow vSphere call from blocking quick housekeeping tasks.
`notifications`	Email senders, Teams card delivery, approval reminders, certification reminders, cost alerts.	I/O-bound, latency-sensitive (a stuck SMTP server shouldn't queue up behind a 30s SCCM probe).
`default`	Audit retention prune, SIEM streaming, license check, update checker, cost-report snapshot, DB backup, api token purge.	Background housekeeping. Mostly cron-driven, low-frequency.
`reclaim`	Asset-expiry checks (`check_expiring_assets`).	Hourly Beat task; small but isolated so the hourly tick doesn't compete with order workflows for a worker slot.

Sizing recommendations (per-queue concurrency × replica count):

Pool size	Recommended config	Reasoning
Lab / single-team (≤50 users)	1 worker replica, `--concurrency=4 -Q provision,notifications,default,reclaim`	All queues on one process; concurrency 4 is plenty for the typical 1–2 orders/hour.
Mid (≤500 users, ≤20 orders/hour)	2 worker replicas split by queue: replica A `-Q provision --concurrency=4`, replica B `-Q notifications,default,reclaim --concurrency=2`	Provisioning latency stays bounded by replica A; replica B handles housekeeping + reminders without queue-head-of-line blocking.
Large (≥500 users, ≥50 orders/hour, regulated SLAs)	3+ worker replicas: dedicated `provision` workers (`--concurrency=8` × 2 replicas), one `notifications` replica (`--concurrency=4`), one `default,reclaim` replica (`--concurrency=2`)	Per-queue scaling matches actual load shape.

Scaling command (single-host, all queues on each replica):

docker compose -f docker-compose.ghcr.yml -f docker-compose.prod.yml \
  up -d --scale worker=3

Per-queue dedicated replicas require either separate compose service definitions (e.g. worker-provision, worker-notifications) each with its own command: overriding the default queue list, or a runtime command: override:

# docker-compose.prod.yml — per-queue split
services:
  worker-provision:
    image: ipsolis-worker
    command: celery -A tasks worker -Q provision --concurrency=8 -l info
    deploy: { replicas: 2 }
    env_file: .env

  worker-notifications:
    image: ipsolis-worker
    command: celery -A tasks worker -Q notifications --concurrency=4 -l info
    deploy: { replicas: 1 }
    env_file: .env

  worker-housekeeping:
    image: ipsolis-worker
    command: celery -A tasks worker -Q default,reclaim --concurrency=2 -l info
    deploy: { replicas: 1 }
    env_file: .env

Beat scaling: the beat container has no fixed container_name so it can be replicated for HA:

docker compose \
  -f docker-compose.ghcr.yml \
  -f docker-compose.prod.yml \
  up -d --scale beat=2

Note: Celery Beat is a singleton scheduler. Multiple beat replicas only make sense with a distributed lock backend — celery-redbeat (already configured) uses Redis locks to prevent duplicate task firing.

Liveness: each worker registers with Celery's mingle-on-startup, which means a fresh worker is visible to Beat / other workers within a couple of seconds. There's no separate health check to wire — if the worker container is Up, it's consuming.

Visibility: Flower (the existing flower service in the dev compose; see docker-compose.yml) shows live worker registration, queue depth, and task-by-task duration breakdown. For production, front it with the same nginx auth as the admin UI; Flower has no built-in authn beyond HTTP basic.

12.3 Postgres high-availability

Postgres HA (streaming replication, pgBackRest, Patroni) is architecturally possible — ip·Solis is single-primary and any connection-string switch requires only a .env change and restart. A validated step-by-step guide is not included in this version.

13. Troubleshooting

Container won't start

# Check container status and exit codes
docker compose ps -a

# Check logs for the failing service
docker compose logs <service-name> --tail=50

Health check fails through nginx but API is healthy

Nginx may have cached the old container IP. Restart the container (not just nginx -s reload — Docker bind-mounts retain the old inode otherwise):

docker compose \
  -f docker-compose.ghcr.yml \
  -f docker-compose.prod.yml \
  restart nginx

Database connection errors

# Check if postgres is running
docker compose exec postgres pg_isready -U xpuser

# Verify the connection from the API container
docker compose exec api python -c "
from sqlalchemy import create_engine, text
e = create_engine('postgresql://xpuser:<password>@postgres:5432/ipsolis')
with e.connect() as c: print(c.execute(text('SELECT 1')).scalar())
"

AD / LDAP connection issues

Verify network connectivity from the container:

docker compose exec api curl -v telnet://dc01.yourcompany.com:389

Check the AD settings in Admin > Settings

Review API logs for LDAP errors:

docker compose logs api 2>&1 | grep -i "ldap\|ad_lookup"

Emails not sending

Verify SMTP settings in Admin > Settings

Check worker logs for SMTP errors:

docker compose logs worker 2>&1 | grep -i "smtp\|mail\|notification"

Ensure the server can reach the SMTP relay:

docker compose exec api curl -v telnet://smtp.yourcompany.com:587

Permission denied on ssl directory

sudo chmod 644 nginx/ssl/cert.pem
sudo chmod 600 nginx/ssl/key.pem

14. Clean Reset (Test Environments)

Test and staging environments only. This section permanently destroys all data. Never run on a production instance.

Docker volumes (database data, Redis data) survive rm -rf /opt/ipsolis because they are stored under /var/lib/docker/volumes/ — independent of the repository directory. For a fully clean reinstall:

# 1. Stop the stack and delete volumes
cd /opt/ipsolis
docker compose \
  -f docker-compose.ghcr.yml \
  -f docker-compose.prod.yml \
  down -v

# 2. Remove the repository directory
cd /opt
sudo rm -rf ipsolis

# 3. Reinstall (continue from section 2), then start the stack via section 6
sudo git clone https://github.com/XenPool/ipsolis.git ipsolis
sudo chown -R $USER:$USER ipsolis
cd ipsolis

After this reset the database contains no users, no configuration and no assets — the initial setup (section 7) must be completed again.