Troubleshooting common deploy failures

Symptoms and concrete fixes for the most common deploy failures: SSH timeout on bootstrap, dependency install errors, migration failures, and post-deploy 502s.

This page is a flat list of failure modes most users hit at least once, with the symptom on top, the root cause underneath, and a single concrete fix that has worked for that specific cause. The deployment log shown in the dashboard is the source of truth for what actually happened; this page tells you what to look for in it.

Server bootstrap times out

Symptom: after you paste the platform's public key onto your server, the polling page sits on pending for a minute then flips to timeout.

Likely causes, in descending order of frequency:

The pasted command was truncated and only included part of the pubkey. Open ~/.ssh/authorized_keys on your server and confirm the last line begins with ssh-ed25519 AAAA and ends with the comment the platform appended.
The SSH daemon's AllowUsers or AllowGroups directive excludes the login the platform is trying to use. Run grep -i allow /etc/ssh/sshd_config; if either is set, add the login.
The pasted command was run as the wrong user. ~/.ssh must be the home directory of the login you put on the Add Server form. Re-paste while logged in as that user.
Inbound port 22 is firewalled. nmap -p 22 from your laptop should succeed.
The server's clock is more than a few minutes off. OpenSSH key auth doesn't strictly require synchronised clocks, but related issues (certificate authority bootstrapping, sudo timestamps) do.

"pip install ... failed" during initial deploy

Symptom: the deploy log shows the pip install -r requirements.txt command exited non-zero with a long traceback ending in a wheel build failure.

Root cause: a Python package needed system headers that are not installed (most commonly libpq-dev for psycopg2, libffi-dev for cryptography, or libjpeg-dev + zlib1g-dev for Pillow).

Fix: SSH into the server and run apt install -y build-essential libpq-dev libffi-dev libjpeg-dev zlib1g-dev python3-dev (Debian/Ubuntu) or the equivalent dnf group on the RHEL line. Redeploy. The platform installs a baseline of common build deps during first-deploy provisioning but does not cover every C extension out there.

Migration fails with "relation does not exist"

Symptom: django.db.utils.ProgrammingError: relation "<table>" does not exist partway through the migrate step.

Root cause: a migration that depends on a table created by a Django app in your project, but that app is not in INSTALLED_APPS on the server's settings module — or the ordering of INSTALLED_APPS is wrong so the dependency graph is unsolvable.

Fix: on your laptop, run ./manage.py showmigrations --plan against the same DJANGO_SETTINGS_MODULE you configured for the deploy. The plan must complete without an error. If it doesn't, fix INSTALLED_APPS in that settings module, commit, and redeploy.

Migration fails with "constraint X on table Y is violated"

Symptom: a migration that adds a non-nullable column or unique constraint exits with a database-side integrity error.

Root cause: existing rows in the table on the server don't satisfy the new constraint. The new column has no value for them, or two rows have the same value for a column you're trying to make unique.

Fix: split the schema change into the data migration + the constraint addition pattern Django docs recommend: first deploy a release with a nullable column + a data migration that populates it for every existing row; then deploy a release that flips the column to non-nullable. Same fix as on any platform — there is no shortcut here.

Deploy succeeds, app returns 502 Bad Gateway

Symptom: the deployment row shows success, the rendered public URL responds, but every request returns a 502.

Root cause: the systemd unit started gunicorn (or the Node process) but the process crashed immediately on first request, or it bound to a different socket than the edge proxy is forwarding to.

Fix: SSH into the server and run journalctl -u <project-slug>.service --since "5 minutes ago" --no-pager. The actual Python traceback (or Node error) will be in that output. The most common single cause is DEBUG = False in the production settings without SECRET_KEY set in the environment.

A close second is DisallowedHost when the request hits a hostname the project doesn't list in ALLOWED_HOSTS — add the <uuid>.apps.codesensei.cloud wildcard or the specific hostname you're testing against.

Deploy succeeds, the URL renders, but static assets 404

Symptom: page HTML loads, but CSS and JS hit 404.

Root cause: either collectstatic did not run (technology field set to Static Site on what is actually a Django project) or your STATIC_URL / STATIC_ROOT pair is inconsistent.

Fix: confirm STATIC_URL ends with /static/ and STATIC_ROOT points at a path inside the release directory (BASE_DIR / "collected-static" is the convention the platform expects). Redeploy after fixing settings.

Rollback runs instantly but the old version doesn't come back

Symptom: you click Roll back to this, the deploy log shows the symlink flip + systemd restart, but the browser still shows the new version.

Root cause: browser cache, edge proxy cache, or the database state has moved on (a destructive migration was already run by the new release and the rolled-back release can't read the new schema).

Fix: hard refresh in your browser (Ctrl-Shift-R or Cmd-Shift-R) to rule out caching first. If the symptom persists, the database state is the issue — there is no automatic data migration rollback. Either roll forward instead, or restore the database from a backup taken before the destructive migration. The Django projects guide has the two-step pattern that avoids putting yourself in this position.

Deploy log stalls partway through

Symptom: the streamed deploy log shows a few lines of output then stops scrolling. The deployment row stays on running indefinitely.

Root cause: the deploy script hit a command that emits no output for a long time — most often, an npm ci on a slow connection or a pip install compiling a large C extension from source. Network buffering between your server and the platform worker is holding the output in flight. The deploy is still progressing; it is the visual that has stalled.

Fix: wait it out — most stalls clear within thirty to ninety seconds. If a single command runs for longer than the configured deploy timeout (twenty minutes by default), the worker aborts and marks the row failed with a timeout reason. To preempt that on a known-slow build, increase the project's deploy timeout from the project detail page's Advanced settings expander.

Custom domain is stuck in verifying

Symptom: you added a custom domain and the verification badge has been on verifying for more than ten minutes.

Root cause: the CNAME record at your DNS provider has not propagated yet, or it points at the wrong target. The platform polls your DNS every thirty seconds and times out after one hour.

Fix: run dig +short CNAME <your-domain> from a third-party DNS resolver (Google's 8.8.8.8 is a good choice). The output must match the target the dashboard showed you when you submitted the form. If it doesn't, fix the record at your DNS provider. If it does match but the badge still does not update, click Re-verify on the project's custom domain page to force-poll.