Serverless, GPU Accelerated OpenAI API Using Podman, Llama.cpp, and systemd - No OpenFaaS Here!

A picture of the text PFD Required, which has nothing to do with the article whatsoever
PFD Required!

Recently, I began putting together a new personal server for large language model training and inference. I really was interested in experimenting more with llama models and wanted the faster iterations having a local development machine would bring.

TLDR: Script here that creates a podman container, systemd service, and socket that supports running any gguf model via llama.cpp on Nvidia GPUs in a serverless fashion.

We want

to serve GPU accelerated requests from a large language model (LLM) like Llama-2-chat-70b or the currently top-of-the-leaderboard ShiningValiant 70b model using llama-cpp-python, which provides an openAI API compatible server that can be used by the matrix-chatgpt-bot or the excellent ChatGPT-Next-Web.

The Problem

is that with dual 24GB Nvidia RTX 4090 GPUs, a 4 bit quantized 70 billion parameter model takes ~80-90% of the VRAM of both cards. If I want to do anything else with the GPUs, I need to shutdown the llama.cpp instance. Not very useful for serving a local version of the openAI API!


This seems like a problem for serverless/functions as a service right? Spin something up per request? I am already using podman containers for everything, so this should be very simple, right?

OpenFaaS has been around awhile, if you are not familiar, this their introduction from their docs: "OpenFaaS® makes it easy for developers to deploy event-driven functions and microservices to Kubernetes without repetitive, boiler-plate coding. Package your code or an existing binary in a Docker image to get a highly scalable endpoint with auto-scaling and metrics."

Wait - Kubernetes? I don't want to setup a Kubernetes cluster! That's WAY more work than I want to do! We need another option!


As it turns out, systemd can actually help us here! Podman already integrates very nicely with systemd, generating a .service for a container is as simple as:

# generates container_name.service for you!
podman generate systemd -f -n container_name --new 

Based on this article, systemd supports the ability to start another .service application when a socket is opened, called "socket activation". Essentially, systemd receives a connection, starts another service, and hands that already connected socket to that newly started service. This setup is basically the same as what is in the article, just adapted to using rootful instead of rootless containers as I don't know how to get rootless containers working with Nvidia GPUs...

One issue is that many services don't support the socket activation feature. In order to work around this, systemd provides a proxy: systemd-socket-proxyd. It essentially forwards data between the already opened socket and a new socket with the application being started by socket activation.

In order to set this up, we will need a .service file for our llama-cpp-python server, a .service file for the systemd-socket-proxyd, and a .socket file. Lets setup our llama-cpp-python container and generate the container service file first, as we need to modify it slightly:

# Only do this if you have not setup Podman with your Nvidia GPUs yet:
# From
# If you don't have Nvidia drivers/cuda setup, do that first!
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

# Optional should print:
nvidia-ctk cdi list 

# Setup a new podman container for our openAI API server
sudo podman create --name llama-oai \
    -p 18080:18080 \
    --ulimit memlock=-1:-1 \
    --device \
    --security-opt=label=disable \
    -v /where/you/put/models/:/models:Z \ \
    python3 -m llama_cpp.server \
    --model /models/shiningvaliant-1.2.Q4_K_M.gguf \
    --n_gpu_layers 90 \
    --port 18080 \
    --host \
    --n_ctx 4096

# Generate the systemd service file, removing the [Install] section,
# adding StopWhenUneeded and ExecStartPost.
sudo podman generate systemd --new --name llama-oai | \
    sed -z 's/\[Install\]\' | \
    sed '/RequiresMountsFor=%t\/containers/a \
      StopWhenUnneeded=yes' filename | \
    sed '/NotifyAccess=all/a ExecStartPost=\/usr\/bin\/timeout 30 \
      sh -c 'while ! curl http:\/\/\/v1\/models >& \
      \/dev\/null; do sleep 1; done\'' | \
    sudo tee /etc/systemd/system/tars.service

An example of how the llama-oai.service file should look now:

Description=Podman llama-oai.service

ExecStart=/usr/bin/podman run \
    --cidfile=%t/%n.ctr-id \
    --cgroups=no-conmon \
    --rm \
    --sdnotify=conmon \
    -d \
    --replace \
    --name llama-oai \
    -p 18080:18080 \
    --ulimit memlock=-1:-1 \
    --device \
    --security-opt=label=disable \
    -v /home/user/code/llama.cpp/models/:/models:Z \ \
    python -m llama_cpp.server \
    --model /models/shiningvaliant-1.2.Q4_K_M.gguf \
    --n_gpu_layers 90 \
    --port 18080 \
    --host \
    --n_ctx 4096
ExecStop=/usr/bin/podman stop \
    --ignore -t 10 \
ExecStopPost=/usr/bin/podman rm \
    -f \
    --ignore -t 10 \
ExecStartPost=/usr/bin/timeout 30 sh -c 'while ! curl >& /dev/null; do sleep 1; done'

StopWhenUneeded and ExecStartPost

StopWhenUneeded and ExecStartPost both are pretty self-explanatory, but worth noting anyway.

  • StopWhenUneeded will shutdown the service when the systemd-socket-proxyd service stops, without this, the service will boot up and continue to run until stopped manually.
  • ExecPostStart is also important. As you might expect, it runs the command specified after the ExecStart command for the service is called. What this allows us to do is keep the service from being marked as up/active until we get a real response on /models from the openAI API being hosted. As it turns out, the default behavoir was to mark the service as active as soon as the ExecStart command began executing, and uvicorn opens it's ports right away. So the service looks up and the port open, but it is all lies. Any requests in this period will fail, unless tell systemd to keep our service marked inactive until /model and the rest of the app is actually up and ready to serve.

systemd-socket-proxyd Setup

Now we need to create the socket for the proxy:

cat << EOF > llama-oai-proxy.socket


And then we create the service for the proxy:

cat << EOF > llama-oai-proxy.service 

ExecStart=/usr/lib/systemd/systemd-socket-proxyd --exit-idle-time=30s


we need to reload, enable, start and our sockets and services:

# Copy the created files to /etc/systemd/system/:
sudo mv llama-oai-proxy* /etc/systemd/system/

# Reload, enable & start the proxy:
sudo systemctl daemon-reload
sudo systemctl enable --now llama-oai-proxy.socket

# Checks the socket is open:
netstat -ltpn | grep 8000

# Finally, test the openAI API:
curl  http://localhost.local:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "shiningvaliant-1.2.Q4_K_M.gguf", "prompt": "What is the highest resolution Phase One camera?", "temperature": 0.1, "max_tokens": 128}'

# Example response:
# {"id":"cmpl-05cbae98-64d3-4ee3-9da5-720df68a4cf7","object":"text_completion","created":1698344737,"model":"shiningvaliant-1.2.Q4_K_M.gguf","choices":[{"text":"\nThe highest resolution Phase One camera as of now (2021) is the XF IQ4 150MP, which features a 151-megapixel CMOS sensor. This medium format digital back offers exceptional image quality and detail for professional photographers in various fields such as fashion, advertising, landscape, and fine art photography.","index":0,"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":11,"completion_tokens":81,"total_tokens":92}}


  • I built my own container for llama-cpp-python with a small modification to the server - ChatGPT-Next-Web does not send a max_tokens when it requests a completion, and the server has a very low default of 64 in this case. I modified the default to be set by an environment variable and built the container from that.
  • You can set a longer or shorter idle time than 30s, depending on your use case, SSD speed, etc.
  • A reasonably fast SSD is probably required. Using an Optane 905P means I can load in a ~40GB 70B Llama model in about 5 seconds, which is not a bad cold start time compared to what I have experienced with runpod. Other SSDs may be faster, the 905P is more about latency than raw bandwidth.