Run AI Locally: Complete Guide to llama.cpp + WordPress

Written by

in

Set up a powerful, private AI assistant on your WordPress site using local models. No API keys. No subscriptions. Full control.

Published: April 2026 | 12 min read | For Developers


Why Run AI Locally? The Privacy-First Revolution

Every time you use a cloud AI API, your data travels across the internet to someone else’s servers. OpenAI, Anthropic, Google—they all log requests. Your content, your users’ interactions, your business logic—it’s all third-party data.

What if you could run a powerful AI model on your own hardware, keeping everything private, offline-capable, and completely under your control?

That’s exactly what llama.cpp makes possible.

💡 What you’ll learn: This guide covers three deployment scenarios—from your laptop to a GPU rig on your LAN to a production cloud server. Pick the one that fits your setup.

The Case for Local AI

  • Privacy: Data never leaves your infrastructure
  • Cost: No per-request fees; one-time hardware investment
  • Latency: Sub-100ms inference on modern hardware
  • Ownership: Full control over model behavior and updates
  • Offline: Works without internet (for local deployments)

The catch? You need to run the models yourself. But that’s where llama.cpp comes in—it makes that easy.


Installation & Setup

What is llama.cpp?

llama.cpp is a lightweight C++ inference engine that runs language models locally. It’s blazingly fast, supports GPU acceleration, and requires minimal dependencies. Think of it as the “Apache for AI models.”

Install llama.cpp on macOS

The fastest way is via Homebrew:

brew install llama.cpp

Verify the installation:

llama-server --help

Install on Linux

Build from source (takes 5–10 minutes):

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build --config Release

The resulting binary is at build/bin/llama-server.

For details, visit the official llama.cpp downloads page.


Downloading & Choosing Models

What is GGUF?

GGUF is the binary format that llama.cpp uses. It’s optimized for inference speed and memory efficiency. Models are hosted on Hugging Face in multiple quantization levels.

QuantizationSizeQualitySpeedBest For
Q2_KSmallestLowerFastestDevices with <2GB RAM
Q4_K_MSmallGoodFastLaptops, modest servers
Q5_K_MMediumBetterModerateBetter quality, still efficient
Q8_0LargestBestSlowestHigh-end GPUs, unlimited RAM

Download a Model

Install the Hugging Face CLI:

pip install -U huggingface_hub

Download TinyLlama (636 MB, great for testing):

huggingface-cli download \
  TheBloke/TinyLlama-1.1B-Chat-GGUF \
  tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
  --local-dir ~/models \
  --local-dir-use-symlinks False

Verify the download:

ls -lh ~/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

Recommended Starter Models

ModelSizeUse Case
TinyLlama 1.1B (Q4_K_M)636 MBTesting, ultra-low resource
Phi-3 Mini 3.8B (Q4_K_M)2.2 GBFast & practical
Mistral 7B (Q4_K_M)4.1 GBHigh quality
Llama 3 8B (Q4_K_M)4.7 GBBest-in-class, needs GPU or 16GB+ RAM

Scenario 1: Same Machine (Localhost)

This is the simplest setup. WordPress and llama.cpp both run on your laptop or desktop.

🖥️ Localhost Setup

Best for: Local development, prototyping, single-user testing

Requirements: One machine with enough RAM for your chosen model

Complexity: ⭐ (Easiest)

Step 1: Start the llama.cpp Server

llama-server --models-dir ~/models

The server will start on http://127.0.0.1:8080. You’ll see output confirming the model loaded successfully.

Step 2: Open the Web UI (Optional)

Open your browser to http://127.0.0.1:8080/ and start chatting with your model immediately. This verifies everything works before WordPress integration.

Step 3: Configure the WordPress Plugin

  1. Go to Settings → AI Provider for llama.cpp
  2. Set the Server URL to http://127.0.0.1:8080
  3. Click Save

The plugin will auto-detect your available models. WordPress now has local AI powers.

✅ What you get: Instant inference on your machine. No internet needed. Responses in 1–3 seconds. Perfect for writing assistance, content generation, and brainstorming.


Scenario 2: Local Network (LAN)

Run llama.cpp on one machine (e.g., a dedicated GPU rig) and access it from WordPress on another machine on the same network.

🌐 Local Network Setup

Best for: Dedicated inference machine, multi-user teams, leveraging a GPU rig

Requirements: Two machines on the same WiFi/Ethernet network

Complexity: ⭐⭐ (Easy, with networking basics)

Step 1: Start the Server with Network Access

On the machine with the model, start the server with --host 0.0.0.0 to accept network connections:

llama-server \
  --models-dir ~/models \
  --host 0.0.0.0 \
  --port 8080

Step 2: Find the Server’s Local IP

On macOS:

ipconfig getifaddr en0

On Linux:

hostname -I

You’ll see something like 192.168.1.50. Note this down.

Step 3: Test from Your WordPress Machine

curl http://192.168.1.50:8080/v1/models

If you get a JSON response listing your models, you’re connected. If not, check:

  • Both machines are on the same network (WiFi/Ethernet)
  • Firewall isn’t blocking port 8080
  • The IP address is correct (try ping 192.168.1.50)

Step 4: Configure WordPress

  1. Go to Settings → AI Provider for llama.cpp
  2. Set the Server URL to http://192.168.1.50:8080 (replace with your IP)
  3. Click Save

💡 Pro tip: Assign a static IP in your router settings so the inference machine’s address doesn’t change.


Scenario 3: Remote Server (Internet)

Expose llama.cpp to the internet so you can access it from anywhere. This requires a secure tunnel.

☁️ Remote Server Setup

Best for: Cloud servers, production deployments, multi-location teams

Requirements: Cloud VM (AWS, DigitalOcean, etc.) and a tunnel service

Complexity: ⭐⭐⭐ (Most involved, but straightforward)

Step 1: Start with Authentication

llama-server \
  --models-dir ~/models \
  --host 0.0.0.0 \
  --api-key your-secret-key-here

⚠️ Security Critical: Never run a public llama.cpp server without an API key. Anyone on the internet could make requests and consume your resources. Always use --api-key.

Step 2: Create a Tunnel with Cloudflare (Recommended)

Install cloudflared:

On macOS:

brew install cloudflared

On Linux (Debian/Ubuntu):

curl -L https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb -o cloudflared.deb
sudo dpkg -i cloudflared.deb

Option A: Quick Tunnel (No Account Needed)

cloudflared tunnel --url http://localhost:8080

You’ll get a public HTTPS URL like https://something-random.trycloudflare.com. This URL changes when you restart.

Option B: Named Tunnel (Stable URL)

For production, use a named tunnel with a permanent URL. First, sign up for a free Cloudflare account and add your domain.

Set up (one time):

cloudflared tunnel login
cloudflared tunnel create llama
cloudflared tunnel route dns llama llama.yourdomain.com

Run it:

cloudflared tunnel run --url http://localhost:8080 llama

Now your server is available at https://llama.yourdomain.com—permanently, with automatic HTTPS.

✨ Why Cloudflare? Free tier includes unlimited tunnels, automatic HTTPS, DDoS protection, and no bandwidth limits. Perfect for self-hosted AI.

Step 3: Configure WordPress

Set the Server URL to your tunnel URL (e.g., https://llama.yourdomain.com or https://something-random.trycloudflare.com).

🚀 You’re live: Your WordPress site now has remote AI inference. All encrypted with HTTPS.


Security & Best Practices

API Keys & Authentication

  • Always use --api-key for remote servers. Without it, anyone can abuse your inference.
  • Use a strong, random key: openssl rand -hex 32
  • Rotate keys periodically (every 90 days in production)
  • Store keys in environment variables or a secrets manager, never in code

Network Security

  • LAN: Use --host 0.0.0.0 only on private networks. Firewalls should block external access to port 8080.
  • Remote: Always use HTTPS (Cloudflare Tunnel provides this). Never expose HTTP to the internet.
  • Both: Consider a reverse proxy (nginx, Caddy) for rate limiting and additional authentication

Performance Optimization

Choose the Right Model Size

  • 1–3B models: 500ms–2s per response (fast interactions)
  • 7–8B models: 2–10s per response (high quality, but slower)
  • 13B+ models: 10–60s per response (production-grade, needs serious hardware)

For WordPress, start with a 3–7B model. It’s the sweet spot for quality and speed.

Quantization Impact

  • Q2_K → Fastest speed, lower quality
  • Q4_K_M → Best balance (recommended)
  • Q8_0 → Best quality, slower inference

Advanced Flags for Speed

llama-server \
  --models-dir ~/models \
  -t 8 \
  -b 256 \
  -c 2048 \
  --n-gpu-layers 99

Breakdown:

  • -t 8 — Use 8 CPU threads (adjust based on your core count)
  • -b 256 — Batch size for processing multiple requests
  • -c 2048 — Context window (smaller = faster, but less context)
  • --n-gpu-layers 99 — Offload to GPU if available (orders of magnitude faster)

Ready to Get Started?

You now have everything you need to run AI locally on WordPress. No cloud subscriptions. No API costs. Just pure, private, offline-first AI.

Download the AI Provider for llama.cpp plugin:

WordPress.org Plugin Directory

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *