Run AI Locally: Complete Guide to llama.cpp + WordPress

Set up a powerful, private AI assistant on your WordPress site using local models. No API keys. No subscriptions. Full control.

Published: April 2026 | 12 min read | For Developers

Why Run AI Locally? The Privacy-First Revolution

Every time you use a cloud AI API, your data travels across the internet to someone else’s servers. OpenAI, Anthropic, Google—they all log requests. Your content, your users’ interactions, your business logic—it’s all third-party data.

What if you could run a powerful AI model on your own hardware, keeping everything private, offline-capable, and completely under your control?

That’s exactly what llama.cpp makes possible.

💡 What you’ll learn: This guide covers three deployment scenarios—from your laptop to a GPU rig on your LAN to a production cloud server. Pick the one that fits your setup.

The Case for Local AI

Privacy: Data never leaves your infrastructure
Cost: No per-request fees; one-time hardware investment
Latency: Sub-100ms inference on modern hardware
Ownership: Full control over model behavior and updates
Offline: Works without internet (for local deployments)

The catch? You need to run the models yourself. But that’s where llama.cpp comes in—it makes that easy.

Installation & Setup

What is llama.cpp?

llama.cpp is a lightweight C++ inference engine that runs language models locally. It’s blazingly fast, supports GPU acceleration, and requires minimal dependencies. Think of it as the “Apache for AI models.”

Install llama.cpp on macOS

The fastest way is via Homebrew:

brew install llama.cpp

Verify the installation:

llama-server --help

Install on Linux

Build from source (takes 5–10 minutes):

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build --config Release

The resulting binary is at build/bin/llama-server.

For details, visit the official llama.cpp downloads page.

Downloading & Choosing Models

What is GGUF?

GGUF is the binary format that llama.cpp uses. It’s optimized for inference speed and memory efficiency. Models are hosted on Hugging Face in multiple quantization levels.

Quantization	Size	Quality	Speed	Best For
Q2_K	Smallest	Lower	Fastest	Devices with <2GB RAM
Q4_K_M	Small	Good	Fast	Laptops, modest servers
Q5_K_M	Medium	Better	Moderate	Better quality, still efficient
Q8_0	Largest	Best	Slowest	High-end GPUs, unlimited RAM

Download a Model

Install the Hugging Face CLI:

pip install -U huggingface_hub

Download TinyLlama (636 MB, great for testing):

huggingface-cli download \
  TheBloke/TinyLlama-1.1B-Chat-GGUF \
  tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
  --local-dir ~/models \
  --local-dir-use-symlinks False

Verify the download:

ls -lh ~/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf

Recommended Starter Models

Model	Size	Use Case
TinyLlama 1.1B (Q4_K_M)	636 MB	Testing, ultra-low resource
Phi-3 Mini 3.8B (Q4_K_M)	2.2 GB	Fast & practical
Mistral 7B (Q4_K_M)	4.1 GB	High quality
Llama 3 8B (Q4_K_M)	4.7 GB	Best-in-class, needs GPU or 16GB+ RAM

Scenario 1: Same Machine (Localhost)

This is the simplest setup. WordPress and llama.cpp both run on your laptop or desktop.

🖥️ Localhost Setup

Best for: Local development, prototyping, single-user testing

Requirements: One machine with enough RAM for your chosen model

Complexity: ⭐ (Easiest)

Step 1: Start the llama.cpp Server

llama-server --models-dir ~/models

The server will start on http://127.0.0.1:8080. You’ll see output confirming the model loaded successfully.

Step 2: Open the Web UI (Optional)

Open your browser to http://127.0.0.1:8080/ and start chatting with your model immediately. This verifies everything works before WordPress integration.

Step 3: Configure the WordPress Plugin

Go to Settings → AI Provider for llama.cpp
Set the Server URL to http://127.0.0.1:8080
Click Save

The plugin will auto-detect your available models. WordPress now has local AI powers.

✅ What you get: Instant inference on your machine. No internet needed. Responses in 1–3 seconds. Perfect for writing assistance, content generation, and brainstorming.

Scenario 2: Local Network (LAN)

Run llama.cpp on one machine (e.g., a dedicated GPU rig) and access it from WordPress on another machine on the same network.

🌐 Local Network Setup

Best for: Dedicated inference machine, multi-user teams, leveraging a GPU rig

Requirements: Two machines on the same WiFi/Ethernet network

Complexity: ⭐⭐ (Easy, with networking basics)

Step 1: Start the Server with Network Access

On the machine with the model, start the server with --host 0.0.0.0 to accept network connections:

llama-server \
  --models-dir ~/models \
  --host 0.0.0.0 \
  --port 8080

Step 2: Find the Server’s Local IP

On macOS:

ipconfig getifaddr en0

On Linux:

hostname -I

You’ll see something like 192.168.1.50. Note this down.

Step 3: Test from Your WordPress Machine

curl http://192.168.1.50:8080/v1/models

If you get a JSON response listing your models, you’re connected. If not, check:

Both machines are on the same network (WiFi/Ethernet)
Firewall isn’t blocking port 8080
The IP address is correct (try ping 192.168.1.50)

Step 4: Configure WordPress

Go to Settings → AI Provider for llama.cpp
Set the Server URL to http://192.168.1.50:8080 (replace with your IP)
Click Save

💡 Pro tip: Assign a static IP in your router settings so the inference machine’s address doesn’t change.

Scenario 3: Remote Server (Internet)

Expose llama.cpp to the internet so you can access it from anywhere. This requires a secure tunnel.

☁️ Remote Server Setup

Best for: Cloud servers, production deployments, multi-location teams

Requirements: Cloud VM (AWS, DigitalOcean, etc.) and a tunnel service

Complexity: ⭐⭐⭐ (Most involved, but straightforward)

Step 1: Start with Authentication

llama-server \
  --models-dir ~/models \
  --host 0.0.0.0 \
  --api-key your-secret-key-here

⚠️ Security Critical: Never run a public llama.cpp server without an API key. Anyone on the internet could make requests and consume your resources. Always use --api-key.

Step 2: Create a Tunnel with Cloudflare (Recommended)

Install cloudflared:

On macOS:

brew install cloudflared

On Linux (Debian/Ubuntu):

curl -L https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb -o cloudflared.deb
sudo dpkg -i cloudflared.deb

Option A: Quick Tunnel (No Account Needed)

cloudflared tunnel --url http://localhost:8080

You’ll get a public HTTPS URL like https://something-random.trycloudflare.com. This URL changes when you restart.

Option B: Named Tunnel (Stable URL)

For production, use a named tunnel with a permanent URL. First, sign up for a free Cloudflare account and add your domain.

Set up (one time):

cloudflared tunnel login
cloudflared tunnel create llama
cloudflared tunnel route dns llama llama.yourdomain.com

Run it:

cloudflared tunnel run --url http://localhost:8080 llama

Now your server is available at https://llama.yourdomain.com—permanently, with automatic HTTPS.

✨ Why Cloudflare? Free tier includes unlimited tunnels, automatic HTTPS, DDoS protection, and no bandwidth limits. Perfect for self-hosted AI.

Step 3: Configure WordPress

Set the Server URL to your tunnel URL (e.g., https://llama.yourdomain.com or https://something-random.trycloudflare.com).

🚀 You’re live: Your WordPress site now has remote AI inference. All encrypted with HTTPS.

Security & Best Practices

API Keys & Authentication

Always use --api-key for remote servers. Without it, anyone can abuse your inference.
Use a strong, random key: openssl rand -hex 32
Rotate keys periodically (every 90 days in production)
Store keys in environment variables or a secrets manager, never in code

Network Security

LAN: Use --host 0.0.0.0 only on private networks. Firewalls should block external access to port 8080.
Remote: Always use HTTPS (Cloudflare Tunnel provides this). Never expose HTTP to the internet.
Both: Consider a reverse proxy (nginx, Caddy) for rate limiting and additional authentication

Performance Optimization

Choose the Right Model Size

1–3B models: 500ms–2s per response (fast interactions)
7–8B models: 2–10s per response (high quality, but slower)
13B+ models: 10–60s per response (production-grade, needs serious hardware)

For WordPress, start with a 3–7B model. It’s the sweet spot for quality and speed.

Quantization Impact

Q2_K → Fastest speed, lower quality
Q4_K_M → Best balance (recommended)
Q8_0 → Best quality, slower inference

Advanced Flags for Speed

llama-server \
  --models-dir ~/models \
  -t 8 \
  -b 256 \
  -c 2048 \
  --n-gpu-layers 99

Breakdown:

-t 8 — Use 8 CPU threads (adjust based on your core count)
-b 256 — Batch size for processing multiple requests
-c 2048 — Context window (smaller = faster, but less context)
--n-gpu-layers 99 — Offload to GPU if available (orders of magnitude faster)

Ready to Get Started?

You now have everything you need to run AI locally on WordPress. No cloud subscriptions. No API costs. Just pure, private, offline-first AI.

Download the AI Provider for llama.cpp plugin:

WordPress.org Plugin Directory

Run AI Locally: Complete Guide to llama.cpp + WordPress

Why Run AI Locally? The Privacy-First Revolution

The Case for Local AI

Installation & Setup

What is llama.cpp?

Install llama.cpp on macOS

Install on Linux

Downloading & Choosing Models

What is GGUF?

Download a Model

Recommended Starter Models

Scenario 1: Same Machine (Localhost)

🖥️ Localhost Setup

Step 1: Start the llama.cpp Server

Step 2: Open the Web UI (Optional)

Step 3: Configure the WordPress Plugin

Scenario 2: Local Network (LAN)

🌐 Local Network Setup

Step 1: Start the Server with Network Access

Step 2: Find the Server’s Local IP

Step 3: Test from Your WordPress Machine

Step 4: Configure WordPress

Scenario 3: Remote Server (Internet)

☁️ Remote Server Setup

Step 1: Start with Authentication

Step 2: Create a Tunnel with Cloudflare (Recommended)

Option A: Quick Tunnel (No Account Needed)

Option B: Named Tunnel (Stable URL)

Step 3: Configure WordPress

Security & Best Practices

API Keys & Authentication

Network Security

Performance Optimization

Choose the Right Model Size

Quantization Impact

Advanced Flags for Speed

Ready to Get Started?

Comments

Leave a Reply Cancel reply

More posts

Run AI Locally: Complete Guide to llama.cpp + WordPress

Turn Off AI Features — A Kill Switch for WordPress AI (Now Live)

From “Disable AI Toolkit” to Approval: Lessons from a WordPress Plugin Review

Introducing AI Provider for llama.cpp: Local AI for WordPress