Termux llama-server Setup Guide (July 2025 Updated)

Created: 2025-07-06 19:12:47 | Last updated: 2025-07-06 19:12:47 | Status: Public

Prerequisites & Environment Setup

1. Update Termux

pkg update && pkg upgrade -y

2. Install Required Packages

pkg install -y git cmake make clang python wget curl
pkg install -y build-essential opencl-headers

3. Install Storage Access (Optional)

termux-setup-storage

GPU Setup for Snapdragon 8 Gen 3 (July 2025)

1. Check for GPU OpenCL Support

# Check if device has OpenCL support
ls -la /system/vendor/lib64/ | grep -i opencl
ls -la /vendor/lib64/ | grep -i opencl

# You should see libOpenCL.so and libOpenCL_adreno.so

2. Copy OpenCL Libraries (CRITICAL for Termux)

# Copy Qualcomm OpenCL libraries to termux for proper detection
cp /vendor/lib64/libOpenCL.so ~/
cp /vendor/lib64/libOpenCL_adreno.so ~/

# Add library path to environment permanently
echo 'export LD_LIBRARY_PATH=$HOME:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

# Verify libraries are accessible
ls -la ~/libOpenCL*

Build llama.cpp (July 2025 Version)

1. Clone Latest Repository

cd ~
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

# Verify you have the latest version with Adreno support (post-December 2024)
git log --oneline | head -3

2. Create Build Directory

mkdir build
cd build

3. Configure with Latest Adreno Backend

# Configure with July 2025 Adreno-optimized OpenCL backend
cmake .. \
    -DLLAMA_SERVER=ON \
    -DGGML_OPENCL=ON \
    -DGGML_OPENCL_USE_ADRENO_KERNELS=ON \
    -DCMAKE_BUILD_TYPE=Release

4. Build

cmake --build . --config Release -j$(nproc)

5. Verify Adreno Backend Success

# Test GPU detection - should show successful Adreno setup
export LD_LIBRARY_PATH=$HOME:$LD_LIBRARY_PATH
./bin/llama-server --version

# Look for these success indicators:
# - "QUALCOMM Snapdragon(TM)" platform (not "clvk")
# - "OpenCL will use matmul kernels optimized for Adreno"
# - "using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)"
# - No "drop unsupported device" messages

Download Model

1. Create Model Directory

mkdir -p ~/models
cd ~/models

2. Download Qwen3-4B Model (Updated Recommendations)

# RECOMMENDED: Q4_0 for best performance (Adreno-optimized)
wget https://huggingface.co/bartowski/Qwen_Qwen3-4B-GGUF/resolve/main/Qwen3-4B-Q4_0.gguf

# Alternative: Q5_K_M for higher quality (CPU performs better)
# wget https://huggingface.co/bartowski/Qwen_Qwen3-4B-GGUF/resolve/main/Qwen3-4B-Q5_K_M.gguf

3. Model Performance Summary (Snapdragon 8 Gen 3 + Adreno 750)

Based on real-world testing:

Q4_0 Model:
- CPU-only: 8.30 tokens/second (recommended)
- GPU (25 layers): 8.81 tokens/second (competitive)
- GPU excels at prompt processing (57.86 vs 41.60 tok/s)

Q5_K_M Model:
- CPU-only: 7.15 tokens/second (much better)
- GPU (25 layers): 2.67 tokens/second (avoid GPU for this format)

Recommendation: Use Q4_0 with CPU-only for best balance of performance and power efficiency.

Configure llama-server

1. Create Optimized Launch Script (Q4_0 CPU-Only - Recommended)

cd ~/llama.cpp/build
cat > start_qwen3_optimal.sh << 'EOF'
#!/bin/bash

# Set OpenCL library path (required even for CPU-only)
export LD_LIBRARY_PATH=$HOME:$LD_LIBRARY_PATH

MODEL_PATH="$HOME/models/Qwen3-4B-Q4_0.gguf"
HOST="127.0.0.1"
PORT="8080"
CONTEXT_SIZE="4096"
THREADS=6  # Optimal for Snapdragon 8 Gen 3

echo "Starting Qwen3-4B server (CPU-optimized for best performance)..."
echo "Model: $MODEL_PATH"
echo "Access: http://$HOST:$PORT"
echo "Threads: $THREADS"
echo "Context: $CONTEXT_SIZE"
echo "Expected performance: ~8.30 tokens/second"
echo ""

# CPU-only for optimal performance
./bin/llama-server \
    --model "$MODEL_PATH" \
    --host "$HOST" \
    --port "$PORT" \
    --ctx-size "$CONTEXT_SIZE" \
    --threads "$THREADS" \
    --n-gpu-layers 0 \
    --chat-template qwen \
    --log-format text \
    --verbose
EOF

chmod +x start_qwen3_optimal.sh

2. Alternative GPU Script (Q4_0 with GPU - Experimental)

cat > start_qwen3_gpu.sh << 'EOF'
#!/bin/bash

# Set OpenCL library path for Adreno GPU detection
export LD_LIBRARY_PATH=$HOME:$LD_LIBRARY_PATH

MODEL_PATH="$HOME/models/Qwen3-4B-Q4_0.gguf"
HOST="127.0.0.1"
PORT="8080"
CONTEXT_SIZE="4096"
THREADS=6

echo "Starting Qwen3-4B server with GPU acceleration (experimental)..."
echo "Model: $MODEL_PATH"
echo "Access: http://$HOST:$PORT"
echo "Threads: $THREADS"
echo "Context: $CONTEXT_SIZE"
echo "Expected performance: ~8.81 tokens/second (faster prompts)"
echo ""

# GPU acceleration - only works well with Q4_0
./bin/llama-server \
    --model "$MODEL_PATH" \
    --host "$HOST" \
    --port "$PORT" \
    --ctx-size "$CONTEXT_SIZE" \
    --threads "$THREADS" \
    --n-gpu-layers 25 \
    --chat-template qwen \
    --log-format text \
    --verbose
EOF

chmod +x start_qwen3_gpu.sh

3. Legacy Q5_K_M Script (Higher Quality, CPU-Only)

cat > start_qwen3_quality.sh << 'EOF'
#!/bin/bash

MODEL_PATH="$HOME/models/Qwen3-4B-Q5_K_M.gguf"
HOST="127.0.0.1"
PORT="8080"
CONTEXT_SIZE="4096"
THREADS=6

echo "Starting Qwen3-4B server (Q5_K_M for higher quality)..."
echo "Model: $MODEL_PATH"
echo "Access: http://$HOST:$PORT"
echo "Threads: $THREADS"
echo "Context: $CONTEXT_SIZE"
echo "Expected performance: ~7.15 tokens/second"
echo ""

# CPU-only - GPU performance is poor with Q5_K_M
./bin/llama-server \
    --model "$MODEL_PATH" \
    --host "$HOST" \
    --port "$PORT" \
    --ctx-size "$CONTEXT_SIZE" \
    --threads "$THREADS" \
    --n-gpu-layers 0 \
    --chat-template qwen \
    --log-format text \
    --verbose
EOF

chmod +x start_qwen3_quality.sh

Run the Server

1. Start with Optimal Performance (Recommended)

cd ~/llama.cpp/build
./start_qwen3_optimal.sh

2. Alternative Performance Options

# For experimental GPU testing (Q4_0 only)
./start_qwen3_gpu.sh

# For higher quality output (slower)
./start_qwen3_quality.sh

3. Quick One-Liner (Optimal Settings)

export LD_LIBRARY_PATH=$HOME:$LD_LIBRARY_PATH && ./bin/llama-server -m ~/models/Qwen3-4B-Q4_0.gguf -c 4096 -t 6 -ngl 0 --chat-template qwen

Test the Server

1. Check Server Status

# In another Termux session
curl http://127.0.0.1:8080/health

2. Simple Chat Test

curl -X POST http://127.0.0.1:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen",
        "messages": [
            {"role": "user", "content": "Hello! How are you?"}
        ],
        "temperature": 0.7,
        "max_tokens": 100
    }'

3. Web Interface

Open in browser: http://127.0.0.1:8080

Performance Optimization (July 2025)

1. Quantization Performance (Real Benchmarks)

Based on Snapdragon 8 Gen 3 + Adreno 750 testing:

Q4_0 (Recommended)
- CPU-only: 8.30 tokens/second, 41.60 tok/s prompts
- GPU (25 layers): 8.81 tokens/second, 57.86 tok/s prompts
- Best overall performance, Adreno-optimized

Q5_K_M (Higher Quality)
- CPU-only: 7.15 tokens/second, 19.74 tok/s prompts
- GPU (25 layers): 2.67 tokens/second (avoid GPU)
- Better quality but GPU acceleration ineffective

Q6_K, Q8_0
- Not supported for GPU offload
- CPU-only performance decreases with larger quantizations

2. GPU vs CPU Decision Matrix

Use CPU-only when:
- Want maximum token generation speed
- Using Q5_K_M or higher quantizations
- Battery life is important
- Consistent performance needed

Use GPU (25 layers) when:
- Using Q4_0 quantization specifically
- Need faster prompt processing (57.86 vs 41.60 tok/s)
- Acceptable to trade slight generation speed for prompt speed
- Want to experiment with Adreno acceleration

3. Memory and Threading

# Optimal settings for Snapdragon 8 Gen 3
--ctx-size 4096     # Standard context
--threads 6         # 6 performance cores (avoid efficiency cores)
--n-gpu-layers 0    # CPU-only recommended
--n-gpu-layers 25   # If using Q4_0 with GPU

Troubleshooting (July 2025)

GPU Detection Issues

Problem: Still shows “clvk” platform

# Solution: Ensure library path is set
export LD_LIBRARY_PATH=$HOME:$LD_LIBRARY_PATH
# Check if libraries are copied correctly
ls -la ~/libOpenCL*

Problem: “No usable GPU found”

# Solution: Verify OpenCL library copying
cp /vendor/lib64/libOpenCL*.so ~/
# Restart termux session

Performance Issues

Problem: GPU slower than CPU

# Try different layer counts
--n-gpu-layers 10   # Instead of higher values
# Use Q4_0 quantization for best GPU performance

Problem: Out of memory errors

# Reduce context size
--ctx-size 2048
# Reduce GPU layers
--n-gpu-layers 15
# Close other apps

Build Issues

Problem: OpenCL not found during build

# Install headers and copy libraries first
pkg install opencl-headers
cp /vendor/lib64/libOpenCL*.so ~/
export LD_LIBRARY_PATH=$HOME:$LD_LIBRARY_PATH
# Then rebuild

Expected Performance (Snapdragon 8 Gen 3 + Adreno 750) - Real Benchmarks

Q4_0 Model (Recommended):

CPU-Only (Optimal):
- Token Generation: 8.30 tokens/second
- Prompt Processing: 41.60 tokens/second
- Memory Usage: ~2.5GB for 4B model
- Power Efficiency: Excellent

GPU Acceleration (Experimental):
- Token Generation: 8.81 tokens/second
- Prompt Processing: 57.86 tokens/second
- Memory Usage: ~3GB (GPU + CPU)
- Power Efficiency: Higher consumption

Q5_K_M Model (Higher Quality):

CPU-Only (Recommended):
- Token Generation: 7.15 tokens/second
- Prompt Processing: 19.74 tokens/second
- Memory Usage: ~3GB for 4B model

GPU Acceleration (Not Recommended):
- Token Generation: 2.67 tokens/second (3x slower!)
- GPU optimization poor for K-quantizations

Performance Summary:

Best Overall: Q4_0 CPU-only (8.30 tok/s)
Fastest Prompts: Q4_0 GPU (57.86 tok/s prompts)
Highest Quality: Q5_K_M CPU-only (7.15 tok/s)
Avoid: Any K-quantization with GPU

API Usage Examples

Python Client

import requests
import json

def chat_with_qwen3(message):
    url = "http://127.0.0.1:8080/v1/chat/completions"
    data = {
        "model": "qwen",
        "messages": [{"role": "user", "content": message}],
        "temperature": 0.7,
        "max_tokens": 500
    }
    response = requests.post(url, json=data)
    return response.json()

# Usage
result = chat_with_qwen3("Explain quantum computing briefly")
print(result["choices"][0]["message"]["content"])

Streaming Example

curl -X POST http://127.0.0.1:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen",
        "messages": [{"role": "user", "content": "Write a Python function"}],
        "stream": true
    }'

What’s New in July 2025

Native Adreno Support: Proper OpenCL backend with Adreno-optimized kernels
Real Performance Data: Actual benchmarks show CPU often outperforms GPU
Quantization-Specific Optimization: Q4_0 works with GPU, K-quantizations don’t
Critical Library Path Fix: Documented OpenCL library copying for Termux
Optimal Configuration: 6 threads, CPU-only recommended for best performance

Final Recommendations

For Best Performance: Use Q4_0 with CPU-only (8.30 tok/s)
For Fastest Prompts: Use Q4_0 with 25 GPU layers (57.86 tok/s prompts)
For Highest Quality: Use Q5_K_M with CPU-only (7.15 tok/s)
Never Use: K-quantizations with GPU acceleration

Next Steps

Start with the optimal script: ./start_qwen3_optimal.sh
Test different quantizations: Compare Q4_0 vs Q5_K_M quality
Monitor performance: Use the verbose output to track tok/s
Experiment carefully: GPU may help with prompts but costs battery life

The server should now be running with optimal performance at http://127.0.0.1:8080!