Termux llama-server Setup Guide (July 2025 Updated)
Created: 2025-07-06 19:12:47 | Last updated: 2025-07-06 19:12:47 | Status: Public
Prerequisites & Environment Setup
1. Update Termux
pkg update && pkg upgrade -y
2. Install Required Packages
pkg install -y git cmake make clang python wget curl
pkg install -y build-essential opencl-headers
3. Install Storage Access (Optional)
termux-setup-storage
GPU Setup for Snapdragon 8 Gen 3 (July 2025)
1. Check for GPU OpenCL Support
# Check if device has OpenCL support
ls -la /system/vendor/lib64/ | grep -i opencl
ls -la /vendor/lib64/ | grep -i opencl
# You should see libOpenCL.so and libOpenCL_adreno.so
2. Copy OpenCL Libraries (CRITICAL for Termux)
# Copy Qualcomm OpenCL libraries to termux for proper detection
cp /vendor/lib64/libOpenCL.so ~/
cp /vendor/lib64/libOpenCL_adreno.so ~/
# Add library path to environment permanently
echo 'export LD_LIBRARY_PATH=$HOME:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
# Verify libraries are accessible
ls -la ~/libOpenCL*
Build llama.cpp (July 2025 Version)
1. Clone Latest Repository
cd ~
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
# Verify you have the latest version with Adreno support (post-December 2024)
git log --oneline | head -3
2. Create Build Directory
mkdir build
cd build
3. Configure with Latest Adreno Backend
# Configure with July 2025 Adreno-optimized OpenCL backend
cmake .. \
-DLLAMA_SERVER=ON \
-DGGML_OPENCL=ON \
-DGGML_OPENCL_USE_ADRENO_KERNELS=ON \
-DCMAKE_BUILD_TYPE=Release
4. Build
cmake --build . --config Release -j$(nproc)
5. Verify Adreno Backend Success
# Test GPU detection - should show successful Adreno setup
export LD_LIBRARY_PATH=$HOME:$LD_LIBRARY_PATH
./bin/llama-server --version
# Look for these success indicators:
# - "QUALCOMM Snapdragon(TM)" platform (not "clvk")
# - "OpenCL will use matmul kernels optimized for Adreno"
# - "using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)"
# - No "drop unsupported device" messages
Download Model
1. Create Model Directory
mkdir -p ~/models
cd ~/models
2. Download Qwen3-4B Model (Updated Recommendations)
# RECOMMENDED: Q4_0 for best performance (Adreno-optimized)
wget https://huggingface.co/bartowski/Qwen_Qwen3-4B-GGUF/resolve/main/Qwen3-4B-Q4_0.gguf
# Alternative: Q5_K_M for higher quality (CPU performs better)
# wget https://huggingface.co/bartowski/Qwen_Qwen3-4B-GGUF/resolve/main/Qwen3-4B-Q5_K_M.gguf
3. Model Performance Summary (Snapdragon 8 Gen 3 + Adreno 750)
Based on real-world testing:
Q4_0 Model:
- CPU-only: 8.30 tokens/second (recommended)
- GPU (25 layers): 8.81 tokens/second (competitive)
- GPU excels at prompt processing (57.86 vs 41.60 tok/s)
Q5_K_M Model:
- CPU-only: 7.15 tokens/second (much better)
- GPU (25 layers): 2.67 tokens/second (avoid GPU for this format)
Recommendation: Use Q4_0 with CPU-only for best balance of performance and power efficiency.
Configure llama-server
1. Create Optimized Launch Script (Q4_0 CPU-Only - Recommended)
cd ~/llama.cpp/build
cat > start_qwen3_optimal.sh << 'EOF'
#!/bin/bash
# Set OpenCL library path (required even for CPU-only)
export LD_LIBRARY_PATH=$HOME:$LD_LIBRARY_PATH
MODEL_PATH="$HOME/models/Qwen3-4B-Q4_0.gguf"
HOST="127.0.0.1"
PORT="8080"
CONTEXT_SIZE="4096"
THREADS=6 # Optimal for Snapdragon 8 Gen 3
echo "Starting Qwen3-4B server (CPU-optimized for best performance)..."
echo "Model: $MODEL_PATH"
echo "Access: http://$HOST:$PORT"
echo "Threads: $THREADS"
echo "Context: $CONTEXT_SIZE"
echo "Expected performance: ~8.30 tokens/second"
echo ""
# CPU-only for optimal performance
./bin/llama-server \
--model "$MODEL_PATH" \
--host "$HOST" \
--port "$PORT" \
--ctx-size "$CONTEXT_SIZE" \
--threads "$THREADS" \
--n-gpu-layers 0 \
--chat-template qwen \
--log-format text \
--verbose
EOF
chmod +x start_qwen3_optimal.sh
2. Alternative GPU Script (Q4_0 with GPU - Experimental)
cat > start_qwen3_gpu.sh << 'EOF'
#!/bin/bash
# Set OpenCL library path for Adreno GPU detection
export LD_LIBRARY_PATH=$HOME:$LD_LIBRARY_PATH
MODEL_PATH="$HOME/models/Qwen3-4B-Q4_0.gguf"
HOST="127.0.0.1"
PORT="8080"
CONTEXT_SIZE="4096"
THREADS=6
echo "Starting Qwen3-4B server with GPU acceleration (experimental)..."
echo "Model: $MODEL_PATH"
echo "Access: http://$HOST:$PORT"
echo "Threads: $THREADS"
echo "Context: $CONTEXT_SIZE"
echo "Expected performance: ~8.81 tokens/second (faster prompts)"
echo ""
# GPU acceleration - only works well with Q4_0
./bin/llama-server \
--model "$MODEL_PATH" \
--host "$HOST" \
--port "$PORT" \
--ctx-size "$CONTEXT_SIZE" \
--threads "$THREADS" \
--n-gpu-layers 25 \
--chat-template qwen \
--log-format text \
--verbose
EOF
chmod +x start_qwen3_gpu.sh
3. Legacy Q5_K_M Script (Higher Quality, CPU-Only)
cat > start_qwen3_quality.sh << 'EOF'
#!/bin/bash
MODEL_PATH="$HOME/models/Qwen3-4B-Q5_K_M.gguf"
HOST="127.0.0.1"
PORT="8080"
CONTEXT_SIZE="4096"
THREADS=6
echo "Starting Qwen3-4B server (Q5_K_M for higher quality)..."
echo "Model: $MODEL_PATH"
echo "Access: http://$HOST:$PORT"
echo "Threads: $THREADS"
echo "Context: $CONTEXT_SIZE"
echo "Expected performance: ~7.15 tokens/second"
echo ""
# CPU-only - GPU performance is poor with Q5_K_M
./bin/llama-server \
--model "$MODEL_PATH" \
--host "$HOST" \
--port "$PORT" \
--ctx-size "$CONTEXT_SIZE" \
--threads "$THREADS" \
--n-gpu-layers 0 \
--chat-template qwen \
--log-format text \
--verbose
EOF
chmod +x start_qwen3_quality.sh
Run the Server
1. Start with Optimal Performance (Recommended)
cd ~/llama.cpp/build
./start_qwen3_optimal.sh
2. Alternative Performance Options
# For experimental GPU testing (Q4_0 only)
./start_qwen3_gpu.sh
# For higher quality output (slower)
./start_qwen3_quality.sh
3. Quick One-Liner (Optimal Settings)
export LD_LIBRARY_PATH=$HOME:$LD_LIBRARY_PATH && ./bin/llama-server -m ~/models/Qwen3-4B-Q4_0.gguf -c 4096 -t 6 -ngl 0 --chat-template qwen
Test the Server
1. Check Server Status
# In another Termux session
curl http://127.0.0.1:8080/health
2. Simple Chat Test
curl -X POST http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen",
"messages": [
{"role": "user", "content": "Hello! How are you?"}
],
"temperature": 0.7,
"max_tokens": 100
}'
3. Web Interface
Open in browser: http://127.0.0.1:8080
Performance Optimization (July 2025)
1. Quantization Performance (Real Benchmarks)
Based on Snapdragon 8 Gen 3 + Adreno 750 testing:
Q4_0 (Recommended)
- CPU-only: 8.30 tokens/second, 41.60 tok/s prompts
- GPU (25 layers): 8.81 tokens/second, 57.86 tok/s prompts
- Best overall performance, Adreno-optimized
Q5_K_M (Higher Quality)
- CPU-only: 7.15 tokens/second, 19.74 tok/s prompts
- GPU (25 layers): 2.67 tokens/second (avoid GPU)
- Better quality but GPU acceleration ineffective
Q6_K, Q8_0
- Not supported for GPU offload
- CPU-only performance decreases with larger quantizations
2. GPU vs CPU Decision Matrix
Use CPU-only when:
- Want maximum token generation speed
- Using Q5_K_M or higher quantizations
- Battery life is important
- Consistent performance needed
Use GPU (25 layers) when:
- Using Q4_0 quantization specifically
- Need faster prompt processing (57.86 vs 41.60 tok/s)
- Acceptable to trade slight generation speed for prompt speed
- Want to experiment with Adreno acceleration
3. Memory and Threading
# Optimal settings for Snapdragon 8 Gen 3
--ctx-size 4096 # Standard context
--threads 6 # 6 performance cores (avoid efficiency cores)
--n-gpu-layers 0 # CPU-only recommended
--n-gpu-layers 25 # If using Q4_0 with GPU
Troubleshooting (July 2025)
GPU Detection Issues
Problem: Still shows “clvk” platform
# Solution: Ensure library path is set
export LD_LIBRARY_PATH=$HOME:$LD_LIBRARY_PATH
# Check if libraries are copied correctly
ls -la ~/libOpenCL*
Problem: “No usable GPU found”
# Solution: Verify OpenCL library copying
cp /vendor/lib64/libOpenCL*.so ~/
# Restart termux session
Performance Issues
Problem: GPU slower than CPU
# Try different layer counts
--n-gpu-layers 10 # Instead of higher values
# Use Q4_0 quantization for best GPU performance
Problem: Out of memory errors
# Reduce context size
--ctx-size 2048
# Reduce GPU layers
--n-gpu-layers 15
# Close other apps
Build Issues
Problem: OpenCL not found during build
# Install headers and copy libraries first
pkg install opencl-headers
cp /vendor/lib64/libOpenCL*.so ~/
export LD_LIBRARY_PATH=$HOME:$LD_LIBRARY_PATH
# Then rebuild
Expected Performance (Snapdragon 8 Gen 3 + Adreno 750) - Real Benchmarks
Q4_0 Model (Recommended):
CPU-Only (Optimal):
- Token Generation: 8.30 tokens/second
- Prompt Processing: 41.60 tokens/second
- Memory Usage: ~2.5GB for 4B model
- Power Efficiency: Excellent
GPU Acceleration (Experimental):
- Token Generation: 8.81 tokens/second
- Prompt Processing: 57.86 tokens/second
- Memory Usage: ~3GB (GPU + CPU)
- Power Efficiency: Higher consumption
Q5_K_M Model (Higher Quality):
CPU-Only (Recommended):
- Token Generation: 7.15 tokens/second
- Prompt Processing: 19.74 tokens/second
- Memory Usage: ~3GB for 4B model
GPU Acceleration (Not Recommended):
- Token Generation: 2.67 tokens/second (3x slower!)
- GPU optimization poor for K-quantizations
Performance Summary:
- Best Overall: Q4_0 CPU-only (8.30 tok/s)
- Fastest Prompts: Q4_0 GPU (57.86 tok/s prompts)
- Highest Quality: Q5_K_M CPU-only (7.15 tok/s)
- Avoid: Any K-quantization with GPU
API Usage Examples
Python Client
import requests
import json
def chat_with_qwen3(message):
url = "http://127.0.0.1:8080/v1/chat/completions"
data = {
"model": "qwen",
"messages": [{"role": "user", "content": message}],
"temperature": 0.7,
"max_tokens": 500
}
response = requests.post(url, json=data)
return response.json()
# Usage
result = chat_with_qwen3("Explain quantum computing briefly")
print(result["choices"][0]["message"]["content"])
Streaming Example
curl -X POST http://127.0.0.1:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen",
"messages": [{"role": "user", "content": "Write a Python function"}],
"stream": true
}'
What’s New in July 2025
- Native Adreno Support: Proper OpenCL backend with Adreno-optimized kernels
- Real Performance Data: Actual benchmarks show CPU often outperforms GPU
- Quantization-Specific Optimization: Q4_0 works with GPU, K-quantizations don’t
- Critical Library Path Fix: Documented OpenCL library copying for Termux
- Optimal Configuration: 6 threads, CPU-only recommended for best performance
Final Recommendations
For Best Performance: Use Q4_0 with CPU-only (8.30 tok/s)
For Fastest Prompts: Use Q4_0 with 25 GPU layers (57.86 tok/s prompts)
For Highest Quality: Use Q5_K_M with CPU-only (7.15 tok/s)
Never Use: K-quantizations with GPU acceleration
Next Steps
- Start with the optimal script:
./start_qwen3_optimal.sh
- Test different quantizations: Compare Q4_0 vs Q5_K_M quality
- Monitor performance: Use the verbose output to track tok/s
- Experiment carefully: GPU may help with prompts but costs battery life
The server should now be running with optimal performance at http://127.0.0.1:8080
!