Running Large Language Models on Android
Created: 2025-01-27 20:36:20 | Last updated: 2025-01-27 20:36:20 | Status: Public
Running Large Language Models on Android
This guide covers two methods for running LLMs on Android devices: llama.cpp and llamafile. Both methods use Termux, a terminal emulator for Android.
Prerequisites
- Install Termux from F-Droid store (not Google Play Store, as that version isn’t maintained)
- A Samsung or other Android device with sufficient RAM (8GB+ recommended)
- Available storage space (varies by model size)
Method 1: llama.cpp
This method involves building llama.cpp from source.
Installation
# Install required packages
pkg update && pkg upgrade
pkg install git make clang python cmake
# Clone llama.cpp repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build llama.cpp
make LLAMA_OPENBLAS=1
# Create a models directory
mkdir models
cd models
# Download a model (example with tiny-llama)
wget https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/model.gguf
# Return to main directory
cd ..
Running the Model
./main -m ./models/model.gguf \
--color \
--ctx_size 2048 \
-n 256 \
-ins -p "### Human: Hello\n### Assistant:"
Configuration Options
--ctx_size
: Adjust based on your device’s capabilities-n
: Number of tokens to generate-ins
: Enable interactive mode- You can adjust temperature and other parameters as needed
Method 2: llamafile (Recommended)
Llamafile is generally easier to use as it packages everything into a single executable file.
Installation
# Install required packages
pkg install curl coreutils
# Create a directory
mkdir ~/llamafile
cd ~/llamafile
# Download a llamafile (example with Mistral 7B)
curl -L https://huggingface.co/jartine/mistral-7b.llamafile/resolve/main/mistral-7b-v0.1-Q4_K_M.llamafile -o mistral.llamafile
# Make it executable
chmod +x mistral.llamafile
Running the Model
Terminal mode:
./mistral.llamafile --server false
Web server mode (accessible at http://localhost:8080):
./mistral.llamafile
Advantages of llamafile
- No compilation needed
- Includes model, weights, and dependencies in one file
- Can run in terminal or web server mode
- Better optimization for mobile processors
- Web interface is mobile-friendly
Tips and Recommendations
-
Model Selection
- Smaller models (7B parameters or less) work better on mobile devices
- Quantized models (Q4, Q5) use less RAM
- Consider TinyLlama or Mistral 7B for a good balance of performance and resource usage -
Performance Optimization
- Close other apps before running the model
- Monitor your device’s temperature
- Adjust context size and other parameters based on your device’s capabilities -
Troubleshooting
- If you get memory errors, try a smaller model or reduce the context size
- Web server mode may be more stable than terminal mode
- Keep your device plugged in during long sessions
Additional Resources
- Llamafile repository: https://github.com/Mozilla-Ocho/llamafile
- llama.cpp repository: https://github.com/ggerganov/llama.cpp
- Termux documentation: https://termux.dev/