Running Large Language Models on Android

Created: 2025-01-27 20:36:20 | Last updated: 2025-01-27 20:36:20 | Status: Public

Running Large Language Models on Android

This guide covers two methods for running LLMs on Android devices: llama.cpp and llamafile. Both methods use Termux, a terminal emulator for Android.

Prerequisites

Install Termux from F-Droid store (not Google Play Store, as that version isn’t maintained)
A Samsung or other Android device with sufficient RAM (8GB+ recommended)
Available storage space (varies by model size)

Method 1: llama.cpp

This method involves building llama.cpp from source.

Installation

# Install required packages
pkg update && pkg upgrade
pkg install git make clang python cmake

# Clone llama.cpp repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build llama.cpp
make LLAMA_OPENBLAS=1

# Create a models directory
mkdir models
cd models

# Download a model (example with tiny-llama)
wget https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/resolve/main/model.gguf

# Return to main directory
cd ..

Running the Model

./main -m ./models/model.gguf \
    --color \
    --ctx_size 2048 \
    -n 256 \
    -ins -p "### Human: Hello\n### Assistant:"

Configuration Options

--ctx_size: Adjust based on your device’s capabilities
-n: Number of tokens to generate
-ins: Enable interactive mode
You can adjust temperature and other parameters as needed

Method 2: llamafile (Recommended)

Llamafile is generally easier to use as it packages everything into a single executable file.

Installation

# Install required packages
pkg install curl coreutils

# Create a directory
mkdir ~/llamafile
cd ~/llamafile

# Download a llamafile (example with Mistral 7B)
curl -L https://huggingface.co/jartine/mistral-7b.llamafile/resolve/main/mistral-7b-v0.1-Q4_K_M.llamafile -o mistral.llamafile

# Make it executable
chmod +x mistral.llamafile

Running the Model

Terminal mode:

./mistral.llamafile --server false

Web server mode (accessible at http://localhost:8080):

./mistral.llamafile

Advantages of llamafile

No compilation needed
Includes model, weights, and dependencies in one file
Can run in terminal or web server mode
Better optimization for mobile processors
Web interface is mobile-friendly

Tips and Recommendations

Model Selection
- Smaller models (7B parameters or less) work better on mobile devices
- Quantized models (Q4, Q5) use less RAM
- Consider TinyLlama or Mistral 7B for a good balance of performance and resource usage
Performance Optimization
- Close other apps before running the model
- Monitor your device’s temperature
- Adjust context size and other parameters based on your device’s capabilities
Troubleshooting
- If you get memory errors, try a smaller model or reduce the context size
- Web server mode may be more stable than terminal mode
- Keep your device plugged in during long sessions

Additional Resources

Llamafile repository: https://github.com/Mozilla-Ocho/llamafile
llama.cpp repository: https://github.com/ggerganov/llama.cpp
Termux documentation: https://termux.dev/