Deployment
This page summarizes all the current available tools that you can use for deploying Falcon-H1 series
Make sure to use Falcon-H1 model in torch.bfloat16 and not torch.float16 for the best performance.
🤗 transformers
We advise users to install Mamba-SSM from our public fork in order to include this fix. Note this is optional as we observed that the issue occurs stochastically.
git clone https://github.com/younesbelkada/mamba.git && cd mamba/ && pip install -e . --no-build-isolation
Check this issue for more details.
Make sure to install transformers
library from source:
pip install git+https://github.com/huggingface/transformers.git
And use AutoModelForCausalLM
interface, e.g.:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "tiiuae/Falcon-H1-1B-Base"
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Perform text generation
vLLM
For vLLM, simply start a server by executing the command below:
# pip install vllm
vllm serve tiiuae/Falcon-H1-1B-Instruct --tensor-parallel-size 2 --data-parallel-size 1
💡 Tip: Falcon-H1’s default --max-model-len
is 262 144 tokens to support very long contexts, but that large window can slow throughput. Set --max-model-len <prompt_len + output_len>
(e.g. 32768
) and cap concurrency with --max-num-seqs <N>
(e.g. 64
) to avoid over-allocating KV-cache memory and speed up generation.
🔧 llama.cpp
Refer to the model cards of our GGUF models and follow the installation instructions to run the model with llama.cpp
. Until our changes gets merged, you can use our public fork of llama.cpp.
All official GGUF files can be found on our official Hugging Face collection.
🔧 llama.cpp Integration
The llama.cpp
toolkit provides a lightweight C/C++ implementation for running Falcon-H1 models locally. We maintain a public fork with all necessary patches and support:
1. Prerequisites
- CMake ≥ 3.16
- A C++17-compatible compiler (e.g.,
gcc
,clang
) - make or ninja build tool
- (Optional) Docker, for OpenWebUI integration
2. Clone & Build
# Clone the Falcon-H1 llama.cpp fork
git clone https://github.com/tiiuae/llama.cpp-Falcon-H1.git
cd llama.cpp-Falcon-H1
# Create a build directory and compile
mkdir build && cd build
cmake .. # Configure the project
make -j$(nproc) # Build the binaries
Tip: For GPU acceleration, refer to the llama.cpp GPU guide.
3. Download a GGUF Model
Fetch the desired Falcon-H1 checkpoint from Hugging Face’s collection:
# Example: download the 1B Instruct model
wget https://huggingface.co/tiiuae/falcon-h1-6819f2795bc406da60fab8df/resolve/main/Falcon-H1-1B-Instruct-Q5_0.gguf \
-P models/
All available GGUF files: https://huggingface.co/collections/tiiuae/falcon-h1-6819f2795bc406da60fab8df
4. Run the llama-server
Start the HTTP server for inference:
./build/bin/llama-server \
-m models/Falcon-H1-1B-Instruct-Q5_0.gguf \
-c 4096 \ # Context window size
--ngl 512 \ # Number of GPU layers (omit if CPU-only)
--temp 0.1 \ # Sampling temperature
--host 0.0.0.0 \ # Bind address
--port 11434 # Listening port
5. Web UI via OpenWebUI
Use the popular OpenWebUI frontend to chat in your browser:
docker run -d \
--name openwebui-test \
-e OPENAI_API_BASE_URL="http://host.docker.internal:11434/v1" \
-p 8888:8888 \
ghcr.io/open-webui/open-webui:main
- Open your browser at http://localhost:8888
- Select Falcon-H1-1B-Instruct-Q5_0 from the model list
- Start chatting!
For advanced tuning and custom flags, see the full llama.cpp documentation: https://github.com/ggerganov/llama.cpp
Demo
We use a MacBook M4 Max Chip and Falcon-H1-1B-Q6_K for this demo.