Deployment

This page summarizes all the current available tools that you can use for deploying Falcon-H1 series

Make sure to use Falcon-H1 model in torch.bfloat16 and not torch.float16 for the best performance.

🤗 transformers

We advise users to install Mamba-SSM from our public fork in order to include this fix. Note this is optional as we observed that the issue occurs stochastically.

git clone https://github.com/younesbelkada/mamba.git && cd mamba/ && pip install -e . --no-build-isolation

Check this issue for more details.

Make sure to install transformers library from source:

pip install git+https://github.com/huggingface/transformers.git

And use AutoModelForCausalLM interface, e.g.:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "tiiuae/Falcon-H1-1B-Base"

model = AutoModelForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.bfloat16,
  device_map="auto"
)

# Perform text generation

vLLM

For vLLM, simply start a server by executing the command below:

# pip install vllm
vllm serve tiiuae/Falcon-H1-1B-Instruct --tensor-parallel-size 2 --data-parallel-size 1

💡 Tip: Falcon-H1’s default --max-model-len is 262 144 tokens to support very long contexts, but that large window can slow throughput. Set --max-model-len <prompt_len + output_len> (e.g. 32768) and cap concurrency with --max-num-seqs <N> (e.g. 64) to avoid over-allocating KV-cache memory and speed up generation.

🔧 llama.cpp

Falcon-H1 is natively supported into llama.cpp !

All official GGUF files can be found on our official Hugging Face collection.


1. Prerequisites

  • CMake ≥ 3.16
  • A C++17-compatible compiler (e.g., gcc, clang)
  • make or ninja build tool
  • (Optional) Docker, for OpenWebUI integration

2. Clone & Build

# Clone the Falcon-H1 llama.cpp fork
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# Create a build directory and compile
mkdir build && cd build
cmake ..         # Configure the project
make -j$(nproc)  # Build the binaries

Tip: For GPU acceleration, refer to the llama.cpp GPU guide.


3. Download a GGUF Model

Fetch the desired Falcon-H1 checkpoint from Hugging Face’s collection:

# Example: download the 1B Instruct model
wget https://huggingface.co/tiiuae/Falcon-H1-1.5B-Instruct-GGUF/resolve/main/Falcon-H1-1.5B-Instruct-Q5_K.gguf \
     -P models/

All available GGUF files: https://huggingface.co/collections/tiiuae/falcon-h1-6819f2795bc406da60fab8df


4. Run the llama-server

Start the HTTP server for inference:

./build/bin/llama-server \
  -m models/Falcon-H1-1B-Instruct-Q5_0.gguf \  
  -c 4096 \                # Context window size
  --ngl 512 \              # Number of GPU layers (omit if CPU-only)
  --temp 0.1 \             # Sampling temperature
  --host 0.0.0.0 \         # Bind address
  --port 11434             # Listening port

5. Web UI via OpenWebUI

Use the popular OpenWebUI frontend to chat in your browser:

docker run -d \
  --name openwebui-test \
  -e OPENAI_API_BASE_URL="http://host.docker.internal:11434/v1" \
  -p 8888:8888 \
  ghcr.io/open-webui/open-webui:main
  1. Open your browser at http://localhost:8888
  2. Select Falcon-H1-1B-Instruct-Q5_0 from the model list
  3. Start chatting!

For advanced tuning and custom flags, see the full llama.cpp documentation: https://github.com/ggerganov/llama.cpp

Demo

We use a MacBook M4 Max Chip and Falcon-H1-1B-Q6_K for this demo.

SkyPilot

Refer to this documentation section for deploying Falcon-H1 series models using Skypilot library.

Lm-studio

First, install lm-studio from the official website - make sure to select the latest llama.cpp runtime by selecting Developer mode -> "LM Runtimes" (from top left) -> Make sure that llama.cpp version is greater than v1.39.0

Jan

Jan is an open source alternative to ChatGPT that runs 100% offline on your computer - it also support H1 models with minimal configuration steps. First make sure to install the version >=0.6.5, then Navigate to Settings -> Model Providers -> Llama.cpp

From there, click on the folder icon to import your own GGUF file and select any H1 GGUF model that you have downloaded locally. After that, switch to the main screen and start chatting with the model !

Docker Model API

Docker supports deploying local models with a simple API, you can use Falcon-H1 with Docker Model API starting from Docker Desktop version 4.43.2.

First make sure to run docker desktop (for Mac devices), then run:

docker model run hf.co/tiiuae/Falcon-H1-1.5B-Instruct-GGUF