Evaluations

We summarize our evaluation settings in the following table.

Task Base models setting Instruct models setting
General
BBH logprobs, 3-shot logprobs, 3-shot
ARC-C logprobs, 25-shot logprobs, 0-shot
TruthfulQA - logprobs, 0-shot
HellaSwag logprobs, 10-shot logprobs, 0-shot
Winogrande logprobs, 5-shot -
MMLU logprobs, 5-shot logprobs, 5-shot
Math
GSM8k strict match, 5-shot strict match, 5-shot
MATH-500 - accuracy
MATH lvl5 math verify, 4-shot -
AMC-23 - average accuracy, 16 repetitions
AIME-24 - average accuracy, 16 repetitions
AIME-25 - average accuracy, 16 repetitions
Science
GPQA logprobs, 5-shot logprobs, 5-shot
GPQA_Diamond - average accuracy, 3 repetitions
MMLU-Pro logprobs, 5-shot logprobs, 5-shot
MMLU-stem subset of MMLU subset of MMLU
Code
HumanEval pass@1 pass@1
HumanEval+ pass@1 pass@1
MBPP pass@1 pass@1
MBPP+ pass@1 pass@1
LiveCodeBench - accuracy
CRUXEval - pass@1, input & output average
Instruction Following
IFEval - inst & prompt average accuaracy
Alpaca-Eval - LC winrate
MTBench - turn 1 & 2 average
LiveBench - global_average

The respective evaluation results for both base and instruct models can be found below

Falcon-H1-0.5B

Tasks Falcon-H1-0.5B Qwen3-0.6B Qwen2.5-0.5B Gemma3-1B Llama3.2-1B Falcon3-1B
General
BBH 40.22 36.07 32.62 30.26 30.72 35.24
MMLU 55.04 52.64 47.61 26.33 32.39 45.14
ARC-C 46.93 44.8 35.32 39.33 39.42 47.87
HellaSwag 56.3 53.51 51.79 62.94 65.73 62.3
Winogrande 59.43 60.54 56.83 62.59 62.75 61.17
Math
GSM8k 60.2 50.04 34.8 2.2 7.05 34.95
MATH lvl5 15.18 9.29 4.23 1.21 0.98 3.4
Science
GPQA 29.7 29.11 27.94 24.66 23.57 27.85
MMLU-Pro 30.04 22.99 18.98 11.31 11.8 16.11
MMLU-stem 57.12 50.11 43.74 27.59 30.19 40.06
Code
HumanEval 35.98 31.71 29.27 6.71 18.9 10.37
HumanEval+ 31.1 27.44 25.0 5.49 16.46 9.15
MBPP 52.12 51.06 40.74 12.7 35.98 12.43
MBPP+ 43.39 42.33 34.66 9.52 29.89 9.52

Falcon-H1-0.5B-Instruct

Tasks Falcon-H1-0.5B Qwen3-0.6B Qwen2.5-0.5B Gemma3-1B Llama3.2-1B Falcon3-1B
General
BBH 42.91 32.95 33.26 35.86 33.21 34.47
ARC-C 37.8 31.06 33.28 34.13 34.64 43.09
TruthfulQA 44.12 51.65 46.19 42.17 42.08 42.31
HellaSwag 51.93 42.17 52.38 42.24 55.3 58.53
MMLU 53.4 42.98 46.07 40.87 45.93 46.1
Math
GSM8k 68.39 42.61 38.51 42.38 44.28 44.05
MATH-500 58.4 46.0 27.8 45.4 13.2 19.8
AMC-23 33.13 27.97 12.5 19.22 7.19 6.87
AIME-24 3.75 2.71 0.62 0.42 1.46 0.41
AIME-25 4.38 1.67 0.21 1.25 0.0 0.21
Science
GPQA 29.95 26.09 26.85 28.19 26.59 26.76
GPQA_Diamond 27.95 25.08 24.24 21.55 25.08 31.31
MMLU-Pro 31.03 16.95 18.73 14.46 16.2 18.49
MMLU-stem 54.55 39.3 39.83 35.39 39.16 39.64
Code
HumanEval 51.83 41.46 36.59 40.85 34.15 22.56
HumanEval+ 45.12 37.19 32.32 37.2 29.88 20.73
MBPP 42.59 56.08 46.83 57.67 33.6 20.63
MBPP+ 33.07 47.08 39.68 50.0 29.37 17.2
LiveCodeBench 7.05 9.78 2.94 5.09 2.35 0.78
CRUXEval 25.75 23.63 14.88 12.7 0.06 15.58
Instruction Following
IFEval 72.07 62.16 32.11 61.48 55.34 54.26
Alpaca-Eval 10.79 9.59 3.26 17.87 9.38 6.98
MTBench 7.06 5.75 4.71 7.03 6.37 6.03
LiveBench 20.8 27.78 14.27 18.79 14.97 14.1

Falcon-H1-1.5B

Tasks Falcon-H1-1.5B Qwen3-1.7B Qwen2.5-1.5B Gemma3-1B Llama3.2-1B Falcon3-1B
General
BBH 46.57 43.05 40.55 30.26 30.72 35.24
MMLU 61.81 62.46 61.13 26.33 32.39 45.14
ARC-C 53.24 55.72 54.27 39.33 39.42 47.87
HellaSwag 66.76 67.09 67.86 62.94 65.73 62.3
Winogrande 65.59 66.3 64.56 62.59 62.75 61.17
Math
GSM8k 52.01 70.74 63.0 2.2 7.05 34.95
MATH lvl5 20.39 16.39 8.84 1.21 0.98 3.4
Science
GPQA 29.11 29.45 28.36 24.66 23.57 27.85
MMLU-Pro 35.53 33.81 28.72 11.31 11.8 16.11
MMLU-stem 63.37 61.53 54.93 27.59 30.19 40.06
Code
HumanEval 50.0 67.68 35.37 6.71 18.9 10.37
HumanEval+ 42.68 60.98 29.27 5.49 16.46 9.15
MBPP 65.08 67.72 60.05 12.7 35.98 12.43
MBPP+ 55.03 58.99 49.47 9.52 29.89 9.52

Falcon-H1-1.5B-Instruct

Tasks Falcon-H1-1.5B Qwen3-1.7B Qwen2.5-1.5B Gemma3-1B Llama3.2-1B Falcon3-1B
General
BBH 46.47 35.18 42.41 35.86 33.21 34.47
ARC-C 42.06 34.81 40.53 34.13 34.64 43.09
TruthfulQA 45.98 49.39 47.05 42.17 42.08 42.31
HellaSwag 63.33 49.27 62.23 42.24 55.3 58.53
MMLU 62.03 57.04 59.76 40.87 45.93 46.1
Math
GSM8k 74.98 69.83 57.47 42.38 44.28 44.05
MATH-500 74.0 73.0 48.4 45.4 13.2 19.8
AMC-23 43.59 46.09 24.06 19.22 7.19 6.87
AIME-24 11.25 12.5 2.29 0.42 1.46 0.41
AIME-25 9.58 8.12 1.25 1.25 0.0 0.21
Science
GPQA 26.34 27.68 26.26 28.19 26.59 26.76
GPQA_Diamond 35.19 33.33 25.59 21.55 25.08 31.31
MMLU-Pro 37.8 23.54 28.35 14.46 16.2 18.49
MMLU-stem 64.13 54.3 54.04 35.39 39.16 39.64
Code
HumanEval 68.29 67.68 56.1 40.85 34.15 22.56
HumanEval+ 61.59 60.96 50.61 37.2 29.88 20.73
MBPP 64.81 58.73 64.81 57.67 33.6 20.63
MBPP+ 56.35 49.74 56.08 50.0 29.37 17.2
LiveCodeBench 17.61 14.87 12.52 5.09 2.35 0.78
CRUXEval 39.57 18.88 34.76 12.7 0.06 15.58
Instruction Following
IFEval 80.66 70.77 45.33 61.48 55.34 54.26
Alpaca-Eval 28.18 21.89 9.54 17.87 9.38 6.98
MTBench 8.46 7.61 7.1 7.03 6.37 6.03
LiveBench 34.13 40.73 21.65 18.79 14.97 14.1

Falcon-H1-1.5B-Deep

Tasks Falcon-H1-1.5B-deep Qwen3-1.7B Qwen2.5-1.5B Gemma3-1B Llama3.2-1B Falcon3-1B
General
BBH 52.37 43.05 40.55 30.26 30.72 35.24
MMLU 66.29 62.46 61.13 26.33 32.39 45.14
ARC-C 55.89 55.72 54.27 39.33 39.42 47.87
HellaSwag 69.72 67.09 67.86 62.94 65.73 62.3
Winogrande 67.09 66.3 64.56 62.59 62.75 61.17
Math
GSM8k 68.69 70.74 63.0 2.2 7.05 34.95
MATH lvl5 24.77 16.39 8.84 1.21 0.98 3.4
Science
GPQA 32.8 29.45 28.36 24.66 23.57 27.85
MMLU-Pro 41.07 33.81 28.72 11.31 11.8 16.11
MMLU-stem 67.43 61.53 54.93 27.59 30.19 40.06
Code
HumanEval 52.44 67.68 35.37 6.71 18.9 10.37
HumanEval+ 46.34 60.98 29.27 5.49 16.46 9.15
MBPP 70.9 67.72 60.05 12.7 35.98 12.43
MBPP+ 60.32 58.99 49.47 9.52 29.89 9.52

Falcon-H1-1.5B-Deep-Instruct

Tasks Falcon-H1-1.5B-deep Qwen3-1.7B Qwen2.5-1.5B Gemma3-1B Llama3.2-1B Falcon3-1B
General
BBH 54.43 35.18 42.41 35.86 33.21 34.47
ARC-C 43.86 34.81 40.53 34.13 34.64 43.09
TruthfulQA 50.48 49.39 47.05 42.17 42.08 42.31
HellaSwag 65.54 49.27 62.23 42.24 55.3 58.53
MMLU 66.11 57.04 59.76 40.87 45.93 46.1
Math
GSM8k 82.34 69.83 57.47 42.38 44.28 44.05
MATH-500 77.8 73.0 48.4 45.4 13.2 19.8
AMC-23 56.56 46.09 24.06 19.22 7.19 6.87
AIME-24 14.37 12.5 2.29 0.42 1.46 0.41
AIME-25 11.04 8.12 1.25 1.25 0.0 0.21
Science
GPQA 33.22 27.68 26.26 28.19 26.59 26.76
GPQA_Diamond 40.57 33.33 25.59 21.55 25.08 31.31
MMLU-Pro 41.89 23.54 28.35 14.46 16.2 18.49
MMLU-stem 67.3 54.3 54.04 35.39 39.16 39.64
Code
HumanEval 73.78 67.68 56.1 40.85 34.15 22.56
HumanEval+ 68.9 60.96 50.61 37.2 29.88 20.73
MBPP 68.25 58.73 64.81 57.67 33.6 20.63
MBPP+ 56.61 49.74 56.08 50.0 29.37 17.2
LiveCodeBench 23.87 14.87 12.52 5.09 2.35 0.78
CRUXEval 52.32 18.88 34.76 12.7 0.06 15.58
Instruction Following
IFEval 83.5 70.77 45.33 61.48 55.34 54.26
Alpaca-Eval 27.12 21.89 9.54 17.87 9.38 6.98
MTBench 8.53 7.61 7.1 7.03 6.37 6.03
LiveBench 36.83 40.73 21.65 18.79 14.97 14.1

Falcon-H1-3B

Tasks Falcon-H1-3B Qwen3-4B Qwen2.5-3B Gemma3-4B Llama3.2-3B Falcon3-3B
General
BBH 53.17 56.88 46.4 40.41 39.45 44.02
MMLU 68.39 72.92 65.56 59.41 55.94 56.77
ARC-C 61.35 64.33 56.57 58.36 51.02 55.12
HellaSwag 73.85 75.74 74.6 77.62 76.39 67.13
Winogrande 68.11 72.3 71.03 72.77 72.22 65.11
Math
GSM8k 68.31 81.65 74.6 37.6 27.82 64.67
MATH lvl5 25.83 24.47 16.09 6.95 1.74 11.56
Science
GPQA 32.63 34.9 28.44 29.78 28.78 29.78
MMLU-Pro 40.58 46.18 32.12 28.34 25.08 29.03
MMLU-stem 69.55 75.58 62.23 51.7 47.67 55.34
Code
HumanEval 59.15 74.39 42.68 33.54 29.27 36.59
HumanEval+ 53.66 68.9 35.37 28.05 26.22 31.71
MBPP 71.43 74.6 59.52 60.05 48.94 51.85
MBPP+ 57.94 63.76 50.53 51.32 39.42 42.06

Falcon-H1-3B-Instruct

Tasks Falcon-H1-3B Qwen3-4B Qwen2.5-3B Gemma3-4B Llama3.2-3B Falcon3-3B
General
BBH 53.69 51.07 46.55 50.01 41.47 45.02
ARC-C 49.57 37.71 43.77 44.88 44.88 48.21
TruthfulQA 53.19 51.75 58.11 51.68 50.27 50.06
HellaSwag 69.85 55.31 64.21 47.68 63.74 64.24
MMLU 68.3 67.01 65.09 59.53 61.74 56.76
Math
GSM8k 84.76 80.44 57.54 77.41 77.26 74.68
MATH-500 74.2 85.0 64.2 76.4 41.2 54.2
AMC-23 55.63 66.88 39.84 48.12 22.66 29.69
AIME-24 11.88 22.29 6.25 6.67 11.67 3.96
AIME-25 13.33 18.96 3.96 13.33 0.21 2.29
Science
GPQA 33.89 28.02 28.69 29.19 28.94 28.69
GPQA_Diamond 38.72 40.74 35.69 28.62 29.97 29.29
MMLU-Pro 43.69 29.75 32.76 29.71 27.44 29.71
MMLU-stem 69.93 67.46 59.78 52.17 51.92 56.11
Code
HumanEval 76.83 84.15 73.78 67.07 54.27 52.44
HumanEval+ 70.73 76.83 68.29 61.59 50.0 45.73
MBPP 79.63 68.78 72.75 77.78 62.17 61.9
MBPP+ 67.46 59.79 60.85 66.93 50.53 55.29
LiveCodeBench 26.81 39.92 11.74 21.14 2.74 3.13
CRUXEval 56.25 69.63 43.26 52.13 17.75 44.38
Instruction Following
IFEval 85.05 84.01 64.26 77.01 74.0 69.1
Alpaca-Eval 31.09 36.51 17.37 39.64 19.69 14.82
MTBench 8.72 8.45 7.79 8.24 7.96 7.79
LiveBench 36.86 51.34 27.32 36.7 26.37 26.01

Falcon-H1-7B

Tasks Falcon-H1-7B Qwen3-8B Qwen2.5-7B Gemma3-12B Llama3.1-8B Falcon3-7B Falcon3-10B
General
BBH 60.61 58.44 53.72 54.33 46.52 50.88 59.3
MMLU 77.38 76.63 74.17 74.23 65.17 69.98 73.22
ARC-C 65.19 67.75 63.91 67.58 57.68 62.71 67.49
HellaSwag 81.26 79.6 80.2 84.22 81.97 76.69 79.64
Winogrande 79.01 76.8 76.01 79.79 77.11 73.64 79.01
Math
GSM8k 73.46 83.02 83.09 71.19 49.51 76.95 82.11
MATH lvl5 34.67 28.85 22.58 17.22 6.57 20.09 25.38
Science
GPQA 36.58 35.65 32.3 34.56 31.46 35.07 35.4
MMLU-Pro 48.38 48.25 43.55 42.72 32.71 39.23 42.45
MMLU-stem 77.2 78.53 71.04 68.51 55.72 67.71 70.85
Code
HumanEval 67.68 87.8 57.32 45.12 39.02 50.0 51.83
HumanEval+ 63.41 82.32 48.78 36.59 31.71 43.29 44.51
MBPP 78.57 75.13 76.72 73.02 61.38 67.99 73.54
MBPP+ 67.2 64.02 63.49 59.79 51.32 57.14 61.38

Falcon-H1-7B-Instruct

Tasks Falcon-H1-7B Qwen3-8B Qwen2.5-7B Gemma3-12B Llama3.1-8B Falcon3-7B Falcon3-10B
General
BBH 62.28 47.47 53.76 63.36 48.58 52.12 58.09
ARC-C 59.98 42.06 41.38 51.96 52.39 54.35 54.44
TruthfulQA 59.91 53.19 62.41 61.02 52.99 55.58 55.05
HellaSwag 75.92 60.56 63.4 55.63 71.28 71.81 75.57
MMLU 76.83 71.56 73.64 72.5 68.67 70.81 74.01
Math
GSM8k 81.65 78.92 71.95 87.49 82.49 81.05 85.06
MATH-500 73.4 83.8 75.8 86.2 45.8 69.0 68.6
AMC-23 56.72 70.78 53.91 66.88 22.81 40.0 45.78
AIME-24 16.04 28.33 12.29 22.5 5.42 8.75 9.79
AIME-25 13.96 19.17 9.58 18.75 0.42 6.25 5.42
Science
GPQA 36.33 25.84 31.79 33.98 32.72 31.21 33.39
GPQA_Diamond 56.9 43.1 33.0 37.71 31.31 37.21 34.68
MMLU-Pro 51.75 34.64 43.23 39.88 36.42 40.73 44.05
MMLU-stem 77.61 66.89 69.36 66.54 59.31 67.43 70.57
Code
HumanEval 86.59 84.75 82.32 84.76 68.29 71.95 82.32
HumanEval+ 81.1 79.27 73.78 75.61 61.59 65.85 75.0
MBPP 80.69 71.96 79.63 85.71 68.25 77.25 73.28
MBPP+ 68.78 62.7 68.25 72.22 55.03 65.87 64.02
LiveCodeBench 35.03 45.6 32.68 30.92 15.85 12.72 19.77
CRUXEval 66.51 72.7 56.9 67.67 21.57 55.0 59.57
Instruction Following
IFEval 85.35 83.43 75.25 81.51 77.04 76.59 78.84
Alpaca-Eval 40.23 46.13 29.48 43.55 25.48 27.56 24.31
MTBench 8.85 8.74 8.45 8.69 8.29 8.73 8.46
LiveBench 45.74 56.19 37.13 49.23 31.73 32.35 34.3

Falcon-H1-34B

Tasks Falcon-H1-34B Qwen2.5-72B Qwen2.5-32B Gemma3-27B Llama3.1-70B Llama4-scout
General
BBH 69.36 67.77 67.45 61.6 62.78 61.71
MMLU 83.46 85.96 83.18 78.32 78.49 77.98
ARC-C 71.25 72.44 70.48 70.31 69.2 62.97
HellaSwag 85.68 87.57 85.13 86.19 87.78 84.01
Winogrande 82.72 83.74 82.32 82.4 85.32 78.93
Math
GSM8k 76.5 89.76 90.14 81.35 80.52 83.24
MATH lvl5 40.71 38.14 36.4 25.38 18.81 27.19
Science
GPQA 42.7 42.28 39.68 35.82 36.49 35.99
MMLU-Pro 57.18 60.22 58.05 49.64 47.07 50.16
MMLU-stem 83.82 84.81 82.81 76.59 70.35 72.57
Code
HumanEval 70.12 59.15 59.76 48.78 57.32 57.32
HumanEval+ 64.63 51.22 51.83 40.85 50.61 48.78
MBPP 83.33 87.04 83.07 76.19 78.84 77.78
MBPP+ 70.37 70.63 68.78 61.64 66.67 64.29

Falcon-H1-34B-Instruct

Tasks Falcon-H1-34B Qwen3-32B Qwen2.5-72B Qwen2.5-32B Gemma3-27B Llama3.3-70B Llama4-scout
General
BBH 70.68 62.47 72.52 68.72 67.28 69.15 64.9
ARC-C 61.01 48.98 46.59 44.54 54.52 63.65 56.14
TruthfulQA 65.27 58.58 69.8 70.28 64.26 66.15 62.74
HellaSwag 81.94 68.89 68.79 73.95 57.25 70.24 65.03
MMLU 84.05 80.89 84.42 82.8 78.01 82.08 80.4
Math
GSM8k 83.62 88.78 82.26 78.47 90.37 93.71 90.37
MATH-500 83.8 82.0 83.6 82.2 90.0 70.6 83.2
AMC-23 69.38 67.34 67.34 68.75 77.81 39.38 69.06
AIME-24 23.75 27.71 17.29 17.92 27.5 12.92 27.92
AIME-25 16.67 19.79 15.21 11.46 22.71 1.25 8.96
Science
GPQA 41.53 30.2 37.67 34.31 36.49 31.99 31.8
GPQA_Diamond 49.66 49.49 44.95 40.74 47.47 42.09 51.18
MMLU-Pro 58.73 54.68 56.35 56.63 47.81 53.29 55.58
MMLU-stem 83.57 81.64 82.59 82.37 73.55 74.88 75.2
Code
HumanEval 87.2 90.85 87.2 90.24 86.59 83.53 85.4
HumanEval+ 81.71 85.37 80.49 82.32 78.05 79.87 78.7
MBPP 83.86 86.24 89.68 87.83 88.36 88.09 81.5
MBPP+ 71.43 71.96 75.4 74.07 74.07 73.81 64.8
LiveCodeBench 49.71 45.01 54.6 49.12 39.53 40.31 40.12
CRUXEval 73.07 78.45 75.63 73.5 74.82 69.53 68.32
Instruction Following
IFEval 89.37 86.97 86.35 81.79 83.19 89.94 86.32
Alpaca-Eval 48.32 64.21 49.29 39.26 56.16 38.27 36.26
MTBench 9.2 9.05 9.16 9.09 8.75 8.98 8.98
LiveBench 46.26 63.05 54.03 52.92 55.41 53.11 54.21