Evaluations
We summarize our evaluation settings in the following table.
Task | Base models setting | Instruct models setting |
---|---|---|
General | ||
BBH | logprobs, 3-shot | logprobs, 3-shot |
ARC-C | logprobs, 25-shot | logprobs, 0-shot |
TruthfulQA | - | logprobs, 0-shot |
HellaSwag | logprobs, 10-shot | logprobs, 0-shot |
Winogrande | logprobs, 5-shot | - |
MMLU | logprobs, 5-shot | logprobs, 5-shot |
Math | ||
GSM8k | strict match, 5-shot | strict match, 5-shot |
MATH-500 | - | accuracy |
MATH lvl5 | math verify, 4-shot | - |
AMC-23 | - | average accuracy, 16 repetitions |
AIME-24 | - | average accuracy, 16 repetitions |
AIME-25 | - | average accuracy, 16 repetitions |
Science | ||
GPQA | logprobs, 5-shot | logprobs, 5-shot |
GPQA_Diamond | - | average accuracy, 3 repetitions |
MMLU-Pro | logprobs, 5-shot | logprobs, 5-shot |
MMLU-stem | subset of MMLU | subset of MMLU |
Code | ||
HumanEval | pass@1 | pass@1 |
HumanEval+ | pass@1 | pass@1 |
MBPP | pass@1 | pass@1 |
MBPP+ | pass@1 | pass@1 |
LiveCodeBench | - | accuracy |
CRUXEval | - | pass@1, input & output average |
Instruction Following | ||
IFEval | - | inst & prompt average accuaracy |
Alpaca-Eval | - | LC winrate |
MTBench | - | turn 1 & 2 average |
LiveBench | - | global_average |
The respective evaluation results for both base and instruct models can be found below
Falcon-H1-0.5B
Tasks | Falcon-H1-0.5B | Qwen3-0.6B | Qwen2.5-0.5B | Gemma3-1B | Llama3.2-1B | Falcon3-1B |
---|---|---|---|---|---|---|
General | ||||||
BBH | 40.22 | 36.07 | 32.62 | 30.26 | 30.72 | 35.24 |
MMLU | 55.04 | 52.64 | 47.61 | 26.33 | 32.39 | 45.14 |
ARC-C | 46.93 | 44.8 | 35.32 | 39.33 | 39.42 | 47.87 |
HellaSwag | 56.3 | 53.51 | 51.79 | 62.94 | 65.73 | 62.3 |
Winogrande | 59.43 | 60.54 | 56.83 | 62.59 | 62.75 | 61.17 |
Math | ||||||
GSM8k | 60.2 | 50.04 | 34.8 | 2.2 | 7.05 | 34.95 |
MATH lvl5 | 15.18 | 9.29 | 4.23 | 1.21 | 0.98 | 3.4 |
Science | ||||||
GPQA | 29.7 | 29.11 | 27.94 | 24.66 | 23.57 | 27.85 |
MMLU-Pro | 30.04 | 22.99 | 18.98 | 11.31 | 11.8 | 16.11 |
MMLU-stem | 57.12 | 50.11 | 43.74 | 27.59 | 30.19 | 40.06 |
Code | ||||||
HumanEval | 35.98 | 31.71 | 29.27 | 6.71 | 18.9 | 10.37 |
HumanEval+ | 31.1 | 27.44 | 25.0 | 5.49 | 16.46 | 9.15 |
MBPP | 52.12 | 51.06 | 40.74 | 12.7 | 35.98 | 12.43 |
MBPP+ | 43.39 | 42.33 | 34.66 | 9.52 | 29.89 | 9.52 |
Falcon-H1-0.5B-Instruct
Tasks | Falcon-H1-0.5B | Qwen3-0.6B | Qwen2.5-0.5B | Gemma3-1B | Llama3.2-1B | Falcon3-1B |
---|---|---|---|---|---|---|
General | ||||||
BBH | 42.91 | 32.95 | 33.26 | 35.86 | 33.21 | 34.47 |
ARC-C | 37.8 | 31.06 | 33.28 | 34.13 | 34.64 | 43.09 |
TruthfulQA | 44.12 | 51.65 | 46.19 | 42.17 | 42.08 | 42.31 |
HellaSwag | 51.93 | 42.17 | 52.38 | 42.24 | 55.3 | 58.53 |
MMLU | 53.4 | 42.98 | 46.07 | 40.87 | 45.93 | 46.1 |
Math | ||||||
GSM8k | 68.39 | 42.61 | 38.51 | 42.38 | 44.28 | 44.05 |
MATH-500 | 58.4 | 46.0 | 27.8 | 45.4 | 13.2 | 19.8 |
AMC-23 | 33.13 | 27.97 | 12.5 | 19.22 | 7.19 | 6.87 |
AIME-24 | 3.75 | 2.71 | 0.62 | 0.42 | 1.46 | 0.41 |
AIME-25 | 4.38 | 1.67 | 0.21 | 1.25 | 0.0 | 0.21 |
Science | ||||||
GPQA | 29.95 | 26.09 | 26.85 | 28.19 | 26.59 | 26.76 |
GPQA_Diamond | 27.95 | 25.08 | 24.24 | 21.55 | 25.08 | 31.31 |
MMLU-Pro | 31.03 | 16.95 | 18.73 | 14.46 | 16.2 | 18.49 |
MMLU-stem | 54.55 | 39.3 | 39.83 | 35.39 | 39.16 | 39.64 |
Code | ||||||
HumanEval | 51.83 | 41.46 | 36.59 | 40.85 | 34.15 | 22.56 |
HumanEval+ | 45.12 | 37.19 | 32.32 | 37.2 | 29.88 | 20.73 |
MBPP | 42.59 | 56.08 | 46.83 | 57.67 | 33.6 | 20.63 |
MBPP+ | 33.07 | 47.08 | 39.68 | 50.0 | 29.37 | 17.2 |
LiveCodeBench | 7.05 | 9.78 | 2.94 | 5.09 | 2.35 | 0.78 |
CRUXEval | 25.75 | 23.63 | 14.88 | 12.7 | 0.06 | 15.58 |
Instruction Following | ||||||
IFEval | 72.07 | 62.16 | 32.11 | 61.48 | 55.34 | 54.26 |
Alpaca-Eval | 10.79 | 9.59 | 3.26 | 17.87 | 9.38 | 6.98 |
MTBench | 7.06 | 5.75 | 4.71 | 7.03 | 6.37 | 6.03 |
LiveBench | 20.8 | 27.78 | 14.27 | 18.79 | 14.97 | 14.1 |
Falcon-H1-1.5B
Tasks | Falcon-H1-1.5B | Qwen3-1.7B | Qwen2.5-1.5B | Gemma3-1B | Llama3.2-1B | Falcon3-1B |
---|---|---|---|---|---|---|
General | ||||||
BBH | 46.57 | 43.05 | 40.55 | 30.26 | 30.72 | 35.24 |
MMLU | 61.81 | 62.46 | 61.13 | 26.33 | 32.39 | 45.14 |
ARC-C | 53.24 | 55.72 | 54.27 | 39.33 | 39.42 | 47.87 |
HellaSwag | 66.76 | 67.09 | 67.86 | 62.94 | 65.73 | 62.3 |
Winogrande | 65.59 | 66.3 | 64.56 | 62.59 | 62.75 | 61.17 |
Math | ||||||
GSM8k | 52.01 | 70.74 | 63.0 | 2.2 | 7.05 | 34.95 |
MATH lvl5 | 20.39 | 16.39 | 8.84 | 1.21 | 0.98 | 3.4 |
Science | ||||||
GPQA | 29.11 | 29.45 | 28.36 | 24.66 | 23.57 | 27.85 |
MMLU-Pro | 35.53 | 33.81 | 28.72 | 11.31 | 11.8 | 16.11 |
MMLU-stem | 63.37 | 61.53 | 54.93 | 27.59 | 30.19 | 40.06 |
Code | ||||||
HumanEval | 50.0 | 67.68 | 35.37 | 6.71 | 18.9 | 10.37 |
HumanEval+ | 42.68 | 60.98 | 29.27 | 5.49 | 16.46 | 9.15 |
MBPP | 65.08 | 67.72 | 60.05 | 12.7 | 35.98 | 12.43 |
MBPP+ | 55.03 | 58.99 | 49.47 | 9.52 | 29.89 | 9.52 |
Falcon-H1-1.5B-Instruct
Tasks | Falcon-H1-1.5B | Qwen3-1.7B | Qwen2.5-1.5B | Gemma3-1B | Llama3.2-1B | Falcon3-1B |
---|---|---|---|---|---|---|
General | ||||||
BBH | 46.47 | 35.18 | 42.41 | 35.86 | 33.21 | 34.47 |
ARC-C | 42.06 | 34.81 | 40.53 | 34.13 | 34.64 | 43.09 |
TruthfulQA | 45.98 | 49.39 | 47.05 | 42.17 | 42.08 | 42.31 |
HellaSwag | 63.33 | 49.27 | 62.23 | 42.24 | 55.3 | 58.53 |
MMLU | 62.03 | 57.04 | 59.76 | 40.87 | 45.93 | 46.1 |
Math | ||||||
GSM8k | 74.98 | 69.83 | 57.47 | 42.38 | 44.28 | 44.05 |
MATH-500 | 74.0 | 73.0 | 48.4 | 45.4 | 13.2 | 19.8 |
AMC-23 | 43.59 | 46.09 | 24.06 | 19.22 | 7.19 | 6.87 |
AIME-24 | 11.25 | 12.5 | 2.29 | 0.42 | 1.46 | 0.41 |
AIME-25 | 9.58 | 8.12 | 1.25 | 1.25 | 0.0 | 0.21 |
Science | ||||||
GPQA | 26.34 | 27.68 | 26.26 | 28.19 | 26.59 | 26.76 |
GPQA_Diamond | 35.19 | 33.33 | 25.59 | 21.55 | 25.08 | 31.31 |
MMLU-Pro | 37.8 | 23.54 | 28.35 | 14.46 | 16.2 | 18.49 |
MMLU-stem | 64.13 | 54.3 | 54.04 | 35.39 | 39.16 | 39.64 |
Code | ||||||
HumanEval | 68.29 | 67.68 | 56.1 | 40.85 | 34.15 | 22.56 |
HumanEval+ | 61.59 | 60.96 | 50.61 | 37.2 | 29.88 | 20.73 |
MBPP | 64.81 | 58.73 | 64.81 | 57.67 | 33.6 | 20.63 |
MBPP+ | 56.35 | 49.74 | 56.08 | 50.0 | 29.37 | 17.2 |
LiveCodeBench | 17.61 | 14.87 | 12.52 | 5.09 | 2.35 | 0.78 |
CRUXEval | 39.57 | 18.88 | 34.76 | 12.7 | 0.06 | 15.58 |
Instruction Following | ||||||
IFEval | 80.66 | 70.77 | 45.33 | 61.48 | 55.34 | 54.26 |
Alpaca-Eval | 28.18 | 21.89 | 9.54 | 17.87 | 9.38 | 6.98 |
MTBench | 8.46 | 7.61 | 7.1 | 7.03 | 6.37 | 6.03 |
LiveBench | 34.13 | 40.73 | 21.65 | 18.79 | 14.97 | 14.1 |
Falcon-H1-1.5B-Deep
Tasks | Falcon-H1-1.5B-deep | Qwen3-1.7B | Qwen2.5-1.5B | Gemma3-1B | Llama3.2-1B | Falcon3-1B |
---|---|---|---|---|---|---|
General | ||||||
BBH | 52.37 | 43.05 | 40.55 | 30.26 | 30.72 | 35.24 |
MMLU | 66.29 | 62.46 | 61.13 | 26.33 | 32.39 | 45.14 |
ARC-C | 55.89 | 55.72 | 54.27 | 39.33 | 39.42 | 47.87 |
HellaSwag | 69.72 | 67.09 | 67.86 | 62.94 | 65.73 | 62.3 |
Winogrande | 67.09 | 66.3 | 64.56 | 62.59 | 62.75 | 61.17 |
Math | ||||||
GSM8k | 68.69 | 70.74 | 63.0 | 2.2 | 7.05 | 34.95 |
MATH lvl5 | 24.77 | 16.39 | 8.84 | 1.21 | 0.98 | 3.4 |
Science | ||||||
GPQA | 32.8 | 29.45 | 28.36 | 24.66 | 23.57 | 27.85 |
MMLU-Pro | 41.07 | 33.81 | 28.72 | 11.31 | 11.8 | 16.11 |
MMLU-stem | 67.43 | 61.53 | 54.93 | 27.59 | 30.19 | 40.06 |
Code | ||||||
HumanEval | 52.44 | 67.68 | 35.37 | 6.71 | 18.9 | 10.37 |
HumanEval+ | 46.34 | 60.98 | 29.27 | 5.49 | 16.46 | 9.15 |
MBPP | 70.9 | 67.72 | 60.05 | 12.7 | 35.98 | 12.43 |
MBPP+ | 60.32 | 58.99 | 49.47 | 9.52 | 29.89 | 9.52 |
Falcon-H1-1.5B-Deep-Instruct
Tasks | Falcon-H1-1.5B-deep | Qwen3-1.7B | Qwen2.5-1.5B | Gemma3-1B | Llama3.2-1B | Falcon3-1B |
---|---|---|---|---|---|---|
General | ||||||
BBH | 54.43 | 35.18 | 42.41 | 35.86 | 33.21 | 34.47 |
ARC-C | 43.86 | 34.81 | 40.53 | 34.13 | 34.64 | 43.09 |
TruthfulQA | 50.48 | 49.39 | 47.05 | 42.17 | 42.08 | 42.31 |
HellaSwag | 65.54 | 49.27 | 62.23 | 42.24 | 55.3 | 58.53 |
MMLU | 66.11 | 57.04 | 59.76 | 40.87 | 45.93 | 46.1 |
Math | ||||||
GSM8k | 82.34 | 69.83 | 57.47 | 42.38 | 44.28 | 44.05 |
MATH-500 | 77.8 | 73.0 | 48.4 | 45.4 | 13.2 | 19.8 |
AMC-23 | 56.56 | 46.09 | 24.06 | 19.22 | 7.19 | 6.87 |
AIME-24 | 14.37 | 12.5 | 2.29 | 0.42 | 1.46 | 0.41 |
AIME-25 | 11.04 | 8.12 | 1.25 | 1.25 | 0.0 | 0.21 |
Science | ||||||
GPQA | 33.22 | 27.68 | 26.26 | 28.19 | 26.59 | 26.76 |
GPQA_Diamond | 40.57 | 33.33 | 25.59 | 21.55 | 25.08 | 31.31 |
MMLU-Pro | 41.89 | 23.54 | 28.35 | 14.46 | 16.2 | 18.49 |
MMLU-stem | 67.3 | 54.3 | 54.04 | 35.39 | 39.16 | 39.64 |
Code | ||||||
HumanEval | 73.78 | 67.68 | 56.1 | 40.85 | 34.15 | 22.56 |
HumanEval+ | 68.9 | 60.96 | 50.61 | 37.2 | 29.88 | 20.73 |
MBPP | 68.25 | 58.73 | 64.81 | 57.67 | 33.6 | 20.63 |
MBPP+ | 56.61 | 49.74 | 56.08 | 50.0 | 29.37 | 17.2 |
LiveCodeBench | 23.87 | 14.87 | 12.52 | 5.09 | 2.35 | 0.78 |
CRUXEval | 52.32 | 18.88 | 34.76 | 12.7 | 0.06 | 15.58 |
Instruction Following | ||||||
IFEval | 83.5 | 70.77 | 45.33 | 61.48 | 55.34 | 54.26 |
Alpaca-Eval | 27.12 | 21.89 | 9.54 | 17.87 | 9.38 | 6.98 |
MTBench | 8.53 | 7.61 | 7.1 | 7.03 | 6.37 | 6.03 |
LiveBench | 36.83 | 40.73 | 21.65 | 18.79 | 14.97 | 14.1 |
Falcon-H1-3B
Tasks | Falcon-H1-3B | Qwen3-4B | Qwen2.5-3B | Gemma3-4B | Llama3.2-3B | Falcon3-3B |
---|---|---|---|---|---|---|
General | ||||||
BBH | 53.17 | 56.88 | 46.4 | 40.41 | 39.45 | 44.02 |
MMLU | 68.39 | 72.92 | 65.56 | 59.41 | 55.94 | 56.77 |
ARC-C | 61.35 | 64.33 | 56.57 | 58.36 | 51.02 | 55.12 |
HellaSwag | 73.85 | 75.74 | 74.6 | 77.62 | 76.39 | 67.13 |
Winogrande | 68.11 | 72.3 | 71.03 | 72.77 | 72.22 | 65.11 |
Math | ||||||
GSM8k | 68.31 | 81.65 | 74.6 | 37.6 | 27.82 | 64.67 |
MATH lvl5 | 25.83 | 24.47 | 16.09 | 6.95 | 1.74 | 11.56 |
Science | ||||||
GPQA | 32.63 | 34.9 | 28.44 | 29.78 | 28.78 | 29.78 |
MMLU-Pro | 40.58 | 46.18 | 32.12 | 28.34 | 25.08 | 29.03 |
MMLU-stem | 69.55 | 75.58 | 62.23 | 51.7 | 47.67 | 55.34 |
Code | ||||||
HumanEval | 59.15 | 74.39 | 42.68 | 33.54 | 29.27 | 36.59 |
HumanEval+ | 53.66 | 68.9 | 35.37 | 28.05 | 26.22 | 31.71 |
MBPP | 71.43 | 74.6 | 59.52 | 60.05 | 48.94 | 51.85 |
MBPP+ | 57.94 | 63.76 | 50.53 | 51.32 | 39.42 | 42.06 |
Falcon-H1-3B-Instruct
Tasks | Falcon-H1-3B | Qwen3-4B | Qwen2.5-3B | Gemma3-4B | Llama3.2-3B | Falcon3-3B |
---|---|---|---|---|---|---|
General | ||||||
BBH | 53.69 | 51.07 | 46.55 | 50.01 | 41.47 | 45.02 |
ARC-C | 49.57 | 37.71 | 43.77 | 44.88 | 44.88 | 48.21 |
TruthfulQA | 53.19 | 51.75 | 58.11 | 51.68 | 50.27 | 50.06 |
HellaSwag | 69.85 | 55.31 | 64.21 | 47.68 | 63.74 | 64.24 |
MMLU | 68.3 | 67.01 | 65.09 | 59.53 | 61.74 | 56.76 |
Math | ||||||
GSM8k | 84.76 | 80.44 | 57.54 | 77.41 | 77.26 | 74.68 |
MATH-500 | 74.2 | 85.0 | 64.2 | 76.4 | 41.2 | 54.2 |
AMC-23 | 55.63 | 66.88 | 39.84 | 48.12 | 22.66 | 29.69 |
AIME-24 | 11.88 | 22.29 | 6.25 | 6.67 | 11.67 | 3.96 |
AIME-25 | 13.33 | 18.96 | 3.96 | 13.33 | 0.21 | 2.29 |
Science | ||||||
GPQA | 33.89 | 28.02 | 28.69 | 29.19 | 28.94 | 28.69 |
GPQA_Diamond | 38.72 | 40.74 | 35.69 | 28.62 | 29.97 | 29.29 |
MMLU-Pro | 43.69 | 29.75 | 32.76 | 29.71 | 27.44 | 29.71 |
MMLU-stem | 69.93 | 67.46 | 59.78 | 52.17 | 51.92 | 56.11 |
Code | ||||||
HumanEval | 76.83 | 84.15 | 73.78 | 67.07 | 54.27 | 52.44 |
HumanEval+ | 70.73 | 76.83 | 68.29 | 61.59 | 50.0 | 45.73 |
MBPP | 79.63 | 68.78 | 72.75 | 77.78 | 62.17 | 61.9 |
MBPP+ | 67.46 | 59.79 | 60.85 | 66.93 | 50.53 | 55.29 |
LiveCodeBench | 26.81 | 39.92 | 11.74 | 21.14 | 2.74 | 3.13 |
CRUXEval | 56.25 | 69.63 | 43.26 | 52.13 | 17.75 | 44.38 |
Instruction Following | ||||||
IFEval | 85.05 | 84.01 | 64.26 | 77.01 | 74.0 | 69.1 |
Alpaca-Eval | 31.09 | 36.51 | 17.37 | 39.64 | 19.69 | 14.82 |
MTBench | 8.72 | 8.45 | 7.79 | 8.24 | 7.96 | 7.79 |
LiveBench | 36.86 | 51.34 | 27.32 | 36.7 | 26.37 | 26.01 |
Falcon-H1-7B
Tasks | Falcon-H1-7B | Qwen3-8B | Qwen2.5-7B | Gemma3-12B | Llama3.1-8B | Falcon3-7B | Falcon3-10B |
---|---|---|---|---|---|---|---|
General | |||||||
BBH | 60.61 | 58.44 | 53.72 | 54.33 | 46.52 | 50.88 | 59.3 |
MMLU | 77.38 | 76.63 | 74.17 | 74.23 | 65.17 | 69.98 | 73.22 |
ARC-C | 65.19 | 67.75 | 63.91 | 67.58 | 57.68 | 62.71 | 67.49 |
HellaSwag | 81.26 | 79.6 | 80.2 | 84.22 | 81.97 | 76.69 | 79.64 |
Winogrande | 79.01 | 76.8 | 76.01 | 79.79 | 77.11 | 73.64 | 79.01 |
Math | |||||||
GSM8k | 73.46 | 83.02 | 83.09 | 71.19 | 49.51 | 76.95 | 82.11 |
MATH lvl5 | 34.67 | 28.85 | 22.58 | 17.22 | 6.57 | 20.09 | 25.38 |
Science | |||||||
GPQA | 36.58 | 35.65 | 32.3 | 34.56 | 31.46 | 35.07 | 35.4 |
MMLU-Pro | 48.38 | 48.25 | 43.55 | 42.72 | 32.71 | 39.23 | 42.45 |
MMLU-stem | 77.2 | 78.53 | 71.04 | 68.51 | 55.72 | 67.71 | 70.85 |
Code | |||||||
HumanEval | 67.68 | 87.8 | 57.32 | 45.12 | 39.02 | 50.0 | 51.83 |
HumanEval+ | 63.41 | 82.32 | 48.78 | 36.59 | 31.71 | 43.29 | 44.51 |
MBPP | 78.57 | 75.13 | 76.72 | 73.02 | 61.38 | 67.99 | 73.54 |
MBPP+ | 67.2 | 64.02 | 63.49 | 59.79 | 51.32 | 57.14 | 61.38 |
Falcon-H1-7B-Instruct
Tasks | Falcon-H1-7B | Qwen3-8B | Qwen2.5-7B | Gemma3-12B | Llama3.1-8B | Falcon3-7B | Falcon3-10B |
---|---|---|---|---|---|---|---|
General | |||||||
BBH | 62.28 | 47.47 | 53.76 | 63.36 | 48.58 | 52.12 | 58.09 |
ARC-C | 59.98 | 42.06 | 41.38 | 51.96 | 52.39 | 54.35 | 54.44 |
TruthfulQA | 59.91 | 53.19 | 62.41 | 61.02 | 52.99 | 55.58 | 55.05 |
HellaSwag | 75.92 | 60.56 | 63.4 | 55.63 | 71.28 | 71.81 | 75.57 |
MMLU | 76.83 | 71.56 | 73.64 | 72.5 | 68.67 | 70.81 | 74.01 |
Math | |||||||
GSM8k | 81.65 | 78.92 | 71.95 | 87.49 | 82.49 | 81.05 | 85.06 |
MATH-500 | 73.4 | 83.8 | 75.8 | 86.2 | 45.8 | 69.0 | 68.6 |
AMC-23 | 56.72 | 70.78 | 53.91 | 66.88 | 22.81 | 40.0 | 45.78 |
AIME-24 | 16.04 | 28.33 | 12.29 | 22.5 | 5.42 | 8.75 | 9.79 |
AIME-25 | 13.96 | 19.17 | 9.58 | 18.75 | 0.42 | 6.25 | 5.42 |
Science | |||||||
GPQA | 36.33 | 25.84 | 31.79 | 33.98 | 32.72 | 31.21 | 33.39 |
GPQA_Diamond | 56.9 | 43.1 | 33.0 | 37.71 | 31.31 | 37.21 | 34.68 |
MMLU-Pro | 51.75 | 34.64 | 43.23 | 39.88 | 36.42 | 40.73 | 44.05 |
MMLU-stem | 77.61 | 66.89 | 69.36 | 66.54 | 59.31 | 67.43 | 70.57 |
Code | |||||||
HumanEval | 86.59 | 84.75 | 82.32 | 84.76 | 68.29 | 71.95 | 82.32 |
HumanEval+ | 81.1 | 79.27 | 73.78 | 75.61 | 61.59 | 65.85 | 75.0 |
MBPP | 80.69 | 71.96 | 79.63 | 85.71 | 68.25 | 77.25 | 73.28 |
MBPP+ | 68.78 | 62.7 | 68.25 | 72.22 | 55.03 | 65.87 | 64.02 |
LiveCodeBench | 35.03 | 45.6 | 32.68 | 30.92 | 15.85 | 12.72 | 19.77 |
CRUXEval | 66.51 | 72.7 | 56.9 | 67.67 | 21.57 | 55.0 | 59.57 |
Instruction Following | |||||||
IFEval | 85.35 | 83.43 | 75.25 | 81.51 | 77.04 | 76.59 | 78.84 |
Alpaca-Eval | 40.23 | 46.13 | 29.48 | 43.55 | 25.48 | 27.56 | 24.31 |
MTBench | 8.85 | 8.74 | 8.45 | 8.69 | 8.29 | 8.73 | 8.46 |
LiveBench | 45.74 | 56.19 | 37.13 | 49.23 | 31.73 | 32.35 | 34.3 |
Falcon-H1-34B
Tasks | Falcon-H1-34B | Qwen2.5-72B | Qwen2.5-32B | Gemma3-27B | Llama3.1-70B | Llama4-scout |
---|---|---|---|---|---|---|
General | ||||||
BBH | 69.36 | 67.77 | 67.45 | 61.6 | 62.78 | 61.71 |
MMLU | 83.46 | 85.96 | 83.18 | 78.32 | 78.49 | 77.98 |
ARC-C | 71.25 | 72.44 | 70.48 | 70.31 | 69.2 | 62.97 |
HellaSwag | 85.68 | 87.57 | 85.13 | 86.19 | 87.78 | 84.01 |
Winogrande | 82.72 | 83.74 | 82.32 | 82.4 | 85.32 | 78.93 |
Math | ||||||
GSM8k | 76.5 | 89.76 | 90.14 | 81.35 | 80.52 | 83.24 |
MATH lvl5 | 40.71 | 38.14 | 36.4 | 25.38 | 18.81 | 27.19 |
Science | ||||||
GPQA | 42.7 | 42.28 | 39.68 | 35.82 | 36.49 | 35.99 |
MMLU-Pro | 57.18 | 60.22 | 58.05 | 49.64 | 47.07 | 50.16 |
MMLU-stem | 83.82 | 84.81 | 82.81 | 76.59 | 70.35 | 72.57 |
Code | ||||||
HumanEval | 70.12 | 59.15 | 59.76 | 48.78 | 57.32 | 57.32 |
HumanEval+ | 64.63 | 51.22 | 51.83 | 40.85 | 50.61 | 48.78 |
MBPP | 83.33 | 87.04 | 83.07 | 76.19 | 78.84 | 77.78 |
MBPP+ | 70.37 | 70.63 | 68.78 | 61.64 | 66.67 | 64.29 |
Falcon-H1-34B-Instruct
Tasks | Falcon-H1-34B | Qwen3-32B | Qwen2.5-72B | Qwen2.5-32B | Gemma3-27B | Llama3.3-70B | Llama4-scout |
---|---|---|---|---|---|---|---|
General | |||||||
BBH | 70.68 | 62.47 | 72.52 | 68.72 | 67.28 | 69.15 | 64.9 |
ARC-C | 61.01 | 48.98 | 46.59 | 44.54 | 54.52 | 63.65 | 56.14 |
TruthfulQA | 65.27 | 58.58 | 69.8 | 70.28 | 64.26 | 66.15 | 62.74 |
HellaSwag | 81.94 | 68.89 | 68.79 | 73.95 | 57.25 | 70.24 | 65.03 |
MMLU | 84.05 | 80.89 | 84.42 | 82.8 | 78.01 | 82.08 | 80.4 |
Math | |||||||
GSM8k | 83.62 | 88.78 | 82.26 | 78.47 | 90.37 | 93.71 | 90.37 |
MATH-500 | 83.8 | 82.0 | 83.6 | 82.2 | 90.0 | 70.6 | 83.2 |
AMC-23 | 69.38 | 67.34 | 67.34 | 68.75 | 77.81 | 39.38 | 69.06 |
AIME-24 | 23.75 | 27.71 | 17.29 | 17.92 | 27.5 | 12.92 | 27.92 |
AIME-25 | 16.67 | 19.79 | 15.21 | 11.46 | 22.71 | 1.25 | 8.96 |
Science | |||||||
GPQA | 41.53 | 30.2 | 37.67 | 34.31 | 36.49 | 31.99 | 31.8 |
GPQA_Diamond | 49.66 | 49.49 | 44.95 | 40.74 | 47.47 | 42.09 | 51.18 |
MMLU-Pro | 58.73 | 54.68 | 56.35 | 56.63 | 47.81 | 53.29 | 55.58 |
MMLU-stem | 83.57 | 81.64 | 82.59 | 82.37 | 73.55 | 74.88 | 75.2 |
Code | |||||||
HumanEval | 87.2 | 90.85 | 87.2 | 90.24 | 86.59 | 83.53 | 85.4 |
HumanEval+ | 81.71 | 85.37 | 80.49 | 82.32 | 78.05 | 79.87 | 78.7 |
MBPP | 83.86 | 86.24 | 89.68 | 87.83 | 88.36 | 88.09 | 81.5 |
MBPP+ | 71.43 | 71.96 | 75.4 | 74.07 | 74.07 | 73.81 | 64.8 |
LiveCodeBench | 49.71 | 45.01 | 54.6 | 49.12 | 39.53 | 40.31 | 40.12 |
CRUXEval | 73.07 | 78.45 | 75.63 | 73.5 | 74.82 | 69.53 | 68.32 |
Instruction Following | |||||||
IFEval | 89.37 | 86.97 | 86.35 | 81.79 | 83.19 | 89.94 | 86.32 |
Alpaca-Eval | 48.32 | 64.21 | 49.29 | 39.26 | 56.16 | 38.27 | 36.26 |
MTBench | 9.2 | 9.05 | 9.16 | 9.09 | 8.75 | 8.98 | 8.98 |
LiveBench | 46.26 | 63.05 | 54.03 | 52.92 | 55.41 | 53.11 | 54.21 |