Benchmarks
When evaluating what LLM to use, it's important to consider the model's "intelligence" - which you can get an idea of with the following benchmark results. With these, you can determine which size and quality your use case requires.
Model | Params (in billions) | Function Calling ↓ | MMLU (5-shot) | GPQA (0-shot) | GSM-8K (8-shot, CoT) | MATH (4-shot, CoT) | MT-bench |
---|---|---|---|---|---|---|---|
Claude-3.5 Sonnet | 98.57% | 88.7 | 59.4 | - | - | - | |
GPT-4o | 98.57% | - | 53.6 | - | - | - | |
Rubra Llama-3 70B Instruct | 70.6 | 97.85% | 75.90 | 33.93 | 82.26 | 34.24 | 8.36 |
Rubra Llama-3 8B Instruct | 8.9 | 89.28% | 64.39 | 31.70 | 68.99 | 23.76 | 8.03 |
Rubra Qwen2-7B-Instruct | 8.55 | 85.71% | 68.88 | 30.36 | 75.82 | 28.72 | 8.08 |
Rubra Mistral 7B Instruct v0.3 | 8.12 | 73.57% | 59.12 | 29.91 | 43.29 | 11.14 | 7.69 |
Rubra Mistral 7B Instruct v0.2 | 8.11 | 69.28% | 58.90 | 29.91 | 34.12 | 8.36 | 7.36 |
Rubra Phi-3 Mini 128k Instruct | 4.27 | 65.71% | 66.66 | 29.24 | 74.09 | 26.84 | 7.45 |
Nexusflow/NexusRaven-V2-13B | 13 | 53.75% ∔ | 43.23 | 28.79 | 22.67 | 7.12 | 5.36 |
Rubra Gemma-1.1 2B Instruct | 2.84 | 45.00% | 38.85 | 24.55 | 6.14 | 2.38 | 5.75 |
gorilla-llm/gorilla-openfunctions-v2 | 6.91 | 41.25% ∔ | 49.14 | 23.66 | 48.29 | 17.54 | 5.13 |
NousResearch/Hermes-2-Pro-Llama-3-8B | 8.03 | 41.25% | 64.16 | 31.92 | 73.92 | 21.58 | 7.83 |
Mistral 7B Instruct v0.3 | 7.25 | 22.5% | 62.10 | 30.58 | 53.07 | 12.98 | 7.50 |
Qwen2-7B-Instruct | 7.62 | - | 70.78 | 32.14 | 78.54 | 30.10 | 8.29 |
Phi-3 Mini 128k Instruct | 3.82 | - | 68.17 | 30.58 | 80.44 | 28.12 | 7.92 |
Mistral 7B Instruct v0.2 | 7.24 | - | 59.27 | 27.68 | 43.21 | 10.30 | 7.50 |
Llama-3 70B Instruct | 70.6 | - | 79.90 | 38.17 | 90.67 | 44.24 | 8.88 |
Llama-3 8B Instruct | 8.03 | - | 65.69 | 31.47 | 77.41 | 27.58 | 8.07 |
Gemma-1.1 2B Instruct | 2.51 | - | 37.84 | 22.99 | 6.29 | 6.14 | 5.82 |
MT-bench for all models was run in June 2024 using GPT-4.
MMLU, GPQA, GSM-8K & MATH were all calculated using the Language Model Evaluation Harness.
Our proprietary function calling benchmark will be open sourced in the coming months - half of it is composed of quickstart examples found in gptscript.
Some of the LLMs above require using custom libraries to post-process LLM generated tool calls. We followed those models' recommendations and guidelines in our evaluation.
mistralai/Mistral-7B-Instruct-v0.3
required mistral-inference library to extract function calls.
NousResearch/Hermes-2-Pro-Llama-3-8B
required hermes-function-calling.
gorilla-llm/gorilla-openfunctions-v2
required special prompting detailed in their Github repo.
Nexusflow/NexusRaven-V2-13B
required nexusraven-pip.
∔ Nexusflow/NexusRaven-V2-13B
and gorilla-llm/gorilla-openfunctions-v2
don't accept tool observations, the result of running a tool or function once the LLM calls it, so we appended the observation to the prompt.