Leaderboard

Leaderboard

The following table presents benchmarking results for popular foundation models using zero-shot prompting. These models were evaluated either by calling APIs or running models directly. For more information about each specific model, please click on the model's name. Individuals and organizations are welcome to submit their model's predictions to VMLU at any time, with either zero-shot or few-shot evaluation. Click here to navigate to the submission page. Please note that your results will not be published on the Leaderboard unless you request it.

Leaderboard of from-scratch models

#
Model
Creator
Access
Evaluation date
Stem
Social Science
Humanities
Others
Avg
1Llama-3-70BMetaWeight23/04/202461.774.9168.7463.53
66.44
2KiLM-13b-v24.7.1Zalo AIPrivate01/08/202460.2973.1571.8560.13
66.07
3GPT-4OpenAIAPI08/01/202463.8471.7866.1460.37
65.53
4Gpt-4o-miniOpenAIAPI01/08/20245870.9565.0360.91
62.87
5gemma-2-9b-itGoogleWeight01/08/202456.3165.7960.854.44
59.04
6geminiGoogleAPI30/01/202442.860.3155.3551.30
51.03
7ChatGPTOpenAIAPI08/01/202443.2451.6746.9646.32
46.33
8ViGPT-1.6B-v1Vin BigDataPrivate08/01/202435.0648.7247.2042.54
42.34
9gemma-7b-itGoogleWeight22/02/202439.9544.9343.3940.11
41.9
10microsoft/Phi-3-small-128k-instructMicrosoftWeight01/08/202439.3144.8241.7840.65
41.24
11microsoft/Phi-3-small-8k-instructMicrosoftWeight01/08/202438.7243.6042.3239.99
40.88
12Qwen-7BAlibaba CloudWeight08/01/202430.6435.0734.1532.68
32.81
13Qwen2-7B-InstructAlibaba CloudWeight01/08/202421.9635.2433.1329.29
28.85
14gemma-2b-itGoogleWeight22/02/202424.3929.5931.0126.81
27.72
15sealion7bAI SingaporeWeight08/01/202426.2828.5727.6627.34
26.73
16bloom-1b7BigScienceWeight08/01/202425.1325.0926.3425.19
25.51
17bloom-7b1BigScienceWeight08/01/202425.0826.2625.7424.59
25.41
18falcon-7bTechnology Innovation InstituteWeight08/01/202424.1923.5926.7224.73
24.96
19PhoGPT-7B5-InstructVin AIWeight08/01/202421.9725.9324.3226.00
24.01
20Llama-2-7b-hfFacebook Research - MetaWeight08/01/202421.4823.4124.1023.59
22.95
21falcon-7b-instructTechnology Innovation InstituteWeight08/01/20249.5013.6314.986.13
11.39

Leaderboard of fine-tuned models

#
Model
Creator
Access
Base Model
Evaluation date
Stem
Social Science
Humanities
Others
Avg
1Llama3-ZAIZalo AIPrivateLlama3-8b01/08/202459.1771.7370.9861.37
65.34
2VTSNLP-8B-InstructVTS DASCPrivateLlama3-8b01/08/202451.5262.4260.1252.37
56.20
3VNPTAI.IO-14BVNPT AIPrivateQwen1.5-14B-Chat11/03/202451.6461.7558.0954.51
55.83
4SeaLLM-7B-v2.5DAMO AcademyPrivatellama-2-7b09/04/202449.3560.6655.9549.05
53.30
5Ml4uLLM-7B-ChatML4UWeightMistral-7B-v0.127/05/202444.7258.6956.8652.36
52.08
6Vistral-7B-ChatUONLP x OntocordWeightMistral-7B-v0.116/01/202443.3257.0255.1248.01
50.07
7SDSRV-7B-chatSDSRV teamsPrivateMistral-7B-v0.126/04/202436.2960.5555.9549.05
48.55
8Arcanic Cono 1.5Arcanic AIPrivateMistral-7B-v0.104/05/202445.1152.4451.9745.36
47.45
9SeaLLM-7b-v2DAMO AcademyWeightllama-2-7b15/02/202439.9552.0249.3845.27
45.79
10bloomz-7b1BigScienceWeightBloom-7b108/01/202432.6345.7341.8539.89
38.87
11T-Llama-7bFPTU HCMWeightllama-2-7b18/03/202432.243.1540.3136.57
37.28
12vbd-llama2-7b-50b-chatVin BigDataWeightllama-2-7b08/01/202431.4540.3440.2439.62
36.98
13vietcuna-3bVirtual InteractiveWeightbloomz-3b08/01/202430.1239.9237.8633.83
34.79
14bloomz-1b7BigScienceWeightBloom-1b708/01/202429.7240.1734.7333.41
33.65
15SeaLLM-7B-HybridDAMO AcademyWeightllama-2-7b08/01/202429.4934.6136.6834.52
33.39
16ura-llama-7bHo Chi Minh City University of TechnologyWeightllama-2-7b08/01/202429.1933.3134.6432.97
32.18
17vinallama-7b-chatVirtual InteractiveWeightllama-2-7b08/01/202425.7034.5033.8731.41
30.64
18vietcuna-7b-v3Virtual InteractiveWeightbloomz-7b08/01/202428.7033.9431.3228.24
30.34
19vietnamese-llama2-7b-40GBBKAI - HUSTWeightllama-2-7b08/01/202423.2225.6126.7126.30
25.19
DISCLAIMER: Please note that evaluating models like LLMs can be challenging, as leaderboards might be susceptible to manipulation, and a small tweak in prompting can lead to totally different results. It's especially concerning because some models are not publicly accessible. For instance, good results can be achieved through distilling answers from stronger models like GPT-4 or even from humans. Therefore, it's important to approach leaderboard scores with caution. Most of the models assessed here are public With Open Access, which have public weights or APIs for verification.