VMLU

Leaderboard

The following table presents benchmarking results for popular foundation models using zero-shot prompting. These models were evaluated either by calling APIs or running models directly. For more information about each specific model, please click on the model's name. Individuals and organizations are welcome to submit their model's predictions to VMLU at any time, with either zero-shot or few-shot evaluation. Click here to navigate to the submission page. Please note that your results will not be published on the Leaderboard unless you request it.

Leaderboard of from-scratch models

#	Model	Creator	Access	Evaluation date	Stem	Social Science	Humanities	Others	Avg
1	QwQ-32B	Alibaba Cloud	Weight	13/03/2025	81.11	78.49	71.78	70.62	76.13
2	Qwen2.5-72B-Instruct-AWQ	Alibaba Cloud	Weight	20/02/2025	68.46	74.65	69.23	64.62	69.17
3	Llama-3-70B	Meta	Weight	23/04/2024	61.7	74.91	68.74	63.53	66.44
4	KiLM-13b-v24.7.1	Kiki AI	Private	01/08/2024	60.29	73.15	71.85	60.13	66.07
5	GPT-4	OpenAI	API	08/01/2024	63.84	71.78	66.14	60.37	65.53
6	QwQ-32B-Preview	Alibaba Cloud	Weight	20/02/2025	63.42	69.84	63.32	59.58	63.9
7	Gpt-4o-mini	OpenAI	API	01/08/2024	58.00	70.95	65.03	60.91	62.87
8	gemma-2-9b-it	Google	Weight	01/08/2024	56.31	65.79	60.80	54.44	59.04
9	Qwen2.5-7B-Instruct	Alibaba Cloud	Weight	20/02/2025	55.43	63.27	58.3	54.4	57.51
10	gemini	Google	API	30/01/2024	42.8	60.31	55.35	51.30	51.03
11	ChatGPT	OpenAI	API	08/01/2024	43.24	51.67	46.96	46.32	46.33
12	ViGPT-1.6B-v1	Vin BigData	Private	08/01/2024	35.06	48.72	47.20	42.54	42.34
13	gemma-7b-it	Google	Weight	22/02/2024	39.95	44.93	43.39	40.11	41.9
14	microsoft/Phi-3-small-128k-instruct	Microsoft	Weight	01/08/2024	39.31	44.82	41.78	40.65	41.24
15	microsoft/Phi-3-small-8k-instruct	Microsoft	Weight	01/08/2024	38.72	43.60	42.32	39.99	40.88
16	Qwen-7B	Alibaba Cloud	Weight	08/01/2024	30.64	35.07	34.15	32.68	32.81
17	Qwen2-7B-Instruct	Alibaba Cloud	Weight	01/08/2024	21.96	35.24	33.13	29.29	28.85
18	gemma-2b-it	Google	Weight	22/02/2024	24.39	29.59	31.01	26.81	27.72
19	sealion7b	AI Singapore	Weight	08/01/2024	26.28	28.57	27.66	27.34	26.73
20	bloom-1b7	BigScience	Weight	08/01/2024	25.13	25.09	26.34	25.19	25.51
21	bloom-7b1	BigScience	Weight	08/01/2024	25.08	26.26	25.74	24.59	25.41
22	falcon-7b	Technology Innovation Institute	Weight	08/01/2024	24.19	23.59	26.72	24.73	24.96
23	PhoGPT-7B5-Instruct	Vin AI	Weight	08/01/2024	21.97	25.93	24.32	26.00	24.01
24	Llama-2-7b-hf	Facebook Research - Meta	Weight	08/01/2024	21.48	23.41	24.10	23.59	22.95
25	falcon-7b-instruct	Technology Innovation Institute	Weight	08/01/2024	9.50	13.63	14.98	6.13	11.39

Leaderboard of fine-tuned models

#	Model	Creator	Access	Base Model	Evaluation date	Stem	Social Science	Humanities	Others	Avg
1	VNPTAI.IO-Medium-R1.2	VNPT AI	Private	Unknown	17/06/2025	84.36	82.22	76.42	72	79.61
2	BnK-AI-Medium-v2.1	BnK Solution	Private	Unknown	27/05/2025	81.53	81.51	75.84	75.58	78.84
3	VNPTAI.IO-Medium-R1	VNPT AI	Private	Unknown	04/04/2025	77.09	82.3	78.85	69.98	77.43
4	MISA-Llama3-v1.1	MISA JSC	Private	Llama3	25/03/2025	77.5	80.75	76.62	71.6	76.87
5	BnK-AI-Medium-v2	BnK Solution	Private	Unknown	19/03/2025	80.94	80.76	70.7	74.06	76.66
6	VNPTAI.IO-Large-v4	VNPT AI	Private	Unknown	28/02/2025	78.05	79.05	75.39	70.37	76.21
7	GreenNode-xMedium-v1	GreenNode AI	Private	Unknown	24/02/2025	75.7	81.09	75.25	69.33	75.5
8	CakebyVPBank-Large	BeFinancial	Private	Unknown	22/10/2024	77.75	78.11	70.38	67.82	73.99
9	DeepSeek-R1-Distill-Llama-70B	DeepSeek	Weight	Meta-Llama-3-70B	24/02/2025	76.77	76.23	67.98	66.82	72.41
10	BnK-AI-Medium-v1	BnK Solution	Private	UNKNOWN	07/02/2025	76.35	75.58	67.57	65.52	71.81
11	MISA-Llama3-v1.0	MISA JSC	Private	Llama3.1	10/02/2025	70.42	76.28	69.47	67.62	70.7
12	DeepSeek-R1-Distill-Qwen-32B	DeepSeek	Weight	Qwen2.5-32B	24/02/2025	73.19	75.03	67.46	65.62	70.55
13	DeepSeek-R1-Distill-Qwen-14B	DeepSeek	Weight	Qwen2.5-14B	24/02/2025	71.24	70.94	63.46	63.37	67.55
14	CakebyVPBank-Small	BeFinancial	Private	Unknown	22/10/2024	63.95	70.68	67.63	61.17	65.82
15	Llama3-ZAI	Kiki AI	Private	Llama3-8b	01/08/2024	59.17	71.73	70.98	61.37	65.34
16	BloomVN-8B-chat	bloomify.cafe	Weight	sail/Sailor2-8B-Chat	07/02/2025	50.72	62.81	60.47	55.4	56.56
17	Llama3-ViettelSolutions-8B	VTS DASC	Private	Llama3-8b	01/08/2024	51.52	62.42	60.12	52.37	56.20
18	Vintern-3B-beta	5CD-AI	Private	Qwen2.5-3B-Instruct	22/10/2024	51.70	61.01	58.41	51.98	54.81
19	SeaLLM-7B-v2.5	DAMO Academy	Private	llama-2-7b	09/04/2024	49.35	60.66	55.95	49.05	53.30
20	Ml4uLLM-7B-Chat	ML4U	Weight	Mistral-7B-v0.1	27/05/2024	44.72	58.69	56.86	52.36	52.08
21	Vistral-7B-Chat	UONLP x Ontocord	Weight	Mistral-7B-v0.1	16/01/2024	43.32	57.02	55.12	48.01	50.07
22	DeepSeek-R1-Distill-Qwen-7B	DeepSeek	Weight	Qwen2.5-7B	24/02/2025	61.14	46.56	39.84	38.86	48.56
23	SDSRV-7B-chat	SDSRV teams	Private	Mistral-7B-v0.1	26/04/2024	36.29	60.55	55.95	49.05	48.55
24	Arcanic Cono 1.5	Arcanic AI	Private	Mistral-7B-v0.1	04/05/2024	45.11	52.44	51.97	45.36	47.45
25	DeepSeek-R1-Distill-Llama-8B	DeepSeek	Weight	Meta-Llama-3-8B	24/02/2025	45.02	52.12	47.78	38.97	46.16
26	SeaLLM-7b-v2	DAMO Academy	Weight	llama-2-7b	15/02/2024	39.95	52.02	49.38	45.27	45.79
27	bloomz-7b1	BigScience	Weight	Bloom-7b1	08/01/2024	32.63	45.73	41.85	39.89	38.87
28	T-Llama-7b	FPTU HCM	Weight	llama-2-7b	18/03/2024	32.2	43.15	40.31	36.57	37.28
29	vbd-llama2-7b-50b-chat	Vin BigData	Weight	llama-2-7b	08/01/2024	31.45	40.34	40.24	39.62	36.98
30	vietcuna-3b	Virtual Interactive	Weight	bloomz-3b	08/01/2024	30.12	39.92	37.86	33.83	34.79
31	bloomz-1b7	BigScience	Weight	Bloom-1b7	08/01/2024	29.72	40.17	34.73	33.41	33.65
32	SeaLLM-7B-Hybrid	DAMO Academy	Weight	llama-2-7b	08/01/2024	29.49	34.61	36.68	34.52	33.39
33	DeepSeek-R1-Distill-Qwen-1.5B	DeepSeek	Weight	Qwen2.5-1.5B	24/02/2025	38.02	32.03	30.56	29.55	33.36
34	ura-llama-7b	Ho Chi Minh City University of Technology	Weight	llama-2-7b	08/01/2024	29.19	33.31	34.64	32.97	32.18
35	vinallama-7b-chat	Virtual Interactive	Weight	llama-2-7b	08/01/2024	25.70	34.50	33.87	31.41	30.64
36	vietcuna-7b-v3	Virtual Interactive	Weight	bloomz-7b	08/01/2024	28.70	33.94	31.32	28.24	30.34
37	BloomVN-0.5B	bloomify.cafe	Weight	Qwen/Qwen2.5-0.5B	07/02/2025	23.18	32.84	32.71	33.67	29.43
38	vietnamese-llama2-7b-40GB	BKAI - HUST	Weight	llama-2-7b	08/01/2024	23.22	25.61	26.71	26.30	25.19

DISCLAIMER: Please note that evaluating models like LLMs can be challenging, as leaderboards might be susceptible to manipulation, and a small tweak in prompting can lead to totally different results. It's especially concerning because some models are not publicly accessible. For instance, good results can be achieved through distilling answers from stronger models like GPT-4 or even from humans. Therefore, it's important to approach leaderboard scores with caution. Most of the models assessed here are public With Open Access, which have public weights or APIs for verification.