diff --git a/README.md b/README.md
index 97e135d..9ad4c14 100644
--- a/README.md
+++ b/README.md
@@ -1066,22 +1066,28 @@ We have tested the model's tool calling capabilities on our open-source Chinese
- Chinese Tool-Use Benchmark |
+ Chinese Tool-Use Benchmark (Version 20231206) |
Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ |
- GPT-4 | 95% | 0.90 | 15.0% |
+ GPT-4 | 98.0% | 0.953 | 23.9% |
- GPT-3.5 | 85% | 0.88 | 75.0% |
+ GPT-3.5 | 74.5% | 0.807 | 80.6% |
- Qwen-7B-Chat | 98% | 0.91 | 7.3% |
+ Qwen-1_8B-Chat | 85.0% | 0.839 | 27.6% |
- Qwen-14B-Chat | 98% | 0.93 | 2.4% |
+ Qwen-7B-Chat | 95.5% | 0.900 | 11.6% |
+
+
+ Qwen-14B-Chat | 96.9% | 0.917 | 5.6% |
+
+
+ Qwen-72B-Chat | 98.2% | 0.927 | 1.1% |
@@ -1091,127 +1097,85 @@ We have observed that Qwen performs well in terms of code executability and resu
- Executable Rate of Generated Code (%) |
+ Code Interpreter Benchmark (Version 20231206) |
- Model | Math↑ | Visualization↑ | General↑ |
+ Model |
+ Accuracy of Code Execution Results (%) |
+ Executable Rate of Code (%) |
- GPT-4 | 91.9 | 85.9 | 82.8 |
+ Math↑ | Visualization-Hard↑ | Visualization-Easy↑ | General↑ |
- GPT-3.5 | 89.2 | 65.0 | 74.1 |
+ GPT-4 |
+ 82.8 |
+ 66.7 |
+ 60.8 |
+ 82.8 |
- LLaMA2-7B-Chat |
- 41.9 |
- 33.1 |
- 24.1 |
+ GPT-3.5 |
+ 47.3 |
+ 33.3 |
+ 55.7 |
+ 74.1 |
LLaMA2-13B-Chat |
- 50.0 |
- 40.5 |
- 48.3 |
-
-
- CodeLLaMA-7B-Instruct |
- 85.1 |
- 54.0 |
- 70.7 |
+ 8.3 |
+ 1.2 |
+ 15.2 |
+ 48.3 |
CodeLLaMA-13B-Instruct |
- 93.2 |
- 55.8 |
- 74.1 |
-
-
- InternLM-7B-Chat-v1.1 |
- 78.4 |
- 44.2 |
- 62.1 |
+ 28.2 |
+ 15.5 |
+ 21.5 |
+ 74.1 |
InternLM-20B-Chat |
- 70.3 |
- 44.2 |
- 65.5 |
-
-
- Qwen-7B-Chat |
- 82.4 |
- 64.4 |
- 67.2 |
-
-
- Qwen-14B-Chat |
- 89.2 |
- 84.1 |
+ 34.6 |
+ 10.7 |
+ 25.1 |
65.5 |
-
-
-
-
- Accuracy of Code Execution Results (%) |
-
-
- Model | Math↑ | Visualization-Hard↑ | Visualization-Easy↑ |
-
-
- GPT-4 | 82.8 | 66.7 | 60.8 |
-
-
- GPT-3.5 | 47.3 | 33.3 | 55.7 |
-
-
- LLaMA2-7B-Chat |
- 3.9 |
- 14.3 |
- 39.2 |
-
-
- LLaMA2-13B-Chat |
- 8.3 |
- 8.3 |
- 40.5 |
-
- CodeLLaMA-7B-Instruct |
- 14.3 |
- 26.2 |
- 60.8 |
+ ChatGLM3-6B |
+ 54.2 |
+ 15.5 |
+ 21.5 |
+ 67.1 |
- CodeLLaMA-13B-Instruct |
- 28.2 |
- 27.4 |
- 62.0 |
-
-
- InternLM-7B-Chat-v1.1 |
- 28.5 |
- 4.8 |
- 40.5 |
-
-
- InternLM-20B-Chat |
- 34.6 |
+ Qwen-1.8B-Chat |
+ 25.6 |
21.4 |
- 45.6 |
+ 22.8 |
+ 65.5 |
Qwen-7B-Chat |
41.9 |
- 40.5 |
- 54.4 |
+ 23.8 |
+ 38.0 |
+ 67.2 |
Qwen-14B-Chat |
58.4 |
- 53.6 |
- 59.5 |
+ 31.0 |
+ 45.6 |
+ 65.5 |
+
+
+ Qwen-72B-Chat |
+ 72.7 |
+ 41.7 |
+ 43.0 |
+ 82.8 |
@@ -1221,62 +1185,6 @@ We have observed that Qwen performs well in terms of code executability and resu
-In addition, we also provide experimental results demonstrating that our model is capable of acting as a HuggingFace Agent. For more information, please refer to the [example documentation](examples/transformers_agent.md). The model's performance on the evaluation dataset provided by Hugging Face is as follows:
-
-
-
- HuggingFace Agent Benchmark- Run Mode |
-
-
- Model | Tool Selection↑ | Tool Used↑ | Code↑ |
-
-
- GPT-4 | 100 | 100 | 97.4 |
-
-
- GPT-3.5 | 95.4 | 96.3 | 87.0 |
-
-
- StarCoder-Base-15B | 86.1 | 87.0 | 68.9 |
-
-
- StarCoder-15B | 87.0 | 88.0 | 68.9 |
-
-
- Qwen-7B-Chat | 87.0 | 87.0 | 71.5 |
-
-
- Qwen-14B-Chat | 93.5 | 94.4 | 87.0 |
-
-
-
-
-
- HuggingFace Agent Benchmark - Chat Mode |
-
-
- Model | Tool Selection↑ | Tool Used↑ | Code↑ |
-
-
- GPT-4 | 97.9 | 97.9 | 98.5 |
-
-
- GPT-3.5 | 97.3 | 96.8 | 89.6 |
-
-
- StarCoder-Base-15B | 97.9 | 97.9 | 91.1 |
-
-
- StarCoder-15B | 97.9 | 97.9 | 89.6 |
-
-
- Qwen-7B-Chat | 94.7 | 94.7 | 85.1 |
-
-
- Qwen-14B-Chat | 97.9 | 97.9 | 95.5 |
-
-
-
## Long-Context Understanding
diff --git a/README_CN.md b/README_CN.md
index 40d99c8..1ee9f57 100644
--- a/README_CN.md
+++ b/README_CN.md
@@ -1059,22 +1059,28 @@ Qwen-Chat针对工具使用、函数调用能力进行了优化。用户可以
- 中文工具调用评测基准 |
+ 中文工具调用评测基准(版本 20231206) |
Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ |
- GPT-4 | 95% | 0.90 | 15.0% |
+ GPT-4 | 98.0% | 0.953 | 23.9% |
- GPT-3.5 | 85% | 0.88 | 75.0% |
+ GPT-3.5 | 74.5% | 0.807 | 80.6% |
- Qwen-7B-Chat | 98% | 0.91 | 7.3% |
+ Qwen-1_8B-Chat | 85.0% | 0.839 | 27.6% |
- Qwen-14B-Chat | 98% | 0.93 | 2.4% |
+ Qwen-7B-Chat | 95.5% | 0.900 | 11.6% |
+
+
+ Qwen-14B-Chat | 96.9% | 0.917 | 5.6% |
+
+
+ Qwen-72B-Chat | 98.2% | 0.927 | 1.1% |
@@ -1083,127 +1089,85 @@ Qwen-Chat针对工具使用、函数调用能力进行了优化。用户可以
- 生成代码的可执行率 (%) |
+ Code Interpreter Benchmark (Version 20231206) |
- Model | Math↑ | Visualization↑ | General↑ |
+ Model |
+ 代码执行结果正确性 (%) |
+ 生成代码的可执行率 (%) |
- GPT-4 | 91.9 | 85.9 | 82.8 |
+ Math↑ | Visualization-Hard↑ | Visualization-Easy↑ | General↑ |
- GPT-3.5 | 89.2 | 65.0 | 74.1 |
+ GPT-4 |
+ 82.8 |
+ 66.7 |
+ 60.8 |
+ 82.8 |
- LLaMA2-7B-Chat |
- 41.9 |
- 33.1 |
- 24.1 |
+ GPT-3.5 |
+ 47.3 |
+ 33.3 |
+ 55.7 |
+ 74.1 |
LLaMA2-13B-Chat |
- 50.0 |
- 40.5 |
- 48.3 |
-
-
- CodeLLaMA-7B-Instruct |
- 85.1 |
- 54.0 |
- 70.7 |
+ 8.3 |
+ 1.2 |
+ 15.2 |
+ 48.3 |
CodeLLaMA-13B-Instruct |
- 93.2 |
- 55.8 |
- 74.1 |
-
-
- InternLM-7B-Chat-v1.1 |
- 78.4 |
- 44.2 |
- 62.1 |
+ 28.2 |
+ 15.5 |
+ 21.5 |
+ 74.1 |
InternLM-20B-Chat |
- 70.3 |
- 44.2 |
- 65.5 |
-
-
- Qwen-7B-Chat |
- 82.4 |
- 64.4 |
- 67.2 |
-
-
- Qwen-14B-Chat |
- 89.2 |
- 84.1 |
+ 34.6 |
+ 10.7 |
+ 25.1 |
65.5 |
-
-
-
-
- 代码执行结果的正确率 (%) |
-
-
- Model | Math↑ | Visualization-Hard↑ | Visualization-Easy↑ |
-
-
- GPT-4 | 82.8 | 66.7 | 60.8 |
-
-
- GPT-3.5 | 47.3 | 33.3 | 55.7 |
-
-
- LLaMA2-7B-Chat |
- 3.9 |
- 14.3 |
- 39.2 |
-
-
- LLaMA2-13B-Chat |
- 8.3 |
- 8.3 |
- 40.5 |
-
- CodeLLaMA-7B-Instruct |
- 14.3 |
- 26.2 |
- 60.8 |
+ ChatGLM3-6B |
+ 54.2 |
+ 15.5 |
+ 21.5 |
+ 67.1 |
- CodeLLaMA-13B-Instruct |
- 28.2 |
- 27.4 |
- 62.0 |
-
-
- InternLM-7B-Chat-v1.1 |
- 28.5 |
- 4.8 |
- 40.5 |
-
-
- InternLM-20B-Chat |
- 34.6 |
+ Qwen-1.8B-Chat |
+ 25.6 |
21.4 |
- 45.6 |
+ 22.8 |
+ 65.5 |
Qwen-7B-Chat |
41.9 |
- 40.5 |
- 54.4 |
+ 23.8 |
+ 38.0 |
+ 67.2 |
Qwen-14B-Chat |
58.4 |
- 53.6 |
- 59.5 |
+ 31.0 |
+ 45.6 |
+ 65.5 |
+
+
+ Qwen-72B-Chat |
+ 72.7 |
+ 41.7 |
+ 43.0 |
+ 82.8 |
@@ -1213,62 +1177,6 @@ Qwen-Chat针对工具使用、函数调用能力进行了优化。用户可以
-此外,我们还提供了实验结果表明我们的模型具备扮演HuggingFace Agent的能力,详见[示例文档](examples/transformers_agent.md)了解更多信息。模型在Hugging Face提供的评测数据集上表现如下:
-
-
-
- HuggingFace Agent评测基准 - Run模式 |
-
-
- Model | Tool Selection↑ | Tool Used↑ | Code↑ |
-
-
- GPT-4 | 100 | 100 | 97.4 |
-
-
- GPT-3.5 | 95.4 | 96.3 | 87.0 |
-
-
- StarCoder-Base-15B | 86.1 | 87.0 | 68.9 |
-
-
- StarCoder-15B | 87.0 | 88.0 | 68.9 |
-
-
- Qwen-7B-Chat | 87.0 | 87.0 | 71.5 |
-
-
- Qwen-14B-Chat | 93.5 | 94.4 | 87.0 |
-
-
-
-
-
- HuggingFace Agent评测基准 - Chat模式 |
-
-
- Model | Tool Selection↑ | Tool Used↑ | Code↑ |
-
-
- GPT-4 | 97.9 | 97.9 | 98.5 |
-
-
- GPT-3.5 | 97.3 | 96.8 | 89.6 |
-
-
- StarCoder-Base-15B | 97.9 | 97.9 | 91.1 |
-
-
- StarCoder-15B | 97.9 | 97.9 | 89.6 |
-
-
- Qwen-7B-Chat | 94.7 | 94.7 | 85.1 |
-
-
- Qwen-14B-Chat | 97.9 | 97.9 | 95.5 |
-
-
-
## 长文本理解
diff --git a/README_ES.md b/README_ES.md
index 1855c89..2939aad 100644
--- a/README_ES.md
+++ b/README_ES.md
@@ -1026,22 +1026,28 @@ Hemos probado las capacidades de llamada de la herramienta del modelo en nuestro
- Chinese Tool-Use Benchmark |
+ Chinese Tool-Use Benchmark (Version 20231206) |
Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ |
- GPT-4 | 95% | 0.90 | 15.0% |
+ GPT-4 | 98.0% | 0.953 | 23.9% |
- GPT-3.5 | 85% | 0.88 | 75.0% |
+ GPT-3.5 | 74.5% | 0.807 | 80.6% |
- Qwen-7B-Chat | 98% | 0.91 | 7.3% |
+ Qwen-1_8B-Chat | 85.0% | 0.839 | 27.6% |
- Qwen-14B-Chat | 98% | 0.93 | 2.4% |
+ Qwen-7B-Chat | 95.5% | 0.900 | 11.6% |
+
+
+ Qwen-14B-Chat | 96.9% | 0.917 | 5.6% |
+
+
+ Qwen-72B-Chat | 98.2% | 0.927 | 1.1% |
@@ -1051,127 +1057,85 @@ Hemos observado que Qwen funciona bien en términos de ejecutabilidad del códig
- Executable Rate of Generated Code (%) |
+ Code Interpreter Benchmark (Version 20231206) |
- Model | Math↑ | Visualization↑ | General↑ |
+ Model |
+ Accuracy of Code Execution Results (%) |
+ Executable Rate of Code (%) |
- GPT-4 | 91.9 | 85.9 | 82.8 |
+ Math↑ | Visualization-Hard↑ | Visualization-Easy↑ | General↑ |
- GPT-3.5 | 89.2 | 65.0 | 74.1 |
+ GPT-4 |
+ 82.8 |
+ 66.7 |
+ 60.8 |
+ 82.8 |
- LLaMA2-7B-Chat |
- 41.9 |
- 33.1 |
- 24.1 |
+ GPT-3.5 |
+ 47.3 |
+ 33.3 |
+ 55.7 |
+ 74.1 |
LLaMA2-13B-Chat |
- 50.0 |
- 40.5 |
- 48.3 |
-
-
- CodeLLaMA-7B-Instruct |
- 85.1 |
- 54.0 |
- 70.7 |
+ 8.3 |
+ 1.2 |
+ 15.2 |
+ 48.3 |
CodeLLaMA-13B-Instruct |
- 93.2 |
- 55.8 |
- 74.1 |
-
-
- InternLM-7B-Chat-v1.1 |
- 78.4 |
- 44.2 |
- 62.1 |
+ 28.2 |
+ 15.5 |
+ 21.5 |
+ 74.1 |
InternLM-20B-Chat |
- 70.3 |
- 44.2 |
- 65.5 |
-
-
- Qwen-7B-Chat |
- 82.4 |
- 64.4 |
- 67.2 |
-
-
- Qwen-14B-Chat |
- 89.2 |
- 84.1 |
+ 34.6 |
+ 10.7 |
+ 25.1 |
65.5 |
-
-
-
-
- Accuracy of Code Execution Results (%) |
-
-
- Model | Math↑ | Visualization-Hard↑ | Visualization-Easy↑ |
-
-
- GPT-4 | 82.8 | 66.7 | 60.8 |
-
-
- GPT-3.5 | 47.3 | 33.3 | 55.7 |
-
-
- LLaMA2-7B-Chat |
- 3.9 |
- 14.3 |
- 39.2 |
-
-
- LLaMA2-13B-Chat |
- 8.3 |
- 8.3 |
- 40.5 |
-
- CodeLLaMA-7B-Instruct |
- 14.3 |
- 26.2 |
- 60.8 |
+ ChatGLM3-6B |
+ 54.2 |
+ 15.5 |
+ 21.5 |
+ 67.1 |
- CodeLLaMA-13B-Instruct |
- 28.2 |
- 27.4 |
- 62.0 |
-
-
- InternLM-7B-Chat-v1.1 |
- 28.5 |
- 4.8 |
- 40.5 |
-
-
- InternLM-20B-Chat |
- 34.6 |
+ Qwen-1.8B-Chat |
+ 25.6 |
21.4 |
- 45.6 |
+ 22.8 |
+ 65.5 |
Qwen-7B-Chat |
41.9 |
- 40.5 |
- 54.4 |
+ 23.8 |
+ 38.0 |
+ 67.2 |
Qwen-14B-Chat |
58.4 |
- 53.6 |
- 59.5 |
+ 31.0 |
+ 45.6 |
+ 65.5 |
+
+
+ Qwen-72B-Chat |
+ 72.7 |
+ 41.7 |
+ 43.0 |
+ 82.8 |
@@ -1181,62 +1145,6 @@ Hemos observado que Qwen funciona bien en términos de ejecutabilidad del códig
-Además, también proporcionamos resultados experimentales que demuestran que nuestro modelo es capaz de actuar como un Agente HuggingFace. Para más información, consulte la [documentación del ejemplo](examples/transformers_agent.md). El rendimiento del modelo en el conjunto de datos de evaluación proporcionado por Hugging Face es el siguiente:
-
-
-
- HuggingFace Agent Benchmark- Run Mode |
-
-
- Model | Tool Selection↑ | Tool Used↑ | Code↑ |
-
-
- GPT-4 | 100 | 100 | 97.4 |
-
-
- GPT-3.5 | 95.4 | 96.3 | 87.0 |
-
-
- StarCoder-Base-15B | 86.1 | 87.0 | 68.9 |
-
-
- StarCoder-15B | 87.0 | 88.0 | 68.9 |
-
-
- Qwen-7B-Chat | 87.0 | 87.0 | 71.5 |
-
-
- Qwen-14B-Chat | 93.5 | 94.4 | 87.0 |
-
-
-
-
-
- HuggingFace Agent Benchmark - Chat Mode |
-
-
- Model | Tool Selection↑ | Tool Used↑ | Code↑ |
-
-
- GPT-4 | 97.9 | 97.9 | 98.5 |
-
-
- GPT-3.5 | 97.3 | 96.8 | 89.6 |
-
-
- StarCoder-Base-15B | 97.9 | 97.9 | 91.1 |
-
-
- StarCoder-15B | 97.9 | 97.9 | 89.6 |
-
-
- Qwen-7B-Chat | 94.7 | 94.7 | 85.1 |
-
-
- Qwen-14B-Chat | 97.9 | 97.9 | 95.5 |
-
-
-
## Comprensión del Contexto Largo
diff --git a/README_FR.md b/README_FR.md
index 19efd9e..38a3c43 100644
--- a/README_FR.md
+++ b/README_FR.md
@@ -1029,22 +1029,28 @@ Nous avons testé les capacités d'appel d'outil du modèle sur notre benchmark
- Chinese Tool-Use Benchmark |
+ Chinese Tool-Use Benchmark (Version 20231206) |
Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ |
- GPT-4 | 95% | 0.90 | 15.0% |
+ GPT-4 | 98.0% | 0.953 | 23.9% |
- GPT-3.5 | 85% | 0.88 | 75.0% |
+ GPT-3.5 | 74.5% | 0.807 | 80.6% |
- Qwen-7B-Chat | 98% | 0.91 | 7.3% |
+ Qwen-1_8B-Chat | 85.0% | 0.839 | 27.6% |
- Qwen-14B-Chat | 98% | 0.93 | 2.4% |
+ Qwen-7B-Chat | 95.5% | 0.900 | 11.6% |
+
+
+ Qwen-14B-Chat | 96.9% | 0.917 | 5.6% |
+
+
+ Qwen-72B-Chat | 98.2% | 0.927 | 1.1% |
@@ -1054,127 +1060,85 @@ Nous avons observé que Qwen est performant en termes d'exécutabilité du code
- Executable Rate of Generated Code (%) |
+ Code Interpreter Benchmark (Version 20231206) |
- Model | Math↑ | Visualization↑ | General↑ |
+ Model |
+ Accuracy of Code Execution Results (%) |
+ Executable Rate of Code (%) |
- GPT-4 | 91.9 | 85.9 | 82.8 |
+ Math↑ | Visualization-Hard↑ | Visualization-Easy↑ | General↑ |
- GPT-3.5 | 89.2 | 65.0 | 74.1 |
+ GPT-4 |
+ 82.8 |
+ 66.7 |
+ 60.8 |
+ 82.8 |
- LLaMA2-7B-Chat |
- 41.9 |
- 33.1 |
- 24.1 |
+ GPT-3.5 |
+ 47.3 |
+ 33.3 |
+ 55.7 |
+ 74.1 |
LLaMA2-13B-Chat |
- 50.0 |
- 40.5 |
- 48.3 |
-
-
- CodeLLaMA-7B-Instruct |
- 85.1 |
- 54.0 |
- 70.7 |
+ 8.3 |
+ 1.2 |
+ 15.2 |
+ 48.3 |
CodeLLaMA-13B-Instruct |
- 93.2 |
- 55.8 |
- 74.1 |
-
-
- InternLM-7B-Chat-v1.1 |
- 78.4 |
- 44.2 |
- 62.1 |
+ 28.2 |
+ 15.5 |
+ 21.5 |
+ 74.1 |
InternLM-20B-Chat |
- 70.3 |
- 44.2 |
- 65.5 |
-
-
- Qwen-7B-Chat |
- 82.4 |
- 64.4 |
- 67.2 |
-
-
- Qwen-14B-Chat |
- 89.2 |
- 84.1 |
+ 34.6 |
+ 10.7 |
+ 25.1 |
65.5 |
-
-
-
-
- Accuracy of Code Execution Results (%) |
-
-
- Model | Math↑ | Visualization-Hard↑ | Visualization-Easy↑ |
-
-
- GPT-4 | 82.8 | 66.7 | 60.8 |
-
-
- GPT-3.5 | 47.3 | 33.3 | 55.7 |
-
-
- LLaMA2-7B-Chat |
- 3.9 |
- 14.3 |
- 39.2 |
-
-
- LLaMA2-13B-Chat |
- 8.3 |
- 8.3 |
- 40.5 |
-
- CodeLLaMA-7B-Instruct |
- 14.3 |
- 26.2 |
- 60.8 |
+ ChatGLM3-6B |
+ 54.2 |
+ 15.5 |
+ 21.5 |
+ 67.1 |
- CodeLLaMA-13B-Instruct |
- 28.2 |
- 27.4 |
- 62.0 |
-
-
- InternLM-7B-Chat-v1.1 |
- 28.5 |
- 4.8 |
- 40.5 |
-
-
- InternLM-20B-Chat |
- 34.6 |
+ Qwen-1.8B-Chat |
+ 25.6 |
21.4 |
- 45.6 |
+ 22.8 |
+ 65.5 |
Qwen-7B-Chat |
41.9 |
- 40.5 |
- 54.4 |
+ 23.8 |
+ 38.0 |
+ 67.2 |
Qwen-14B-Chat |
58.4 |
- 53.6 |
- 59.5 |
+ 31.0 |
+ 45.6 |
+ 65.5 |
+
+
+ Qwen-72B-Chat |
+ 72.7 |
+ 41.7 |
+ 43.0 |
+ 82.8 |
@@ -1184,62 +1148,6 @@ Nous avons observé que Qwen est performant en termes d'exécutabilité du code
-En outre, nous fournissons également des résultats expérimentaux démontrant que notre modèle est capable d'agir en tant qu'agent Hugging Face. Pour plus d'informations, veuillez vous référer à la [documentation de l'exemple](examples/transformers_agent.md). Les performances du modèle sur l'ensemble des données d'évaluation fournies par Hugging Face sont les suivantes:
-
-
-
- HuggingFace Agent Benchmark- Run Mode |
-
-
- Model | Tool Selection↑ | Tool Used↑ | Code↑ |
-
-
- GPT-4 | 100 | 100 | 97.4 |
-
-
- GPT-3.5 | 95.4 | 96.3 | 87.0 |
-
-
- StarCoder-Base-15B | 86.1 | 87.0 | 68.9 |
-
-
- StarCoder-15B | 87.0 | 88.0 | 68.9 |
-
-
- Qwen-7B-Chat | 87.0 | 87.0 | 71.5 |
-
-
- Qwen-14B-Chat | 93.5 | 94.4 | 87.0 |
-
-
-
-
-
- HuggingFace Agent Benchmark - Chat Mode |
-
-
- Model | Tool Selection↑ | Tool Used↑ | Code↑ |
-
-
- GPT-4 | 97.9 | 97.9 | 98.5 |
-
-
- GPT-3.5 | 97.3 | 96.8 | 89.6 |
-
-
- StarCoder-Base-15B | 97.9 | 97.9 | 91.1 |
-
-
- StarCoder-15B | 97.9 | 97.9 | 89.6 |
-
-
- Qwen-7B-Chat | 94.7 | 94.7 | 85.1 |
-
-
- Qwen-14B-Chat | 97.9 | 97.9 | 95.5 |
-
-
-
## Compréhension du Contexte Long
diff --git a/README_JA.md b/README_JA.md
index be232ed..e8646f6 100644
--- a/README_JA.md
+++ b/README_JA.md
@@ -1056,22 +1056,28 @@ ReAct プロンプトの原則に基づいてツール呼び出しを実装す
- Chinese Tool-Use Benchmark |
+ Chinese Tool-Use Benchmark (Version 20231206) |
Model | Tool Selection (Acc.↑) | Tool Input (Rouge-L↑) | False Positive Error↓ |
- GPT-4 | 95% | 0.90 | 15.0% |
+ GPT-4 | 98.0% | 0.953 | 23.9% |
- GPT-3.5 | 85% | 0.88 | 75.0% |
+ GPT-3.5 | 74.5% | 0.807 | 80.6% |
- Qwen-7B-Chat | 98% | 0.91 | 7.3% |
+ Qwen-1_8B-Chat | 85.0% | 0.839 | 27.6% |
- Qwen-14B-Chat | 98% | 0.93 | 2.4% |
+ Qwen-7B-Chat | 95.5% | 0.900 | 11.6% |
+
+
+ Qwen-14B-Chat | 96.9% | 0.917 | 5.6% |
+
+
+ Qwen-72B-Chat | 98.2% | 0.927 | 1.1% |
@@ -1081,127 +1087,85 @@ Qwen は、コード生成時のコードの実行可能性と結果の精度の
- Executable Rate of Generated Code (%) |
+ Code Interpreter Benchmark (Version 20231206) |
- Model | Math↑ | Visualization↑ | General↑ |
+ Model |
+ Accuracy of Code Execution Results (%) |
+ Executable Rate of Code (%) |
- GPT-4 | 91.9 | 85.9 | 82.8 |
+ Math↑ | Visualization-Hard↑ | Visualization-Easy↑ | General↑ |
- GPT-3.5 | 89.2 | 65.0 | 74.1 |
+ GPT-4 |
+ 82.8 |
+ 66.7 |
+ 60.8 |
+ 82.8 |
- LLaMA2-7B-Chat |
- 41.9 |
- 33.1 |
- 24.1 |
+ GPT-3.5 |
+ 47.3 |
+ 33.3 |
+ 55.7 |
+ 74.1 |
LLaMA2-13B-Chat |
- 50.0 |
- 40.5 |
- 48.3 |
-
-
- CodeLLaMA-7B-Instruct |
- 85.1 |
- 54.0 |
- 70.7 |
+ 8.3 |
+ 1.2 |
+ 15.2 |
+ 48.3 |
CodeLLaMA-13B-Instruct |
- 93.2 |
- 55.8 |
- 74.1 |
-
-
- InternLM-7B-Chat-v1.1 |
- 78.4 |
- 44.2 |
- 62.1 |
+ 28.2 |
+ 15.5 |
+ 21.5 |
+ 74.1 |
InternLM-20B-Chat |
- 70.3 |
- 44.2 |
- 65.5 |
-
-
- Qwen-7B-Chat |
- 82.4 |
- 64.4 |
- 67.2 |
-
-
- Qwen-14B-Chat |
- 89.2 |
- 84.1 |
+ 34.6 |
+ 10.7 |
+ 25.1 |
65.5 |
-
-
-
-
- Accuracy of Code Execution Results (%) |
-
-
- Model | Math↑ | Visualization-Hard↑ | Visualization-Easy↑ |
-
-
- GPT-4 | 82.8 | 66.7 | 60.8 |
-
-
- GPT-3.5 | 47.3 | 33.3 | 55.7 |
-
-
- LLaMA2-7B-Chat |
- 3.9 |
- 14.3 |
- 39.2 |
-
-
- LLaMA2-13B-Chat |
- 8.3 |
- 8.3 |
- 40.5 |
-
- CodeLLaMA-7B-Instruct |
- 14.3 |
- 26.2 |
- 60.8 |
+ ChatGLM3-6B |
+ 54.2 |
+ 15.5 |
+ 21.5 |
+ 67.1 |
- CodeLLaMA-13B-Instruct |
- 28.2 |
- 27.4 |
- 62.0 |
-
-
- InternLM-7B-Chat-v1.1 |
- 28.5 |
- 4.8 |
- 40.5 |
-
-
- InternLM-20B-Chat |
- 34.6 |
+ Qwen-1.8B-Chat |
+ 25.6 |
21.4 |
- 45.6 |
+ 22.8 |
+ 65.5 |
Qwen-7B-Chat |
41.9 |
- 40.5 |
- 54.4 |
+ 23.8 |
+ 38.0 |
+ 67.2 |
Qwen-14B-Chat |
58.4 |
- 53.6 |
- 59.5 |
+ 31.0 |
+ 45.6 |
+ 65.5 |
+
+
+ Qwen-72B-Chat |
+ 72.7 |
+ 41.7 |
+ 43.0 |
+ 82.8 |
@@ -1211,62 +1175,6 @@ Qwen は、コード生成時のコードの実行可能性と結果の精度の
-さらに、Qwenが HuggingFace Agent として機能できることを実証する実験結果も提供します。 詳細については、[ドキュメント例](examples/transformers_agent.md) を参照してください。 Hugging Face が提供する評価データセットにおけるモデルのパフォーマンスは次のとおりです。
-
-
-
- HuggingFace Agent Benchmark- Run Mode |
-
-
- Model | Tool Selection↑ | Tool Used↑ | Code↑ |
-
-
- GPT-4 | 100 | 100 | 97.4 |
-
-
- GPT-3.5 | 95.4 | 96.3 | 87.0 |
-
-
- StarCoder-Base-15B | 86.1 | 87.0 | 68.9 |
-
-
- StarCoder-15B | 87.0 | 88.0 | 68.9 |
-
-
- Qwen-7B-Chat | 87.0 | 87.0 | 71.5 |
-
-
- Qwen-14B-Chat | 93.5 | 94.4 | 87.0 |
-
-
-
-
-
- HuggingFace Agent Benchmark - Chat Mode |
-
-
- Model | Tool Selection↑ | Tool Used↑ | Code↑ |
-
-
- GPT-4 | 97.9 | 97.9 | 98.5 |
-
-
- GPT-3.5 | 97.3 | 96.8 | 89.6 |
-
-
- StarCoder-Base-15B | 97.9 | 97.9 | 91.1 |
-
-
- StarCoder-15B | 97.9 | 97.9 | 89.6 |
-
-
- Qwen-7B-Chat | 94.7 | 94.7 | 85.1 |
-
-
- Qwen-14B-Chat | 97.9 | 97.9 | 95.5 |
-
-
-
## 長い文脈の理解
diff --git a/eval/EVALUATION.md b/eval/EVALUATION.md
index 5baeb4d..b939ad2 100644
--- a/eval/EVALUATION.md
+++ b/eval/EVALUATION.md
@@ -85,9 +85,12 @@ This script is used to reproduce the results of the ReAct and Hugging Face Agent
# Qwen-7B-Chat
mkdir data;
cd data;
-wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/opensource_data/exam_plugin_v1/exam_plugin_v1_react_positive.jsonl;
-wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/opensource_data/exam_plugin_v1/exam_plugin_v1_react_negative.jsonl;
-cd ..;
+## Old Evaluation Dataset (Version 20230803)
+# wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/opensource_data/exam_plugin_v1/exam_plugin_v1_react_positive.jsonl;
+# wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/opensource_data/exam_plugin_v1/exam_plugin_v1_react_negative.jsonl;
+## New Evaluation Dataset (Version 20231206)
+wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/opensource_data/exam_plugin_v20231206/exam_plugin_v20231206_react_positive.jsonl;
+wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/opensource_data/exam_plugin_v20231206/exam_plugin_v20231206_react_negative.jsonl;cd ..;
pip install json5;
pip install jsonlines;
pip install rouge_score;
diff --git a/eval/evaluate_plugin.py b/eval/evaluate_plugin.py
index f3b953b..94d18aa 100644
--- a/eval/evaluate_plugin.py
+++ b/eval/evaluate_plugin.py
@@ -46,7 +46,7 @@ def process_res(response):
)
except:
# print("JSON Load Error:", action_input)
- pass
+ action_input = ""
res_dict = {
"thought": thought,
"action": action,
@@ -80,7 +80,7 @@ def eval_action(job):
response = job["gen"][0]
golden = job["response"]
- if "Action:" in response:
+ if "\nAction: " in response:
response, golden = process_res(response), process_res(golden)
if is_callable(response, golden):
return True
@@ -263,7 +263,7 @@ def main(args):
filename=args.eval_react_negative_filename, model=model, tokenizer=tokenizer
)
for job in jobs:
- if "\nAction:" in job["gen"][0]:
+ if "\nAction: " in job["gen"][0]:
bad_count += 1
scores = {"bad_rate": bad_count / len(jobs)}
result.update({"react_negative": scores})