Qwen-7B ð€ | ð€  ïœ Qwen-7B-Chat ð€ | ð€  | Qwen-7B-Chat-Int4 ð€
Demo  ïœ  Report   |   Discord  ïœ  WeChat
äžæ  ïœ  English  ïœ  æ¥æ¬èª
Japanese document maintainer: Ikko Eltociear Ashimine
ããã«ã[OpenCompass](https://opencompass.org.cn/leaderboard-llm)ãå®æœãã倧èŠæš¡èšèªã¢ãã«ã®ç¬¬äžè
è©äŸ¡ã«ãããšãQwen-7BãšQwen-7B-Chatã¯7Bãã©ã¡ãŒã¿ã¢ãã«ã®ãããã§ããããã®è©äŸ¡ã¯ãèšèªç解ã»çæãã³ãŒãã£ã³ã°ãæ°åŠãæšè«ãªã©ã®è©äŸ¡ã®ããã®å€§éã®å
¬éãã³ãããŒã¯ã§æ§æãããŠããã
ãã詳现ãªå®éšçµæïŒããå€ãã®ãã³ãããŒã¯ããŒã¿ã»ããã§ã®è©³çŽ°ãªã¢ãã«æ§èœïŒã詳现ã«ã€ããŠã¯ã[ãã¡ã](tech_memo.md)ãã¯ãªãã¯ããŠæè¡ã¡ã¢ãåç
§ããŠãã ããã
## å¿
èŠæ¡ä»¶
* python 3.8 以äž
* pytorch 1.12 以äžã2.0 以äžãæšå¥š
* CUDA 11.4 以äžãæšå¥šïŒGPU ãŠãŒã¶ãŒããã©ãã·ã¥ã¢ãã³ã·ã§ã³ãŠãŒã¶ãŒåããªã©ïŒ
## ã¯ã€ãã¯ã¹ã¿ãŒã
以äžã§ã¯ãQwen-7B ãš ð€ ModelScope ãš ð€ Transformers ã®ç°¡åãªäœ¿çšäŸã瀺ããŸãã
ã³ãŒããå®è¡ããåã«ãç°å¢ã®ã»ããã¢ãããšå¿
èŠãªããã±ãŒãžã®ã€ã³ã¹ããŒã«ãæžãã§ããããšã確èªããŠãã ãããäžèšã®èŠä»¶ãæºãããŠããããšã確èªããŠãããäŸåããã©ã€ãã©ãªãã€ã³ã¹ããŒã«ããŠãã ããã
```bash
pip install -r requirements.txt
```
ã䜿ãã®ããã€ã¹ã fp16 ãŸã㯠bf16 ããµããŒãããŠããå Žåã[flash-attention](https://github.com/Dao-AILab/flash-attention) ãã€ã³ã¹ããŒã«ããããšã§ãããé«ãå¹çãšã¡ã¢ãªäœ¿çšéãæããããšãã§ããŸãã(**flash-attention ã¯ãªãã·ã§ã³ã§ãããã€ã³ã¹ããŒã«ããªããŠããããžã§ã¯ãã¯æ£åžžã«å®è¡ã§ããŸã**)
```bash
git clone -b v1.0.8 https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# 以äžã¯ãªãã·ã§ã³ã§ããã€ã³ã¹ããŒã«ã«æéããããå ŽåããããŸãã
# pip install csrc/layer_norm
# pip install csrc/rotary
```
ãã㧠ModelScope ã Transformers ã§å§ããããšãã§ããŸãã
#### ð€ Transformers
Qwen-7B-Chat ãæšè«ã«äœ¿çšããã«ã¯ã以äžã®ããã«æ°è¡ã®ã³ãŒããå
¥åããã ãã§ãã**ææ°ã®ã³ãŒãã䜿çšããŠããããšã確èªããŠãã ããã**
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
# 泚: ããã©ã«ãã®åäœã§ã¯ãã€ã³ãžã§ã¯ã·ã§ã³æ»æé²æ¢æ©èœããªãã«ãªã£ãŠããŸãã
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
# bf16 ã䜿çš
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# fp16 ã䜿çš
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# CPU ã®ã¿äœ¿çš
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
# ãªãŒãã¢ãŒãã䜿çšãããšãããã€ã¹ã«å¿ããŠèªåçã«ç²ŸåºŠãéžæãããŸãã
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True).eval()
# çæã®ããã®ãã€ããŒãã©ã¡ãŒã¿ãæå®
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)
# 第äžèœ®å¯¹è¯ 第äžå察話ã¿ãŒã³
response, history = model.chat(tokenizer, "äœ å¥œ", history=None)
print(response)
# ããã«ã¡ã¯ïŒ ã圹ã«ç«ãŠãŠããããã§ãã
# 第äºèœ®å¯¹è¯ 第äºå察話ã¿ãŒã³
response, history = model.chat(tokenizer, "ç»æ讲äžäžªå¹Žèœ»äººå¥æåäžæç»ååŸæåçæ
äºã", history=history)
print(response)
# ããã¯ãèªåã®ããžãã¹ãå§ããããšå¥®éãããããŠæåããè¥è
ã®ç©èªã§ããã
# ãã®ç©èªã®äž»äººå
¬ã¯ãå¹³å¡ãªå®¶åºã«çãŸããå¹³å¡ãªåŽåè
ã§ãã䞡芪ãæã€ææã§ããã ææã¯åäŸã®é ããèµ·æ¥å®¶ãšããŠæåããããšãç®æšãšããŠããã
# ãã®ç®æšãéæãããããææã¯çå匷ããŠå€§åŠã«å
¥ã£ãã 倧åŠæ代ã«ã¯ãããŸããŸãªèµ·æ¥å®¶ã³ã³ãã¹ãã«ç©æ¥µçã«åå ããå€ãã®è³ãç²åŸããã ãŸããäœæãå©çšããŠã€ã³ã¿ãŒã³ã·ããã«ãåå ãã貎éãªçµéšãç©ãã ã
# åæ¥åŸãææã¯èµ·æ¥ã決æããã æè³å
ãæ¢ãå§ããããäœåºŠãæãããã ãããã圌ã¯ãããããªãã£ãã 圌ã¯æžåœã«åãç¶ããããžãã¹ãã©ã³ãæ¹åããæ°ããªæè³æ©äŒãæ¢ããã
# ãããŠææã¯æè³ãåããããšã«æåããèªåã®ããžãã¹ãå§ããã 圌ã¯æ°ããã¿ã€ãã®ãœãããŠã§ã¢ã®éçºã«çŠç¹ãåœãŠããã¯ãããžãŒäŒç€Ÿãèšç«ããã 圌ã®ãªãŒããŒã·ããã®äžãäŒç€Ÿã¯æ¥éã«æé·ãããã¯ãããžãŒäŒæ¥ãšããŠæåãåããã
# ææã®æåã¯å¶ç¶ã§ã¯ãªãã 圌ã¯å€åã§ããããŸãããåéºå¥œãã§ãåžžã«åŠã³ãèªåãé«ããŠããã 圌ã®æåã¯ãŸããåªåããã°èª°ã§ãæåã§ããããšã蚌æããŠããã
# 第äžèœ®å¯¹è¯ 第äžå察話ã¿ãŒã³
response, history = model.chat(tokenizer, "ç»è¿äžªæ
äºèµ·äžäžªæ é¢", history=history)
print(response)
# ãèµ·æ¥ãžã®å¥®éïŒããè¥è
ã®æåãžã®éã
```
Qwen-7B ã®åŠç¿æžã¿ããŒã¹ã¢ãã«ã®å®è¡ãç°¡åã§ãã
Qwen-7B ã®å®è¡
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
# bf16 ã䜿çš
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, bf16=True).eval()
# fp16 ã䜿çš
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()
# CPU ã®ã¿äœ¿çš
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="cpu", trust_remote_code=True).eval()
# ãªãŒãã¢ãŒãã䜿çšãããšãããã€ã¹ã«å¿ããŠèªåçã«ç²ŸåºŠãéžæãããŸãã
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True).eval()
# çæã®ããã®ãã€ããŒãã©ã¡ãŒã¿ãæå®
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
inputs = tokenizer('ã¢ã³ãŽã«ã®éŠéœã¯ãŠã©ã³ããŒãã«ïŒUlaanbaatarïŒ\nã¢ã€ã¹ã©ã³ãã®éŠéœã¯ã¬ã€ãã£ãã¯ïŒReykjavikïŒ\nãšããªãã¢ã®éŠéœã¯', return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# ã¢ã³ãŽã«ã®éŠéœã¯ãŠã©ã³ããŒãã«ïŒUlaanbaatarïŒ\nã¢ã€ã¹ã©ã³ãã®éŠéœã¯ã¬ã€ãã£ãã¯ïŒReykjavikïŒ\nãšããªãã¢ã®éŠéœã¯ã¢ãã£ã¹ã¢ããïŒAddis AbabaïŒ...
```
### CLI ã㢠`cli_demo.py` ã« CLI ã®ãã¢äŸãçšæããŠããŸãããŠãŒã¶ã¯ããã³ãããå ¥åããããšã§ Qwen-7B-Chat ãšå¯Ÿè©±ããããšãã§ããã¢ãã«ã¯ã¹ããªãŒãã³ã°ã¢ãŒãã§ã¢ãã«ã®åºåãè¿ããŸãã以äžã®ã³ãã³ããå®è¡ããïŒ ``` python cli_demo.py ```
## API OpenAI APIãããŒã¹ã«ããŒã«ã«APIããããã€ããæ¹æ³ãæäŸããïŒ@hanpenggitã«æè¬ïŒãå§ããåã«ãå¿ èŠãªããã±ãŒãžãã€ã³ã¹ããŒã«ããŠãã ããïŒ ```bash pip install fastapi uvicorn openai pydantic sse_starlette ``` ãããããAPIããããã€ããã³ãã³ããå®è¡ããïŒ ```bash python openai_api.py ``` ãã§ãã¯ãã€ã³ãåããã¹ã«ã¯ `-c` ãCPU ãããã€ã¡ã³ãã«ã¯ `--cpu-only` ãªã©ãåŒæ°ãå€æŽã§ããŸããAPIãããã€ã¡ã³ããèµ·åããéã«åé¡ãçºçããå Žåã¯ãããã±ãŒãžãææ°ããŒãžã§ã³ã«æŽæ°ããããšã§è§£æ±ºã§ããå¯èœæ§ããããŸãã APIã®äœ¿ãæ¹ãç°¡åã ã以äžã®äŸãã芧ãã ããïŒ ```python import openai openai.api_base = "http://localhost:8000/v1" openai.api_key = "none" # create a request activating streaming response for chunk in openai.ChatCompletion.create( model="Qwen-7B", messages=[ {"role": "user", "content": "äœ å¥œ"} ], stream=True ): if hasattr(chunk.choices[0].delta, "content"): print(chunk.choices[0].delta.content, end="", flush=True) # create a request not activating streaming response response = openai.ChatCompletion.create( model="Qwen-7B", messages=[ {"role": "user", "content": "äœ å¥œ"} ], stream=False ) print(response.choices[0].message.content) ```
## ããŒã«ã®äœ¿çš Qwen-7B-Chat ã¯ãAPIãããŒã¿ããŒã¹ãã¢ãã«ãªã©ãããŒã«ã®å©çšã«ç¹åããŠæé©åãããŠããããŠãŒã¶ã¯ç¬èªã® Qwen-7B ããŒã¹ã® LangChainããšãŒãžã§ã³ããã³ãŒãã€ã³ã¿ããªã¿ãæ§ç¯ããããšãã§ããŸããããŒã«å©çšèœåãè©äŸ¡ããããã®è©äŸ¡[ãã³ãããŒã¯](eval/EVALUATION.md)ã§ã¯ãQwen-7B ã¯å®å®ããæ§èœã«éããŠããŸãã [](https://) | Model | Tool Selection (Acc.â) | Tool Input (Rouge-Lâ) | False Positive Errorâ | |:------------|:----------------------:|:----------------------:|:----------------------:| | GPT-4 | 95% | **0.90** | 15% | | GPT-3.5 | 85% | 0.88 | 75% | | **Qwen-7B** | **99%** | 0.89 | **9.7%** | ReAct ããã³ããã®æžãæ¹ã䜿ãæ¹ã«ã€ããŠã¯ã[ReAct ã®äŸ](examples/react_prompt.md)ãåç §ããŠãã ãããããŒã«ã䜿çšããããšã§ãã¢ãã«ãããããã¿ã¹ã¯ãå®è¡ã§ããããã«ãªããŸãã ããã«ããšãŒãžã§ã³ããšããŠã®èœåã瀺ãå®éšçµæãæäŸããã詳现㯠[Hugging Face Agent](https://huggingface.co/docs/transformers/transformers_agents) ãåç §ãHugging Face ãæäŸããã©ã³ã¢ãŒããã³ãããŒã¯ã§ã®æ§èœã¯ä»¥äžã®éãã§ã: | Model | Tool Selectionâ | Tool Usedâ | Codeâ | |:---------------|:---------------:|:-----------:|:---------:| |GPT-4 | **100** | **100** | **97.41** | |GPT-3.5 | 95.37 | 96.30 | 87.04 | |StarCoder-15.5B | 87.04 | 87.96 | 68.89 | | **Qwen-7B** | 90.74 | 92.59 | 74.07 | ## é·ãæèã®ç解 ã³ã³ããã¹ãã®é·ããæ¡åŒµããèšç·Žã·ãŒã±ã³ã¹ã®é·ãã®ããã«ããã¯ã解æ¶ããããã«ãNTK ãèæ ®ããè£éããŠã£ã³ããŠã¢ãã³ã·ã§ã³ãLogN ã¢ãã³ã·ã§ã³ã¹ã±ãŒãªã³ã°ãªã©ã®æè¡ãå°å ¥ããã³ã³ããã¹ãã®é·ãã 8K ããŒã¯ã³ä»¥äžã«æ¡åŒµãããarXiv ããŒã¿ã»ãããçšã㊠PPL è©äŸ¡ã«ããèšèªã¢ããªã³ã°å®éšãè¡ããQwen-7B ãé·ãã³ã³ããã¹ãã®ã·ããªãªã«ãããŠåè¶ããæ§èœãéæã§ããããšãèŠåºããã以äžã«çµæã瀺ããŸã:
Model | Sequence Length | ||||
---|---|---|---|---|---|
1024 | 2048 | 4096 | 8192 | 16384 | |
Qwen-7B | 4.23 | 3.78 | 39.35 | 469.81 | 2645.09 |
+ dynamic_ntk | 4.23 | 3.78 | 3.59 | 3.66 | 5.71 |
+ dynamic_ntk + logn | 4.23 | 3.78 | 3.58 | 3.56 | 4.62 |
+ dynamic_ntk + logn + window_attn | 4.23 | 3.78 | 3.58 | 3.49 | 4.32 |