Isekai-Qwen/recipes/applications/retrieval/retrieval.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "245ab07a-fb2f-4cf4-ab9a-5c05a9b44daa",
   "metadata": {},
   "source": [
    "# LangChain retrieval knowledge base Q&A based on Qwen-7B-Chat"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e8df2cb7-a69c-4231-9596-4c871d893633",
   "metadata": {},
   "source": [
    "This notebook introduces a question-answering application based on a local knowledge base using Qwen-7B-Chat with langchain. The goal is to establish a knowledge base Q&A solution that is friendly to many scenarios and open-source models, and that can run offline. The implementation process of this project includes loading files -> reading text -> segmenting text -> vectorizing text -> vectorizing questions -> matching the top k most similar text vectors with the question vectors -> incorporating the matched text as context along with the question into the prompt -> submitting to the LLM (Large Language Model) to generate an answer."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92e9c81a-45c7-4c12-91af-3c5dd52f63bb",
   "metadata": {},
   "source": [
    "## Preparation"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84cfcf88-3bef-4412-a658-4eaefeb6502a",
   "metadata": {},
   "source": [
    "Download Qwen-7B-Chat\n",
    "\n",
    "Firstly, we need to download the model. You can use the snapshot_download that comes with modelscope to download the model to a specified directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9c1f9ded-8035-42c7-82c7-444ce06572bc",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "!pip install modelscope"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c26225c-c958-429e-b81d-2de9820670c2",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from modelscope.hub.snapshot_download import snapshot_download\n",
    "snapshot_download(\"Qwen/Qwen-7B-Chat\",cache_dir='/tmp/models') "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8f51796-49fa-467d-a825-ae9a281eb3fd",
   "metadata": {},
   "source": [
    "Download the dependencies for langchain and Qwen."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "87fe1023-644f-4610-afaf-0b7cddc30d60",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "!pip install langchain==0.0.187 dashscope==1.0.4 sentencepiece==0.1.99 cpm_kernels==1.0.11 nltk==3.8.1 sentence_transformers==2.2.2 unstructured==0.6.5 faiss-cpu==1.7.4 icetk==0.0.7"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "853cdfa4-a2ce-4baa-919a-b9e2aecd2706",
   "metadata": {},
   "source": [
    "Download the retrieval document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8ba800dc-311d-4a83-8115-f05b09b39ffd",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/LLM_Survey_Chinese.pdf.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07e923b3-b7ae-4983-abeb-2ce115566f15",
   "metadata": {},
   "source": [
    "Download the text2vec model, for Chinese in our case."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9a07cd8d-3cec-40f6-8d2b-eb111aaf1164",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "!wget https://atp-modelzoo-sh.oss-cn-shanghai.aliyuncs.com/release/tutorials/qwen_recipes/GanymedeNil_text2vec-large-chinese.tar.gz\n",
    "!tar -zxvf GanymedeNil_text2vec-large-chinese.tar.gz -C /tmp"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc483af0-170e-4e61-8d25-a336d1592e34",
   "metadata": {},
   "source": [
    "## Try out the model \n",
    "\n",
    "Load the Qwen-7B-Chat model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c112cf82-0447-46c4-9c32-18f243c0a686",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "from abc import ABC\n",
    "from langchain.llms.base import LLM\n",
    "from typing import Any, List, Mapping, Optional\n",
    "from langchain.callbacks.manager import CallbackManagerForLLMRun\n",
    "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
    "\n",
    "model_path=\"/tmp/models/Qwen/Qwen-7B-Chat\"\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)\n",
    "model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).half().cuda()\n",
    "model.eval()\n",
    "\n",
    "class Qwen(LLM, ABC):\n",
    "    max_token: int = 10000\n",
    "    temperature: float = 0.01\n",
    "    top_p = 0.9\n",
    "    history_len: int = 3\n",
    "\n",
    "    def __init__(self):\n",
    "        super().__init__()\n",
    "\n",
    "    @property\n",
    "    def _llm_type(self) -> str:\n",
    "        return \"Qwen\"\n",
    "\n",
    "    @property\n",
    "    def _history_len(self) -> int:\n",
    "        return self.history_len\n",
    "\n",
    "    def set_history_len(self, history_len: int = 10) -> None:\n",
    "        self.history_len = history_len\n",
    "\n",
    "    def _call(\n",
    "        self,\n",
    "        prompt: str,\n",
    "        stop: Optional[List[str]] = None,\n",
    "        run_manager: Optional[CallbackManagerForLLMRun] = None,\n",
    "    ) -> str:\n",
    "        response, _ = model.chat(tokenizer, prompt, history=[])\n",
    "        return response\n",
    "    \n",
    "    @property\n",
    "    def _identifying_params(self) -> Mapping[str, Any]:\n",
    "        \"\"\"Get the identifying parameters.\"\"\"\n",
    "        return {\"max_token\": self.max_token,\n",
    "                \"temperature\": self.temperature,\n",
    "                \"top_p\": self.top_p,\n",
    "                \"history_len\": self.history_len}\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "382ed433-870f-424e-b074-210ea6f84b70",
   "metadata": {},
   "source": [
    "Specify the txt file that needs retrieval for knowledge-based Q&A."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14be706b-4a7d-4906-9369-1f03c6c99854",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import torch\n",
    "import argparse\n",
    "from langchain.vectorstores import FAISS\n",
    "from langchain.embeddings.huggingface import HuggingFaceEmbeddings\n",
    "from typing import List, Tuple\n",
    "import numpy as np\n",
    "from langchain.document_loaders import TextLoader\n",
    "from chinese_text_splitter import ChineseTextSplitter\n",
    "from langchain.docstore.document import Document\n",
    "from langchain.prompts.prompt import PromptTemplate\n",
    "from langchain.chains import RetrievalQA\n",
    "\n",
    "\n",
    "def load_file(filepath, sentence_size=100):\n",
    "    loader = TextLoader(filepath, autodetect_encoding=True)\n",
    "    textsplitter = ChineseTextSplitter(pdf=False, sentence_size=sentence_size)\n",
    "    docs = loader.load_and_split(textsplitter)\n",
    "    write_check_file(filepath, docs)\n",
    "    return docs\n",
    "\n",
    "\n",
    "def write_check_file(filepath, docs):\n",
    "    folder_path = os.path.join(os.path.dirname(filepath), \"tmp_files\")\n",
    "    if not os.path.exists(folder_path):\n",
    "        os.makedirs(folder_path)\n",
    "    fp = os.path.join(folder_path, 'load_file.txt')\n",
    "    with open(fp, 'a+', encoding='utf-8') as fout:\n",
    "        fout.write(\"filepath=%s,len=%s\" % (filepath, len(docs)))\n",
    "        fout.write('\\n')\n",
    "        for i in docs:\n",
    "            fout.write(str(i))\n",
    "            fout.write('\\n')\n",
    "        fout.close()\n",
    "\n",
    "        \n",
    "def seperate_list(ls: List[int]) -> List[List[int]]:\n",
    "    lists = []\n",
    "    ls1 = [ls[0]]\n",
    "    for i in range(1, len(ls)):\n",
    "        if ls[i - 1] + 1 == ls[i]:\n",
    "            ls1.append(ls[i])\n",
    "        else:\n",
    "            lists.append(ls1)\n",
    "            ls1 = [ls[i]]\n",
    "    lists.append(ls1)\n",
    "    return lists\n",
    "\n",
    "\n",
    "class FAISSWrapper(FAISS):\n",
    "    chunk_size = 250\n",
    "    chunk_conent = True\n",
    "    score_threshold = 0\n",
    "    \n",
    "    def similarity_search_with_score_by_vector(\n",
    "            self, embedding: List[float], k: int = 4\n",
    "    ) -> List[Tuple[Document, float]]:\n",
    "        scores, indices = self.index.search(np.array([embedding], dtype=np.float32), k)\n",
    "        docs = []\n",
    "        id_set = set()\n",
    "        store_len = len(self.index_to_docstore_id)\n",
    "        for j, i in enumerate(indices[0]):\n",
    "            if i == -1 or 0 < self.score_threshold < scores[0][j]:\n",
    "                # This happens when not enough docs are returned.\n",
    "                continue\n",
    "            _id = self.index_to_docstore_id[i]\n",
    "            doc = self.docstore.search(_id)\n",
    "            if not self.chunk_conent:\n",
    "                if not isinstance(doc, Document):\n",
    "                    raise ValueError(f\"Could not find document for id {_id}, got {doc}\")\n",
    "                doc.metadata[\"score\"] = int(scores[0][j])\n",
    "                docs.append(doc)\n",
    "                continue\n",
    "            id_set.add(i)\n",
    "            docs_len = len(doc.page_content)\n",
    "            for k in range(1, max(i, store_len - i)):\n",
    "                break_flag = False\n",
    "                for l in [i + k, i - k]:\n",
    "                    if 0 <= l < len(self.index_to_docstore_id):\n",
    "                        _id0 = self.index_to_docstore_id[l]\n",
    "                        doc0 = self.docstore.search(_id0)\n",
    "                        if docs_len + len(doc0.page_content) > self.chunk_size:\n",
    "                            break_flag = True\n",
    "                            break\n",
    "                        elif doc0.metadata[\"source\"] == doc.metadata[\"source\"]:\n",
    "                            docs_len += len(doc0.page_content)\n",
    "                            id_set.add(l)\n",
    "                if break_flag:\n",
    "                    break\n",
    "        if not self.chunk_conent:\n",
    "            return docs\n",
    "        if len(id_set) == 0 and self.score_threshold > 0:\n",
    "            return []\n",
    "        id_list = sorted(list(id_set))\n",
    "        id_lists = seperate_list(id_list)\n",
    "        for id_seq in id_lists:\n",
    "            for id in id_seq:\n",
    "                if id == id_seq[0]:\n",
    "                    _id = self.index_to_docstore_id[id]\n",
    "                    doc = self.docstore.search(_id)\n",
    "                else:\n",
    "                    _id0 = self.index_to_docstore_id[id]\n",
    "                    doc0 = self.docstore.search(_id0)\n",
    "                    doc.page_content += \" \" + doc0.page_content\n",
    "            if not isinstance(doc, Document):\n",
    "                raise ValueError(f\"Could not find document for id {_id}, got {doc}\")\n",
    "            doc_score = min([scores[0][id] for id in [indices[0].tolist().index(i) for i in id_seq if i in indices[0]]])\n",
    "            doc.metadata[\"score\"] = int(doc_score)\n",
    "            docs.append((doc, doc_score))\n",
    "        return docs\n",
    "\n",
    "\n",
    "if __name__ == '__main__':\n",
    "    # load docs\n",
    "    filepath = 'LLM_Survey_Chinese.pdf.txt'\n",
    "    # LLM name\n",
    "    LLM_TYPE = 'qwen'\n",
    "    # Embedding model name\n",
    "    EMBEDDING_MODEL = 'text2vec'\n",
    "    # 基于上下文的prompt模版，请务必保留\"{question}\"和\"{context_str}\"\n",
    "    PROMPT_TEMPLATE = \"\"\"已知信息：\n",
    "    {context_str} \n",
    "    根据上述已知信息，简洁和专业的来回答用户的问题。如果无法从中得到答案，请说 “根据已知信息无法回答该问题” 或 “没有提供足够的相关信息”，不允许在答案中添加编造成分，答案请使用中文。 问题是：{question}\"\"\"\n",
    "    # Embedding running device\n",
    "    EMBEDDING_DEVICE = \"cuda\"\n",
    "    # return top-k text chunk from vector store\n",
    "    VECTOR_SEARCH_TOP_K = 3\n",
    "    # 文本分句长度\n",
    "    SENTENCE_SIZE = 50\n",
    "    CHAIN_TYPE = 'stuff'\n",
    "    llm_model_dict = {\n",
    "        \"qwen\": QWen,\n",
    "    }\n",
    "    embedding_model_dict = {\n",
    "        \"text2vec\": \"/tmp/GanymedeNil_text2vec-large-chinese\",\n",
    "    }\n",
    "    print(\"loading model start\")\n",
    "    llm = llm_model_dict[LLM_TYPE]()\n",
    "    embeddings = HuggingFaceEmbeddings(model_name=embedding_model_dict[EMBEDDING_MODEL],model_kwargs={'device': EMBEDDING_DEVICE})\n",
    "    print(\"loading model done\")\n",
    "\n",
    "    print(\"loading documents start\")\n",
    "    docs = load_file(filepath, sentence_size=SENTENCE_SIZE)\n",
    "    print(\"loading documents done\")\n",
    "\n",
    "    print(\"embedding start\")\n",
    "    docsearch = FAISSWrapper.from_documents(docs, embeddings)\n",
    "    print(\"embedding done\")\n",
    "\n",
    "    print(\"loading qa start\")\n",
    "    prompt = PromptTemplate(\n",
    "        template=PROMPT_TEMPLATE, input_variables=[\"context_str\", \"question\"]\n",
    "    )\n",
    "\n",
    "    chain_type_kwargs = {\"prompt\": prompt, \"document_variable_name\": \"context_str\"}\n",
    "    qa = RetrievalQA.from_chain_type(\n",
    "        llm=llm,\n",
    "        chain_type=CHAIN_TYPE, \n",
    "        retriever=docsearch.as_retriever(search_kwargs={\"k\": VECTOR_SEARCH_TOP_K}), \n",
    "        chain_type_kwargs=chain_type_kwargs)\n",
    "    print(\"loading qa done\")\n",
    "\n",
    "    query = \"大模型指令微调有好的策略？\"  \n",
    "    print(qa.run(query))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}