欧美日韩国产综合网站,日日揉,夜夜摸,人人妻

--data_path Dahoas/rm-static \
--model_name_or_path bigscience/bloomz-560m \
--gradient_accumulation_steps 8 --lora_dim 128 --zero_stage $ZERO_STAGE \
--per_device_train_batch_size 8 --per_device_eval_batch_size 8 \
--deepspeed --output_dir $OUTPUT 2>&1 | tee $OUTPUT/training.log

注：以上設(shè)置顯存占用約30G，你可以調(diào)整 per_device_train_batch_size 和 per_device_eval_batch_size 來(lái)降低顯存的使用。

模型導(dǎo)入可以分為三個(gè)部分：

導(dǎo)入 tokenizer： AutoTokenizer.from_pretrained(…)
導(dǎo)入 model_config： AutoConfig.from_pretrained(…)
導(dǎo)入 model: AutoModelForCausalLM.from_pretrained(…)

詳細(xì)的實(shí)現(xiàn)細(xì)節(jié)，可以參考下面的代碼。

from utils.model.model_utils import create_hf_model

tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path,

                                          fast_tokenizer=True)

model = create_hf_model(AutoModelForCausalLM,

                        args.model_name_or_path,

                        tokenizer,

                        ds_config,

                        disable_dropout=args.disable_dropout)

create_hf_model函數(shù)的實(shí)現(xiàn)代碼如下：

def create_hf_model(model_class,

                    model_name_or_path,

                    tokenizer,

                    ds_config=None,

                    rlhf_training=False,

                    disable_dropout=False):

    model_config = AutoConfig.from_pretrained(model_name_or_path)

    if disable_dropout:

        model_config.dropout = 0.0

    # Note: dschf is defined in function scope to avoid global effects

    # https://huggingface.co/docs/transformers/main_classes/deepspeed#nontrainer-deepspeed-integration

    if ds_config is not None and ds_config["zero_optimization"]["stage"] == 3:

        dschf = HfDeepSpeedConfig(ds_config)

    else:

        dschf = None

    if rlhf_training:

        # the weight loading is handled by create critic model

        model = model_class.from_config(model_config)

    else:

        model = model_class.from_pretrained(

            model_name_or_path,

            from_tf=bool(".ckpt" in model_name_or_path),

            config=model_config)



    model.config.end_token_id = tokenizer.eos_token_id

    model.config.pad_token_id = model.config.eos_token_id

    model.resize_token_embeddings(int(8 *math.ceil(len(tokenizer) / 8.0)))  

    # make the vocab size multiple of 8



    return model

使用 BLOOMZ 系列模型時(shí)，不需要修改任何模型導(dǎo)入代碼。但在使用其他模型，例如 GLM 時(shí)，DS-Chat 無(wú)法直接導(dǎo)入模型，這時(shí)需要對(duì)上述代碼進(jìn)行調(diào)整。

常見(jiàn)問(wèn)題：

訓(xùn)練過(guò)程中出現(xiàn)內(nèi)存不足：
對(duì)策：調(diào)小 batch-size，可以添加參數(shù) --per_device_train_batch_size 1 --per_device_eval_batch_size 1
另外也可以修改參數(shù)：--max_seq_len 255
從 Huggingface 下載的模型，本地存放位置：
默認(rèn)位置在：~/.cache/huggingface/hub 目錄下
如何使用自己的模型
設(shè)置參數(shù) model_name_or_path 為本地的路徑即可。
注意需要確認(rèn)模型文件夾下是否有 tokenizer_config.json 和 tokenizer.json 兩個(gè)文件（DS-Chat 保存模型時(shí)，并不存儲(chǔ)此兩個(gè)文件）。

3、替換數(shù)據(jù)

針對(duì)大型模型的一個(gè)重要開(kāi)發(fā)工作是使用特定任務(wù)的數(shù)據(jù)對(duì)模型進(jìn)行進(jìn)一步優(yōu)化。通常情況下，使用相關(guān)任務(wù)的數(shù)據(jù)進(jìn)行優(yōu)化的模型會(huì)在目標(biāo)任務(wù)上表現(xiàn)更好。在 DS-Chat 工具中使用自己的數(shù)據(jù)進(jìn)行模型訓(xùn)練可以分為以下三個(gè)步驟：

準(zhǔn)備數(shù)據(jù)，并按照一定的格式整理數(shù)據(jù)，例如使用 JSON 格式。
修改 data_utils.py 和 raw_datasets.py 的代碼，以添加對(duì)新數(shù)據(jù)的支持。
在訓(xùn)練 shell 腳本中設(shè)置使用新數(shù)據(jù)，并開(kāi)始模型訓(xùn)練。

3.1 如何準(zhǔn)備數(shù)據(jù)

在準(zhǔn)備數(shù)據(jù)之前，首先需要了解模型訓(xùn)練時(shí)所需的數(shù)據(jù)格式。我們可以通過(guò)查看 raw_datasets.py 代碼來(lái)了解訓(xùn)練時(shí)使用的數(shù)據(jù)格式。以下是代碼中實(shí)現(xiàn)的其中一種類(lèi)型數(shù)據(jù)讀取的示例：

class HelloSimpleAIHC3ChineseDataset(PromptRawDataset):

    def get_prompt(self, sample):

        if sample['question'] is not None:

            return " Human: " + sample['question'] + " Assistant:"

        return None



    def get_chosen(self, sample):

        if sample['human_answers'][0] is not None:

            return " " + sample['human_answers'][0]

        return None



    def get_prompt_and_chosen(self, sample):

        if sample['question'] is not None and sample['human_answers'][

                0] is not None:

            return " Human: " + sample['question'] + " Assistant: " + sample[

                'human_answers'][0]

        return None



    def get_rejected(self, sample):

        ...

    def get_prompt_and_rejected(self, sample):

        ...

通過(guò)上面的代碼，我們可以看到，此數(shù)據(jù)中共有三種數(shù)據(jù)格式：prompt、answer、rejected，以及它們的組合：prompt+answer 和 prompt+rejected。因此，訓(xùn)練數(shù)據(jù)最基本的內(nèi)容是 prompt、answer 和 rejected。

然后，我們可以在 data_utils.py 文件中第 141 行的部分了解到：

在 Stage 1 階段調(diào)用了 get_prompt_and_chosen() 來(lái)讀取訓(xùn)練數(shù)據(jù)。所以，如果要進(jìn)行 Stage 1 的訓(xùn)練，我們需要準(zhǔn)備 prompt 和 answer 即可。
Stage 2 中調(diào)用了 get_prompt_and_chosen 和 get_prompt_and_rejected 讀取數(shù)據(jù)來(lái)訓(xùn)練 reward 模型，也就是此部分需要 prompt、answer 和 rejected 數(shù)據(jù)。
Stage 3 中只調(diào)用了 get_prompt，因此只需要有 prompt 即可進(jìn)行 Stage 3 的訓(xùn)練。

LLMZoo模型中模型的訓(xùn)練類(lèi)似 Stage 1，所以，你需要準(zhǔn)備的數(shù)據(jù)只需包含 prompt 和 answer 即可。

為了便于數(shù)據(jù)讀取，我對(duì) phoenix-sft-data-v1 數(shù)據(jù)進(jìn)行格式轉(zhuǎn)換，下面是其數(shù)據(jù)的 JSON 示例：

[

  {

    "id": "0",

    "type": "Instruction",

    "from_human": "假設(shè)你是一位Airbnb房主。... \n",

    "from_gpt": "很抱歉，作為AI語(yǔ)言模型，我無(wú)法檢查您的Airbnb列表。"

  },

  {

    "id": "1",

    "type": "Instruction",

    "from_human": "假設(shè)你是一位翻譯。... \n",

    "from_gpt": "\"Al dente\" means cooking the ..."

  }

]

其中，from_human 為 prompt，而 from_gpt 為 answer。接下來(lái)，如果你有自己的數(shù)據(jù)，就可以按照上述格式來(lái)準(zhǔn)備數(shù)據(jù)了。

3.2 修改代碼讀取數(shù)據(jù)

接下來(lái)，我們將介紹如何修改代碼以讀取自定義數(shù)據(jù)。DS-Chat 中提供了多種格式的數(shù)據(jù)讀取方式，你可以選擇與自己數(shù)據(jù)格式相似的數(shù)據(jù)讀取類(lèi)進(jìn)行修改?；蛘咧苯舆x擇其中一個(gè)格式，并按照其格式準(zhǔn)備數(shù)據(jù)，這樣可以減少代碼修改量。

代碼修改包括（修改過(guò)程請(qǐng)參考視頻）：

data_utils.py
新增內(nèi)容：需要定義新數(shù)據(jù)類(lèi)的對(duì)象及接口。
raw_datasets.py
新增內(nèi)容：定義新的數(shù)據(jù)讀取類(lèi)。load_dataset 的本地?cái)?shù)據(jù)讀取方式：self.raw_datasets = load_dataset(path=”/home/data/”, data_files=”yourData.json”)。
run1.3b.sh
修改：設(shè)置使用自己的數(shù)據(jù)庫(kù)名稱(chēng)。

模型訓(xùn)練過(guò)程中，會(huì)通過(guò)數(shù)據(jù)庫(kù)名稱(chēng)，在 data_utils.py 中調(diào)用數(shù)據(jù)的讀取類(lèi)，來(lái)初始化數(shù)據(jù)讀取對(duì)象。然后在 raw_datasets.py 文件中，第一次調(diào)用 load_dataset 時(shí)，load_dataset 會(huì)將 JSON 文件轉(zhuǎn)換為 arrow 格式，并緩存到 cache_dir 目錄下。在下次再次讀取數(shù)據(jù)時(shí)，會(huì)直接讀取緩存的 arrow 文件。

注意事項(xiàng)：
如果是使用分布式訓(xùn)練時(shí)，建議先使用單 GPU 進(jìn)程對(duì)數(shù)據(jù)部分進(jìn)行緩存處理，因?yàn)樵诜植际接?xùn)練時(shí)，多進(jìn)程對(duì)數(shù)據(jù)進(jìn)行緩存可能會(huì)出現(xiàn)錯(cuò)誤，尤其是在數(shù)據(jù)量比較大的情況下。

另外要注意，DS-Chat 會(huì)對(duì)數(shù)據(jù)進(jìn)行第二次的本機(jī)數(shù)據(jù)緩存處理，這可能會(huì)額外占用你的硬盤(pán)存儲(chǔ)空間。并且這種方法在數(shù)據(jù)量比較大時(shí)，也會(huì)導(dǎo)致內(nèi)存消耗過(guò)大的問(wèn)題。目前官方正在解決中，具體信息可以參考下面的鏈接。在學(xué)習(xí)階段，你可以使用少量樣本，或者使用多 GPU 訓(xùn)練的方式來(lái)緩解此問(wèn)題。
https://github.com/microsoft/DeepSpeedExamples/issues/450

數(shù)據(jù)調(diào)用流程
接下來(lái)，我給出了代碼修改的過(guò)程。在修改代碼時(shí)，你可以參考以下的調(diào)用過(guò)程進(jìn)行修改。

- File: step1_supervised_finetuning/main.py： 

  - Line 224 （train_dataset, eval_dataset = create_prompt_dataset（） 

    - File: /training/utils/data/data_utils.py

      - Line 268: train_dataset, eval_dataset = create_dataset()

      - Line 212: raw_dataset = get_raw_dataset()

        - Line 20：def get_raw_dataset(): 

            return raw_datasets.Wangrui6ZhihuKOLDataset()

            - File: training/utils/data/raw_datasets.py

              - Line 307: class Wangrui6ZhihuKOLDataset(PromptRawDataset)



      - Line 220: train_dataset = create_dataset_split()

        - Line 141: if train_phase == 1:

            chosen_sentence = raw_dataset.get_prompt_and_chosen()

常見(jiàn)問(wèn)題

Q/A 1：錯(cuò)誤 Exception: Current loss scale already at minimum – cannot decrease scale anymore. Exiting run
問(wèn)題描述：訓(xùn)練過(guò)程中，可能會(huì)遇到上述錯(cuò)誤。此問(wèn)題通常是由于模型訓(xùn)練不穩(wěn)定造成的，建議將 batch size 調(diào)大以增加訓(xùn)練的穩(wěn)定性。調(diào)大 batch size 會(huì)增加顯存的使用，一個(gè)替代的做法是使用多 GPU，或者設(shè)置 gradient_accumulation_steps 以達(dá)到增加 batch size 的效果。
如果依然有問(wèn)題，可以嘗試使用 float32（一般是針對(duì) nan 的錯(cuò)誤）。
Q/A 2：注意刪除臨時(shí)數(shù)據(jù)
DS-Chat 程序，默認(rèn)會(huì)對(duì)數(shù)據(jù)進(jìn)行多次緩存，其中包括：

Huggingface 對(duì)數(shù)據(jù)的緩存，例如 map 操作會(huì)自動(dòng)緩存數(shù)據(jù)（程序修改可能會(huì)引起重新緩存，所以要注意刪除舊緩存文件）。
load_dataset 會(huì)自動(dòng)將 json 數(shù)據(jù)緩存為 arrow 數(shù)據(jù)格式。
DS-Chat 會(huì)將數(shù)據(jù)緩存到本機(jī)：traindata-xxxx.pt evaldata-xxx.pt 文件在本機(jī)的 /tmp/data_files/ 目錄下，另外還包括一個(gè)數(shù)據(jù) index 文件（*.npy）。

Q/A 3: 分布式訓(xùn)練時(shí)，數(shù)據(jù)讀取錯(cuò)誤
建議在單 GPU 上單獨(dú)執(zhí)行一次數(shù)據(jù) load_dataset 部分，對(duì)基本的數(shù)據(jù)處理進(jìn)行緩存后，再啟動(dòng)多節(jié)點(diǎn)的分布式訓(xùn)練。
Q/A 4：數(shù)據(jù)量較大時(shí)，如何降低機(jī)器內(nèi)存的使用
適當(dāng)?shù)貙?duì)數(shù)據(jù)進(jìn)行拆分處理（需要相應(yīng)的代碼調(diào)整）。
可以采用動(dòng)態(tài)調(diào)用的方式，官方正在解決中，你可以關(guān)注下面的鏈接以了解最新進(jìn)度：https://github.com/microsoft/DeepSpeedExamples/issues/450
Q/A 5：本地?cái)?shù)據(jù)修改后，重新訓(xùn)練時(shí)，數(shù)據(jù)依舊是修改前的
這是因?yàn)镈S-Chat對(duì)數(shù)據(jù)的緩存引起的，需要手動(dòng)刪除本機(jī)上的緩存文件：
默認(rèn)的緩存目錄：/tmp/data_files/，可以將此目錄刪除后重新開(kāi)始訓(xùn)練。