GPT2 สำหรับการจำแนกข้อความโดยใช้ Hugging Face Transformers

เผยแพร่ซ้ำโดยเพลโต

ผู้ติดตาม: 0

การจัดประเภทข้อความ

สมุดบันทึกนี้ใช้เพื่อปรับแต่งโมเดล GPT2 สำหรับการจัดประเภทข้อความโดยใช้ กอดหน้า หม้อแปลง ไลบรารีบนชุดข้อมูลที่กำหนดเอง

Hugging Face นั้นดีมากสำหรับเราที่จะรวมฟังก์ชันทั้งหมดที่จำเป็นสำหรับ GPT2 เพื่อใช้ในการจัดหมวดหมู่ ขอบคุณใบหน้ากอด!

ฉันไม่พบข้อมูลมากนักเกี่ยวกับวิธีใช้ GPT2 สำหรับการจัดหมวดหมู่ ดังนั้นฉันจึงตัดสินใจทำบทช่วยสอนนี้โดยใช้โครงสร้างที่คล้ายคลึงกันกับหม้อแปลงรุ่นอื่นๆ

หากเนื้อหาการศึกษาเชิงลึกนี้มีประโยชน์สำหรับคุณ สมัครรับจดหมายข่าวการวิจัย AI ของเรา เพื่อรับการแจ้งเตือนเมื่อเราเผยแพร่เนื้อหาใหม่

แนวคิดหลัก: เนื่องจาก GPT2 เป็นหม้อแปลงถอดรหัส โทเค็นสุดท้ายของลำดับอินพุตจึงถูกใช้เพื่อคาดการณ์เกี่ยวกับโทเค็นถัดไปที่ควรเป็นไปตามอินพุต ซึ่งหมายความว่าโทเค็นสุดท้ายของลำดับอินพุตมีข้อมูลทั้งหมดที่จำเป็นในการทำนาย ด้วยเหตุนี้ เราจึงสามารถใช้ข้อมูลนั้นเพื่อคาดการณ์ในงานจำแนกประเภทแทนงานสร้าง

กล่าวอีกนัยหนึ่ง แทนที่จะใช้การฝังโทเค็นแรกในการทำนายเหมือนที่เราทำใน Bert เราจะใช้การฝังโทเค็นสุดท้ายเพื่อคาดการณ์ด้วย GPT2

เนื่องจากเราสนใจเฉพาะโทเค็นแรกใน Bert เราจึงเติมทางด้านขวา ตอนนี้ใน GPT2 เรากำลังใช้โทเค็นสุดท้ายในการทำนาย ดังนั้นเราจะต้องเลื่อนไปทางซ้าย เนื่องจากการอัพเกรดที่ดีเป็น HuggingFace Transformers เราจึงสามารถกำหนดค่า GPT2 Tokenizer ให้ทำเช่นนั้นได้

ฉันควรรู้อะไรบ้างสำหรับสมุดบันทึกนี้

เนื่องจากฉันใช้ PyTorch เพื่อปรับแต่งแบบจำลองหม้อแปลงของเรา ความรู้ใดๆ เกี่ยวกับ PyTorch จึงมีประโยชน์มาก

เกร็ดความรู้เล็กน้อยเกี่ยวกับ หม้อแปลง ห้องสมุดช่วยด้วย

วิธีการใช้โน๊ตบุ๊คนี้?

เช่นเดียวกับทุกโครงการ ฉันสร้างสมุดบันทึกนี้โดยคำนึงถึงการนำกลับมาใช้ใหม่ได้

การเปลี่ยนแปลงทั้งหมดจะเกิดขึ้นในส่วนการประมวลผลข้อมูล ซึ่งคุณต้องปรับแต่งชุดข้อมูล PyTorch, Data Collator และ DataLoader เพื่อให้เหมาะกับความต้องการข้อมูลของคุณเอง

พารามิเตอร์ทั้งหมดที่สามารถเปลี่ยนได้อยู่ภายใต้ การนำเข้า ส่วน. พารามิเตอร์แต่ละตัวได้รับการแสดงความคิดเห็นอย่างดีและมีโครงสร้างที่เข้าใจง่ายที่สุด

ชุด

โน้ตบุ๊กนี้จะครอบคลุมถึงหม้อแปลงก่อนการฝึกอบรมในชุดข้อมูลที่กำหนดเอง ฉันจะใช้บทวิจารณ์ภาพยนตร์ที่รู้จักกันดีในเชิงบวก — เชิงลบที่มีป้ายกำกับ ชุดข้อมูลบทวิจารณ์ภาพยนตร์ขนาดใหญ่.

คำอธิบายที่ให้ไว้บนเว็บไซต์สแตนฟอร์ด:

นี่คือชุดข้อมูลสำหรับการจัดประเภทความเชื่อมั่นแบบไบนารีที่มีข้อมูลมากกว่าชุดข้อมูลเปรียบเทียบก่อนหน้าอย่างมาก เราจัดเตรียมชุดบทวิจารณ์ภาพยนตร์ขั้วโลกเหนือ 25,000 เรื่องสำหรับการฝึกอบรม และ 25,000 รายการสำหรับการทดสอบ มีข้อมูลที่ไม่มีป้ายกำกับเพิ่มเติมสำหรับการใช้งานเช่นกัน มีรูปแบบข้อความดิบและถุงคำที่ประมวลผลแล้ว ดูไฟล์ README ที่มีอยู่ในรีลีสสำหรับรายละเอียดเพิ่มเติม

ทำไมต้องเป็นชุดข้อมูลนี้ ฉันเชื่อว่าเป็นชุดข้อมูลที่เข้าใจง่ายและใช้สำหรับการจำแนกประเภท ฉันคิดว่าข้อมูลความเชื่อมั่นมักจะสนุกในการทำงานด้วย

การเข้ารหัส

มาทำการเข้ารหัสกันเถอะ! เราจะพูดถึงแต่ละเซลล์การเข้ารหัสในสมุดบันทึกและอธิบายว่ามันทำอะไร โค้ดคืออะไร และเมื่อใดที่เกี่ยวข้องกัน — แสดงผลออกมา

ฉันทำให้รูปแบบนี้ง่ายต่อการติดตาม หากคุณตัดสินใจที่จะเรียกใช้แต่ละเซลล์โค้ดในสมุดบันทึกหลามของคุณเอง

เมื่อฉันเรียนรู้จากบทช่วยสอน ฉันมักจะพยายามทำซ้ำผลลัพธ์ ฉันเชื่อว่ามันง่ายที่จะปฏิบัติตามหากคุณมีรหัสข้างคำอธิบาย

ดาวน์โหลด

ดาวน์โหลด ชุดข้อมูลบทวิจารณ์ภาพยนตร์ขนาดใหญ่ และแตกไฟล์ในเครื่อง

Download the dataset.
!wget -q -nc http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Unzip the dataset.
!tar -zxf /content/aclImdb_v1.tar.gz

การติดตั้ง

หม้อแปลง ต้องติดตั้งไลบรารีเพื่อใช้รหัสที่ยอดเยี่ยมทั้งหมดจาก Hugging Face เพื่อให้ได้เวอร์ชันล่าสุด ฉันจะติดตั้งจาก GitHub โดยตรง
ml_things ไลบรารีที่ใช้สำหรับงานที่เกี่ยวข้องกับการเรียนรู้ของเครื่องต่างๆ ฉันสร้างไลบรารีนี้เพื่อลดจำนวนโค้ดที่ต้องเขียนสำหรับโปรเจ็กต์การเรียนรู้ของเครื่องแต่ละโปรเจ็กต์

# Install transformers library.
!pip install -q git+https://github.com/huggingface/transformers.git
# Install helper functions.
!pip install -q git+https://github.com/gmihaila/ml_things.git

Installing build dependencies ... done Getting requirements to build wheel ... done Preparing wheel metadata ... done |████████████████████████████████| 2.9MB 6.7MB/s |████████████████████████████████| 890kB 48.9MB/s |████████████████████████████████| 1.1MB 49.0MB/s Building wheelfor transformers (PEP 517) ... done Building wheel for sacremoses (setup.py) ... done |████████████████████████████████| 71kB 5.2MB/s Building wheel for ml-things (setup.py) ... done Building wheel for ftfy (setup.py) ... done

การนำเข้า

นำเข้าไลบรารีที่จำเป็นทั้งหมดสำหรับสมุดบันทึกนี้ ประกาศพารามิเตอร์ที่ใช้สำหรับสมุดบันทึกนี้:

set_seed(123) – ดีเสมอที่จะตั้งเมล็ดพันธุ์ตายตัวเพื่อการทำซ้ำ
epochs – จำนวนครั้งของการฝึกอบรม (ผู้เขียนแนะนำระหว่าง 2 และ 4)
batch_size – จำนวนแบทช์ – ขึ้นอยู่กับความยาวลำดับสูงสุดและหน่วยความจำ GPU สำหรับความยาวลำดับ 512 ชุด 10 ปกติทำงานโดยไม่มีปัญหาหน่วยความจำ cuda สำหรับความยาวลำดับขนาดเล็กสามารถลองชุด 32 หรือสูงกว่า max_length – ย่อหรือตัดลำดับข้อความให้มีความยาวที่กำหนด ฉันจะตั้งค่าเป็น 60 เพื่อเพิ่มความเร็วในการฝึก
device - หา gpu มาใช้ จะใช้ cpu เป็นค่าเริ่มต้นหากไม่พบ gpu
model_name_or_path – ชื่อรุ่นหม้อแปลง – จะใช้รุ่นสำเร็จรูปแล้ว เส้นทางของโมเดลหม้อแปลง - จะโหลดโมเดลของคุณเองจากดิสก์ในเครื่อง ในบทช่วยสอนนี้ ฉันจะใช้ gpt2 แบบ
labels_ids – พจนานุกรมของป้ายกำกับและรหัส – ใช้เพื่อแปลงป้ายกำกับสตริงเป็นตัวเลข
n_labels – เราใช้ป้ายกำกับกี่ชุดในชุดข้อมูลนี้ ใช้สำหรับกำหนดขนาดของหัวการจัดประเภท

import io
import os
import torch
from tqdm.notebook import tqdm
from torch.utils.data import Dataset, DataLoader
from ml_things import plot_dict, plot_confusion_matrix, fix_text
from sklearn.metrics import classification_report, accuracy_score
from transformers import (set_seed, TrainingArguments, Trainer, GPT2Config, GPT2Tokenizer, AdamW, get_linear_schedule_with_warmup, GPT2ForSequenceClassification) # Set seed for reproducibility.
set_seed(123) # Number of training epochs (authors on fine-tuning Bert recommend between 2 and 4).
epochs = 4 # Number of batches - depending on the max sequence length and GPU memory.
# For 512 sequence length batch of 10 works without cuda memory issues.
# For small sequence length can try batch of 32 or higher.
batch_size = 32 # Pad or truncate text sequences to a specific length
# if `None` it will use maximum sequence of word piece tokens allowed by model.
max_length = 60 # Look for gpu to use. Will use `cpu` by default if no gpu found.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # Name of transformers model - will use already pretrained model.
# Path of transformer model - will load your own model from local disk.
model_name_or_path = 'gpt2' # Dictionary of labels and their id - this will be used to convert.
# String labels to number ids.
labels_ids = {'neg': 0, 'pos': 1} # How many labels are we using in training.
# This is used to decide size of classification head.
n_labels = len(labels_ids)
ฟังก์ชั่นตัวช่วย
ฉันชอบที่จะเก็บคลาสและฟังก์ชันทั้งหมดที่จะใช้ในสมุดบันทึกนี้ไว้ในส่วนนี้เพื่อช่วยรักษารูปลักษณ์ที่สะอาดของสมุดบันทึก:
MovieReviewsชุดข้อมูล (ชุดข้อมูล)
หากคุณเคยทำงานกับ PyTorch มาก่อน นี่เป็นมาตรฐานที่ค่อนข้างดี เราต้องการให้คลาสนี้อ่านในชุดข้อมูล แยกวิเคราะห์ และส่งคืนข้อความพร้อมป้ายกำกับที่เกี่ยวข้อง
ในชั้นเรียนนี้ ฉันต้องอ่านเนื้อหาของแต่ละไฟล์เท่านั้น ใช้ fix_text เพื่อแก้ไขปัญหา Unicode และติดตามความรู้สึกเชิงบวกและเชิงลบ
ฉันจะผนวกข้อความและป้ายกำกับทั้งหมดในรายการ
มีสามส่วนหลักของคลาส PyTorch Dataset นี้:

ในนั้น() ที่เราอ่านในชุดข้อมูลและแปลงข้อความและป้ายกำกับเป็นตัวเลข
เลน () ที่เราต้องการส่งคืนจำนวนตัวอย่างที่เราอ่าน ใช้สำหรับเรียก len(MovieReviewsDataset())
รับรายการ () รับค่า int ที่เป็นอินพุตซึ่งแสดงถึงตัวอย่างจากตัวอย่างของเราที่จะส่งคืนจากชุดข้อมูลของเราเสมอ หากผ่านค่า 3 เราจะส่งคืนตัวอย่างจากชุดข้อมูลของเราที่ตำแหน่ง 3

class MovieReviewsDataset(Dataset): r"""PyTorch Dataset class for loading data. This is where the data parsing happens. This class is built with reusability in mind: it can be used as is as. Arguments: path (:obj:`str`): Path to the data partition. """ def __init__(self, path, use_tokenizer): # Check if path exists. if not os.path.isdir(path): # Raise error if path is invalid. raise ValueError('Invalid `path` variable! Needs to be a directory') self.texts = [] self.labels = [] # Since the labels are defined by folders with data we loop # through each label. for label in ['pos', 'neg']: sentiment_path = os.path.join(path, label) # Get all files from path. files_names = os.listdir(sentiment_path)#[:10] # Sample for debugging. # Go through each file and read its content. for file_name in tqdm(files_names, desc=f'{label} files'): file_path = os.path.join(sentiment_path, file_name) # Read content. content = io.open(file_path, mode='r', encoding='utf-8').read() # Fix any unicode issues. content = fix_text(content) # Save content. self.texts.append(content) # Save encode labels. self.labels.append(label) # Number of exmaples. self.n_examples = len(self.labels) return def __len__(self): r"""When used `len` return the number of examples. """ return self.n_examples def __getitem__(self, item): r"""Given an index return an example from the position. Arguments: item (:obj:`int`): Index position to pick an example to return. Returns: :obj:`Dict[str, str]`: Dictionary of inputs that contain text and asociated labels. """ return {'text':self.texts[item], 'label':self.labels[item]}
Gpt2ClassificationColrator
ฉันใช้คลาสนี้เพื่อสร้าง Data Collator สิ่งนี้จะถูกใช้ใน DataLoader เพื่อสร้างการอาบน้ำของข้อมูลที่ป้อนไปยังโมเดล ฉันใช้ตัวเข้ารหัสโทเค็นและตัวเข้ารหัสป้ายกำกับในแต่ละลำดับเพื่อแปลงข้อความและป้ายกำกับเป็นตัวเลข
โชคดีสำหรับเรา Hugging Face คิดถึงทุกอย่างและทำให้ tokenizer ทำงานหนักทั้งหมด (แยกข้อความออกเป็นโทเค็น ช่องว่างภายใน การตัดทอน เข้ารหัสข้อความเป็นตัวเลข) และใช้งานง่ายมาก!
มีสองส่วนหลักของคลาส Data Collator นี้:

ในนั้น() ที่ที่เราเริ่มต้น tokenizer ที่เราวางแผนจะใช้ วิธีเข้ารหัสป้ายกำกับของเรา และหากเราจำเป็นต้องตั้งค่าความยาวของลำดับเป็นค่าอื่น
เรียก() ใช้เป็นตัวเปรียบเทียบฟังก์ชันที่ใช้เป็นชุดของตัวอย่างข้อมูลอินพุต จำเป็นต้องส่งคืนวัตถุด้วยรูปแบบที่สามารถป้อนให้กับแบบจำลองของเราได้ โชคดีที่ tokenizer ของเราทำอย่างนั้นเพื่อเราและส่งคืนพจนานุกรมของตัวแปรที่พร้อมจะป้อนให้กับโมเดลในลักษณะนี้: model(**inputs). เนื่องจากเรากำลังปรับแต่งโมเดลอย่างละเอียด ฉันจึงรวมป้ายกำกับไว้ด้วย

class Gpt2ClassificationCollator(object): r""" Data Collator used for GPT2 in a classificaiton rask. It uses a given tokenizer and label encoder to convert any text and labels to numbers that can go straight into a GPT2 model. This class is built with reusability in mind: it can be used as is as long as the `dataloader` outputs a batch in dictionary format that can be passed straight into the model - `model(**batch)`. Arguments: use_tokenizer (:obj:`transformers.tokenization_?`): Transformer type tokenizer used to process raw text into numbers. labels_ids (:obj:`dict`): Dictionary to encode any labels names into numbers. Keys map to labels names and Values map to number associated to those labels. max_sequence_len (:obj:`int`, `optional`) Value to indicate the maximum desired sequence to truncate or pad text sequences. If no value is passed it will used maximum sequence size supported by the tokenizer and model. """ def __init__(self, use_tokenizer, labels_encoder, max_sequence_len=None): # Tokenizer to be used inside the class. self.use_tokenizer = use_tokenizer # Check max sequence length. self.max_sequence_len = use_tokenizer.model_max_length if max_sequence_len is None else max_sequence_len # Label encoder used inside the class. self.labels_encoder = labels_encoder return def __call__(self, sequences): r""" This function allowes the class objesct to be used as a function call. Sine the PyTorch DataLoader needs a collator function, I can use this class as a function. Arguments: item (:obj:`list`): List of texts and labels. Returns: :obj:`Dict[str, object]`: Dictionary of inputs that feed into the model. It holddes the statement `model(**Returned Dictionary)`. """ # Get all texts from sequences list. texts = [sequence['text'] for sequence in sequences] # Get all labels from sequences list. labels = [sequence['label'] for sequence in sequences] # Encode all labels using label encoder. labels = [self.labels_encoder[label] for label in labels] # Call tokenizer on all texts to convert into tensors of numbers with # appropriate padding. inputs = self.use_tokenizer(text=texts, return_tensors="pt", padding=True, truncation=True, max_length=self.max_sequence_len) # Update the inputs with the associated encoded labels as tensor. inputs.update({'labels':torch.tensor(labels)}) return inputs
รถไฟ (ตัวโหลดข้อมูล, เครื่องมือเพิ่มประสิทธิภาพ_, ตัวกำหนดตารางเวลา_, อุปกรณ์_)
ฉันสร้างฟังก์ชันนี้เพื่อดำเนินการผ่านวัตถุ DataLoader แบบเต็ม (วัตถุ DataLoader ถูกสร้างขึ้นจากวัตถุประเภท Dataset* ของเราโดยใช้คลาส **MovieReviewsDataset) โดยพื้นฐานแล้วนี่คือการเทรนในยุคหนึ่งผ่านชุดข้อมูลทั้งหมด
ตัวโหลดข้อมูลถูกสร้างขึ้นจาก PyTorch DataLoader ซึ่งนำวัตถุที่สร้างจากคลาส MovieReviewsDataset และวางแต่ละตัวอย่างเป็นแบทช์ วิธีนี้ทำให้เราสามารถป้อนชุดข้อมูลแบบจำลองของเราได้!
Optimizer_ และ scheduler_ มีอยู่ทั่วไปใน PyTorch พวกเขาจำเป็นต้องอัปเดตพารามิเตอร์ของโมเดลของเราและอัปเดตอัตราการเรียนรู้ของเราระหว่างการฝึกอบรม มีมากกว่านั้นอีกมาก แต่ฉันจะไม่ลงรายละเอียด นี่อาจเป็นรูกระต่ายขนาดใหญ่จริงๆ เพราะมีหลายอย่างเกิดขึ้นเบื้องหลังการทำงานเหล่านี้ซึ่งเราไม่จำเป็นต้องกังวล ขอบคุณ PyTorch!
ในกระบวนการนี้ เราจะติดตามฉลากจริงและฉลากที่คาดการณ์ไว้พร้อมกับการสูญเสีย
def train(dataloader, optimizer_, scheduler_, device_): r""" Train pytorch model on a single pass through the data loader. It will use the global variable `model` which is the transformer model loaded on `_device` that we want to train on. This function is built with reusability in mind: it can be used as is as long as the `dataloader` outputs a batch in dictionary format that can be passed straight into the model - `model(**batch)`. Arguments: dataloader (:obj:`torch.utils.data.dataloader.DataLoader`): Parsed data into batches of tensors. optimizer_ (:obj:`transformers.optimization.AdamW`): Optimizer used for training. scheduler_ (:obj:`torch.optim.lr_scheduler.LambdaLR`): PyTorch scheduler. device_ (:obj:`torch.device`): Device used to load tensors before feeding to model. Returns: :obj:`List[List[int], List[int], float]`: List of [True Labels, Predicted Labels, Train Average Loss]. """ # Use global variable for model. global model # Tracking variables. predictions_labels = [] true_labels = [] # Total loss for this epoch. total_loss = 0 # Put the model into training mode. model.train() # For each batch of training data... for batch in tqdm(dataloader, total=len(dataloader)): # Add original labels - use later for evaluation. true_labels += batch['labels'].numpy().flatten().tolist() # move batch to device batch = {k:v.type(torch.long).to(device_) for k,v in batch.items()} # Always clear any previously calculated gradients before performing a # backward pass. model.zero_grad() # Perform a forward pass (evaluate the model on this training batch). # This will return the loss (rather than the model output) because we # have provided the `labels`. # The documentation for this a bert model function is here: # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification outputs = model(**batch) # The call to `model` always returns a tuple, so we need to pull the # loss value out of the tuple along with the logits. We will use logits # later to calculate training accuracy. loss, logits = outputs[:2] # Accumulate the training loss over all of the batches so that we can # calculate the average loss at the end. `loss` is a Tensor containing a # single value; the `.item()` function just returns the Python value # from the tensor. total_loss += loss.item() # Perform a backward pass to calculate the gradients. loss.backward() # Clip the norm of the gradients to 1.0. # This is to help prevent the "exploding gradients" problem. torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Update parameters and take a step using the computed gradient. # The optimizer dictates the "update rule"--how the parameters are # modified based on their gradients, the learning rate, etc. optimizer.step() # Update the learning rate. scheduler.step() # Move logits and labels to CPU logits = logits.detach().cpu().numpy() # Convert these logits to list of predicted labels values. predictions_labels += logits.argmax(axis=-1).flatten().tolist() # Calculate the average loss over the training data. avg_epoch_loss = total_loss / len(dataloader) # Return all true labels and prediction for future evaluations. return true_labels, predictions_labels, avg_epoch_loss
การตรวจสอบความถูกต้อง (ตัวโหลดข้อมูล, อุปกรณ์_)
ฉันใช้ฟังก์ชันนี้ในลักษณะที่คล้ายคลึงกันมากกับรถไฟ แต่ไม่มีการอัปเดตพารามิเตอร์ การส่งต่อย้อนหลัง และการไล่ระดับสีที่เหมาะสม เราไม่จำเป็นต้องทำงานที่ต้องใช้การคำนวณมากทั้งหมด เพราะเราสนใจเฉพาะการคาดการณ์ของแบบจำลองของเราเท่านั้น
ฉันใช้ DataLoader ในลักษณะเดียวกับในการฝึกเพื่อแยกแบตช์เพื่อป้อนให้กับโมเดลของเรา
ในกระบวนการนี้ ฉันติดตามป้ายกำกับจริงและป้ายกำกับที่คาดการณ์ไว้พร้อมกับการสูญเสีย
def validation(dataloader, device_): r"""Validation function to evaluate model performance on a separate set of data. This function will return the true and predicted labels so we can use later to evaluate the model's performance. This function is built with reusability in mind: it can be used as is as long as the `dataloader` outputs a batch in dictionary format that can be passed straight into the model - `model(**batch)`. Arguments: dataloader (:obj:`torch.utils.data.dataloader.DataLoader`): Parsed data into batches of tensors. device_ (:obj:`torch.device`): Device used to load tensors before feeding to model. Returns: :obj:`List[List[int], List[int], float]`: List of [True Labels, Predicted Labels, Train Average Loss] """ # Use global variable for model. global model # Tracking variables predictions_labels = [] true_labels = [] #total loss for this epoch. total_loss = 0 # Put the model in evaluation mode--the dropout layers behave differently # during evaluation. model.eval() # Evaluate data for one epoch for batch in tqdm(dataloader, total=len(dataloader)): # add original labels true_labels += batch['labels'].numpy().flatten().tolist() # move batch to device batch = {k:v.type(torch.long).to(device_) for k,v in batch.items()} # Telling the model not to compute or store gradients, saving memory and # speeding up validation with torch.no_grad(): # Forward pass, calculate logit predictions. # This will return the logits rather than the loss because we have # not provided labels. # token_type_ids is the same as the "segment ids", which # differentiates sentence 1 and 2 in 2-sentence tasks. # The documentation for this `model` function is here: # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification outputs = model(**batch) # The call to `model` always returns a tuple, so we need to pull the # loss value out of the tuple along with the logits. We will use logits # later to to calculate training accuracy. loss, logits = outputs[:2] # Move logits and labels to CPU logits = logits.detach().cpu().numpy() # Accumulate the training loss over all of the batches so that we can # calculate the average loss at the end. `loss` is a Tensor containing a # single value; the `.item()` function just returns the Python value # from the tensor. total_loss += loss.item() # get predicitons to list predict_content = logits.argmax(axis=-1).flatten().tolist() # update list predictions_labels += predict_content # Calculate the average loss over the training data. avg_epoch_loss = total_loss / len(dataloader) # Return all true labels and prediciton for future evaluations. return true_labels, predictions_labels, avg_epoch_loss
โหลดโมเดลและ Tokenizer
กำลังโหลดสามส่วนที่จำเป็นของหม้อแปลง GPT2 ที่ผ่านการฝึกอบรมแล้ว: การกำหนดค่า โทเค็นไลเซอร์ และรุ่น
สำหรับตัวอย่างนี้ ฉันจะใช้ gpt2 จากหม้อแปลงไฟฟ้าสำเร็จรูป HuggingFace คุณสามารถใช้ GP2 แบบต่างๆ ได้ตามต้องการ
ในการสร้างไฟล์ model_config ฉันจะพูดถึงจำนวนป้ายกำกับที่จำเป็นสำหรับงานการจัดหมวดหมู่ของฉัน เนื่องจากฉันคาดคะเนเพียงสองความรู้สึก: เชิงบวกและเชิงลบ ฉันจะต้องการเพียงสองป้ายกำกับสำหรับ num_labels.
การสร้างไฟล์ tokenizer เป็นมาตรฐานที่ดีเมื่อใช้ไลบรารี่ Transformers หลังจากสร้าง tokenizer แล้ว สิ่งสำคัญสำหรับบทช่วยสอนนี้ในการตั้งค่า padding ไปทางซ้าย tokenizer.padding_side = "left" และเริ่มต้น padding token to tokenizer.eos_token ซึ่งเป็นโทเค็นลำดับสิ้นสุดดั้งเดิมของ GPT2 นี่เป็นส่วนที่สำคัญที่สุดของบทช่วยสอนนี้ เนื่องจาก GPT2 ใช้โทเค็นสุดท้ายในการทำนาย เราจึงต้องเลื่อนไปทางซ้าย
HuggingFace ทำงานส่วนใหญ่ให้เราแล้ว และเพิ่มเลเยอร์การจัดหมวดหมู่ให้กับโมเดล GPT2 ในการสร้างแบบจำลองที่ฉันใช้ GPT2ForSequenceClassification. เนื่องจากเรามี padding token แบบกำหนดเอง เราจึงต้องเริ่มต้นสำหรับโมเดลโดยใช้ model.config.pad_token_id. สุดท้ายเราจะต้องย้ายโมเดลไปยังอุปกรณ์ที่เรากำหนดไว้ก่อนหน้านี้
# Get model configuration.
print('Loading configuraiton...')
model_config = GPT2Config.from_pretrained(pretrained_model_name_or_path=model_name_or_path, num_labels=n_labels) # Get model's tokenizer.
print('Loading tokenizer...')
tokenizer = GPT2Tokenizer.from_pretrained(pretrained_model_name_or_path=model_name_or_path)
# default to left padding
tokenizer.padding_side = "left"
# Define PAD Token = EOS Token = 50256
tokenizer.pad_token = tokenizer.eos_token # Get the actual model.
print('Loading model...')
model = GPT2ForSequenceClassification.from_pretrained(pretrained_model_name_or_path=model_name_or_path, config=model_config) # resize model embedding to match new tokenizer
model.resize_token_embeddings(len(tokenizer)) # fix model padding token id
model.config.pad_token_id = model.config.eos_token_id # Load model to defined device.
model.to(device)
print('Model loaded to `%s`'%device)
Loading configuraiton... Loading tokenizer... Loading model... Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Model loaded to `cuda`
ชุดข้อมูลและตัวประสาน
นี่คือที่ที่ฉันสร้าง PyTorch Dataset และ Data Loader ด้วยออบเจ็กต์ Data Collator ที่จะใช้ในการป้อนข้อมูลลงในโมเดลของเรา
นี่คือที่ที่ฉันใช้ MovieReviewsชุดข้อมูล เพื่อสร้างชุดข้อมูล PyTorch ที่จะส่งคืนข้อความและป้ายกำกับ
เนื่องจากเราจำเป็นต้องป้อนตัวเลขลงในแบบจำลองของเรา เราจึงต้องแปลงข้อความและป้ายกำกับให้เป็นตัวเลข นี่คือจุดประสงค์ของคอลเลเตอร์! จะนำข้อมูลที่ส่งออกโดยชุดข้อมูล PyTorch และส่งผ่านฟังก์ชัน Data Collator เพื่อส่งออกลำดับสำหรับโมเดลของเรา
ฉันกำลังเก็บ tokenizer ให้ห่างจากชุดข้อมูล PyTorch เพื่อทำให้โค้ดสะอาดขึ้นและมีโครงสร้างที่ดีขึ้น คุณสามารถใช้ tokenizer ภายในชุดข้อมูล PyTorch และลำดับเอาต์พุตที่สามารถใช้ได้โดยตรงในโมเดลโดยไม่ต้องใช้ Data Collator
ฉันขอแนะนำอย่างยิ่งให้ใช้ไฟล์ข้อความตรวจสอบความถูกต้องเพื่อกำหนดว่าจำเป็นต้องมีการฝึกอบรมมากน้อยเพียงใดเพื่อหลีกเลี่ยงไม่ให้เกินพอดี หลังจากที่คุณทราบแล้วว่าพารามิเตอร์ใดให้ผลลัพธ์ที่ดีที่สุด ไฟล์การตรวจสอบสามารถรวมไว้ในการฝึกและเรียกใช้รถไฟขบวนสุดท้ายด้วยชุดข้อมูลทั้งหมด
เครื่องมือรวบรวมข้อมูลใช้เพื่อจัดรูปแบบเอาต์พุตชุดข้อมูล PyTorch เพื่อให้ตรงกับอินพุตที่จำเป็นสำหรับ GPT2
# Create data collator to encode text and labels into numbers.
gpt2_classificaiton_collator = Gpt2ClassificationCollator(use_tokenizer=tokenizer, labels_encoder=labels_ids, max_sequence_len=max_length) print('Dealing with Train...')
# Create pytorch dataset.
train_dataset = MovieReviewsDataset(path='/content/aclImdb/train', use_tokenizer=tokenizer)
print('Created `train_dataset` with %d examples!'%len(train_dataset)) # Move pytorch dataset into dataloader.
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=gpt2_classificaiton_collator)
print('Created `train_dataloader` with %d batches!'%len(train_dataloader)) print() print('Dealing with Validation...')
# Create pytorch dataset.
valid_dataset = MovieReviewsDataset(path='/content/aclImdb/test', use_tokenizer=tokenizer)
print('Created `valid_dataset` with %d examples!'%len(valid_dataset)) # Move pytorch dataset into dataloader.
valid_dataloader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=False, collate_fn=gpt2_classificaiton_collator)
print('Created `eval_dataloader` with %d batches!'%len(valid_dataloader))
Dealing with Train... pos files: 100%|████████████████████████████████|12500/12500 [01:17<00:00, 161.19it/s] neg files: 100%|████████████████████████████████|12500/12500 [01:05<00:00, 190.72it/s] Created `train_dataset` with 25000 examples! Created `train_dataloader` with 782 batches! Reading pos files... pos files: 100%|████████████████████████████████|12500/12500 [00:54<00:00, 230.93it/s] neg files: 100%|████████████████████████████████|12500/12500 [00:42<00:00, 291.07it/s] Created `valid_dataset` with 25000 examples! Created `eval_dataloader` with 782 batches!
รถไฟ
ฉันสร้างเครื่องมือเพิ่มประสิทธิภาพและตัวกำหนดตารางเวลาการใช้งานโดย PyTorch ในการฝึกอบรม ฉันใช้พารามิเตอร์ทั่วไปส่วนใหญ่ที่ใช้โดยโมเดลหม้อแปลงไฟฟ้า
ฉันวนซ้ำตามจำนวนยุคที่กำหนดและเรียก รถไฟ และ  การตรวจสอบ ฟังก์ชั่น
ฉันกำลังพยายามส่งข้อมูลที่คล้ายกันหลังจากแต่ละยุคเป็น Keras: train_loss: — val_loss: — รถไฟ_acc: — valid_acc.
หลังจากการฝึกอบรม ให้วางแผนการฝึกและการสูญเสียการตรวจสอบและเส้นโค้งความแม่นยำเพื่อตรวจสอบว่าการฝึกอบรมดำเนินไปอย่างไร
หมายเหตุ แผนผังการฝึกอาจดูแปลกไปเล็กน้อย: ความแม่นยำในการตรวจสอบความถูกต้องเริ่มต้นสูงกว่าความแม่นยำในการฝึก และการสูญเสียการตรวจสอบเริ่มต้นต่ำกว่าการสูญเสียการฝึก โดยปกตินี้จะเป็นสิ่งที่ตรงกันข้าม ฉันถือว่าการแบ่งข้อมูลทำได้ง่ายขึ้นสำหรับส่วนตรวจสอบความถูกต้องหรือยากเกินไปสำหรับส่วนการฝึกอบรมหรือทั้งสองอย่าง เนื่องจากบทช่วยสอนนี้เกี่ยวกับการใช้ GPT2 สำหรับการจัดประเภท ฉันจะไม่กังวลเกี่ยวกับผลลัพธ์ของแบบจำลองมากเกินไป
# Note: AdamW is a class from the huggingface library (as opposed to pytorch) # I believe the 'W' stands for 'Weight Decay fix"
optimizer = AdamW(model.parameters(), lr = 2e-5, # default is 5e-5, our notebook had 2e-5 eps = 1e-8 # default is 1e-8. ) # Total number of training steps is number of batches * number of epochs.
# `train_dataloader` contains batched data so `len(train_dataloader)` gives # us the number of batches.
total_steps = len(train_dataloader) * epochs # Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 0, # Default value in run_glue.py num_training_steps = total_steps) # Store the average loss after each epoch so we can plot them.
all_loss = {'train_loss':[], 'val_loss':[]}
all_acc = {'train_acc':[], 'val_acc':[]} # Loop through each epoch.
print('Epoch')
for epoch in tqdm(range(epochs)): print() print('Training on batches...') # Perform one full pass over the training set. train_labels, train_predict, train_loss = train(train_dataloader, optimizer, scheduler, device) train_acc = accuracy_score(train_labels, train_predict) # Get prediction form model on validation data. print('Validation on batches...') valid_labels, valid_predict, val_loss = validation(valid_dataloader, device) val_acc = accuracy_score(valid_labels, valid_predict) # Print loss and accuracy values to see how training evolves. print(" train_loss: %.5f - val_loss: %.5f - train_acc: %.5f - valid_acc: %.5f"%(train_loss, val_loss, train_acc, val_acc)) print() # Store the loss value for plotting the learning curve. all_loss['train_loss'].append(train_loss) all_loss['val_loss'].append(val_loss) all_acc['train_acc'].append(train_acc) all_acc['val_acc'].append(val_acc) # Plot loss curves.
plot_dict(all_loss, use_xlabel='Epochs', use_ylabel='Value', use_linestyles=['-', '--']) # Plot accuracy curves.
plot_dict(all_acc, use_xlabel='Epochs', use_ylabel='Value', use_linestyles=['-', '--'])
Epoch 100%|████████████████████████████████|4/4 [15:11<00:00, 227.96s/it] Training on batches... 100%|████████████████████████████████|782/782 [02:42<00:00, 4.82it/s] Validation on batches... 100%|████████████████████████████████|782/782 [02:07<00:00, 6.13it/s] train_loss: 0.54128 - val_loss: 0.38758 - train_acc: 0.75288 - valid_acc: 0.81904 Training on batches... 100%|████████████████████████████████|782/782 [02:36<00:00, 5.00it/s] Validation on batches... 100%|████████████████████████████████|782/782 [01:41<00:00, 7.68it/s] train_loss: 0.36716 - val_loss: 0.37620 - train_acc: 0.83288 -valid_acc: 0.82912 Training on batches... 100%|████████████████████████████████|782/782 [02:36<00:00, 5.00it/s] Validation on batches... 100%|████████████████████████████████|782/782 [01:24<00:00, 9.24it/s] train_loss: 0.31409 - val_loss: 0.39384 - train_acc: 0.86304 - valid_acc: 0.83044 Training on batches... 100%|████████████████████████████████|782/782 [02:36<00:00, 4.99it/s] Validation on batches... 100%|████████████████████████████████|782/782 [01:09<00:00, 11.29it/s] train_loss: 0.27358 - val_loss: 0.39798 - train_acc: 0.88432 - valid_acc: 0.83292

รถไฟและการสูญเสียการตรวจสอบ


ฝึกฝนและตรวจสอบความถูกต้อง

ประเมินค่า
เมื่อจัดการกับการจำแนกประเภทจะมีประโยชน์ในการดูการเรียกคืนที่แม่นยำและคะแนน F1
มาตรวัดที่ดีที่ควรมีในการประเมินแบบจำลองคือเมทริกซ์ความสับสน
# Get prediction form model on validation data. This is where you should use
# your test data.
true_labels, predictions_labels, avg_epoch_loss = validation(valid_dataloader, device) # Create the evaluation report.
evaluation_report = classification_report(true_labels, predictions_labels, labels=list(labels_ids.values()), target_names=list(labels_ids.keys()))
# Show the evaluation report.
print(evaluation_report) # Plot confusion matrix.
plot_confusion_matrix(y_true=true_labels, y_pred=predictions_labels, classes=list(labels_ids.keys()), normalize=True, magnify=0.1, );
Training on batches... 100%|████████████████████████████████|782/782 [01:09<00:00, 11.24it/s] precision recall f1-score support neg 0.84 0.83 0.83 12500 pos 0.83 0.84 0.83 12500 accuracy 0.83 25000 macro avg 0.83 0.83 0.83 25000 weighted avg 0.83 0.83 0.83 25000

เมทริกซ์ความสับสนทำให้เป็นมาตรฐาน

บันทึกสุดท้าย
ถ้ามาไกลขนาดนี้ ขอแสดงความยินดี! 🎊 และ ขอขอบคุณ! 🙏 สำหรับความสนใจในการกวดวิชาของฉัน!
ฉันได้ใช้รหัสนี้มาระยะหนึ่งแล้วและรู้สึกว่ามันถึงจุดที่มีการจัดทำเป็นเอกสารไว้อย่างดีและง่ายต่อการติดตาม
แน่นอนเป็นเรื่องง่ายสำหรับฉันที่จะปฏิบัติตามเพราะฉันสร้างมันขึ้นมา นั่นคือเหตุผลที่เรายินดีรับข้อเสนอแนะและช่วยให้ฉันปรับปรุงบทช่วยสอนในอนาคตของฉัน!
หากคุณเห็นสิ่งผิดปกติโปรดแจ้งให้เราทราบโดยเปิดปัญหาใน my ml_things ที่เก็บ GitHub!
บทช่วยสอนจำนวนมากส่วนใหญ่มักจะเกิดขึ้นเพียงครั้งเดียวและไม่ได้รับการดูแล ฉันวางแผนที่จะอัปเดตบทแนะนำของฉันให้ทันสมัยที่สุดเท่าที่จะทำได้
บทความนี้ถูกเผยแพร่เมื่อวันที่ เว็บไซต์ส่วนตัวของ George Mihaila  และเผยแพร่ซ้ำไปยัง TOPBOTS โดยได้รับอนุญาตจากผู้เขียน
สนุกกับบทความนี้? ลงทะเบียนเพื่อรับการอัปเดต AI เพิ่มเติม
เราจะแจ้งให้คุณทราบเมื่อเราเผยแพร่การศึกษาด้านเทคนิคเพิ่มเติม
ที่เกี่ยวข้อง

 ที่มา: https://www.topbots.com/gpt2-text-classification-using-hugging-face-transformers/