포옹 얼굴 변환기를 사용한 텍스트 분류 용 GPT2

플라톤에 의해 재발행

팔로워 : 0

텍스트 분류

이 노트북은 다음을 사용하여 텍스트 분류를 위해 GPT2 모델을 미세 조정하는 데 사용됩니다. 포옹하는 얼굴 변압기 맞춤 데이터 세트의 라이브러리.

Hugging Face는 GPT2가 분류 작업에 사용되는 데 필요한 모든 기능을 포함하는 데 매우 좋습니다. 포옹 얼굴 감사합니다!

분류를 위해 GPT2를 사용하는 방법에 대한 정보를 많이 찾을 수 없었기 때문에 다른 변압기 모델과 유사한 구조를 사용하여이 튜토리얼을 만들기로 결정했습니다.

이 심층 교육 콘텐츠가 도움이된다면 AI 연구 메일 링리스트 구독 새로운 자료를 공개 할 때 경고를받습니다.

주요 아이디어 : GPT2는 디코더 변환기이므로 입력 시퀀스의 마지막 토큰은 입력을 따라야하는 다음 토큰에 대한 예측을 만드는 데 사용됩니다. 이는 입력 시퀀스의 마지막 토큰에 예측에 필요한 모든 정보가 포함되어 있음을 의미합니다. 이를 염두에두고 해당 정보를 사용하여 생성 작업 대신 분류 작업에서 예측을 수행 할 수 있습니다.

즉, Bert 에서처럼 예측하기 위해 첫 번째 토큰 임베딩을 사용하는 대신 마지막 토큰 임베딩을 사용하여 GPT2로 예측을 수행합니다.

Bert의 첫 번째 토큰에만 관심이 있었기 때문에 오른쪽으로 패딩했습니다. 이제 GPT2에서는 예측에 마지막 토큰을 사용하고 있으므로 왼쪽을 채워야합니다. HuggingFace Transformers의 멋진 업그레이드 덕분에 GPT2 Tokenizer를 구성 할 수 있습니다.

이 노트북에 대해 무엇을 알아야합니까?

PyTorch를 사용하여 트랜스포머 모델을 미세 조정하기 때문에 PyTorch에 대한 모든 지식은 매우 유용합니다.

에 대해 조금 알고 변압기 도서관도 도움이됩니다.

이 노트북을 어떻게 사용합니까?

모든 프로젝트와 마찬가지로 재사용 성을 염두에두고이 노트북을 만들었습니다.

모든 변경 사항은 데이터 처리 부분에서 발생합니다. 여기서 PyTorch Dataset, Data Collator 및 DataLoader를 자신의 데이터 요구에 맞게 사용자 정의해야합니다.

변경할 수있는 모든 매개 변수는 수입 부분. 각 매개 변수는 가능한 한 직관적으로 잘 설명되고 구조화되어 있습니다.

데이터 세트

이 노트북은 커스텀 데이터 세트에 대한 사전 학습 변환기를 다룹니다. 나는 잘 알려진 영화 리뷰 긍정적-부정적인 레이블을 사용합니다 대형 영화 리뷰 데이터 세트.

Stanford 웹 사이트에 제공된 설명 :

이것은 이전 벤치 마크 데이터 세트보다 훨씬 많은 데이터를 포함하는 이진 감정 분류를위한 데이터 세트입니다. 교육용으로 25,000 개의 극지 영화 리뷰 세트를 제공하고 테스트 용으로 25,000 개를 제공합니다. 사용할 레이블이없는 추가 데이터도 있습니다. 원시 텍스트 및 이미 처리 된 단어 모음 형식이 제공됩니다. 자세한 내용은 릴리스에 포함 된 README 파일을 참조하십시오.

왜이 데이터 셋인가? 분류를 위해 데이터 세트를 이해하고 사용하기 쉽다고 생각합니다. 감정 데이터는 항상 재미있게 작업 할 수 있다고 생각합니다.

코딩

이제 코딩을 해봅시다! 노트북의 각 코딩 셀을 살펴보고 그것이 무엇을하는지, 코드가 무엇인지, 언제 관련이 있는지 설명합니다. 출력을 보여줍니다.

자신의 파이썬 노트북에서 각 코드 셀을 실행하기로 결정한 경우이 형식을 쉽게 따를 수 있도록 만들었습니다.

튜토리얼에서 배울 때 나는 항상 결과를 복제하려고 노력합니다. 설명 옆에 코드가 있으면 따라하기 쉽다고 생각합니다.

다운로드

를 다운로드 대형 영화 리뷰 데이터 세트 로컬에서 압축을 풉니 다.

Download the dataset.
!wget -q -nc http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Unzip the dataset.
!tar -zxf /content/aclImdb_v1.tar.gz

설치하는 방법

변압기 Hugging Face의 모든 멋진 코드를 사용하려면 라이브러리를 설치해야합니다. 최신 버전을 얻으려면 GitHub에서 바로 설치하겠습니다.
ml_things 다양한 머신 러닝 관련 작업에 사용되는 라이브러리입니다. 각 기계 학습 프로젝트에 대해 작성해야하는 코드의 양을 줄이기 위해이 라이브러리를 만들었습니다.

# Install transformers library.
!pip install -q git+https://github.com/huggingface/transformers.git
# Install helper functions.
!pip install -q git+https://github.com/gmihaila/ml_things.git

Installing build dependencies ... done Getting requirements to build wheel ... done Preparing wheel metadata ... done |████████████████████████████████| 2.9MB 6.7MB/s |████████████████████████████████| 890kB 48.9MB/s |████████████████████████████████| 1.1MB 49.0MB/s Building wheelfor transformers (PEP 517) ... done Building wheel for sacremoses (setup.py) ... done |████████████████████████████████| 71kB 5.2MB/s Building wheel for ml-things (setup.py) ... done Building wheel for ftfy (setup.py) ... done

수입

이 노트북에 필요한 모든 라이브러리를 가져옵니다.이 노트북에 사용되는 매개 변수를 선언합니다.

set_seed(123) – 재현성을 위해 항상 고정 된 시드를 설정하는 것이 좋습니다.
epochs – 훈련 시대의 수 (저자는 2-4 사이에서 권장).
batch_size – 배치 수 – 최대 시퀀스 길이 및 GPU 메모리에 따라 다름. 512 시퀀스 길이의 경우 10 USUALY 배치가 cuda 메모리 문제없이 작동합니다. 작은 시퀀스 길이의 경우 32 이상의 배치를 시도 할 수 있습니다. max_length – 텍스트 시퀀스를 특정 길이로 채우거나 자릅니다. 훈련 속도를 높이기 위해 60으로 설정하겠습니다.
device – 사용할 GPU를 찾으십시오. GPU가 발견되지 않으면 기본적으로 CPU를 사용합니다.
model_name_or_path – 변압기 모델 이름 – 이미 사전 훈련 된 모델을 사용합니다. 변환기 모델의 경로 – 로컬 디스크에서 자신의 모델을로드합니다. 이 튜토리얼에서는 gpt2 모델입니다.
labels_ids – 레이블 사전 및 ID – 문자열 레이블을 숫자로 변환하는 데 사용됩니다.
n_labels –이 데이터 세트에서 사용중인 라벨 수. 분류 헤드의 크기를 결정하는 데 사용됩니다.

import io
import os
import torch
from tqdm.notebook import tqdm
from torch.utils.data import Dataset, DataLoader
from ml_things import plot_dict, plot_confusion_matrix, fix_text
from sklearn.metrics import classification_report, accuracy_score
from transformers import (set_seed, TrainingArguments, Trainer, GPT2Config, GPT2Tokenizer, AdamW, get_linear_schedule_with_warmup, GPT2ForSequenceClassification) # Set seed for reproducibility.
set_seed(123) # Number of training epochs (authors on fine-tuning Bert recommend between 2 and 4).
epochs = 4 # Number of batches - depending on the max sequence length and GPU memory.
# For 512 sequence length batch of 10 works without cuda memory issues.
# For small sequence length can try batch of 32 or higher.
batch_size = 32 # Pad or truncate text sequences to a specific length
# if `None` it will use maximum sequence of word piece tokens allowed by model.
max_length = 60 # Look for gpu to use. Will use `cpu` by default if no gpu found.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # Name of transformers model - will use already pretrained model.
# Path of transformer model - will load your own model from local disk.
model_name_or_path = 'gpt2' # Dictionary of labels and their id - this will be used to convert.
# String labels to number ids.
labels_ids = {'neg': 0, 'pos': 1} # How many labels are we using in training.
# This is used to decide size of classification head.
n_labels = len(labels_ids)
도우미 기능
노트북의 깔끔한 모양을 유지하는 데 도움이되도록이 노트북에서 사용할 모든 클래스와 기능을이 섹션 아래에 유지하고 싶습니다.
MovieReviewsDataset (데이터 세트)
이전에 PyTorch로 작업했다면 이것은 꽤 표준입니다. 데이터 세트를 읽고, 구문 분석하고, 관련 레이블이있는 텍스트를 반환하려면이 클래스가 필요합니다.
이 수업에서는 각 파일의 내용을 읽고 fix_text를 사용하여 유니 코드 문제를 수정하고 긍정적이고 부정적인 감정을 추적하면됩니다.
모든 텍스트와 레이블을 목록에 추가하겠습니다.
이 PyTorch Dataset 클래스에는 세 가지 주요 부분이 있습니다.

init () 데이터 세트를 읽고 텍스트와 라벨을 숫자로 변환합니다.
len () 여기에서 읽은 예제의 수를 반환해야합니다. 이것은 len (MovieReviewsDataset ())를 호출 할 때 사용됩니다.
getitem () 항상 데이터 세트에서 반환 할 예제의 예를 나타내는 int 값을 입력으로받습니다. 값 3이 전달되면 위치 3의 데이터 세트에서 예제를 반환합니다.

class MovieReviewsDataset(Dataset): r"""PyTorch Dataset class for loading data. This is where the data parsing happens. This class is built with reusability in mind: it can be used as is as. Arguments: path (:obj:`str`): Path to the data partition. """ def __init__(self, path, use_tokenizer): # Check if path exists. if not os.path.isdir(path): # Raise error if path is invalid. raise ValueError('Invalid `path` variable! Needs to be a directory') self.texts = [] self.labels = [] # Since the labels are defined by folders with data we loop # through each label. for label in ['pos', 'neg']: sentiment_path = os.path.join(path, label) # Get all files from path. files_names = os.listdir(sentiment_path)#[:10] # Sample for debugging. # Go through each file and read its content. for file_name in tqdm(files_names, desc=f'{label} files'): file_path = os.path.join(sentiment_path, file_name) # Read content. content = io.open(file_path, mode='r', encoding='utf-8').read() # Fix any unicode issues. content = fix_text(content) # Save content. self.texts.append(content) # Save encode labels. self.labels.append(label) # Number of exmaples. self.n_examples = len(self.labels) return def __len__(self): r"""When used `len` return the number of examples. """ return self.n_examples def __getitem__(self, item): r"""Given an index return an example from the position. Arguments: item (:obj:`int`): Index position to pick an example to return. Returns: :obj:`Dict[str, str]`: Dictionary of inputs that contain text and asociated labels. """ return {'text':self.texts[item], 'label':self.labels[item]}
Gpt2분류 콜레이터
이 클래스를 사용하여 Data Collator를 만듭니다. 이것은 DataLoader에서 모델에 공급되는 데이터 배스를 만드는 데 사용됩니다. 각 시퀀스에서 토크 나이저와 레이블 인코더를 사용하여 텍스트와 레이블을 숫자로 변환합니다.
운이 좋게도 Hugging Face는 모든 것을 생각하고 토크 나이저가 모든 무거운 작업 (텍스트를 토큰으로 분할, 패딩, 자르기, 텍스트를 숫자로 인코딩)을 수행하도록 만들었으며 사용하기 매우 쉽습니다!
이 Data Collator 클래스에는 두 가지 주요 부분이 있습니다.

init () 사용할 토크 나이저를 초기화하는 위치, 레이블 인코딩 방법 및 시퀀스 길이를 다른 값으로 설정해야하는 경우.
요구() 데이터 예제의 배치를 입력으로 취하는 함수 콜 레이터로 사용됩니다. 모델에 제공 할 수있는 형식으로 객체를 반환해야합니다. 운 좋게도 토크 나이 저는이를 수행하고 다음과 같은 방식으로 모델에 공급할 준비가 된 변수 사전을 반환합니다. model(**inputs). 모델을 미세 조정하고 있으므로 레이블도 포함했습니다.

class Gpt2ClassificationCollator(object): r""" Data Collator used for GPT2 in a classificaiton rask. It uses a given tokenizer and label encoder to convert any text and labels to numbers that can go straight into a GPT2 model. This class is built with reusability in mind: it can be used as is as long as the `dataloader` outputs a batch in dictionary format that can be passed straight into the model - `model(**batch)`. Arguments: use_tokenizer (:obj:`transformers.tokenization_?`): Transformer type tokenizer used to process raw text into numbers. labels_ids (:obj:`dict`): Dictionary to encode any labels names into numbers. Keys map to labels names and Values map to number associated to those labels. max_sequence_len (:obj:`int`, `optional`) Value to indicate the maximum desired sequence to truncate or pad text sequences. If no value is passed it will used maximum sequence size supported by the tokenizer and model. """ def __init__(self, use_tokenizer, labels_encoder, max_sequence_len=None): # Tokenizer to be used inside the class. self.use_tokenizer = use_tokenizer # Check max sequence length. self.max_sequence_len = use_tokenizer.model_max_length if max_sequence_len is None else max_sequence_len # Label encoder used inside the class. self.labels_encoder = labels_encoder return def __call__(self, sequences): r""" This function allowes the class objesct to be used as a function call. Sine the PyTorch DataLoader needs a collator function, I can use this class as a function. Arguments: item (:obj:`list`): List of texts and labels. Returns: :obj:`Dict[str, object]`: Dictionary of inputs that feed into the model. It holddes the statement `model(**Returned Dictionary)`. """ # Get all texts from sequences list. texts = [sequence['text'] for sequence in sequences] # Get all labels from sequences list. labels = [sequence['label'] for sequence in sequences] # Encode all labels using label encoder. labels = [self.labels_encoder[label] for label in labels] # Call tokenizer on all texts to convert into tensors of numbers with # appropriate padding. inputs = self.use_tokenizer(text=texts, return_tensors="pt", padding=True, truncation=True, max_length=self.max_sequence_len) # Update the inputs with the associated encoded labels as tensor. inputs.update({'labels':torch.tensor(labels)}) return inputs
train (dataloader, optimizer_, scheduler_, device_)
이 함수를 생성하여 DataLoader 개체를 완전히 통과했습니다 (DataLoader 개체는 ** MovieReviewsDataset 클래스를 사용하여 Dataset * 유형 개체에서 생성됨). 이것은 기본적으로 전체 데이터 세트를 통한 하나의 에포크 기차입니다.
데이터 로더는 MovieReviewsDataset 클래스에서 생성 된 객체를 가져 와서 각 예제를 일괄 처리하는 PyTorch DataLoader에서 생성됩니다. 이런 식으로 모델 배치 데이터를 공급할 수 있습니다!
optimizer_ 및 scheduler_는 PyTorch에서 매우 일반적입니다. 모델의 매개 변수를 업데이트하고 훈련 중에 학습률을 업데이트해야합니다. 그보다 더 많은 것이 있지만 자세히 설명하지는 않겠습니다. 우리가 걱정할 필요가없는 이러한 기능 뒤에 많은 일이 발생하기 때문에 이것은 실제로 거대한 토끼 구멍이 될 수 있습니다. PyTorch 감사합니다!
이 과정에서 손실과 함께 실제 라벨과 예측 라벨을 추적합니다.
def train(dataloader, optimizer_, scheduler_, device_): r""" Train pytorch model on a single pass through the data loader. It will use the global variable `model` which is the transformer model loaded on `_device` that we want to train on. This function is built with reusability in mind: it can be used as is as long as the `dataloader` outputs a batch in dictionary format that can be passed straight into the model - `model(**batch)`. Arguments: dataloader (:obj:`torch.utils.data.dataloader.DataLoader`): Parsed data into batches of tensors. optimizer_ (:obj:`transformers.optimization.AdamW`): Optimizer used for training. scheduler_ (:obj:`torch.optim.lr_scheduler.LambdaLR`): PyTorch scheduler. device_ (:obj:`torch.device`): Device used to load tensors before feeding to model. Returns: :obj:`List[List[int], List[int], float]`: List of [True Labels, Predicted Labels, Train Average Loss]. """ # Use global variable for model. global model # Tracking variables. predictions_labels = [] true_labels = [] # Total loss for this epoch. total_loss = 0 # Put the model into training mode. model.train() # For each batch of training data... for batch in tqdm(dataloader, total=len(dataloader)): # Add original labels - use later for evaluation. true_labels += batch['labels'].numpy().flatten().tolist() # move batch to device batch = {k:v.type(torch.long).to(device_) for k,v in batch.items()} # Always clear any previously calculated gradients before performing a # backward pass. model.zero_grad() # Perform a forward pass (evaluate the model on this training batch). # This will return the loss (rather than the model output) because we # have provided the `labels`. # The documentation for this a bert model function is here: # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification outputs = model(**batch) # The call to `model` always returns a tuple, so we need to pull the # loss value out of the tuple along with the logits. We will use logits # later to calculate training accuracy. loss, logits = outputs[:2] # Accumulate the training loss over all of the batches so that we can # calculate the average loss at the end. `loss` is a Tensor containing a # single value; the `.item()` function just returns the Python value # from the tensor. total_loss += loss.item() # Perform a backward pass to calculate the gradients. loss.backward() # Clip the norm of the gradients to 1.0. # This is to help prevent the "exploding gradients" problem. torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Update parameters and take a step using the computed gradient. # The optimizer dictates the "update rule"--how the parameters are # modified based on their gradients, the learning rate, etc. optimizer.step() # Update the learning rate. scheduler.step() # Move logits and labels to CPU logits = logits.detach().cpu().numpy() # Convert these logits to list of predicted labels values. predictions_labels += logits.argmax(axis=-1).flatten().tolist() # Calculate the average loss over the training data. avg_epoch_loss = total_loss / len(dataloader) # Return all true labels and prediction for future evaluations. return true_labels, predictions_labels, avg_epoch_loss
유효성 검사 (데이터 로더, 장치 _)
이 기능을 기차와 매우 유사한 방식으로 구현했지만 매개 변수 업데이트, 역방향 패스 및 그래디언트 괜찮은 부분이 없습니다. 모델의 예측에만 관심이 있기 때문에 계산 집약적 인 작업을 모두 수행 할 필요는 없습니다.
저는 기차에서와 비슷한 방식으로 DataLoader를 사용하여 모델에 공급할 배치를 가져옵니다.
이 과정에서 손실과 함께 실제 라벨과 예측 라벨을 추적합니다.
def validation(dataloader, device_): r"""Validation function to evaluate model performance on a separate set of data. This function will return the true and predicted labels so we can use later to evaluate the model's performance. This function is built with reusability in mind: it can be used as is as long as the `dataloader` outputs a batch in dictionary format that can be passed straight into the model - `model(**batch)`. Arguments: dataloader (:obj:`torch.utils.data.dataloader.DataLoader`): Parsed data into batches of tensors. device_ (:obj:`torch.device`): Device used to load tensors before feeding to model. Returns: :obj:`List[List[int], List[int], float]`: List of [True Labels, Predicted Labels, Train Average Loss] """ # Use global variable for model. global model # Tracking variables predictions_labels = [] true_labels = [] #total loss for this epoch. total_loss = 0 # Put the model in evaluation mode--the dropout layers behave differently # during evaluation. model.eval() # Evaluate data for one epoch for batch in tqdm(dataloader, total=len(dataloader)): # add original labels true_labels += batch['labels'].numpy().flatten().tolist() # move batch to device batch = {k:v.type(torch.long).to(device_) for k,v in batch.items()} # Telling the model not to compute or store gradients, saving memory and # speeding up validation with torch.no_grad(): # Forward pass, calculate logit predictions. # This will return the logits rather than the loss because we have # not provided labels. # token_type_ids is the same as the "segment ids", which # differentiates sentence 1 and 2 in 2-sentence tasks. # The documentation for this `model` function is here: # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification outputs = model(**batch) # The call to `model` always returns a tuple, so we need to pull the # loss value out of the tuple along with the logits. We will use logits # later to to calculate training accuracy. loss, logits = outputs[:2] # Move logits and labels to CPU logits = logits.detach().cpu().numpy() # Accumulate the training loss over all of the batches so that we can # calculate the average loss at the end. `loss` is a Tensor containing a # single value; the `.item()` function just returns the Python value # from the tensor. total_loss += loss.item() # get predicitons to list predict_content = logits.argmax(axis=-1).flatten().tolist() # update list predictions_labels += predict_content # Calculate the average loss over the training data. avg_epoch_loss = total_loss / len(dataloader) # Return all true labels and prediciton for future evaluations. return true_labels, predictions_labels, avg_epoch_loss
로드 모델 및 토크 나이저
사전 훈련 된 GPT2 변환기의 세 가지 필수 부분 인 구성, 토크 나이저 및 모델로드.
이 예에서는 gpt2 HuggingFace의 사전 훈련 된 변압기에서. 원하는 GP2 변형을 사용할 수 있습니다.
창조에서 model_config 분류 작업에 필요한 레이블 수를 언급하겠습니다. 긍정과 부정의 두 가지 감정 만 예측하므로 num_labels.
그 생성 tokenizer Transformers 라이브러리를 사용할 때 꽤 표준입니다. 토크 나이저를 만든 후에는이 튜토리얼에서 패딩을 왼쪽으로 설정하는 것이 중요합니다. tokenizer.padding_side = "left" 패딩 토큰을 다음으로 초기화합니다. tokenizer.eos_token GPT2의 원래 시퀀스 토큰 끝입니다. GPT2는 예측을 위해 마지막 토큰을 사용하므로이 튜토리얼에서 가장 중요한 부분입니다. 왼쪽으로 패딩해야합니다.
HuggingFace는 이미 대부분의 작업을 수행했으며 GPT2 모델에 분류 계층을 추가했습니다. 내가 사용한 모델을 만들 때 GPT2ForSequenceClassification. 사용자 정의 패딩 토큰이 있으므로 다음을 사용하여 모델에 대해 초기화해야합니다. model.config.pad_token_id. 마지막으로 앞에서 정의한 장치로 모델을 이동해야합니다.
# Get model configuration.
print('Loading configuraiton...')
model_config = GPT2Config.from_pretrained(pretrained_model_name_or_path=model_name_or_path, num_labels=n_labels) # Get model's tokenizer.
print('Loading tokenizer...')
tokenizer = GPT2Tokenizer.from_pretrained(pretrained_model_name_or_path=model_name_or_path)
# default to left padding
tokenizer.padding_side = "left"
# Define PAD Token = EOS Token = 50256
tokenizer.pad_token = tokenizer.eos_token # Get the actual model.
print('Loading model...')
model = GPT2ForSequenceClassification.from_pretrained(pretrained_model_name_or_path=model_name_or_path, config=model_config) # resize model embedding to match new tokenizer
model.resize_token_embeddings(len(tokenizer)) # fix model padding token id
model.config.pad_token_id = model.config.eos_token_id # Load model to defined device.
model.to(device)
print('Model loaded to `%s`'%device)
Loading configuraiton... Loading tokenizer... Loading model... Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. Model loaded to `cuda`
데이터 세트 및 콜 레이터
여기에서 모델에 데이터를 공급하는 데 사용되는 Data Collator 개체를 사용하여 PyTorch Dataset 및 Data Loader를 만듭니다.
이것은 내가 사용하는 곳입니다 MovieReviews데이터세트 클래스를 사용하여 텍스트와 레이블을 반환 할 PyTorch Dataset을 만듭니다.
모델에 숫자를 입력해야하므로 텍스트와 레이블을 숫자로 변환해야합니다. 이것이 콜 레이터의 목적입니다! PyTorch Dataset에서 출력 한 데이터를 가져와 Data Collator 함수를 통해 전달하여 모델의 시퀀스를 출력합니다.
코드를 더 깔끔하고 구조화하기 위해 토크 나이저를 PyTorch Dataset에서 멀리 유지하고 있습니다. 분명히 PyTorch Dataset 내에서 토크 나이저를 사용할 수 있으며 데이터 콜 레이터를 사용하지 않고도 모델에 직접 사용할 수있는 출력 시퀀스를 사용할 수 있습니다.
과적 합을 피하기 위해 얼마나 많은 훈련이 필요한지 결정하기 위해 유효성 검사 텍스트 파일을 사용하는 것이 좋습니다. 어떤 매개 변수가 최상의 결과를 산출하는지 파악한 후 유효성 검사 파일을 학습에 통합하고 전체 데이터 세트로 최종 학습을 실행할 수 있습니다.
데이터 콜 레이터는 GPT2에 필요한 입력과 일치하도록 PyTorch 데이터 세트 출력의 형식을 지정하는 데 사용됩니다.
# Create data collator to encode text and labels into numbers.
gpt2_classificaiton_collator = Gpt2ClassificationCollator(use_tokenizer=tokenizer, labels_encoder=labels_ids, max_sequence_len=max_length) print('Dealing with Train...')
# Create pytorch dataset.
train_dataset = MovieReviewsDataset(path='/content/aclImdb/train', use_tokenizer=tokenizer)
print('Created `train_dataset` with %d examples!'%len(train_dataset)) # Move pytorch dataset into dataloader.
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=gpt2_classificaiton_collator)
print('Created `train_dataloader` with %d batches!'%len(train_dataloader)) print() print('Dealing with Validation...')
# Create pytorch dataset.
valid_dataset = MovieReviewsDataset(path='/content/aclImdb/test', use_tokenizer=tokenizer)
print('Created `valid_dataset` with %d examples!'%len(valid_dataset)) # Move pytorch dataset into dataloader.
valid_dataloader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=False, collate_fn=gpt2_classificaiton_collator)
print('Created `eval_dataloader` with %d batches!'%len(valid_dataloader))
Dealing with Train... pos files: 100%|████████████████████████████████|12500/12500 [01:17<00:00, 161.19it/s] neg files: 100%|████████████████████████████████|12500/12500 [01:05<00:00, 190.72it/s] Created `train_dataset` with 25000 examples! Created `train_dataloader` with 782 batches! Reading pos files... pos files: 100%|████████████████████████████████|12500/12500 [00:54<00:00, 230.93it/s] neg files: 100%|████████████████████████████████|12500/12500 [00:42<00:00, 291.07it/s] Created `valid_dataset` with 25000 examples! Created `eval_dataloader` with 782 batches!
Train

교육에서 PyTorch에서 사용하는 최적화 프로그램과 스케줄러를 만들었습니다. 트랜스포머 모델에서 사용하는 가장 일반적인 매개 변수를 사용했습니다.
나는 정의 된 시대의 수를 반복하고 기차 과 확인 기능.
각 시대마다 Keras와 비슷한 정보를 출력하려고합니다. train_loss : — val_loss : — train_acc : — valid_acc.
훈련 후 훈련이 어떻게 진행되었는지 확인하기 위해 훈련 및 검증 손실과 정확도 곡선을 플로팅합니다.
참고 : 훈련 플롯이 약간 이상하게 보일 수 있습니다. 검증 정확도는 훈련 정확도보다 높고 검증 손실은 훈련 손실보다 낮게 시작됩니다. 일반적으로 이것은 그 반대입니다. 데이터 분할이 유효성 검사 부분에서 더 쉬우거나 훈련 부분 또는 둘 다에 너무 힘들다고 가정합니다. 이 튜토리얼은 분류를 위해 GPT2를 사용하는 것에 관한 것이므로 모델의 결과에 대해 너무 걱정하지 않을 것입니다.
# Note: AdamW is a class from the huggingface library (as opposed to pytorch) # I believe the 'W' stands for 'Weight Decay fix"
optimizer = AdamW(model.parameters(), lr = 2e-5, # default is 5e-5, our notebook had 2e-5 eps = 1e-8 # default is 1e-8. ) # Total number of training steps is number of batches * number of epochs.
# `train_dataloader` contains batched data so `len(train_dataloader)` gives # us the number of batches.
total_steps = len(train_dataloader) * epochs # Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 0, # Default value in run_glue.py num_training_steps = total_steps) # Store the average loss after each epoch so we can plot them.
all_loss = {'train_loss':[], 'val_loss':[]}
all_acc = {'train_acc':[], 'val_acc':[]} # Loop through each epoch.
print('Epoch')
for epoch in tqdm(range(epochs)): print() print('Training on batches...') # Perform one full pass over the training set. train_labels, train_predict, train_loss = train(train_dataloader, optimizer, scheduler, device) train_acc = accuracy_score(train_labels, train_predict) # Get prediction form model on validation data. print('Validation on batches...') valid_labels, valid_predict, val_loss = validation(valid_dataloader, device) val_acc = accuracy_score(valid_labels, valid_predict) # Print loss and accuracy values to see how training evolves. print(" train_loss: %.5f - val_loss: %.5f - train_acc: %.5f - valid_acc: %.5f"%(train_loss, val_loss, train_acc, val_acc)) print() # Store the loss value for plotting the learning curve. all_loss['train_loss'].append(train_loss) all_loss['val_loss'].append(val_loss) all_acc['train_acc'].append(train_acc) all_acc['val_acc'].append(val_acc) # Plot loss curves.
plot_dict(all_loss, use_xlabel='Epochs', use_ylabel='Value', use_linestyles=['-', '--']) # Plot accuracy curves.
plot_dict(all_acc, use_xlabel='Epochs', use_ylabel='Value', use_linestyles=['-', '--'])
Epoch 100%|████████████████████████████████|4/4 [15:11<00:00, 227.96s/it] Training on batches... 100%|████████████████████████████████|782/782 [02:42<00:00, 4.82it/s] Validation on batches... 100%|████████████████████████████████|782/782 [02:07<00:00, 6.13it/s] train_loss: 0.54128 - val_loss: 0.38758 - train_acc: 0.75288 - valid_acc: 0.81904 Training on batches... 100%|████████████████████████████████|782/782 [02:36<00:00, 5.00it/s] Validation on batches... 100%|████████████████████████████████|782/782 [01:41<00:00, 7.68it/s] train_loss: 0.36716 - val_loss: 0.37620 - train_acc: 0.83288 -valid_acc: 0.82912 Training on batches... 100%|████████████████████████████████|782/782 [02:36<00:00, 5.00it/s] Validation on batches... 100%|████████████████████████████████|782/782 [01:24<00:00, 9.24it/s] train_loss: 0.31409 - val_loss: 0.39384 - train_acc: 0.86304 - valid_acc: 0.83044 Training on batches... 100%|████████████████████████████████|782/782 [02:36<00:00, 4.99it/s] Validation on batches... 100%|████████████████████████████████|782/782 [01:09<00:00, 11.29it/s] train_loss: 0.27358 - val_loss: 0.39798 - train_acc: 0.88432 - valid_acc: 0.83292

훈련 및 검증 손실.


훈련 및 검증 정확도.

평가
분류를 다룰 때 정밀도 재현율과 F1 점수를 보는 데 유용합니다.
모델을 평가할 때 가지고있는 좋은 게이지는 혼동 행렬입니다.
# Get prediction form model on validation data. This is where you should use
# your test data.
true_labels, predictions_labels, avg_epoch_loss = validation(valid_dataloader, device) # Create the evaluation report.
evaluation_report = classification_report(true_labels, predictions_labels, labels=list(labels_ids.values()), target_names=list(labels_ids.keys()))
# Show the evaluation report.
print(evaluation_report) # Plot confusion matrix.
plot_confusion_matrix(y_true=true_labels, y_pred=predictions_labels, classes=list(labels_ids.keys()), normalize=True, magnify=0.1, );
Training on batches... 100%|████████████████████████████████|782/782 [01:09<00:00, 11.24it/s] precision recall f1-score support neg 0.84 0.83 0.83 12500 pos 0.83 0.84 0.83 12500 accuracy 0.83 25000 macro avg 0.83 0.83 0.83 25000 weighted avg 0.83 0.83 0.83 25000

혼동 행렬이 정규화되었습니다.

마지막 주
여기까지왔다면 축하! 🎊 및 감사합니다! 🙏 내 튜토리얼에 관심을 가져 주셔서 감사합니다!
저는이 코드를 한동안 사용해 왔으며 잘 문서화되고 따라하기 쉬운 지점에 도달했다고 느낍니다.
물론 내가 만들었 기 때문에 따라하기 쉽습니다. 그렇기 때문에 모든 피드백을 환영하며 향후 튜토리얼을 개선하는 데 도움이됩니다!
잘못된 점을 발견하면 내 문제를 열어 알려주십시오. ml_things GitHub 저장소!
많은 튜토리얼은 대부분 일회성이며 유지되지 않습니다. 튜토리얼을 가능한 한 최신 상태로 유지할 계획입니다.
이 기사는 원래에 게시되었습니다. George Mihaila의 개인 웹 사이트  저자의 허락을 받아 TOPBOTS에 다시 게시했습니다.
이 기사가 마음에 드십니까? 더 많은 AI 업데이트에 가입하세요.
더 많은 기술 교육을 발표하면 알려 드리겠습니다.
관련

 출처 : https://www.topbots.com/gpt2-text-classification-using-hugging-face-transformers/