Python 中的情感分析：超越词袋

由柏拉图重新发布

关注： 0

Python 中的情感分析：超越词袋
图像创建于 DALL-E

您知道通过情绪分析可以在一定程度上预测选举结果吗？当数据科学应用于现实生活而不是使用模拟数据集时，它既有趣又非常有用。

在本文中，我们将使用 Twitter 数据进行简短的案例研究。最后，您将看到一个对现实生活产生重大影响的案例研究，这肯定会激起您的兴趣。但首先，让我们从基础知识开始。

情感分析是一种方法，用于预测情感，就像数字心理学家一样。有了这个，你创建的心理学家，你将分析的文本的命运将掌握在你的手中。你可以像著名心理学家弗洛伊德那样做，也可以像心理学家一样在场，每次咨询收费 10 美元。

就像你的心理学家倾听并理解你的情绪一样，情绪分析对文本（如评论、评论或推文）执行相同的操作，正如我们将在下一节中所做的那样。为此，我们开始对准备好的数据集进行案例研究。

为了进行情感分析，我们将使用 Kaggle 的数据集。这里这个数据集是使用 twitter api 收集的。这是该数据集的链接： https://www.kaggle.com/datasets/kazanova/sentiment140

现在，让我们开始探索数据集。

探索数据集

现在，在进行情感分析之前，让我们先探索一下我们的数据集。要读取它，请使用编码。因此，我们稍后将添加列名称。您可以增加进行数据探索的方法。标题、信息和描述方法会给你一个很好的提示；让我们看看代码。

import pandas as pd data = pd.read_csv('training.csv', encoding='ISO-8859-1', header=None)
column_names = ['target', 'ids', 'date', 'flag', 'user', 'text']
data.columns = column_names
head = data.head()
info = data.info()
describe = data.describe()
head, info, describe

这是输出。

Python 中的情感分析：超越词袋

当然，如果您的项目没有图像限制，您可以一一运行这些方法。让我们看看我们从上述探索方法中收集到的见解。

行业洞见

该数据集有 1.6 万条推文，任何列中都没有缺失值。
每条推文都有一个目标情绪（0 表示负面，2 表示中性，4 表示正面）、ID、时间戳、标志（查询或“NO_QUERY”）、用户名和文本。
情绪目标是平衡的，具有相同数量的正面和负面标签。

可视化数据集

太棒了，我们拥有有关数据集的统计和结构知识。现在，让我们创建一些可视化来描绘它。现在，我们都知道最尖锐的情绪，积极的和消极的。要查看将使用哪些单词，我们将使用其中之一蟒蛇库称为词云。

该库将根据数据集中单词的频率来可视化您的数据集。如果单词使用频繁，看它的大小就明白了，是正相关的，如果单词大了，就应该使用得很多。

但首先，我们应该选择积极和消极的推文，并使用以下方法将它们组合在一起 python 连接方法然后。让我们看看代码。

# Separate positive and negative tweets based on the 'target' column
positive_tweets = data[data['target'] == 4]['text']
negative_tweets = data[data['target'] == 0]['text'] # Sample some positive and negative tweets to create word clouds
sample_positive_text = " ".join(text for text in positive_tweets.sample(frac=0.1, random_state=23))
sample_negative_text = " ".join(text for text in negative_tweets.sample(frac=0.1, random_state=23)) # Generate word cloud images for both positive and negative sentiments
wordcloud_positive = WordCloud(width=800, height=400, max_words=200, background_color="white").generate(sample_positive_text)
wordcloud_negative = WordCloud(width=800, height=400, max_words=200, background_color="white").generate(sample_negative_text) # Display the generated image using matplotlib
plt.figure(figsize=(15, 7.5)) # Positive word cloud
plt.subplot(1, 2, 1)
plt.imshow(wordcloud_positive, interpolation='bilinear')
plt.title('Positive Tweets Word Cloud')
plt.axis("off") # Negative word cloud
plt.subplot(1, 2, 2)
plt.imshow(wordcloud_negative, interpolation='bilinear')
plt.title('Negative Tweets Word Cloud')
plt.axis("off") plt.show()

这是输出。

Python 中的情感分析：超越词袋

图中左边的“谢谢”和“现在”听起来更积极。然而，“工作”和“现在”看起来很有趣，因为这些词看起来经常出现在负面推文中。

情感分析

要进行情感分析，我们将遵循以下步骤；

预处理文本数据
分割数据集
对数据集进行向量化
资料转换
标签编码
训练神经网络
训练模型
评估模型（通过绘图）

现在，处理 1.6 万条推文对于您的计算机或平台来说可能是一个巨大的工作量；这就是为什么我一开始选择了 50 万条正面推文和 50 万条负面推文。

# Since we need to use a smaller dataset due to resource constraints, let's sample 100k tweets
# Balanced sampling: 50k positive and 50k negative
sample_size_per_class = 50000 positive_sample = data[data['target'] == 4].sample(n=sample_size_per_class, random_state=23)
negative_sample = data[data['target'] == 0].sample(n=sample_size_per_class, random_state=23) # Combine the samples into one dataset
balanced_sample = pd.concat([positive_sample, negative_sample]) # Check the balance of the sampled data
balanced_sample['target'].value_counts()

接下来，让我们构建神经网络。

import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(max_features=10000, ngram_range=(1, 2)) # Train and test split
X_train, X_val, y_train, y_val = train_test_split(balanced_sample['text'], balanced_sample['target'], test_size=0.2, random_state=23) # After vectorizing the text data using TF-IDF
X_train_vectorized = vectorizer.fit_transform(X_train)
X_val_vectorized = vectorizer.transform(X_val) # Convert the sparse matrix to a dense matrix
X_train_vectorized = X_train_vectorized.todense()
X_val_vectorized = X_val_vectorized.todense() # Convert labels to one-hot encoding
encoder = LabelEncoder()
y_train_encoded = to_categorical(encoder.fit_transform(y_train))
y_val_encoded = to_categorical(encoder.transform(y_val)) # Define a simple neural network model
model = Sequential()
model.add(Dense(512, input_shape=(X_train_vectorized.shape[1],), activation='relu'))
model.add(Dense(2, activation='softmax')) # 2 because we have two classes # Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) # Train the model over epochs
history = model.fit(X_train_vectorized, y_train_encoded, epochs=10, batch_size=128, validation_data=(X_val_vectorized, y_val_encoded), verbose=1) # Plotting the model accuracy over epochs
plt.figure(figsize=(10, 6))
plt.plot(history.history['accuracy'], label='Train Accuracy', marker='o')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy', marker='o')
plt.title('Model Accuracy over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()

这是输出。

Python 中的情感分析：超越词袋