Natural language processing With Tensorflow
Natural language processing With Tensorflow: Natural language processing (NLP) has been a hot topic in the machine learning world recently and there are some great open-source libraries that can make it easier to implement NLP into your projects. In this article, I will introduce you to some of the basic concepts of NLP using TensorFlow available libraries so you can get started right away!
Let’s get started!
One should be familiar with Text Tokenization and Word Embedding prior to developing an NLP deep learning model.
1. Text Tokenization
Text tokenization is the process of breaking a text into sentences, words, or other meaningful units. Deep learning, it’s usually done by splitting the text into spaces and punctuation marks.TensorFlow TextVectorization method standardizes, tokenizes, and vectorizes our data. Text Tokenization is one of the first steps in natural language processing (NLP) and machine learning.
from tensorflow.keras.layers import TextVectorization
words = ['Well done',
'Good work',
'Great effort',
'nice work',
'Excellent',
'Weak',
'Poor effort',
'not good',
'poor work',
'Could have done better']
vectorize_layer = TextVectorization(output_mode='int')
vectorize_layer.adapt(words)
print(vectorize_layer(words))
tf.Tensor(
[[10 3 0 0 ]
[ 6 2 0 0 ]
[14 7 0 0 ]
[13 2 0 0 ]
[15 0 0 0 ]
[11 0 0 0 ]
[ 4 7 0 0 ]
[12 6 0 0 ]
[ 4 2 0 0 ]
[ 8 5 3 9 ]], shape=(10, 4), dtype=int64)
# Get the unique words in the vocabulary
words_in_vocab = vectorize_layer.get_vocabulary()
top_5_words = words_in_vocab[:5] # most common tokens (notice the [UNK] token for "unknown" words)
bottom_5_words = words_in_vocab[-5:] # least common tokens
print(f"Number of words in vocab: {len(words_in_vocab)}")
print(f"Top 5 most common words: {top_5_words}")
print(f"Bottom 5 least common words: {bottom_5_words}")
Number of words in vocab: 16
Top 5 most common words: ['', '[UNK]', 'work', 'done', 'poor']
Bottom 5 least common words: ['not', 'nice', 'great', 'excellent']
2. Word Embedding
Word embedding is the process of mapping a set of words to a real space and assigning each word to a point in that space in such a way that semantically similar words are located near each other. By using Kears Embedding Layer we can perform embedding in Tensorflow.
# Embed a 1,000 word vocabulary into 5 dimensions.
embedding_layer = tf.keras.layers.Embedding(1000, 5)
result = embedding_layer(vectorize_layer(words))
result.numpy()
Text Classification
In this, we will analyze a dataset to determine whether a specific sentence contains sarcasm or not.
Download Dataset
# Download Dataset
import json
!wget --no-check-certificate \
https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json \
-O /tmp/sarcasm.json
with open("/tmp/sarcasm.json", 'r') as f:
datastore = json.load(f)
sentences = []
labels = []
for item in datastore:
sentences.append(item['headline'])
labels.append(item['is_sarcastic'])
Split data into training and validation sets
from sklearn.model_selection import train_test_split
train_sentences, val_sentences, train_labels, val_labels = train_test_split(sentences,
labels,
test_size=0.2
)
Visualize Splited Dataset
# Check the lengths
len(train_sentences), len(train_labels), len(val_sentences), len(val_labels)
(21367, 21367, 5342, 5342)
# View the first 10 training sentences and their labels
train_sentences[:10], train_labels[:10]
Tokenize Dataset
training_vec = TextVectorization(max_tokens=10000,
output_mode="int",
output_sequence_length=15)
training_vec.adapt(sentences)
Build Model
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.Input(shape=(1,), dtype=tf.string),
training_vec,
tf.keras.layers.Embedding(25000, 16),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(24, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, 100, 16) 400000 global_average_pooling1d (G (None, 16) 0 lobalAveragePooling1D) dense (Dense) (None, 24) 408 dense_1 (Dense) (None, 1) 25 ================================================================= Total params: 400,433 Trainable params: 400,433 Non-trainable params: 0 _________________________________________________________________
num_epochs = 10
history = model.fit(train_sentences,
train_labels,
epochs=num_epochs,
validation_data=(val_sentences, val_labels),
)
Epoch 1/10 668/668 [==============================] – 3s 5ms/step – loss: 0.0024 – accuracy: 0.9995 – val_loss: 1.7502 – val_accuracy: 0.8038
Epoch 2/10 668/668 [==============================] – 3s 5ms/step – loss: 0.0019 – accuracy: 0.9995 – val_loss: 1.8446 – val_accuracy: 0.7999
Epoch 3/10 668/668 [==============================] – 3s 5ms/step – loss: 0.0015 – accuracy: 0.9996 – val_loss: 1.9524 – val_accuracy: 0.8016
Epoch 4/10 668/668 [==============================] – 3s 5ms/step – loss: 0.0015 – accuracy: 0.9996 – val_loss: 2.0310 – val_accuracy: 0.7995
Epoch 5/10 668/668 [==============================] – 3s 4ms/step – loss: 0.0018 – accuracy: 0.9994 – val_loss: 2.1121 – val_accuracy: 0.8016
Epoch 6/10 668/668 [==============================] – 3s 4ms/step – loss: 0.0046 – accuracy: 0.9982 – val_loss: 2.1681 – val_accuracy: 0.7988
Epoch 7/10 668/668 [==============================] – 3s 5ms/step – loss: 0.0026 – accuracy: 0.9993 – val_loss: 2.2269 – val_accuracy: 0.7969
Epoch 8/10 668/668 [==============================] – 3s 4ms/step – loss: 0.0011 – accuracy: 0.9996 – val_loss: 2.2779 – val_accuracy: 0.7975
Epoch 9/10 668/668 [==============================] – 3s 5ms/step – loss: 0.0010 – accuracy: 0.9997 – val_loss: 2.3523 – val_accuracy: 0.7963
Epoch 10/10 668/668 [==============================] – 3s 5ms/step – loss: 5.5185e-04 – accuracy: 0.9998 – val_loss: 2.4191 – val_accuracy: 0.7937
Predict Model
sentenceTst=train_sentences[:10]
prediction = model.predict(sentenceTst)
print(tf.squeeze(tf.round(prediction)))
print(tf.squeeze(tf.round(prediction)).numpy() == train_labels[:10])
print(train_labels[:10])
tf.Tensor([1. 1. 0. 0. 1. 0. 1. 1. 1. 0.], shape=(10,), dtype=float32)
[ True True True True True True True True True True]
[1, 1, 0, 0, 1, 0, 1, 1, 1, 0]
0 Comments