Natural language processing With Tensorflow

Published by Abhay Rastogi on

Natural language processing With Tensorflow: Natural language processing (NLP) has been a hot topic in the machine learning world recently and there are some great open-source libraries that can make it easier to implement NLP into your projects. In this article, I will introduce you to some of the basic concepts of NLP using TensorFlow available libraries so you can get started right away!

Let’s get started!

One should be familiar with Text Tokenization and Word Embedding prior to developing an NLP deep learning model.

1. Text Tokenization

Text tokenization is the process of breaking a text into sentences, words, or other meaningful units. Deep learning, it’s usually done by splitting the text into spaces and punctuation marks.TensorFlow TextVectorization method standardizes, tokenizes, and vectorizes our data. Text Tokenization is one of the first steps in natural language processing (NLP) and machine learning.

from tensorflow.keras.layers import TextVectorization

words = ['Well done',
	 'Good work',
	 'Great effort',
	 'nice work',
	 'Excellent',
	 'Weak',
	 'Poor effort',
	 'not good',
	 'poor work',
         'Could have done better']

vectorize_layer = TextVectorization(output_mode='int')
vectorize_layer.adapt(words)

print(vectorize_layer(words))
tf.Tensor(
[[10  3  0  0 ]
 [ 6  2  0  0 ]
 [14  7  0  0 ]
 [13  2  0  0 ]
 [15  0  0  0 ]
 [11  0  0  0 ]
 [ 4  7  0  0 ]
 [12  6  0  0 ]
 [ 4  2  0  0 ]
 [ 8  5  3  9 ]], shape=(10, 4), dtype=int64)
# Get the unique words in the vocabulary
words_in_vocab = vectorize_layer.get_vocabulary()
top_5_words = words_in_vocab[:5] # most common tokens (notice the [UNK] token for "unknown" words)
bottom_5_words = words_in_vocab[-5:] # least common tokens
print(f"Number of words in vocab: {len(words_in_vocab)}")
print(f"Top 5 most common words: {top_5_words}") 
print(f"Bottom 5 least common words: {bottom_5_words}")
Number of words in vocab: 16
Top 5 most common words: ['', '[UNK]', 'work', 'done', 'poor']
Bottom 5 least common words: ['not', 'nice', 'great', 'excellent']

2. Word Embedding

Word embedding is the process of mapping a set of words to a real space and assigning each word to a point in that space in such a way that semantically similar words are located near each other. By using Kears Embedding Layer we can perform embedding in Tensorflow.

# Embed a 1,000 word vocabulary into 5 dimensions.
embedding_layer = tf.keras.layers.Embedding(1000, 5)
result = embedding_layer(vectorize_layer(words))
result.numpy()

Text Classification

In this, we will analyze a dataset to determine whether a specific sentence contains sarcasm or not.

Download Dataset

# Download Dataset
import json
!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json \
    -O /tmp/sarcasm.json

with open("/tmp/sarcasm.json", 'r') as f:
    datastore = json.load(f)

sentences = []
labels = []

for item in datastore:
    sentences.append(item['headline'])
    labels.append(item['is_sarcastic'])

Split data into training and validation sets

from sklearn.model_selection import train_test_split

train_sentences, val_sentences, train_labels, val_labels = train_test_split(sentences,
                                                                            labels,
                                                                            test_size=0.2
                                                                            ) 

Visualize Splited Dataset

# Check the lengths
len(train_sentences), len(train_labels), len(val_sentences), len(val_labels)
(21367, 21367, 5342, 5342)
# View the first 10 training sentences and their labels
train_sentences[:10], train_labels[:10]

 Tokenize Dataset

training_vec = TextVectorization(max_tokens=10000,
                                    output_mode="int",
                                    output_sequence_length=15)
training_vec.adapt(sentences)

Build Model

import tensorflow as tf
model = tf.keras.Sequential([
    tf.keras.Input(shape=(1,), dtype=tf.string),
    training_vec,
    tf.keras.layers.Embedding(25000, 16),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, 100, 16)           400000    
                                                                 
 global_average_pooling1d (G  (None, 16)               0         
 lobalAveragePooling1D)                                          
                                                                 
 dense (Dense)               (None, 24)                408       
                                                                 
 dense_1 (Dense)             (None, 1)                 25        
                                                                 
=================================================================
Total params: 400,433
Trainable params: 400,433
Non-trainable params: 0
_________________________________________________________________
num_epochs = 10
history = model.fit(train_sentences, 
                    train_labels, 
                    epochs=num_epochs, 
                    validation_data=(val_sentences, val_labels), 
                    )

Epoch 1/10 668/668 [==============================] – 3s 5ms/step – loss: 0.0024 – accuracy: 0.9995 – val_loss: 1.7502 – val_accuracy: 0.8038

Epoch 2/10 668/668 [==============================] – 3s 5ms/step – loss: 0.0019 – accuracy: 0.9995 – val_loss: 1.8446 – val_accuracy: 0.7999

Epoch 3/10 668/668 [==============================] – 3s 5ms/step – loss: 0.0015 – accuracy: 0.9996 – val_loss: 1.9524 – val_accuracy: 0.8016

Epoch 4/10 668/668 [==============================] – 3s 5ms/step – loss: 0.0015 – accuracy: 0.9996 – val_loss: 2.0310 – val_accuracy: 0.7995

Epoch 5/10 668/668 [==============================] – 3s 4ms/step – loss: 0.0018 – accuracy: 0.9994 – val_loss: 2.1121 – val_accuracy: 0.8016

Epoch 6/10 668/668 [==============================] – 3s 4ms/step – loss: 0.0046 – accuracy: 0.9982 – val_loss: 2.1681 – val_accuracy: 0.7988

Epoch 7/10 668/668 [==============================] – 3s 5ms/step – loss: 0.0026 – accuracy: 0.9993 – val_loss: 2.2269 – val_accuracy: 0.7969

Epoch 8/10 668/668 [==============================] – 3s 4ms/step – loss: 0.0011 – accuracy: 0.9996 – val_loss: 2.2779 – val_accuracy: 0.7975

Epoch 9/10 668/668 [==============================] – 3s 5ms/step – loss: 0.0010 – accuracy: 0.9997 – val_loss: 2.3523 – val_accuracy: 0.7963

Epoch 10/10 668/668 [==============================] – 3s 5ms/step – loss: 5.5185e-04 – accuracy: 0.9998 – val_loss: 2.4191 – val_accuracy: 0.7937

Predict Model

sentenceTst=train_sentences[:10]

prediction = model.predict(sentenceTst)
print(tf.squeeze(tf.round(prediction)))
print(tf.squeeze(tf.round(prediction)).numpy() == train_labels[:10])
print(train_labels[:10])
tf.Tensor([1. 1. 0. 0. 1. 0. 1. 1. 1. 0.], shape=(10,), dtype=float32)
[ True  True  True  True  True  True  True  True  True  True]
[1, 1, 0, 0, 1, 0, 1, 1, 1, 0]

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published.