Basics of NLP: Tokenization, Stemming, Lemmatization
Natural Language Processing (NLP) helps computers understand and process human language. Tokenization, Stemming, and Lemmatization are essential text preprocessing techniques in NLP.
1. Tokenization: Splitting Text into Words or Sentences
What is Tokenization?
Tokenization is the process of splitting text into smaller units (tokens) like words or sentences.
Example using Python (NLTK & spaCy)
”import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
import spacy
nltk.download(‘punkt’)
text = “Hello! How are you? I love NLP.”
# Word Tokenization (NLTK)
print(“Word Tokens:”, word_tokenize(text))
# Sentence Tokenization (NLTK)
print(“Sentence Tokens:”, sent_tokenize(text))
# Using spaCy
nlp = spacy.load(“en_core_web_sm”)
doc = nlp(text)
print(“Word Tokens (spaCy):”, [token.text for token in doc])”
Key Takeaways:
- Word Tokenization splits text into words.
- Sentence Tokenization splits text into sentences.
- NLTK & spaCy are popular libraries for tokenization.
2. Stemming: Reducing Words to Their Root Form
What is Stemming?
Stemming reduces a word to its root or base form but may not always produce a real word.
Example using NLTK PorterStemmer
”from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = [“running”, “flies”, “easily”, “played”]
print([stemmer.stem(word) for word in words])”
Output:
”[‘run’, ‘fli’, ‘easili’, ‘play’]”
Key Takeaways:
- Fast and simple but may produce meaningless words (e.g., “flies” → “fli”, “easily” → “easili”).
- Common Stemmers: Porter Stemmer, Snowball Stemmer.
3. Lemmatization: Reducing Words to Dictionary Form
What is Lemmatization?
Lemmatization reduces a word to its base or dictionary form (lemma), ensuring it remains a valid word.
Example using WordNetLemmatizer (NLTK)
”from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download(‘wordnet’)
lemmatizer = Word Net Lemmatizer()
words = [“running”, “flies”, “easily”, “played”]
print([lemmatizer. lemmatize(word, pos=’v’) for word in words]) # ‘v’ means verb”
Output:
”[‘run’, ‘fly’, ‘easily’, ‘play’]”
Key Takeaways:
- More accurate than stemming since it considers word meaning.
- Requires POS tagging (Part-of-Speech) for best results.
- Used in search engines, chatbots, sentiment analysis, etc.
4. Stemming vs. Lemmatization: Which One to Use?
Feature | Stemming (Porter) | Lemmatization (WordNet) |
---|---|---|
Speed | Fast | Slower |
Output | Chopped words | Meaningful words |
Accuracy | Lower | Higher |
Example | “running” → “run” | “running” → “run” |
Use Case | Quick text processing | NLP models, search engines |
Next Steps:
- Want to try Named Entity Recognition (NER)?
- Need help with Part-of-Speech (POS) Tagging?
- Interested in Text Classification & Sentiment Analysis?
Word Embeddings: Word2Vec, GloVe, Transformers
Word embeddings are vector representations of words that capture their semantic meaning. Traditional NLP models struggle with context and relationships between words, but Word2Vec, GloVe, and Transformers revolutionized NLP by learning rich word representations.
1. Word2Vec: Learning Word Representations from Context
Word2Vec (developed by Google) is a neural network-based method that represents words as dense vectors in a way that similar words have similar vectors.
How Word2Vec Works?
It learns word embeddings using two architectures.
- CBOW (Continuous Bag of Words) → Predicts a word from surrounding words.
- Skip-gram → Predicts surrounding words given a word.
Example: Train Word2Vec Model using Gensim
”from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
# Sample text corpus
sentences = [
“I love natural language processing”,
“Deep learning makes NLP powerful”,
“Word embeddings capture semantic meaning”
]
# Tokenize sentences
tokenized_sentences = [word_tokenize(sent.lower()) for sent in sentences]
# Train Word2Vec model
model = Word2Vec(tokenized_sentences, vector_size=50, window=5, min_count=1, sg=1)
# Get vector for “NLP”
print(model.wv[‘nlp’])
# Find similar words
print(model.wv.most_similar(‘nlp’))”
Key Takeaways:
- Captures relationships between words (e.g., “king” – “man” + “woman” ≈ “queen”).
- Skip-gram works better with small datasets, CBOW is faster for large datasets.
- Pre-trained Word2Vec models (Google News corpus) are available.
2. GloVe (Global Vectors for Word Representation)
GloVe (developed by Stanford) uses word co-occurrence statistics to create embeddings.
How GloVe Works?
- Instead of predicting words (like Word2Vec), GloVe analyzes word co-occurrence matrices to learn embeddings.
- Captures global relationships between words rather than just local context.
Example: Use Pre-trained GloVe Embeddings
”import gensim.downloader as api
# Load pre-trained GloVe embeddings (50D vectors)
glove_model = api.load(“glove-wiki-gigaword-50”)
# Get vector for “apple”
print(glove_model[‘apple’])
# Find similar words
print(glove_model.most_similar(‘apple’))”
Key Takeaways:
- GloVe is better at capturing global context than Word2Vec.
- Pre-trained embeddings (GloVe.6B, GloVe.840B) are widely used.
- Works well for NLP tasks like sentiment analysis, text classification.
3. Transformers: Contextual Word Embeddings (BERT, GPT)
Why Transformers?
- Word2Vec and GloVe assign static word embeddings (e.g., “bank” has the same vector whether it’s a riverbank or a financial bank).
- Transformers generate contextualized word embeddings, meaning the representation changes based on sentence context.
Popular Transformer Models:
- BERT (Bidirectional Encoder Representations from Transformers) → Trained to understand words in context by looking at both left and right words.
- GPT (Generative Pre-trained Transformer) → Good at text generation and conversational AI.
- T5, XLNet, RoBERTa → Variants of Transformers with specialized capabilities.
Example: Use Pre-trained BERT Embeddings (Hugging Face)
”from transformers import BertTokenizer, BertModel
import torch
# Load BERT model & tokenizer
tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’)
model = BertModel.from_pretrained(‘bert-base-uncased’)
# Example sentence
sentence = “I love NLP and deep learning.”
# Tokenize input text
tokens = tokenizer(sentence, return_tensors=’pt’)
# Generate word embeddings
with torch.no_grad():
output = model(**tokens)
# Extract embedding for the [CLS] token (sentence-level representation)
sentence_embedding = output.last_hidden_state[:, 0, :]
print(sentence_embedding.shape) # (1, 768) → 768D embedding”
Key Takeaways:
- BERT, GPT, and Transformers provide contextual embeddings that change depending on sentence meaning.
- Useful for advanced NLP tasks like question answering, text summarization, and chatbots.
- Pre-trained models (Hugging Face Transformers) are state-of-the-art for NLP.
Word2Vec vs. GloVe vs. Transformers: Comparison
Feature | Word2Vec | GloVe | Transformers (BERT) |
---|---|---|---|
Embedding Type | Static | Static | Contextual |
Training | Predicts words from context | Co-occurrence matrix | Deep learning-based self-attention |
Handles Polysemy (multiple meanings)? | No | No | Yes |
Computation | Fast | Fast | Expensive (large models) |
Pre-trained Models Available? | Yes | Yes | Yes |
Best for | Word similarity | Word relationships | Context-aware NLP tasks |
Next Steps:
- Want to train your own Word2Vec or GloVe model?
- Interested in fine-tuning BERT or GPT for text classification?
- Need help deploying NLP models into applications?
Named Entity Recognition (NER)
Named Entity Recognition (NER) is an essential Natural Language Processing (NLP) task that identifies and classifies named entities (e.g., persons, locations, organizations, dates) in text.
1. What is NER?
NER extracts important entities from text and categorizes them into predefined groups like.
- Person Names (e.g., “Elon Musk”)
- Organizations (e.g., “Google”, “NASA”)
- Locations (e.g., “New York”, “Amazon River”)
- Dates & Time (e.g., “March 10, 2025”)
- Monetary Values (e.g., “$1,000”, “€500”)
Example:
Input: Apple Inc. was founded by Steve Jobs in Cupertino.
NER Output:
- Apple Inc. → ORG (Organization)
- Steve Jobs → PERSON (Person)
- Cupertino → GPE (Geopolitical Entity – Location)
2. Implementing NER using spaCy
spaCy is one of the most popular NLP libraries for fast and efficient NER.
Install spaCy and download English model.
”pip install spacy
python -m spacy download en_core_web_sm”
NER in spaCy (Python)
”import spacy
# Load pre-trained spaCy model
nlp = spacy.load(“en_core_web_sm”)
# Example text
text = “Elon Musk is the CEO of Tesla, which is based in California.”
# Process the text with spaCy
doc = nlp(text)
# Extract named entities
for ent in doc.ents:
print(f”{ent.text} → {ent.label_}”)
# Visualize named entities (Jupyter Notebook)
from spacy import displacy
displacy.render(doc, style=”ent”, jupyter=True)”
Output:
”Elon Musk → PERSON
Tesla → ORG
California → GPE”
Visualization (if using Jupyter Notebook):
3. Implementing NER using NLTK
NLTK provides a rule-based approach to NER, but it’s not as accurate as deep learning-based methods.
”import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
nltk.download(‘maxent_ne_chunker’)
nltk.download(‘words’)
nltk.download(‘punkt’)
text = “Microsoft Corporation was founded by Bill Gates and Paul Allen in 1975.”
# Tokenize and POS tag words
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)
# Apply Named Entity Recognition
entities = ne_chunk(pos_tags)
print(entities)”
Key Takeaways:
- NLTK’s NER is rule-based, so it’s less effective than spaCy or Transformer models.
- Works well for basic NER tasks.
4. Advanced NER using Transformers (BERT, spaCy Transformer, Flair)
Deep learning models (BERT, RoBERTa, Flair, spaCy Transformer) provide contextual NER for higher accuracy.
NER using Hugging Face Transformers (BERT)
”from transformers import pipeline
# Load pre-trained NER model
ner_model = pipeline(“ner”, model=”dbmdz/bert-large-cased-finetuned-conll03-english”)
# Example text
text = “Barack Obama was the 44th President of the United States.”
# Get NER results
ner_results = ner_model(text)
# Print results
for entity in ner_results:
print(entity)”
Output:
”{‘word’: ‘Barack’, ‘score’: 0.999, ‘entity’: ‘B-PER’}
{‘word’: ‘Obama’, ‘score’: 0.999, ‘entity’: ‘I-PER’}
{‘word’: ‘United States’, ‘score’: 0.998, ‘entity’: ‘B-LOC’}”
Why Use Transformers for NER?
- More accurate than rule-based methods
- Understands context better
- Can be fine-tuned on custom datasets
5. Where is NER Used?
- Chatbots & Virtual Assistants → Extract user names, locations, dates
- Healthcare → Identify diseases, drug names, symptoms
- Finance & Business → Extract company names, financial data
- Legal & Compliance → Identify legal entities, contracts
Next Steps:
- Want to fine-tune BERT for NER on a custom dataset?
- Need help deploying an NER model into production?
- Interested in using NER for text summarization or information retrieval?
Sentiment Analysis & Chatbot Development
Sentiment Analysis and Chatbots are two essential Natural Language Processing (NLP) applications. Let’s dive into how to build sentiment analysis models and develop a chatbot using NLP & machine learning.
1. Sentiment Analysis: Understanding Emotions in Text
Sentiment Analysis (also called Opinion Mining) determines whether a text expresses positive, negative, or neutral sentiment. It is widely used in customer feedback, reviews, and social media analysis.
Approaches for Sentiment Analysis:
- Rule-based → Uses lexicons (e.g., Vader, TextBlob)
- Machine Learning-based → Uses classifiers (e.g., Logistic Regression, SVM, Random Forest)
- Deep Learning-based → Uses LSTMs, BERT, or Transformers
2. Sentiment Analysis using Python (VADER – Rule-Based)
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a pre-trained sentiment analysis model for social media text.
Install NLTK:
”pip install nltk”
Python Code for VADER Sentiment Analysis:
”from nltk. sentiment import Sentiment Intensity Analyzer
import nltk
nltk. download(‘vader _ lexicon’)
# Initialize Sentiment Intensity Analyzer
sia = Sentiment Intensity Analyzer()
# Example sentences
sentences = [
“I love this product! It’s amazing. “,
“This is the worst experience I’ve ever had. Terrible service!”,
“It’s okay, not too bad but not great either.”
]
# Analyze sentiment scores
for sentence in sentences:
score = sia.polarity_scores(sentence)
print(f”Text: {sentence}\nSentiment Score: {score}\n”)”
VADER Output Example:
”Text: I love this product! It’s amazing.
Sentiment Score: {‘neg’: 0.0, ‘neu’: 0.32, ‘pos’: 0.68, ‘compound’: 0.85}
Text: This is the worst experience I’ve ever had. Terrible service!
Sentiment Score: {‘neg’: 0.76, ‘neu’: 0.24, ‘pos’: 0.0, ‘compound’: -0.91}
Text: It’s okay, not too bad but not great either.
Sentiment Score: {‘neg’: 0.13, ‘neu’: 0.68, ‘pos’: 0.19, ‘compound’: 0.2}”
Key Takeaways:
- Fast & accurate for social media text, reviews, chat messages.
- VADER is good for short texts but not suitable for deep sentiment understanding.
- For longer texts, use Machine Learning or Deep Learning models (see below).
3. Sentiment Analysis using Transformers (BERT-based Model)
If you need higher accuracy, use Transformer models like BERT.
Install Transformers:
”pip install transformers”
Python Code for BERT-based Sentiment Analysis:
”from transformers import pipeline
# Load pre-trained sentiment analysis pipeline
sentiment_pipeline = pipeline(“sentiment-analysis”)
# Example sentences
text = [“I love this phone, it’s the best I’ve ever used!”,
“I hate waiting in long lines, it’s so frustrating.”]
# Perform sentiment analysis
results = sentiment_pipeline(text)
# Display results
for sentence, result in zip(text, results):
print(f”Text: {sentence}\nSentiment: {result[‘label’]}, Confidence: {result[‘score’]:.2f}\n”)”
Key Takeaways:
- BERT-based models are state-of-the-art for sentiment analysis.
- Good for long and complex texts.
- Hugging Face Transformers make it easy to use pre-trained models.
4. Chatbot Development: Building a Simple NLP Chatbot
A chatbot is an AI-powered assistant that understands and responds to human input.
Types of Chatbots:
- Rule-based Chatbots → Uses predefined rules & keywords
- Retrieval-based Chatbots → Uses NLP & machine learning to choose the best response
- Generative Chatbots → Uses Deep Learning (LSTMs, Transformers, GPT-3, etc.) to generate
5. Build a Simple NLP Chatbot using NLTK
Install NLTK:
”pip install nltk”
Python Code for Rule-Based Chatbot:
”import nltk
from nltk.chat.util import Chat, reflections
# Define chatbot responses (Rule-based)
pairs = [
[“hi|hello|hey”, [“Hello!”, “Hi there!”, “Hey!”]],
[“how are you?”, [“I’m doing well, thanks!”, “I’m fine, what about you?”]],
[“what is your name?”, [“I am a chatbot.”, “Call me ChatGPT!”]],
[“bye|goodbye”, [“Goodbye!”, “See you later!”, “Have a great day!”]],
]
# Create chatbot
chatbot = Chat(pairs, reflections)
# Start chatbot
print(“Hello! I’m a simple chatbot. Type ‘bye’ to exit.”)
chatbot.converse()”
Key Takeaways:
- Basic rule-based chatbot using pattern matching.
- Good for FAQs but not ideal for complex conversations.
- For advanced chatbots, use Machine Learning (NLU, Transformers, GPT).
6. Advanced AI Chatbot using Transformers (DialoGPT)
DialoGPT (by Microsoft) is a pre-trained conversational model.
Install Transformers:
”pip install transformers”
Python Code for AI Chatbot using DialoGPT:
”from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load pre-trained chatbot model (DialoGPT)
tokenizer = AutoTokenizer.from_pretrained(“microsoft/DialoGPT-medium”)
model = AutoModelForCausalLM.from_pretrained(“microsoft/DialoGPT-medium”)
def chat_with_ai():
print(“Chatbot: Hello! Type ‘quit’ to exit.”)
chat_history_ids = None
while True:
user_input = input(“You: “)
if user_input.lower() == “quit”:
break
# Encode user input
new_input_ids = tokenizer.encode(user_input + tokenizer.eos_token, return_tensors=”pt”)
# Append to chat history
bot_input_ids = torch.cat([chat_history_ids, new_input_ids], dim=-1) if chat_history_ids is not None else new_input_ids
# Generate chatbot response
chat_history_ids = model.generate(bot_input_ids, max_length=1000, pad_token_id=tokenizer.eos_token_id)
# Decode and print chatbot response
response = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)
print(f”Chatbot: {response}”)
# Start AI Chatbot
chat_with_ai()”
Key Takeaways:
- DialoGPT generates human-like responses using deep learning.
- Can maintain conversational context.
- Best for real-world chatbot applications like virtual assistants, customer support.
Next Steps:
- Want to train your own Sentiment Analysis model on a custom dataset?
- Interested in deploying a chatbot as a web app?
- Need help fine-tuning a Transformer chatbot (GPT-3, ChatGPT, etc.)?
Hands-on Practice: Basic of NLP
Perform Text Sentiment Analysis using NLP libraries (spaCy, NLTK)
Here’s a hands-on guide for performing Text Sentiment Analysis using spaCy and NLTK in Python.
Sentiment Analysis using NLTK (VADER)
NLTK’s VADER (Valence Aware Dictionary and sEntiment Reasoner) is great for sentiment analysis, especially for short, informal text like tweets or reviews.
Install NLTK:
”pip install nltk”
NLTK VADER Sentiment Analysis Code:
”from nltk.sentiment import SentimentIntensityAnalyzer
import nltk
# Download VADER model
nltk.download(‘vader_lexicon’)
# Initialize Sentiment Intensity Analyzer
sia = SentimentIntensityAnalyzer()
# Example sentences
sentences = [
“I absolutely love this product! It’s fantastic. “,
“This was the worst experience ever. So disappointed! “,
“The service was okay, nothing special.”
]
# Analyze sentiment scores
for sentence in sentences:
score = sia.polarity_scores(sentence)
print(f”Text: {sentence}\nSentiment Score: {score}\n”)”
Sample Output:
”Text: I absolutely love this product! It’s fantastic.
Sentiment Score: {‘neg’: 0.0, ‘neu’: 0.23, ‘pos’: 0.77, ‘compound’: 0.92}
Text: This was the worst experience ever. So disappointed!
Sentiment Score: {‘neg’: 0.76, ‘neu’: 0.24, ‘pos’: 0.0, ‘compound’: -0.85}
Text: The service was okay, nothing special.
Sentiment Score: {‘neg’: 0.13, ‘neu’: 0.74, ‘pos’: 0.13, ‘compound’: 0.0}”
Key Takeaways from NLTK VADER
- Fast & efficient for short texts like tweets, reviews
- Handles negations, emojis, and slang well
- Pre-trained, no need for manual training
Sentiment Analysis using spaCy (with TextBlob)
spaCy does not have built-in sentiment analysis, but we can use TextBlob (which is built on NLTK) with it.
Install Dependencies:
”pip install spacy textblob
python -m spacy download en_core_web_sm”
spaCy + TextBlob Sentiment Analysis Code:
”import spacy
from textblob import TextBlob
# Load spaCy NLP model
nlp = spacy.load(“en_core_web_sm”)
# Example text
text = “I love this phone. The battery life is amazing!”
# Process text with spaCy
doc = nlp(text)
# Perform sentiment analysis using TextBlob
sentiment = TextBlob(doc.text).sentiment
# Print results
print(f”Text: {text}”)
print(f”Polarity: {sentiment.polarity} (Negative: -1 to Positive: 1)”)
print(f”Subjectivity: {sentiment.subjectivity} (Objective: 0 to Subjective: 1)”)”
Sample Output:
”Text: I love this phone. The battery life is amazing!
Polarity: 0.75 (Negative: -1 to Positive: 1)
Subjectivity: 0.85 (Objective: 0 to Subjective: 1)”
Key Takeaways from spaCy + TextBlob
- Polarity Score → Measures sentiment from -1 (negative) to +1 (positive)
- Subjectivity Score → Measures if a sentence is factual (0) or opinion-based (1)
- Good for processing longer texts, like news articles or blogs
Comparing NLTK vs. spaCy for Sentiment Analysis
Feature | NLTK (VADER) | spaCy + TextBlob |
---|---|---|
Pre-trained? | Yes | Yes |
Handles Emojis? | Yes | No |
Works on Short Texts? | Best for social media, reviews | But better for longer text |
Requires Training? | No training needed | No training needed |
Sentiment Scores | Compound Score (-1 to +1) | Polarity (-1 to +1) & Subjectivity (0 to 1) |
Next Steps:
- Want to use Deep Learning (BERT) for sentiment analysis?
- Need help deploying a sentiment analysis model into a chatbot?
Build a Chatbot using OpenAI’s GPT API
Let’s create an AI chatbot using OpenAI’s GPT API in Python. This chatbot can understand and respond to human text input in a conversational manner.
Step 1: Install Dependencies
You’ll need openai and python-dotenv to manage API keys securely.
”pip install openai python-dotenv”
Step 2: Get OpenAI API Key
- Sign up at OpenAI.
- Generate an API key from the OpenAI API Dashboard.
- Store the key securely in a .env file.
Create a .env file in your project directory:
”OPENAI_API_KEY=”your-api-key-here”
Step 3: Build the Chatbot using GPT API
”import openai
import os
from dotenv import load_dotenv
# Load API key from .env file
load_dotenv()
openai.api_key = os.getenv(“OPENAI_API_KEY”)
def chat_with_gpt(prompt, chat_history=[]):
# Append user message to chat history
chat_history.append({“role”: “user”, “content”: prompt})
# Call OpenAI API
response = openai.ChatCompletion.create(
model=”gpt-4″, # You can use “gpt-3.5-turbo” for cheaper and faster results
messages=chat_history,
temperature=0.7, # Adjust for creativity (0 = precise, 1 = more creative)
)
# Get chatbot response
reply = response[“choices”][0][“message”][“content”]
# Append chatbot message to chat history
chat_history.append({“role”: “assistant”, “content”: reply})
return reply, chat_history
# Run chatbot in a loop
print(” Chatbot: Hello! Type ‘exit’ to stop.”)
chat_history = []
while True:
user_input = input(“You: “)
if user_input.lower() == “exit”:
print(” Chatbot: Goodbye!”)
break
response, chat_history = chat_with_gpt(user_input, chat_history)
print(f” Chatbot: {response}”)”
How It Works:
- Stores conversation history for better context.
- Uses OpenAI’s GPT API to generate responses.
- Runs in a loop until the user types “exit”.
Next Steps:
- Want to deploy this chatbot as a web app?
- Need voice input/output integration for a voice assistant?
- Want to fine-tune the chatbot for specific topics?