Stemming & Lemmatization in NLP – Text Preprocessing Techniques for Beginners


Learn the difference between stemming and lemmatization in NLP. This beginner-friendly tutorial covers text normalization techniques in Python to preprocess text for machine learning and AI applications.

1. Introduction

Stemming and Lemmatization are NLP techniques used to reduce words to their base or root forms.

  1. They are crucial for text preprocessing before building machine learning or deep learning models.

Key Difference:

  1. Stemming: Cuts words to their root form using rules (may not be real words).
  2. Lemmatization: Reduces words to their meaningful base form using a dictionary (always valid words).

Example:

  1. Word: “running”
  2. Stemming → “run”
  3. Lemmatization → “run”
  4. Word: “better”
  5. Stemming → “better”
  6. Lemmatization → “good”

2. Python Example

Using NLTK for stemming and lemmatization:


import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('wordnet')

words = ["running", "jumps", "easily", "fairly", "better"]

# Stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(w) for w in words]

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(w, pos='v') for w in words] # pos='v' for verbs

print("Original Words:", words)
print("Stemmed Words:", stemmed_words)
print("Lemmatized Words:", lemmatized_words)

Output Example:


Original Words: ['running', 'jumps', 'easily', 'fairly', 'better']
Stemmed Words: ['run', 'jump', 'easili', 'fairli', 'better']
Lemmatized Words: ['run', 'jump', 'easily', 'fairly', 'better']

3. Best Practices

  1. Stemming is faster but less accurate. Use for large datasets when speed matters.
  2. Lemmatization is more accurate; preferred when meaning matters.
  3. Combine with tokenization and stopword removal for better preprocessing.
  4. Always choose technique based on task requirements (search, classification, sentiment analysis).

4. Outcome

After learning stemming and lemmatization, beginners will be able to:

  1. Normalize text efficiently for NLP tasks.
  2. Reduce words to base forms for improved model performance.
  3. Preprocess text data as a critical step in AI and machine learning pipelines.