Building a Sentence Similarity with Python: TF-IDF, Sentence Transformers, and Word2Vec
In the age of artificial intelligence and natural language processing, understanding how similar two sentences are can be incredibly valuable. Whether it’s for detecting plagiarism, summarizing texts, or enhancing chatbots, sentence similarity plays a crucial role. Today, we’ll explore a simple yet effective Streamlit application that computes sentence similarity using three different methods: Tf-IDF, Sentence Transformers, and Word2Vec.
Introduction
Before diving into the code, let’s briefly understand the three methods we’ll be using:
- Tf-IDF (Term Frequency-Inverse Document Frequency): This is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents (corpus). The similarity is computed by transforming sentences into TF-IDF vectors and then calculating the cosine similarity between them.
- Sentence Transformers: This method uses pre-trained BERT (Bidirectional Encoder Representations from Transformers) models to generate embeddings for sentences. These embeddings are then compared using cosine similarity.
- Word2Vec: This approach uses word embeddings pre-trained on large corpora like Google News. Each word in a sentence is represented as a vector, and the sentence vector is obtained by averaging the word vectors. Cosine similarity is used to compare these sentence vectors.
Streamlit application for TF-IDF, Sentence Transformers, and Word2Vec:
Import required packages:
# app.py
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.spatial.distance import cosine
from sentence_transformers import SentenceTransformer
from gensim.models import KeyedVectors
import numpy as np
import streamlit as st
Note: Download GoogleNews-vectors-negative300.bin for word2Vec.
Function to compute similarity
# Function to compute similarity
def compute_similarity(sentence1, sentence2, similarity_type):
if similarity_type == 'tfidf':
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([sentence1, sentence2])
similarity = 1 - cosine(tfidf_matrix[0].toarray()[0], tfidf_matrix[1].toarray()[0])
elif similarity_type == "sentencetransformer":
model = SentenceTransformer('bert-base-nli-mean-tokens')
embeddings = model.encode([sentence1, sentence2])
similarity = 1 - cosine(embeddings[0], embeddings[1])
elif similarity_type == 'word2vec':
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, limit=500000) # Load a subset of the model
embeddings = [np.mean([model[word] for word in sentence.split() if word in model] or [np.zeros(model.vector_size)], axis=0) for sentence in [sentence1, sentence2]]
similarity = 1 - cosine(embeddings[0], embeddings[1])
else:
raise ValueError("Invalid similarity_type. Choose either 'tfidf', 'sentencetransformer', or 'word2vec'.")
return similarity
Streamlit application
st.title("Sentence Similarity Checker")
# User inputs
sentence1 = st.text_input("Enter the first sentence:")
sentence2 = st.text_input("Enter the second sentence:")
# Option to select similarity types
similarity_types = st.multiselect(
"Choose similarity type(s):",
['tfidf', 'sentencetransformer', 'word2vec']
)
if st.button("Compute Similarity"):
if sentence1 and sentence2:
if 'tfidf' in similarity_types:
tfidf_similarity = compute_similarity(sentence1, sentence2, 'tfidf')
st.write(f"TfIDF Similarity: {tfidf_similarity:.4f}")
if 'sentencetransformer' in similarity_types:
transformer_similarity = compute_similarity(sentence1, sentence2, 'sentencetransformer')
st.write(f"Sentence Transformer Similarity: {transformer_similarity:.4f}")
if 'word2vec' in similarity_types:
word2vec_similarity = compute_similarity(sentence1, sentence2, 'word2vec')
st.write(f"Word2Vec Similarity: {word2vec_similarity:.4f}")
Run your script using below command:
streamlit run app.py
How It Works
- User Input: The application starts by taking two sentences as input from the user.
- Select Similarity Types: The user can select one or more similarity computation methods from a multi-select option.
3. Compute Similarity: When the “Compute Similarity” button is clicked, the application calculates and displays the similarity scores using the selected methods.
Detailed Breakdown
- Tf-IDF Similarity:
- We use
TfidfVectorizer
to convert the sentences into TF-IDF vectors. - The cosine similarity between these vectors is computed to determine the similarity.
- Sentence Transformers Similarity:
- We use a pre-trained BERT model (
bert-base-nli-mean-tokens
) to encode the sentences into embeddings. - The cosine similarity between these embeddings is computed.
- Word2Vec Similarity:
- We load pre-trained word vectors from the Google News corpus.
- Each sentence is converted to a vector by averaging the vectors of its constituent words.
- The cosine similarity between these sentence vectors is computed.
Conclusion
This Streamlit application provides a simple interface to compute sentence similarity using three different methods. By leveraging powerful pre-trained models and efficient vectorization techniques, it offers a robust solution for various NLP tasks. Whether you’re a researcher, developer, or enthusiast, this tool can help you explore and understand the nuances of sentence similarity in a straightforward manner.