MachineLearning – D Curious Tech

As Generative AI becomes integral to modern workflows, evaluating the quality of model outputs has become critical. Metrics like BERTScore and cosine similarity are widely used to compare generated text with reference answers. However, recent experiments in our gen-ai-tests repository show that these metrics often overestimate similarity, even when the content is unrelated or incorrect.

In this post, we will explore how these metrics work, highlight their key shortcomings, and provide real examples including test failures from GitHub Actions to show why relying on them alone is risky.

How Do These Metrics Work?

BERTScore: Uses pre-trained transformer models to compare the contextual embeddings of tokens. It calculates similarity based on token-level precision, recall, and F1 scores.
Cosine Similarity: Measures the cosine of the angle between two high-dimensional sentence embeddings. A score closer to 1 indicates greater similarity.

While these metrics are fast and easy to implement, they have critical blind spots.

Experimental Results From `gen-ai-tests`

We evaluated prompts and responses using prompt_eval.py.

Example 1: High Similarity for Valid Output

"prompt": "What is the capital of France?",
"expected_answer": "Paris is the capital of France.",
"generated_answer": "The capital of France is Paris."

Results:

BERTScore F1: approximately 0.93
Cosine Similarity: approximately 0.99

This is expected. Both metrics perform well when the content is semantically correct and phrased differently.

Example 2: High Similarity for Unrelated Sentences

In test_prompt_eval.py, we evaluate unrelated sentences:

def test_bertscore_unrelated():
    # Two sentences that are unrelated should have a low BERTScore
    # This is a simple example to showcase the limitations of BERTScore
    s1 = "The quick brown fox jumps over the lazy dog."
    s2 = "Quantum mechanics describes the behavior of particles at atomic scales."
    score = evaluate_bertscore(s1, s2)
    print(f"BERTScore between unrelated sentences: {score}")
    assert score < 0.8


def test_cosine_similarity_unrelated():
    s1 = "The quick brown fox jumps over the lazy dog."
    s2 = "Quantum mechanics describes the behavior of particles at atomic scales."
    sim = evaluate_cosine_similarity(s1, s2)
    print(f"Cosine similarity between unrelated sentences: {sim}")
    assert sim < 0.8

However, this test fails. Despite the sentences being completely unrelated, BERTScore returns a score above 0.8.

Real Test Failure in CI

Here is the actual test failure from our GitHub Actions pipeline:

FAILED prompt_tests/test_prompt_eval.py::test_bertscore_unrelated - assert 0.8400543332099915 < 0.8

📎 View the GitHub Actions log here

This demonstrates how BERTScore can be misleading even in automated test pipelines, letting incorrect or irrelevant GenAI outputs pass evaluation.

Key Limitations Observed

Overestimation of Similarity
Common linguistic patterns and phrasing can inflate similarity scores, even when the content is semantically different.
No Factual Awareness
These metrics do not measure whether the generated output is correct or grounded in fact. They only compare vector embeddings.
Insensitive to Word Order or Meaning Shift
Sentences like “The cat chased the dog” and “The dog chased the cat” may receive similarly high scores, despite the reversed meaning.

What to Use Instead?

To evaluate GenAI reliably, especially in production, we recommend integrating context-aware, task-specific, or model-in-the-loop evaluation strategies:

LLM-as-a-judge using GPT-4 or Claude for qualitative feedback.
BLEURT, G-Eval, or BARTScore for learned scoring aligned with human judgments.
Fact-checking modules, citation checkers, or hallucination detectors.
Hybrid pipelines that combine automatic similarity scoring with targeted LLM evaluation and manual review.

Final Takeaway

If you are evaluating GenAI outputs for tasks like question answering, summarization, or decision support, do not rely solely on BERTScore or cosine similarity. These metrics can lead to overconfident assessments of poor-quality outputs.

You can find all code and examples here:
📁 gen-ai-tests
📄 prompt_eval.py, test_prompt_eval.py

Tag: MachineLearning

Why BERTScore and Cosine Similarity Are Not Enough for Evaluating GenAI Outputs

How Do These Metrics Work?

Experimental Results From `gen-ai-tests`

Real Test Failure in CI

Key Limitations Observed

What to Use Instead?

Final Takeaway

More posts

Notes-Taking AI

Building AI Chat Apps Across Python, TypeScript & Java: A Hands-on Comparison

Why BERTScore and Cosine Similarity Are Not Enough for Evaluating GenAI Outputs

Java Streams

Tag: MachineLearning

Why BERTScore and Cosine Similarity Are Not Enough for Evaluating GenAI Outputs

How Do These Metrics Work?

Experimental Results From gen-ai-tests

Real Test Failure in CI

Key Limitations Observed

What to Use Instead?

Final Takeaway

More posts

Notes-Taking AI

Building AI Chat Apps Across Python, TypeScript & Java: A Hands-on Comparison

Why BERTScore and Cosine Similarity Are Not Enough for Evaluating GenAI Outputs

Java Streams

Experimental Results From `gen-ai-tests`