Categories: FAANG

Evaluating Evaluation Metrics — The Mirage of Hallucination Detection

Hallucinations pose a significant obstacle to the reliability and widespread adoption of language models, yet their accurate measurement remains a persistent challenge. While many task- and domain-specific metrics have been proposed to assess faithfulness and factuality concerns, the robustness and generalization of these metrics are still untested. In this paper, we conduct a large-scale empirical evaluation of 6 diverse sets of hallucination detection metrics across 4 datasets, 37 language models from 5 families, and 5 decoding methods. Our extensive investigation reveals concerning gaps in…

RAG Hallucination Detection Techniques

Large language models (LLMs) are useful for many applications, including question answering, translation, summarization, and much more, with recent advancements in the area having increased their potential.

January 10, 2025

In "AI/ML Research"

Construction of Paired Knowledge Graph – Text Datasets Informed by Cyclic Evaluation

Datasets that pair Knowledge Graphs (KG) and text together (KG-T) can be used to train forward and reverse neural models that generate text from KG and vice versa. However models trained on datasets where KG and text pairs are not equivalent can suffer from more hallucination and poorer recall. In…

March 16, 2024

In "FAANG"

Evaluating the IWSLT2023 Speech Translation Tasks: Human Annotations, Automatic Metrics, and Segmentation

Human evaluation is a critical component in machine translation system development and has received much attention in text translation research. However, little prior work exists on the topic of human evaluation for speech translation, which adds additional challenges such as noisy data and segmentation mismatches. We take first steps to…

June 15, 2024

In "FAANG"

AI Generated Robotic Content

Next 10 Python One-Liners for Generating Time Series Features »

Previous « Announcing new capabilities in Vertex AI Training for large-scale training

Published by

AI Generated Robotic Content

Tags: ai/mlfaang

7 months ago

Using depth maps and weight noising to get better character LoRAs

A few weeks ago I introduced a new method for training style LoRAs which has…

16 hours ago

AI/ML Research

The Statistics of Token Selection: Logits, Temperature, and Top-P Walkthrough

When large language models, or LLMs for short, produce outputs, several criteria are at stake,…

16 hours ago

FAANG

Process financial documents using Amazon Bedrock Data Automation

Financial institutions process thousands of documents daily, including tax forms, loan statements, and purchase orders.…

16 hours ago

FAANG

Introducing Google AI Threat Defense to help you outpace the adversary

aside_block <ListValue: [StructValue([('title', 'Summary of today’s news'), ('body', <wagtail.rich_text.RichText object at 0x7f00683723a0>), ('btn_text', ''), ('href',…

16 hours ago