🔬 Evaluation
Overview
We provide out-of-the-box evaluation metrics for your RAG application. You can use them to evaluate your RAG applications and compare against different settings of your production RAG application.
Currently, we provide support for following evaluation metrics:
Quickstart
Here is a basic example of running evaluation:
Under the hood, Embedchain does the following:
- Runs semantic search in the vector database and fetches context
- LLM call with question, context to fetch the answer
- Run evaluation on following metrics:
context relevancy
,groundedness
, andanswer relevancy
and return result
Advanced Usage
We use OpenAI’s gpt-4
model as default LLM model for automatic evaluation. Hence, we require you to set OPENAI_API_KEY
as an environment variable.
Step-1: Create dataset
In order to evaluate your RAG application, you have to setup a dataset. A data point in the dataset consists of questions
, contexts
, answer
. Here is an example of how to create a dataset for evaluation:
Step-2: Run evaluation
Once you have created your dataset, you can run evaluation on the dataset by picking the metric you want to run evaluation on.
For example, you can run evaluation on context relevancy metric using the following code:
You can choose a different metric or write your own to run evaluation on. You can check the following links:
Metrics
Context Relevancy
Context relevancy is a metric to determine “how relevant the context is to the question”. We use OpenAI’s gpt-4
model to determine the relevancy of the context. We achieve this by prompting the model with the question and the context and asking it to return relevant sentences from the context. We then use the following formula to determine the score:
Examples
You can run the context relevancy evaluation with the following simple code:
In the above example, we used sensible defaults for the evaluation. However, you can also configure the evaluation metric as per your needs using the ContextRelevanceConfig
class.
Here is a more advanced example of how to pass a custom evaluation config for evaluating on context relevance metric:
ContextRelevanceConfig
The model to use for the evaluation. Defaults to gpt-4
. We only support openai’s models for now.
The openai api key to use for the evaluation. Defaults to None
. If not provided, we will use the OPENAI_API_KEY
environment variable.
The language of the dataset being evaluated. We need this to determine the understand the context provided in the dataset. Defaults to en
.
The prompt to extract the relevant sentences from the context. Defaults to CONTEXT_RELEVANCY_PROMPT
, which can be found at embedchain.config.evaluation.base
path.
Answer Relevancy
Answer relevancy is a metric to determine how relevant the answer is to the question. We prompt the model with the answer and asking it to generate questions from the answer. We then use the cosine similarity between the generated questions and the original question to determine the score.
Examples
You can run the answer relevancy evaluation with the following simple code:
In the above example, we used sensible defaults for the evaluation. However, you can also configure the evaluation metric as per your needs using the AnswerRelevanceConfig
class. Here is a more advanced example where you can provide your own evaluation config:
AnswerRelevanceConfig
The model to use for the evaluation. Defaults to gpt-4
. We only support openai’s models for now.
The embedder to use for embedding the text. Defaults to text-embedding-ada-002
. We only support openai’s embedders for now.
The openai api key to use for the evaluation. Defaults to None
. If not provided, we will use the OPENAI_API_KEY
environment variable.
The number of questions to generate for each answer. We use the generated questions to compare the similarity with the original question to determine the score. Defaults to 1
.
The prompt to extract the num_gen_questions
number of questions from the provided answer. Defaults to ANSWER_RELEVANCY_PROMPT
, which can be found at embedchain.config.evaluation.base
path.
Groundedness
Groundedness is a metric to determine how grounded the answer is to the context. We use OpenAI’s gpt-4
model to determine the groundedness of the answer. We achieve this by prompting the model with the answer and asking it to generate claims from the answer. We then again prompt the model with the context and the generated claims to determine the verdict on the claims. We then use the following formula to determine the score:
You can run the groundedness evaluation with the following simple code:
In the above example, we used sensible defaults for the evaluation. However, you can also configure the evaluation metric as per your needs using the GroundednessConfig
class. Here is a more advanced example where you can configure the evaluation config:
GroundednessConfig
The model to use for the evaluation. Defaults to gpt-4
. We only support openai’s models for now.
The openai api key to use for the evaluation. Defaults to None
. If not provided, we will use the OPENAI_API_KEY
environment variable.
The prompt to extract the claims from the provided answer. Defaults to GROUNDEDNESS_ANSWER_CLAIMS_PROMPT
, which can be found at embedchain.config.evaluation.base
path.
The prompt to get verdicts on the claims from the answer from the given context. Defaults to GROUNDEDNESS_CLAIMS_INFERENCE_PROMPT
, which can be found at embedchain.config.evaluation.base
path.
Custom
You can also create your own evaluation metric by extending the BaseMetric
class. You can find the source code for the existing metrics at embedchain.evaluation.metrics
path.
You must provide the name
of your custom metric in the __init__
method of your class. This name will be used to identify your metric in the evaluation report.
Was this page helpful?