Programmatic Evaluation
Murnitur AI allows developers to run programmatic evaluations, providing a flexible and automated way to assess Large Language Models (LLMs) through code.
Note: Before running programmatic evaluations, you need to set up the Murnitur SDK with your API key. Detailed instructions on how to set up the SDK can be found here.
Example
Each run is saved in the UI. You can view the results on the AI Evaluations page.
Knowledge Retention Evaluation
To evaluate how well your model retains information across interactions, use Murnitur’s knowledge_retention
method. You can also set up a base threshold to define acceptable retention levels.
-
Import the Required Module:
-
Prepare Your Dataset:
-
Set Up a Base Threshold (Optional):
-
Run the Evaluation:
-
Sample Output:
This method returns a JSON with a score (0-1) indicating the level of knowledge retention, where closer to 0 means better retention. By setting a base threshold, you can determine if the retention meets your desired standards.
Creating dataset
Side Note: If you are creating a dataset from a dictionary, use the following data structure:
Loading dataset
Use any of the following functions to load the dataset:
Extra Params
Extra parameters that can be passed to run_suite
save_output
: A boolean. Set to false if you do not want to store the results in Murnitur AI.async_mode
: Indicates whether the requests should run asynchronously.
Evaluation Metrics
We currently support the following evaluation metrics:
hallucination
: Measures the accuracy of the model’s output, identifying any fabricated or incorrect information.faithfulness
: Evaluates how well the model’s output adheres to the provided context or source material.relevancy
: Assesses the relevance of the model’s response to the input query or prompt.bias
: Detects any unfair or discriminatory tendencies in the model’s predictions.context-relevancy
: Determines how relevant the input or output is in relation to the given context.context-precision
: Measures the precision of the model’s output within the specific context provided.toxicity
: Identifies harmful or offensive language in the model’s responses.summarization
: Evaluates the quality and accuracy of summaries generated by the model.pii
: Detects the presence of personally identifiable information in the model’s output.
Metrics & Required Fields
Metric | Payload |
---|---|
hallucination | input , output , context |
faithfulness | input , output , retrieval_context |
relevancy | input , output |
bias | input , output |
context-relevancy | input , output , retrieval_context |
context-precision | input , output , ground_truth , retrieval_context |
toxicity | input , output |
summarization | input , output |
pii | output use murnitur-shield to detect from the input |