Murnitur AI allows developers to run programmatic evaluations, providing a flexible and automated way to assess Large Language Models (LLMs) through code. Note: Before running programmatic evaluations, you need to set up the Murnitur SDK with your API key. Detailed instructions on how to set up the SDK can be found here.Documentation Index
Fetch the complete documentation index at: https://docs.murnitur.ai/llms.txt
Use this file to discover all available pages before exploring further.
Example
Knowledge Retention Evaluation
To evaluate how well your model retains information across interactions, use Murnitur’sknowledge_retention method. You can also set up a base threshold to define acceptable retention levels.
-
Import the Required Module:
-
Prepare Your Dataset:
-
Set Up a Base Threshold (Optional):
-
Run the Evaluation:
-
Sample Output:
Creating dataset
Side Note: If you are creating a dataset from a dictionary, use the following data structure:Loading dataset
Use any of the following functions to load the dataset:Extra Params
Extra parameters that can be passed torun_suite
save_output: A boolean. Set to false if you do not want to store the results in Murnitur AI.async_mode: Indicates whether the requests should run asynchronously.
Evaluation Metrics
We currently support the following evaluation metrics:hallucination: Measures the accuracy of the model’s output, identifying any fabricated or incorrect information.faithfulness: Evaluates how well the model’s output adheres to the provided context or source material.relevancy: Assesses the relevance of the model’s response to the input query or prompt.bias: Detects any unfair or discriminatory tendencies in the model’s predictions.context-relevancy: Determines how relevant the input or output is in relation to the given context.context-precision: Measures the precision of the model’s output within the specific context provided.toxicity: Identifies harmful or offensive language in the model’s responses.summarization: Evaluates the quality and accuracy of summaries generated by the model.pii: Detects the presence of personally identifiable information in the model’s output.
Metrics & Required Fields
| Metric | Payload |
|---|---|
hallucination | input, output, context |
faithfulness | input, output, retrieval_context |
relevancy | input, output |
bias | input, output |
context-relevancy | input, output, retrieval_context |
context-precision | input, output, ground_truth, retrieval_context |
toxicity | input, output |
summarization | input, output |
pii | output use murnitur-shield to detect from the input |