Programmatic Evaluation

Murnitur AI allows developers to run programmatic evaluations, providing a flexible and automated way to assess Large Language Models (LLMs) through code. Note: Before running programmatic evaluations, you need to set up the Murnitur SDK with your API key. Detailed instructions on how to set up the SDK can be found here.

Example


import murnitur
from murnitur.loaders import Loader
from murnitur.annotation.eval import Evaluation

# initialize evaluation
murnitur.set_api_key("mt-ey...")
key = "sk-..."

eval = Evaluation(openai_key=key, murnitur_key=murnitur.get_api_key())

# Load datasets from csv
data = Loader.load_csv("https://murnitur.github.io/murnix-kube/sample-dataset.csv")

results = eval.run_suite(metrics=["hallucination", "pii"], dataset=data)

print(results)

Each run is saved in the UI. You can view the results on the AI Evaluations page.

Knowledge Retention Evaluation

To evaluate how well your model retains information across interactions, use Murnitur’s knowledge_retention method. You can also set up a base threshold to define acceptable retention levels.

Import the Required Module:

from murnitur.annotation.eval import Evaluation

Prepare Your Dataset:

data = [
  {
    "input": "How often should I run webinars to gain traction for Murnitur?",
    "output": "Once a month to start. This gives you time to prepare quality content."
  },
  {
    "input": "What topics should I cover in these webinars?",
    "output": "Focus on unique features like tracing and monitoring, and AI safety with Murnitur Shield."
  }
]

Set Up a Base Threshold (Optional):

base_threshold = 0.2  # Define your acceptable retention level

Run the Evaluation:

eval = Evaluation()
results = eval.knowledge_retention(dataset=data, save_output=True, threshold=base_threshold)
print(results)

Sample Output:

{
  "score": 0.1,
  "reason": "The model retains information well with minimal loss."
}

This method returns a JSON with a score (0-1) indicating the level of knowledge retention, where closer to 0 means better retention. By setting a base threshold, you can determine if the retention meets your desired standards.

Creating dataset

Side Note: If you are creating a dataset from a dictionary, use the following data structure:

dataset = [
    {
        "input": "",
        "output": "",
        "ground_truth": "",
        "context": [""],
        "retrieval_context": [""],
    }
]

Loading dataset

Use any of the following functions to load the dataset:

from murnitur.loaders import Loader

dataset_from_json = Loader.load_json(path_to_json_file)

dataset_from_csv = Loader.load_csv(path_to_csv_document)

Extra Params

Extra parameters that can be passed to run_suite

save_output: A boolean. Set to false if you do not want to store the results in Murnitur AI.
async_mode: Indicates whether the requests should run asynchronously.

Evaluation Metrics

We currently support the following evaluation metrics:

hallucination: Measures the accuracy of the model’s output, identifying any fabricated or incorrect information.
faithfulness: Evaluates how well the model’s output adheres to the provided context or source material.
relevancy: Assesses the relevance of the model’s response to the input query or prompt.
bias: Detects any unfair or discriminatory tendencies in the model’s predictions.
context-relevancy: Determines how relevant the input or output is in relation to the given context.
context-precision: Measures the precision of the model’s output within the specific context provided.
toxicity: Identifies harmful or offensive language in the model’s responses.
summarization: Evaluates the quality and accuracy of summaries generated by the model.
pii: Detects the presence of personally identifiable information in the model’s output.

Metrics & Required Fields

Metric	Payload
`hallucination`	`input`, `output`, `context`
`faithfulness`	`input`, `output`, `retrieval_context`
`relevancy`	`input`, `output`
`bias`	`input`, `output`
`context-relevancy`	`input`, `output`, `retrieval_context`
`context-precision`	`input`, `output`, `ground_truth`, `retrieval_context`
`toxicity`	`input`, `output`
`summarization`	`input`, `output`
`pii`	`output` use murnitur-shield to detect from the input

Get Started

Murnitur Shield

Monitoring

AI Agents

Playground

Presets

Evals

Alerts

Datasets

Integrations

Programmatic Evaluation

Example

Knowledge Retention Evaluation

Creating dataset

Loading dataset

Extra Params

Evaluation Metrics

Metrics & Required Fields

Get Started

Murnitur Shield

Monitoring

AI Agents

Playground

Presets

Evals

Alerts

Datasets

Integrations

​Example

​Knowledge Retention Evaluation

​Creating dataset

​Loading dataset

​Extra Params

​Evaluation Metrics

​Metrics & Required Fields

Example

Knowledge Retention Evaluation

Creating dataset

Loading dataset

Extra Params

Evaluation Metrics

Metrics & Required Fields