> ## Documentation Index
> Fetch the complete documentation index at: https://docs.murnitur.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Programmatic Evaluation

Murnitur AI allows developers to run programmatic evaluations, providing a flexible and automated way to assess Large Language Models (LLMs) through code.

**Note**: Before running programmatic evaluations, you need to set up the Murnitur SDK with your API key. Detailed instructions on how to set up the SDK can be found [here](/installation).

<iframe width="650" height="400" src="https://www.loom.com/embed/fa1fd8667d784c5b9e815ec01bd66a1a?sid=16263b86-00a3-4aaf-855d-ddde91fabbdc" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen />

## Example

<CodeGroup>
  ```python python theme={null}

  import murnitur
  from murnitur.loaders import Loader
  from murnitur.annotation.eval import Evaluation

  # initialize evaluation
  murnitur.set_api_key("mt-ey...")
  key = "sk-..."

  eval = Evaluation(openai_key=key, murnitur_key=murnitur.get_api_key())

  # Load datasets from csv
  data = Loader.load_csv("https://murnitur.github.io/murnix-kube/sample-dataset.csv")

  results = eval.run_suite(metrics=["hallucination", "pii"], dataset=data)

  print(results)

  ```
</CodeGroup>

Each run is saved in the UI. You can view the results on the **AI Evaluations** page.

## Knowledge Retention Evaluation

To evaluate how well your model retains information across interactions, use Murnitur's `knowledge_retention` method. You can also set up a base threshold to define acceptable retention levels.

1. **Import the Required Module**:
   ```python theme={null}
   from murnitur.annotation.eval import Evaluation
   ```

2. **Prepare Your Dataset**:
   ```python theme={null}
   data = [
     {
       "input": "How often should I run webinars to gain traction for Murnitur?",
       "output": "Once a month to start. This gives you time to prepare quality content."
     },
     {
       "input": "What topics should I cover in these webinars?",
       "output": "Focus on unique features like tracing and monitoring, and AI safety with Murnitur Shield."
     }
   ]
   ```

3. **Set Up a Base Threshold (Optional)**:
   ```python theme={null}
   base_threshold = 0.2  # Define your acceptable retention level
   ```

4. **Run the Evaluation**:
   ```python theme={null}
   eval = Evaluation()
   results = eval.knowledge_retention(dataset=data, save_output=True, threshold=base_threshold)
   print(results)
   ```

5. **Sample Output**:
   ```json theme={null}
   {
     "score": 0.1,
     "reason": "The model retains information well with minimal loss."
   }
   ```

This method returns a JSON with a score (0-1) indicating the level of knowledge retention, where closer to 0 means better retention. By setting a base threshold, you can determine if the retention meets your desired standards.

## Creating dataset

***Side Note***: If you are creating a dataset from a dictionary, use the following data structure:

<CodeGroup>
  ```python python theme={null}
  dataset = [
      {
          "input": "",
          "output": "",
          "ground_truth": "",
          "context": [""],
          "retrieval_context": [""],
      }
  ]

  ```
</CodeGroup>

## Loading dataset

Use any of the following functions to load the dataset:

<CodeGroup>
  ```python python theme={null}
  from murnitur.loaders import Loader

  dataset_from_json = Loader.load_json(path_to_json_file)

  dataset_from_csv = Loader.load_csv(path_to_csv_document)

  ```
</CodeGroup>

## Extra Params

Extra parameters that can be passed to `run_suite`

* `save_output`: A boolean. Set to false if you do not want to store the results in Murnitur AI.
* `async_mode`: Indicates whether the requests should run asynchronously.

## Evaluation Metrics

We currently support the following evaluation metrics:

* `hallucination`: Measures the accuracy of the model's output, identifying any fabricated or incorrect information.
* `faithfulness`: Evaluates how well the model's output adheres to the provided context or source material.
* `relevancy`: Assesses the relevance of the model's response to the input query or prompt.
* `bias`: Detects any unfair or discriminatory tendencies in the model's predictions.
* `context-relevancy`: Determines how relevant the input or output is in relation to the given context.
* `context-precision`: Measures the precision of the model's output within the specific context provided.
* `toxicity`: Identifies harmful or offensive language in the model's responses.
* `summarization`: Evaluates the quality and accuracy of summaries generated by the model.
* `pii`: Detects the presence of personally identifiable information in the model's output.

## Metrics & Required Fields

| Metric              | Payload                                                                     |
| ------------------- | --------------------------------------------------------------------------- |
| `hallucination`     | `input`, `output`, `context`                                                |
| `faithfulness`      | `input`, `output`, `retrieval_context`                                      |
| `relevancy`         | `input`, `output`                                                           |
| `bias`              | `input`, `output`                                                           |
| `context-relevancy` | `input`, `output`, `retrieval_context`                                      |
| `context-precision` | `input`, `output`, `ground_truth`, `retrieval_context`                      |
| `toxicity`          | `input`, `output`                                                           |
| `summarization`     | `input`, `output`                                                           |
| `pii`               | `output` *use [murnitur-shield](/murnitur-shield) to detect from the input* |
