Murnitur AI allows developers to run programmatic evaluations, providing a flexible and automated way to assess Large Language Models (LLMs) through code.

Note: Before running programmatic evaluations, you need to set up the Murnitur SDK with your API key. Detailed instructions on how to set up the SDK can be found here.

Example

Each run is saved in the UI. You can view the results on the AI Evaluations page.

Knowledge Retention Evaluation

To evaluate how well your model retains information across interactions, use Murnitur’s knowledge_retention method. You can also set up a base threshold to define acceptable retention levels.

  1. Import the Required Module:

    from murnitur.annotation.eval import Evaluation
    
  2. Prepare Your Dataset:

    data = [
      {
        "input": "How often should I run webinars to gain traction for Murnitur?",
        "output": "Once a month to start. This gives you time to prepare quality content."
      },
      {
        "input": "What topics should I cover in these webinars?",
        "output": "Focus on unique features like tracing and monitoring, and AI safety with Murnitur Shield."
      }
    ]
    
  3. Set Up a Base Threshold (Optional):

    base_threshold = 0.2  # Define your acceptable retention level
    
  4. Run the Evaluation:

    eval = Evaluation()
    results = eval.knowledge_retention(dataset=data, save_output=True, threshold=base_threshold)
    print(results)
    
  5. Sample Output:

    {
      "score": 0.1,
      "reason": "The model retains information well with minimal loss."
    }
    

This method returns a JSON with a score (0-1) indicating the level of knowledge retention, where closer to 0 means better retention. By setting a base threshold, you can determine if the retention meets your desired standards.

Creating dataset

Side Note: If you are creating a dataset from a dictionary, use the following data structure:

Loading dataset

Use any of the following functions to load the dataset:

Extra Params

Extra parameters that can be passed to run_suite

  • save_output: A boolean. Set to false if you do not want to store the results in Murnitur AI.
  • async_mode: Indicates whether the requests should run asynchronously.

Evaluation Metrics

We currently support the following evaluation metrics:

  • hallucination: Measures the accuracy of the model’s output, identifying any fabricated or incorrect information.
  • faithfulness: Evaluates how well the model’s output adheres to the provided context or source material.
  • relevancy: Assesses the relevance of the model’s response to the input query or prompt.
  • bias: Detects any unfair or discriminatory tendencies in the model’s predictions.
  • context-relevancy: Determines how relevant the input or output is in relation to the given context.
  • context-precision: Measures the precision of the model’s output within the specific context provided.
  • toxicity: Identifies harmful or offensive language in the model’s responses.
  • summarization: Evaluates the quality and accuracy of summaries generated by the model.
  • pii: Detects the presence of personally identifiable information in the model’s output.

Metrics & Required Fields

MetricPayload
hallucinationinput, output, context
faithfulnessinput, output, retrieval_context
relevancyinput, output
biasinput, output
context-relevancyinput, output, retrieval_context
context-precisioninput, output, ground_truth, retrieval_context
toxicityinput, output
summarizationinput, output
piioutput use murnitur-shield to detect from the input