Developers can create custom evaluations in Murnitur AI to manually assess Large Language Models (LLMs) based on real-world interactions. This human evaluation process is essential for ensuring LLM applications meet quality and reliability standards, providing insights that automated metrics may miss.

Murnitur AI allows developers to design tailored evaluation criteria to fit specific use cases and objectives. This includes creating simple boolean, list, and number range evaluations, offering flexibility in grading LLM performance. Developers can assign ratings from 1 to 5, streamlining the assessment process.

Types of Human Evaluation

  • Boolean Evaluation: Allows developers to assess whether the LLM performed as expected based on predefined criteria, with responses limited to true or false.

  • List Evaluation: Enables developers to create a list of predefined options for grading LLM performance, providing more flexibility in evaluation criteria.

  • Number Range Evaluation: Allows developers to assign a numerical score to rate the performance of the LLM on a scale from 1 to 5, facilitating more granular assessment.

Evaluate

Human evaluation in Murnitur AI is a valuable tool for developers to validate and refine their LLM applications, ensuring they meet performance, accuracy, and reliability standards.