评估 LLM - MLflow Evals, Auto Eval

将 LiteLLM 与 MLflow 结合使用

MLflow 提供了一个 API mlflow.evaluate() 来帮助评估您的 LLM https://mlflow.org.cn/docs/latest/llms/llm-evaluate/index.html

先决条件

pip install litellm

pip install mlflow

步骤 1：在 CLI 上启动 LiteLLM 代理

LiteLLM 允许您为所有支持的 LLM 创建一个 OpenAI 兼容服务器。 litellm 代理的更多信息请见此处

$ litellm --model huggingface/bigcode/starcoder

#INFO: Proxy running on http://0.0.0.0:8000

以下是如何为其他支持的 LLM 创建代理

$ export AWS_ACCESS_KEY_ID=""
$ export AWS_REGION_NAME="" # e.g. us-west-2
$ export AWS_SECRET_ACCESS_KEY=""

$ litellm --model bedrock/anthropic.claude-v2

$ export HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]

$ litellm --model huggingface/<your model name> --api_base https://k58ory32yinf1ly0.us-east-1.aws.endpoints.huggingface.cloud

$ export ANTHROPIC_API_KEY=my-api-key

$ litellm --model claude-instant-1

假设您正在本地运行 vllm

$ litellm --model vllm/facebook/opt-125m

$ litellm --model openai/<model_name> --api_base <your-api-base>

$ export TOGETHERAI_API_KEY=my-api-key

$ litellm --model together_ai/lmsys/vicuna-13b-v1.5-16k

$ export REPLICATE_API_KEY=my-api-key

$ litellm \
  --model replicate/meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3

$ litellm --model petals/meta-llama/Llama-2-70b-chat-hf

$ export PALM_API_KEY=my-palm-key

$ litellm --model palm/chat-bison

$ export AZURE_API_KEY=my-api-key
$ export AZURE_API_BASE=my-api-base

$ litellm --model azure/my-deployment-name

$ export AI21_API_KEY=my-api-key

$ litellm --model j2-light

$ export COHERE_API_KEY=my-api-key

$ litellm --model command-nightly

步骤 2：运行 MLflow

在运行评估之前，我们将把 openai.api_base 设置为步骤 1 中的 litellm 代理

openai.api_base = "http://0.0.0.0:8000"

import openai
import pandas as pd
openai.api_key = "anything"             # this can be anything, we set the key on the proxy
openai.api_base = "http://0.0.0.0:8000" # set api base to the proxy from step 1


import mlflow
eval_data = pd.DataFrame(
    {
        "inputs": [
            "What is the largest country",
            "What is the weather in sf?",
        ],
        "ground_truth": [
            "India is a large country",
            "It's cold in SF today"
        ],
    }
)

with mlflow.start_run() as run:
    system_prompt = "Answer the following question in two sentences"
    logged_model_info = mlflow.openai.log_model(
        model="gpt-3.5",
        task=openai.ChatCompletion,
        artifact_path="model",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "{question}"},
        ],
    )

    # Use predefined question-answering metrics to evaluate our model.
    results = mlflow.evaluate(
        logged_model_info.model_uri,
        eval_data,
        targets="ground_truth",
        model_type="question-answering",
    )
    print(f"See aggregated evaluation results below: \n{results.metrics}")

    # Evaluation result for each data record is available in `results.tables`.
    eval_table = results.tables["eval_results_table"]
    print(f"See evaluation table below: \n{eval_table}")

MLflow 输出

{'toxicity/v1/mean': 0.00014476531214313582, 'toxicity/v1/variance': 2.5759661361262862e-12, 'toxicity/v1/p90': 0.00014604929747292773, 'toxicity/v1/ratio': 0.0, 'exact_match/v1': 0.0}
Downloading artifacts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1890.18it/s]
See evaluation table below:
                        inputs              ground_truth                                            outputs  token_count  toxicity/v1/score
0  What is the largest country  India is a large country   Russia is the largest country in the world in...           14           0.000146
1   What is the weather in sf?     It's cold in SF today   I'm sorry, I cannot provide the current weath...           36           0.000143

将 LiteLLM 与 AutoEval 结合使用

AutoEvals 是一个使用最佳实践快速轻松评估 AI 模型输出的工具。 https://github.com/braintrustdata/autoevals

先决条件

pip install litellm

pip install autoevals

快速入门

在此代码示例中，我们使用 autoevals.llm 中的 Factuality() 评估器来测试输出是否真实，并与原始（预期）值进行比较。

Autoevals 默认使用 gpt-3.5-turbo / gpt-4-turbo 来评估响应

请参阅 autoevals 文档以了解支持的评估器 - 翻译、摘要、安全评估器等

# auto evals imports 
from autoevals.llm import *
###################
import litellm

# litellm completion call
question = "which country has the highest population"
response = litellm.completion(
    model = "gpt-3.5-turbo",
    messages = [
        {
            "role": "user",
            "content": question
        }
    ],
)
print(response)
# use the auto eval Factuality() evaluator
evaluator = Factuality()
result = evaluator(
    output=response.choices[0]["message"]["content"],       # response from litellm.completion()
    expected="India",                                       # expected output
    input=question                                          # question passed to litellm.completion
)

print(result)

评估输出 - 来自 AutoEvals

Score(
    name='Factuality', 
    score=0, 
    metadata=
        {'rationale': "The expert answer is 'India'.\nThe submitted answer is 'As of 2021, China has the highest population in the world with an estimated 1.4 billion people.'\nThe submitted answer mentions China as the country with the highest population, while the expert answer mentions India.\nThere is a disagreement between the submitted answer and the expert answer.", 
        'choice': 'D'
        }, 
    error=None
)

评估 LLM - MLflow Evals, Auto Eval

将 LiteLLM 与 MLflow 结合使用​

先决条件​

步骤 1：在 CLI 上启动 LiteLLM 代理​

步骤 2：运行 MLflow​

MLflow 输出​

将 LiteLLM 与 AutoEval 结合使用​

先决条件​

快速入门​

评估输出 - 来自 AutoEvals​

将 LiteLLM 与 MLflow 结合使用

先决条件

步骤 1：在 CLI 上启动 LiteLLM 代理

步骤 2：运行 MLflow

MLflow 输出

将 LiteLLM 与 AutoEval 结合使用

先决条件

快速入门

评估输出 - 来自 AutoEvals