Agent-JAE/default-skills/venice-chat-benchmark/README.md
jae 7fe886fea5
Some checks are pending
CI / build-check-test (push) Waiting to run
feat: add 11 Venice AI skills as bundled defaults
Skills included:
- venice-chat: Chat with Venice LLM models, vision, reasoning
- venice-chat-benchmark: Benchmark chat models with infographics
- venice-image-gen: Generate images via Venice API
- venice-list-image-models: List available image models
- venice-list-text-models: List available text models
- venice-list-video-models: List available video models
- venice-tts: Text-to-speech via Venice API
- venice-video-generate: Generate videos from text/images
- venice-video-queue: Queue video generation jobs
- venice-video-quote: Get video generation cost quotes
- venice-video-retrieve: Retrieve completed videos

All rebranded from Agent Zero paths to Agent JAE (~/.jae/agent/skills/).
Requires VENICE_API_KEY environment variable.
2026-03-23 18:47:33 +01:00

124 lines
4 KiB
Markdown

# Venice Chat Benchmark
Benchmark [Venice.ai](https://venice.ai/) chat completion models with complex `tool_choice` payloads. Runs N iterations, captures detailed timing and reliability metrics, and optionally generates a 4K infographic summary.
## Features
- **Stress testing** -- run configurable iterations against any Venice chat model
- **Tool choice analysis** -- measures tool call rate, distribution across 7 defined tools, and JSON argument validity
- **Timing statistics** -- average, median, min, max, standard deviation, P90, P95, and P99
- **Error categorization** -- groups failures by type (HTTP, timeout, connection, JSON decode)
- **Token tracking** -- per-run and aggregate prompt, completion, and total token usage
- **Finish reason tracking** -- counts of `tool_calls`, `stop`, and other finish reasons
- **4K infographic** -- optional visual summary generated via the `venice-image-gen` skill
- **Intermediate saves** -- results are written to disk after every run so data is preserved if interrupted
## Prerequisites
```bash
pip install requests
export VENICE_API_KEY="your_venice_api_key"
```
For infographic generation, the `venice-image-gen` skill must be available.
## Usage
### Basic benchmark (50 runs, default model)
```bash
python scripts/benchmark.py --model minimax-m27 --runs 50 --output ./chat_benchmark
```
### Custom run count and timeout
```bash
python scripts/benchmark.py --model minimax-m27 --runs 100 --timeout 60 --output ./chat_benchmark
```
### With infographic generation
```bash
python scripts/benchmark.py --model minimax-m27 --runs 50 --output ./chat_benchmark --infographic
```
## Options
| Option | Short | Default | Description |
|--------|-------|---------|-------------|
| `--model` | -- | `minimax-m27` | Model ID to benchmark |
| `--runs` | -- | `50` | Number of test iterations |
| `--timeout` | -- | `120` | Request timeout in seconds |
| `--output` | -- | `~/chat_benchmark` | Output directory for results |
| `--infographic` | -- | off | Generate a 4K infographic summary when done |
## Test Payload
The benchmark sends a fixed travel planning scenario to every run:
- **System prompt** enforces tool-only responses (no plain text)
- **7 function tools** defined: `set_travel_dates`, `set_secondary_destinations`, `set_traveler_info`, `set_travel_priorities`, `set_budget`, `present_choices`, `suggest_primary_destinations`
- **User message** contains multiple extractable data points (dates, destinations, interests, budget)
- **`tool_choice: auto`** lets the model decide which tool(s) to call
## Python Import
```python
from benchmark import run_benchmark
results = run_benchmark(
api_key="your_key",
model="minimax-m27",
num_runs=10,
output_dir="./benchmark_output",
timeout=120
)
print(results["stats"]["success_rate"])
```
## Response Format
The benchmark writes `benchmark_results.json` to the output directory:
```json
{
"metadata": {
"model": "minimax-m27",
"num_runs": 50,
"timeout": 120,
"num_tools": 7,
"tool_names": ["set_travel_dates", "..."],
"tool_choice": "auto",
"start_time": "2026-03-20T12:00:00",
"end_time": "2026-03-20T12:15:00"
},
"runs": [
{
"run": 1,
"success": true,
"duration_seconds": 2.451,
"finish_reason": "tool_calls",
"has_tool_calls": true,
"tool_calls": [{"name": "set_travel_dates", "args_valid_json": true}],
"usage": {"prompt_tokens": 850, "completion_tokens": 120, "total_tokens": 970}
}
],
"stats": {
"total_runs": 50,
"success_rate": 98.0,
"tool_call_rate": 95.0,
"json_validity_rate": 100.0,
"timing": {"avg": 2.5, "median": 2.3, "min": 1.1, "max": 5.2, "stdev": 0.8},
"tool_call_distribution": {"set_travel_dates": 40, "set_budget": 8},
"token_usage": {"avg_total_tokens": 970, "total_all_tokens": 48500}
}
}
```
With the `--infographic` flag, a `benchmark_infographic.png` file is also generated.
## Environment Variables
| Variable | Required | Description |
|----------|----------|-------------|
| `VENICE_API_KEY` | Yes | Venice.ai API key |