Some checks are pending
CI / build-check-test (push) Waiting to run
Skills included: - venice-chat: Chat with Venice LLM models, vision, reasoning - venice-chat-benchmark: Benchmark chat models with infographics - venice-image-gen: Generate images via Venice API - venice-list-image-models: List available image models - venice-list-text-models: List available text models - venice-list-video-models: List available video models - venice-tts: Text-to-speech via Venice API - venice-video-generate: Generate videos from text/images - venice-video-queue: Queue video generation jobs - venice-video-quote: Get video generation cost quotes - venice-video-retrieve: Retrieve completed videos All rebranded from Agent Zero paths to Agent JAE (~/.jae/agent/skills/). Requires VENICE_API_KEY environment variable.
4 KiB
4 KiB
Venice Chat Benchmark
Benchmark Venice.ai chat completion models with complex tool_choice payloads. Runs N iterations, captures detailed timing and reliability metrics, and optionally generates a 4K infographic summary.
Features
- Stress testing -- run configurable iterations against any Venice chat model
- Tool choice analysis -- measures tool call rate, distribution across 7 defined tools, and JSON argument validity
- Timing statistics -- average, median, min, max, standard deviation, P90, P95, and P99
- Error categorization -- groups failures by type (HTTP, timeout, connection, JSON decode)
- Token tracking -- per-run and aggregate prompt, completion, and total token usage
- Finish reason tracking -- counts of
tool_calls,stop, and other finish reasons - 4K infographic -- optional visual summary generated via the
venice-image-genskill - Intermediate saves -- results are written to disk after every run so data is preserved if interrupted
Prerequisites
pip install requests
export VENICE_API_KEY="your_venice_api_key"
For infographic generation, the venice-image-gen skill must be available.
Usage
Basic benchmark (50 runs, default model)
python scripts/benchmark.py --model minimax-m27 --runs 50 --output ./chat_benchmark
Custom run count and timeout
python scripts/benchmark.py --model minimax-m27 --runs 100 --timeout 60 --output ./chat_benchmark
With infographic generation
python scripts/benchmark.py --model minimax-m27 --runs 50 --output ./chat_benchmark --infographic
Options
| Option | Short | Default | Description |
|---|---|---|---|
--model |
-- | minimax-m27 |
Model ID to benchmark |
--runs |
-- | 50 |
Number of test iterations |
--timeout |
-- | 120 |
Request timeout in seconds |
--output |
-- | ~/chat_benchmark |
Output directory for results |
--infographic |
-- | off | Generate a 4K infographic summary when done |
Test Payload
The benchmark sends a fixed travel planning scenario to every run:
- System prompt enforces tool-only responses (no plain text)
- 7 function tools defined:
set_travel_dates,set_secondary_destinations,set_traveler_info,set_travel_priorities,set_budget,present_choices,suggest_primary_destinations - User message contains multiple extractable data points (dates, destinations, interests, budget)
tool_choice: autolets the model decide which tool(s) to call
Python Import
from benchmark import run_benchmark
results = run_benchmark(
api_key="your_key",
model="minimax-m27",
num_runs=10,
output_dir="./benchmark_output",
timeout=120
)
print(results["stats"]["success_rate"])
Response Format
The benchmark writes benchmark_results.json to the output directory:
{
"metadata": {
"model": "minimax-m27",
"num_runs": 50,
"timeout": 120,
"num_tools": 7,
"tool_names": ["set_travel_dates", "..."],
"tool_choice": "auto",
"start_time": "2026-03-20T12:00:00",
"end_time": "2026-03-20T12:15:00"
},
"runs": [
{
"run": 1,
"success": true,
"duration_seconds": 2.451,
"finish_reason": "tool_calls",
"has_tool_calls": true,
"tool_calls": [{"name": "set_travel_dates", "args_valid_json": true}],
"usage": {"prompt_tokens": 850, "completion_tokens": 120, "total_tokens": 970}
}
],
"stats": {
"total_runs": 50,
"success_rate": 98.0,
"tool_call_rate": 95.0,
"json_validity_rate": 100.0,
"timing": {"avg": 2.5, "median": 2.3, "min": 1.1, "max": 5.2, "stdev": 0.8},
"tool_call_distribution": {"set_travel_dates": 40, "set_budget": 8},
"token_usage": {"avg_total_tokens": 970, "total_all_tokens": 48500}
}
}
With the --infographic flag, a benchmark_infographic.png file is also generated.
Environment Variables
| Variable | Required | Description |
|---|---|---|
VENICE_API_KEY |
Yes | Venice.ai API key |