Skills included: - venice-chat: Chat with Venice LLM models, vision, reasoning - venice-chat-benchmark: Benchmark chat models with infographics - venice-image-gen: Generate images via Venice API - venice-list-image-models: List available image models - venice-list-text-models: List available text models - venice-list-video-models: List available video models - venice-tts: Text-to-speech via Venice API - venice-video-generate: Generate videos from text/images - venice-video-queue: Queue video generation jobs - venice-video-quote: Get video generation cost quotes - venice-video-retrieve: Retrieve completed videos All rebranded from Agent Zero paths to Agent JAE (~/.jae/agent/skills/). Requires VENICE_API_KEY environment variable.
83 lines
2.6 KiB
Markdown
83 lines
2.6 KiB
Markdown
---
|
|
name: "venice-chat-benchmark"
|
|
description: "Benchmark Venice.ai chat models with complex tool_choice payloads. Runs N iterations, captures timing, tool call distribution, JSON validity, errors, token usage, and generates a 4K infographic."
|
|
version: "1.0.0"
|
|
author: "Agent JAE"
|
|
tags:
|
|
- venice
|
|
- api
|
|
- benchmark
|
|
- chat
|
|
- tool_choice
|
|
- testing
|
|
trigger_patterns:
|
|
- "benchmark chat"
|
|
- "test model"
|
|
- "venice benchmark"
|
|
- "tool choice test"
|
|
---
|
|
|
|
# Venice Chat Model Benchmark
|
|
|
|
Benchmark Venice.ai chat completion models with complex tool_choice payloads.
|
|
|
|
## When to Use
|
|
|
|
Use this skill when you need to:
|
|
- Stress test a Venice chat model with tool calling
|
|
- Measure response time, reliability, and tool call accuracy
|
|
- Compare model behavior across many runs
|
|
- Generate visual benchmark reports
|
|
|
|
## Usage
|
|
|
|
### Basic (50 runs, minimax-m27)
|
|
```bash
|
|
export VENICE_API_KEY="your-key"
|
|
python ~/.jae/agent/skills/venice-chat-benchmark/scripts/benchmark.py --model minimax-m27 --runs 50 --output ~/chat_benchmark
|
|
```
|
|
|
|
### With Infographic
|
|
```bash
|
|
python ~/.jae/agent/skills/venice-chat-benchmark/scripts/benchmark.py --model minimax-m27 --runs 50 --output ~/chat_benchmark --infographic
|
|
```
|
|
|
|
## Options
|
|
|
|
| Option | Default | Description |
|
|
|--------|---------|-------------|
|
|
| --model | minimax-m27 | Model ID to benchmark |
|
|
| --runs | 50 | Number of test iterations |
|
|
| --timeout | 120 | Request timeout in seconds |
|
|
| --output | ~/chat_benchmark | Output directory |
|
|
| --infographic | off | Generate 4K infographic when done |
|
|
|
|
## What It Measures
|
|
|
|
- **Response time** (avg, median, min, max, stdev, P90, P95)
|
|
- **Success rate** (HTTP errors, timeouts, connection errors)
|
|
- **Tool call rate** (% of responses that include tool calls)
|
|
- **Tool call distribution** (which tools get selected)
|
|
- **JSON validity** (whether tool call arguments parse correctly)
|
|
- **Token usage** (prompt, completion, total)
|
|
- **Finish reasons** (tool_calls vs stop vs other)
|
|
- **Error categorization** (by type, with details)
|
|
|
|
## Test Payload
|
|
|
|
The benchmark uses a complex travel planning scenario with:
|
|
- Detailed system prompt enforcing tool-only responses
|
|
- 7 function tools defined (dates, destinations, traveler info, priorities, budget, choices, suggestions)
|
|
- A rich user message with multiple extractable data points
|
|
- `tool_choice: auto`
|
|
|
|
## Output
|
|
|
|
- `benchmark_results.json` — Full results with all run data and computed stats
|
|
- `benchmark_infographic.png` — 4K visual summary (with --infographic flag)
|
|
|
|
## Requirements
|
|
|
|
- `VENICE_API_KEY` environment variable
|
|
- `requests` Python package
|
|
- `venice-image-gen` skill (for infographic generation, optional)
|