Sentra Code Memory on Terminal-Bench 2.1
Measuring the effect of a task-scoped code memory layer on the accuracy, cost, and token efficiency of a frontier coding agent.
Results are from Sentra's internal evaluation using the official Terminal-Bench 2.1 task set and harness, pending official benchmark verification (see Limitations). Baseline figures were recomputed from the published per-task trial data for the public Codex CLI + GPT-5.5 (xhigh) entry.
Executive SummaryEvery headline measure improved at once
88.31%mean reward · +4.94 pts vs. baseline
$510.30total model cost · 72.6% lower
663.5Mdisplay tokens · 41.2% fewer
68 / 89tasks solved in all five trials
Sentra evaluated Codex CLI running GPT-5.5 at xhigh reasoning effort — the configuration that currently leads the public Terminal-Bench 2.1 leaderboard — with Sentra Code Memory made available to the agent as a memory tool. The evaluation covered all 89 Terminal-Bench 2.1 tasks at the standard five trials per task, for 445 trials in total.
- Accuracy. The Sentra-enabled agent succeeded on 393 of 445 trials, a mean reward of 88.31%, versus 371 of 445 (83.37%) for the published baseline — an improvement of 4.94 points and 22 additional successful trials.
- Cost. Total model cost was $510.30, versus $1,862.98 for the baseline, a 72.6% reduction. Cost per successful trial fell from $5.02 to $1.30.
- Tokens. The run consumed 663.5 million tokens under the leaderboard's display convention, versus 1.128 billion for the baseline, 41.2% fewer.
- Consistency. Tasks solved in all five trials rose from 63 to 68, and tasks that failed in all five fell from 5 to 3.
The base agent, model, reasoning effort, benchmark harness, and scoring were identical across the two configurations. The only change was the availability of a task-scoped code memory layer, which the agent could query for relevant development context instead of repeatedly rediscovering it.
1 · Leaderboard ContextWhere the result sits
At 88.31% mean reward, the Sentra-enabled configuration scores 4.94 points above the highest published entry. Because this result has not yet passed official verification, it is presented as an internal evaluation rather than a leaderboard ranking.
Terminal-Bench 2.1 leaderboard — mean reward, k = 5.| # | Agent | Model | Accuracy |
|---|
| — | Sentra Code Memory + Codex CLIinternal eval | GPT-5.5 · xhigh | 88.31% |
|---|
| 1 | Codex CLI0.125.0 | GPT-5.5 | 83.4%± 2.2 pp |
|---|
| 2 | Claude Code2.1.152 | Claude Opus 4.8 | 78.9%± 2.5 pp |
|---|
| 3 | Terminus 22.0.0 | GPT-5.5 | 78.2%± 2.4 pp |
|---|
| 4 | Terminus 22.0.0 | Claude Opus 4.8 | 74.6%± 2.5 pp |
|---|
| 5 | Terminus 22.0.0 | Gemini 3 Pro | 74.4%± 2.6 pp |
|---|
| 6 | Gemini CLI0.40.0 | Gemini 3.1 Pro | 70.7%± 3.0 pp |
|---|
| 7 | Terminus 22.0.0 | Gemini 3.1 Pro | 70.3%± 3.0 pp |
|---|
| 8 | Claude Code2.1.123 | Claude Opus 4.7 | 69.7%± 2.8 pp |
|---|
2 · Sentra Code MemoryA memory layer, not a bigger window
What it is
Sentra builds memory infrastructure for teams and AI agents: a shared memory system that captures interactions, decisions, and evidence into a queryable structure, keeps answers connected to their supporting evidence, and tracks how information changes over time. Sentra Code Memory applies this capability to coding agents. It gives an agent a task-scoped memory of development state — repository structure, relevant code context, tool activity, file changes, test signals, and continuation context — exposed through a CLI, an SDK, and a local API. Rather than repeatedly re-scanning a repository to rebuild context, the agent retrieves compact, relevant development state at the moment it needs it.
Memory is not a bigger context window
A larger context window gives a model more room, but it does not decide what should be remembered, when a fact has gone stale, which evidence matters, or how to separate durable task state from incidental output. Sentra Code Memory operates as a memory layer around the coding workflow rather than as additional raw capacity. The model continues to write code, run commands, and reason about failures; the memory layer provides a scoped recall channel the agent can query as it works.
Why this should matter on Terminal-Bench
Terminal-Bench tasks typically demand repeated repository inspection, build-system discovery, test interpretation, and incremental debugging. Without a memory layer, an agent pays model tokens to rediscover context it has already seen, often several times within a single run. With task-scoped memory available, the agent can preserve and recall relevant state across the run, reducing redundant discovery and leaving more of its budget for edits and verification.
3 · Evaluation MethodologyConfiguration and accounting
Evaluation configuration.| Dataset | terminal-bench-2-1 (official task set) |
|---|
| Tasks | 89 |
|---|
| Trials per task | 5 (k = 5, per the leaderboard protocol) |
|---|
| Total trials | 445 |
|---|
| Base agent | Codex CLI |
|---|
| Model | GPT-5.5 |
|---|
| Reasoning effort | xhigh |
|---|
| Memory layer | Sentra Code Memory, as a task-scoped tool |
|---|
| Harness | Harbor, with Docker task environments |
|---|
| Scoring | Official Terminal-Bench reward per trial |
|---|
| Baseline | Public Codex CLI + GPT-5.5 (xhigh), v0.125.0 |
|---|
Memory isolation
Each task ran with isolated memory state. The task repository was indexed before the agent loop began, and an index watcher kept the memory current as the agent edited files during the run. Memory visible to the agent was scoped to the current task's repository rather than to any shared or global workspace.
Cost and token accounting
Model-cost figures reflect the source-reported model cost from the run artifacts. Sentra-side retrieval costs, limited to embedding and reranking, are not included in the model-cost field. To keep comparisons like-for-like, token counts follow the public task-detail conventions: API tokens are input + output; leaderboard-display tokens are input + output + cache. Model cost remains the most direct budget measure, since it comes from the source-reported cost field rather than a derived token formula.
4 · ResultsHeadline comparison
Sentra Code Memory + Codex CLI (GPT-5.5, xhigh) vs. the public baseline.| Metric | With Sentra | Baseline | Change |
|---|
| Accuracy (mean reward) | 88.31% | 83.37% | +4.94 pts |
|---|
| Successful trials | 393 / 445 | 371 / 445 | +22 |
|---|
| Tasks solved in all 5 trials | 68 | 63 | +5 |
|---|
| Tasks failed in all 5 trials | 3 | 5 | −2 |
|---|
| Total model cost | $510.30 | $1,862.98 | −72.6% |
|---|
| Cost per task | $5.73 | $20.93 | −72.6% |
|---|
| Cost per trial | $1.15 | $4.19 | −72.6% |
|---|
| Cost per successful trial | $1.30 | $5.02 | −74.1% |
|---|
| API tokens (input + output) | 354.4M | 735.2M | −51.8% |
|---|
| Leaderboard-display tokens | 663.5M | 1,127.6M | −41.2% |
|---|
Reliability and consistency
Head-to-head at the task level, the Sentra-enabled configuration improved on 20 tasks, tied on 61, and regressed on 8, for a net gain of 22 successful trials. The outcome distribution shifted toward consistency: more tasks solved in every trial, fewer failing in every trial.
Distribution of successful trials per task (number of tasks in each band).| Successful trials per task | With Sentra | Baseline |
|---|
| 5 / 5 | 68 | 63 |
|---|
| 4 / 5 | 8 | 8 |
|---|
| 3 / 5 | 3 | 4 |
|---|
| 2 / 5 | 5 | 3 |
|---|
| 1 / 5 | 2 | 6 |
|---|
| 0 / 5 | 3 | 5 |
|---|
5 · Cost and Token EfficiencySpending less for more
Model cost comparison.| Metric | With Sentra | Baseline | Baseline ÷ Sentra |
|---|
| Total model cost | $510.30 | $1,862.98 | 3.65× |
|---|
| Cost per task | $5.73 | $20.93 | 3.65× |
|---|
| Cost per trial | $1.15 | $4.19 | 3.65× |
|---|
| Cost per successful trial | $1.30 | $5.02 | 3.87× |
|---|
Read this as: the public baseline spent 3.65× more model budget to achieve a lower score. Because the Sentra run also succeeded more often, the gap widens on a per-success basis, to 3.87×.
Token usage by class, totals across all 445 trials.| Token class | With Sentra | Baseline | Reduction |
|---|
| Input tokens | 349,168,710 | 729,230,975 | −52.1% |
|---|
| Output tokens | 5,191,939 | 5,966,373 | −13.0% |
|---|
| Cache tokens | 309,178,624 | 392,433,664 | −21.2% |
|---|
| API tokens (input + output) | 354,360,649 | 735,197,348 | −51.8% |
|---|
| Display tokens (input + output + cache) | 663,539,273 | 1,127,631,012 | −41.2% |
|---|
The savings are dominated by input tokens (−52.1%), with output tokens nearly unchanged (−13.0%) — a pattern consistent with the agent ingesting far less repeated context while producing a comparable volume of work.
6 · Task-Level HighlightsWhere the gains concentrate
The largest gains concentrate in tasks that demand heavy environment and repository discovery, where the baseline spent large budgets rebuilding context. On train-fasttext, the Sentra-enabled agent flipped zero baseline successes into two while cutting cost from $138.90 to $14.70; on compile-compcert it reached 5/5 at $13.47 versus $99.89. Of the five tasks the baseline failed in every trial, the Sentra-enabled agent solved four at least once.
Tasks where Sentra Code Memory gained successful trials.| Task | Sentra | Baseline | Δ | Sentra cost | Baseline cost |
|---|
| extract-moves-from-video | 3 | 0 | +3 | $47.89 | $217.35 |
|---|
| qemu-alpine-ssh | 5 | 2 | +3 | $6.68 | $29.45 |
|---|
| protein-assembly | 5 | 2 | +3 | $7.91 | $23.62 |
|---|
| configure-git-webserver | 3 | 0 | +3 | $7.00 | $10.22 |
|---|
| extract-elf | 4 | 1 | +3 | $3.33 | $6.94 |
|---|
| train-fasttext | 2 | 0 | +2 | $14.70 | $138.90 |
|---|
| pypi-server | 5 | 3 | +2 | $2.64 | $7.94 |
|---|
| kv-store-grpc | 5 | 3 | +2 | $3.50 | $5.83 |
|---|
| install-windows-3.11 | 4 | 3 | +1 | $11.48 | $101.73 |
|---|
| compile-compcert | 5 | 4 | +1 | $13.47 | $99.89 |
|---|
| video-processing | 2 | 1 | +1 | $8.79 | $25.87 |
|---|
| qemu-startup | 5 | 4 | +1 | $4.08 | $22.05 |
|---|
| largest-eigenval | 5 | 4 | +1 | $3.69 | $19.36 |
|---|
| gcode-to-text | 2 | 1 | +1 | $7.75 | $16.28 |
|---|
| dna-assembly | 2 | 1 | +1 | $8.56 | $13.40 |
|---|
| pytorch-model-recovery | 4 | 3 | +1 | $3.83 | $12.20 |
|---|
| sam-cell-seg | 5 | 4 | +1 | $4.98 | $10.92 |
|---|
| torch-tensor-parallelism | 5 | 4 | +1 | $3.37 | $5.15 |
|---|
| dna-insert | 1 | 0 | +1 | $5.07 | $4.57 |
|---|
| chess-best-move | 5 | 4 | +1 | $2.58 | $3.97 |
|---|
Across the run, the Sentra-enabled agent gained 33 trials on 20 tasks and gave back 11 trials on 8 tasks, for the net improvement of +22. The regressions are concentrated in a small set of tasks and are under analysis ahead of the official benchmark submission.
All tasks where the public baseline outperformed Sentra.| Task | Sentra | Baseline | Δ | Sentra cost | Baseline cost |
|---|
| torch-pipeline-parallelism | 2 | 5 | -3 | $4.79 | $8.68 |
|---|
| vulnerable-secret | 3 | 5 | -2 | $1.74 | $2.91 |
|---|
| build-cython-ext | 4 | 5 | -1 | $9.06 | $28.08 |
|---|
| db-wal-recovery | 1 | 2 | -1 | $7.19 | $13.40 |
|---|
| make-doom-for-mips | 0 | 1 | -1 | $5.23 | $90.98 |
|---|
| make-mips-interpreter | 4 | 5 | -1 | $11.31 | $38.08 |
|---|
| query-optimize | 4 | 5 | -1 | $7.18 | $29.23 |
|---|
| raman-fitting | 0 | 1 | -1 | $7.52 | $17.95 |
|---|
7 · DiscussionThree measures moved together
Improvements on agentic benchmarks typically trade off against cost: higher scores are usually bought with more sampling, longer runs, or heavier reasoning. In this evaluation, the Sentra-enabled configuration scored higher while spending 72.6% less and consuming 41.2% fewer tokens than the public baseline.
The most plausible explanation is reduced rediscovery. Terminal-Bench rewards sustained, multi-step work inside a repository. With task-scoped memory available, the agent recalls context on demand and directs more of its budget toward edits, tests, and verification.
The composition of the token savings supports this reading: input tokens fell by 52.1% while output tokens fell by only 13.0%. The distribution of outcomes suggests the gains reflect greater reliability rather than a handful of fortunate trials — tasks solved in all five trials rose from 63 to 68, and four of the five tasks the baseline never solved were solved at least once with memory available.
8 · Limitations and VerificationInternal evaluation, pending verification
The results in this report were produced by Sentra using the official Terminal-Bench 2.1 task set, harness, and scoring, but they have not yet been verified by the Terminal-Bench team, and Sentra Code Memory does not currently appear on the public leaderboard. Before any leaderboard claim, the configuration will be rerun through the official submission path, and that run will serve as the source of truth.
- Baseline snapshot. Baseline figures were recomputed from the published per-task trial data for the public Codex CLI + GPT-5.5 (xhigh) entry as of June 2026. Public leaderboard data may be revised over time.
- Cost scope. Model-cost figures reflect the agent's reported model usage. Sentra-side retrieval costs (embedding and reranking) are not included.
- Generalization. This evaluation covers a single benchmark and a single base agent and model configuration. Results on other benchmarks, agents, or models may differ.
9 · Availability and Next StepsWhat happens next
Sentra will submit this configuration through the official Terminal-Bench 2.1 verification process and will release the complete run trajectories so the community can inspect the agent's behavior directly. Sentra Code Memory itself will be made available shortly. For questions about this report, early access, or partnership inquiries, contact ashwin@sentra.app or visit sentra.app.
References- Terminal-Bench 2.1 leaderboard. tbench.ai/leaderboard/terminal-bench/2.1
- Public Codex CLI + GPT-5.5 (xhigh) task-detail pages, Terminal-Bench 2.1.
- Terminal-Bench benchmark index. tbench.ai/benchmarks
- Sentra. sentra.app
- Sentra Terminal-Bench 2.1 run artifacts (445 trial records; trajectories to be released publicly).
Appendix AComplete per-task results
Per-task results for all 89 Terminal-Bench 2.1 tasks: successful trials out of five for each configuration, the difference, total model cost per task, and total tokens per task under the leaderboard display convention (input + output + cache).
| Task | Sentra | Base | Δ | Sentra $ | Base $ | Sentra tok | Base tok |
|---|
| adaptive-rejection-sampler | 5 | 5 | 0 | $4.94 | $11.89 | 2.6M | 9.2M |
|---|
| bn-fit-modify | 5 | 5 | 0 | $3.38 | $3.43 | 2.4M | 1.0M |
|---|
| break-filter-js-from-html | 5 | 5 | 0 | $4.21 | $15.06 | 3.7M | 6.1M |
|---|
| build-cython-ext | 4 | 5 | -1 | $9.06 | $28.08 | 16.5M | 22.0M |
|---|
| build-pmars | 5 | 5 | 0 | $6.86 | $11.82 | 10.3M | 9.5M |
|---|
| build-pov-ray | 5 | 5 | 0 | $6.36 | $14.90 | 8.1M | 12.9M |
|---|
| caffe-cifar-10 | 5 | 5 | 0 | $15.04 | $64.14 | 35.8M | 62.4M |
|---|
| cancel-async-tasks | 5 | 5 | 0 | $1.78 | $2.61 | 0.8M | 1.0M |
|---|
| chess-best-move | 5 | 4 | +1 | $2.58 | $3.97 | 1.2M | 1.5M |
|---|
| circuit-fibsqrt | 5 | 5 | 0 | $4.65 | $8.81 | 3.5M | 4.0M |
|---|
| cobol-modernization | 5 | 5 | 0 | $4.44 | $12.47 | 4.2M | 4.7M |
|---|
| code-from-image | 5 | 5 | 0 | $1.08 | $1.04 | 0.6M | 0.6M |
|---|
| compile-compcert | 5 | 4 | +1 | $13.47 | $99.89 | 27.9M | 72.1M |
|---|
| configure-git-webserver | 3 | 0 | +3 | $7.00 | $10.22 | 8.0M | 9.8M |
|---|
| constraints-scheduling | 5 | 5 | 0 | $2.59 | $2.27 | 1.3M | 0.9M |
|---|
| count-dataset-tokens | 5 | 5 | 0 | $4.86 | $5.17 | 4.8M | 3.3M |
|---|
| crack-7z-hash | 5 | 5 | 0 | $5.33 | $8.25 | 7.7M | 9.4M |
|---|
| custom-memory-heap-crash | 5 | 5 | 0 | $4.63 | $11.75 | 6.6M | 10.1M |
|---|
| db-wal-recovery | 1 | 2 | -1 | $7.19 | $13.40 | 8.0M | 9.6M |
|---|
| distribution-search | 5 | 5 | 0 | $1.89 | $2.98 | 1.3M | 1.1M |
|---|
| dna-assembly | 2 | 1 | +1 | $8.56 | $13.40 | 6.5M | 4.7M |
|---|
| dna-insert | 1 | 0 | +1 | $5.07 | $4.57 | 5.0M | 1.9M |
|---|
| extract-elf | 4 | 1 | +3 | $3.33 | $6.94 | 3.0M | 3.2M |
|---|
| extract-moves-from-video | 3 | 0 | +3 | $47.89 | $217.35 | 87.8M | 99.9M |
|---|
| feal-differential-cryptanalysis | 5 | 5 | 0 | $3.21 | $9.06 | 2.5M | 5.3M |
|---|
| feal-linear-cryptanalysis | 5 | 5 | 0 | $4.01 | $5.27 | 2.9M | 2.3M |
|---|
| filter-js-from-html | 0 | 0 | 0 | $4.06 | $7.34 | 2.5M | 3.5M |
|---|
| financial-document-processor | 5 | 5 | 0 | $8.38 | $12.42 | 8.1M | 7.3M |
|---|
| fix-code-vulnerability | 5 | 5 | 0 | $3.99 | $3.03 | 4.6M | 2.5M |
|---|
| fix-git | 5 | 5 | 0 | $2.74 | $3.71 | 2.9M | 2.8M |
|---|
| fix-ocaml-gc | 5 | 5 | 0 | $8.51 | $48.92 | 13.8M | 88.7M |
|---|
| gcode-to-text | 2 | 1 | +1 | $7.75 | $16.28 | 8.1M | 9.9M |
|---|
| git-leak-recovery | 5 | 5 | 0 | $1.74 | $2.02 | 1.3M | 1.5M |
|---|
| git-multibranch | 5 | 5 | 0 | $3.79 | $9.88 | 3.8M | 5.4M |
|---|
| gpt2-codegolf | 5 | 5 | 0 | $8.97 | $23.50 | 9.3M | 14.0M |
|---|
| headless-terminal | 5 | 5 | 0 | $2.47 | $4.38 | 1.2M | 2.1M |
|---|
| hf-model-inference | 5 | 5 | 0 | $2.82 | $7.46 | 2.3M | 5.9M |
|---|
| install-windows-3.11 | 4 | 3 | +1 | $11.48 | $101.73 | 17.7M | 69.3M |
|---|
| kv-store-grpc | 5 | 3 | +2 | $3.50 | $5.83 | 3.5M | 3.6M |
|---|
| large-scale-text-editing | 5 | 5 | 0 | $2.14 | $8.36 | 1.6M | 4.5M |
|---|
| largest-eigenval | 5 | 4 | +1 | $3.69 | $19.36 | 3.1M | 12.2M |
|---|
| llm-inference-batching-scheduler | 5 | 5 | 0 | $5.13 | $14.79 | 4.0M | 4.9M |
|---|
| log-summary-date-ranges | 5 | 5 | 0 | $1.74 | $1.61 | 1.3M | 0.6M |
|---|
| mailman | 5 | 5 | 0 | $11.35 | $51.99 | 20.7M | 31.2M |
|---|
| make-doom-for-mips | 0 | 1 | -1 | $5.23 | $90.98 | 5.9M | 80.5M |
|---|
| make-mips-interpreter | 4 | 5 | -1 | $11.31 | $38.08 | 14.7M | 25.2M |
|---|
| mcmc-sampling-stan | 5 | 5 | 0 | $12.20 | $45.05 | 26.9M | 26.6M |
|---|
| merge-diff-arc-agi-task | 5 | 5 | 0 | $3.58 | $10.82 | 3.7M | 5.2M |
|---|
| model-extraction-relu-logits | 5 | 5 | 0 | $4.53 | $10.69 | 3.0M | 2.7M |
|---|
| modernize-scientific-stack | 5 | 5 | 0 | $1.11 | $1.17 | 0.8M | 0.5M |
|---|
| mteb-leaderboard | 5 | 5 | 0 | $14.41 | $40.99 | 33.0M | 29.9M |
|---|
| mteb-retrieve | 5 | 5 | 0 | $2.60 | $6.16 | 2.7M | 3.5M |
|---|
| multi-source-data-merger | 5 | 5 | 0 | $2.07 | $2.18 | 1.5M | 0.7M |
|---|
| nginx-request-logging | 5 | 5 | 0 | $2.27 | $3.16 | 2.4M | 1.1M |
|---|
| openssl-selfsigned-cert | 5 | 5 | 0 | $0.98 | $1.65 | 0.6M | 0.6M |
|---|
| overfull-hbox | 4 | 4 | 0 | $4.47 | $13.90 | 4.1M | 8.0M |
|---|
| password-recovery | 5 | 5 | 0 | $3.44 | $6.98 | 3.2M | 2.4M |
|---|
| path-tracing | 5 | 5 | 0 | $5.39 | $42.08 | 6.4M | 24.1M |
|---|
| path-tracing-reverse | 5 | 5 | 0 | $8.68 | $20.27 | 14.3M | 10.9M |
|---|
| polyglot-c-py | 5 | 5 | 0 | $3.01 | $4.82 | 2.2M | 2.3M |
|---|
| polyglot-rust-c | 5 | 5 | 0 | $5.36 | $11.50 | 3.5M | 3.6M |
|---|
| portfolio-optimization | 5 | 5 | 0 | $1.53 | $9.38 | 1.1M | 3.1M |
|---|
| protein-assembly | 5 | 2 | +3 | $7.91 | $23.62 | 7.9M | 8.7M |
|---|
| prove-plus-comm | 5 | 5 | 0 | $0.85 | $1.68 | 0.7M | 0.7M |
|---|
| pypi-server | 5 | 3 | +2 | $2.64 | $7.94 | 2.7M | 2.9M |
|---|
| pytorch-model-cli | 5 | 5 | 0 | $3.44 | $10.22 | 2.7M | 5.2M |
|---|
| pytorch-model-recovery | 4 | 3 | +1 | $3.83 | $12.20 | 3.3M | 6.2M |
|---|
| qemu-alpine-ssh | 5 | 2 | +3 | $6.68 | $29.45 | 8.9M | 15.4M |
|---|
| qemu-startup | 5 | 4 | +1 | $4.08 | $22.05 | 4.6M | 10.2M |
|---|
| query-optimize | 4 | 5 | -1 | $7.18 | $29.23 | 8.3M | 11.9M |
|---|
| raman-fitting | 0 | 1 | -1 | $7.52 | $17.95 | 6.4M | 5.6M |
|---|
| regex-chess | 5 | 5 | 0 | $7.61 | $21.58 | 6.0M | 10.7M |
|---|
| regex-log | 5 | 5 | 0 | $2.34 | $2.45 | 1.5M | 0.9M |
|---|
| reshard-c4-data | 5 | 5 | 0 | $5.25 | $16.88 | 5.6M | 11.8M |
|---|
| rstan-to-pystan | 5 | 5 | 0 | $10.04 | $32.73 | 18.6M | 13.3M |
|---|
| sam-cell-seg | 5 | 4 | +1 | $4.98 | $10.92 | 4.1M | 3.8M |
|---|
| sanitize-git-repo | 4 | 4 | 0 | $4.03 | $33.08 | 3.3M | 13.7M |
|---|
| schemelike-metacircular-eval | 5 | 5 | 0 | $5.92 | $39.66 | 5.6M | 17.0M |
|---|
| sparql-university | 5 | 5 | 0 | $3.54 | $11.55 | 3.2M | 5.8M |
|---|
| sqlite-db-truncate | 5 | 5 | 0 | $2.03 | $2.75 | 1.2M | 0.8M |
|---|
| sqlite-with-gcov | 5 | 5 | 0 | $2.37 | $8.55 | 2.2M | 5.8M |
|---|
| torch-pipeline-parallelism | 2 | 5 | -3 | $4.79 | $8.68 | 3.0M | 2.5M |
|---|
| torch-tensor-parallelism | 5 | 4 | +1 | $3.37 | $5.15 | 2.1M | 1.4M |
|---|
| train-fasttext | 2 | 0 | +2 | $14.70 | $138.90 | 29.0M | 60.4M |
|---|
| tune-mjcf | 5 | 5 | 0 | $3.79 | $18.51 | 4.4M | 14.2M |
|---|
| video-processing | 2 | 1 | +1 | $8.79 | $25.87 | 9.7M | 13.6M |
|---|
| vulnerable-secret | 3 | 5 | -2 | $1.74 | $2.91 | 1.4M | 1.0M |
|---|
| winning-avg-corewars | 5 | 5 | 0 | $8.25 | $32.42 | 12.8M | 13.0M |
|---|
| write-compressor | 5 | 5 | 0 | $2.81 | $4.84 | 1.5M | 1.8M |
|---|
| All 89 tasks | 393 | 371 | +22 | $510.30 | $1,862.98 | 663.5M | 1,127.6M |
|---|