Sentra Code Memory on Terminal-Bench 2.1

Measuring the effect of a task-scoped code memory layer on the accuracy, cost, and token efficiency of a frontier coding agent.

Results are from Sentra's internal evaluation using the official Terminal-Bench 2.1 task set and harness, pending official benchmark verification (see Limitations). Baseline figures were recomputed from the published per-task trial data for the public Codex CLI + GPT-5.5 (xhigh) entry.

Executive Summary

Every headline measure improved at once

88.31%mean reward · +4.94 pts vs. baseline
$510.30total model cost · 72.6% lower
663.5Mdisplay tokens · 41.2% fewer
68 / 89tasks solved in all five trials

Sentra evaluated Codex CLI running GPT-5.5 at xhigh reasoning effort — the configuration that currently leads the public Terminal-Bench 2.1 leaderboard — with Sentra Code Memory made available to the agent as a memory tool. The evaluation covered all 89 Terminal-Bench 2.1 tasks at the standard five trials per task, for 445 trials in total.

  • Accuracy. The Sentra-enabled agent succeeded on 393 of 445 trials, a mean reward of 88.31%, versus 371 of 445 (83.37%) for the published baseline — an improvement of 4.94 points and 22 additional successful trials.
  • Cost. Total model cost was $510.30, versus $1,862.98 for the baseline, a 72.6% reduction. Cost per successful trial fell from $5.02 to $1.30.
  • Tokens. The run consumed 663.5 million tokens under the leaderboard's display convention, versus 1.128 billion for the baseline, 41.2% fewer.
  • Consistency. Tasks solved in all five trials rose from 63 to 68, and tasks that failed in all five fell from 5 to 3.

The base agent, model, reasoning effort, benchmark harness, and scoring were identical across the two configurations. The only change was the availability of a task-scoped code memory layer, which the agent could query for relevant development context instead of repeatedly rediscovering it.

1 · Leaderboard Context

Where the result sits

At 88.31% mean reward, the Sentra-enabled configuration scores 4.94 points above the highest published entry. Because this result has not yet passed official verification, it is presented as an internal evaluation rather than a leaderboard ranking.

Terminal-Bench 2.1 leaderboard — mean reward, k = 5.
#AgentModelAccuracy
Sentra Code Memory + Codex CLIinternal evalGPT-5.5 · xhigh88.31%
1Codex CLI0.125.0GPT-5.583.4%± 2.2 pp
2Claude Code2.1.152Claude Opus 4.878.9%± 2.5 pp
3Terminus 22.0.0GPT-5.578.2%± 2.4 pp
4Terminus 22.0.0Claude Opus 4.874.6%± 2.5 pp
5Terminus 22.0.0Gemini 3 Pro74.4%± 2.6 pp
6Gemini CLI0.40.0Gemini 3.1 Pro70.7%± 3.0 pp
7Terminus 22.0.0Gemini 3.1 Pro70.3%± 3.0 pp
8Claude Code2.1.123Claude Opus 4.769.7%± 2.8 pp
2 · Sentra Code Memory

A memory layer, not a bigger window

What it is

Sentra builds memory infrastructure for teams and AI agents: a shared memory system that captures interactions, decisions, and evidence into a queryable structure, keeps answers connected to their supporting evidence, and tracks how information changes over time. Sentra Code Memory applies this capability to coding agents. It gives an agent a task-scoped memory of development state — repository structure, relevant code context, tool activity, file changes, test signals, and continuation context — exposed through a CLI, an SDK, and a local API. Rather than repeatedly re-scanning a repository to rebuild context, the agent retrieves compact, relevant development state at the moment it needs it.

Memory is not a bigger context window

A larger context window gives a model more room, but it does not decide what should be remembered, when a fact has gone stale, which evidence matters, or how to separate durable task state from incidental output. Sentra Code Memory operates as a memory layer around the coding workflow rather than as additional raw capacity. The model continues to write code, run commands, and reason about failures; the memory layer provides a scoped recall channel the agent can query as it works.

Why this should matter on Terminal-Bench

Terminal-Bench tasks typically demand repeated repository inspection, build-system discovery, test interpretation, and incremental debugging. Without a memory layer, an agent pays model tokens to rediscover context it has already seen, often several times within a single run. With task-scoped memory available, the agent can preserve and recall relevant state across the run, reducing redundant discovery and leaving more of its budget for edits and verification.

3 · Evaluation Methodology

Configuration and accounting

Evaluation configuration.
Datasetterminal-bench-2-1 (official task set)
Tasks89
Trials per task5 (k = 5, per the leaderboard protocol)
Total trials445
Base agentCodex CLI
ModelGPT-5.5
Reasoning effortxhigh
Memory layerSentra Code Memory, as a task-scoped tool
HarnessHarbor, with Docker task environments
ScoringOfficial Terminal-Bench reward per trial
BaselinePublic Codex CLI + GPT-5.5 (xhigh), v0.125.0

Memory isolation

Each task ran with isolated memory state. The task repository was indexed before the agent loop began, and an index watcher kept the memory current as the agent edited files during the run. Memory visible to the agent was scoped to the current task's repository rather than to any shared or global workspace.

Cost and token accounting

Model-cost figures reflect the source-reported model cost from the run artifacts. Sentra-side retrieval costs, limited to embedding and reranking, are not included in the model-cost field. To keep comparisons like-for-like, token counts follow the public task-detail conventions: API tokens are input + output; leaderboard-display tokens are input + output + cache. Model cost remains the most direct budget measure, since it comes from the source-reported cost field rather than a derived token formula.

4 · Results

Headline comparison

Sentra Code Memory + Codex CLI (GPT-5.5, xhigh) vs. the public baseline.
MetricWith SentraBaselineChange
Accuracy (mean reward)88.31%83.37%+4.94 pts
Successful trials393 / 445371 / 445+22
Tasks solved in all 5 trials6863+5
Tasks failed in all 5 trials35−2
Total model cost$510.30$1,862.98−72.6%
Cost per task$5.73$20.93−72.6%
Cost per trial$1.15$4.19−72.6%
Cost per successful trial$1.30$5.02−74.1%
API tokens (input + output)354.4M735.2M−51.8%
Leaderboard-display tokens663.5M1,127.6M−41.2%

Reliability and consistency

Head-to-head at the task level, the Sentra-enabled configuration improved on 20 tasks, tied on 61, and regressed on 8, for a net gain of 22 successful trials. The outcome distribution shifted toward consistency: more tasks solved in every trial, fewer failing in every trial.

Distribution of successful trials per task (number of tasks in each band).
Successful trials per taskWith SentraBaseline
5 / 56863
4 / 588
3 / 534
2 / 553
1 / 526
0 / 535
5 · Cost and Token Efficiency

Spending less for more

Model cost comparison.
MetricWith SentraBaselineBaseline ÷ Sentra
Total model cost$510.30$1,862.983.65×
Cost per task$5.73$20.933.65×
Cost per trial$1.15$4.193.65×
Cost per successful trial$1.30$5.023.87×

Read this as: the public baseline spent 3.65× more model budget to achieve a lower score. Because the Sentra run also succeeded more often, the gap widens on a per-success basis, to 3.87×.

Token usage by class, totals across all 445 trials.
Token classWith SentraBaselineReduction
Input tokens349,168,710729,230,975−52.1%
Output tokens5,191,9395,966,373−13.0%
Cache tokens309,178,624392,433,664−21.2%
API tokens (input + output)354,360,649735,197,348−51.8%
Display tokens (input + output + cache)663,539,2731,127,631,012−41.2%

The savings are dominated by input tokens (−52.1%), with output tokens nearly unchanged (−13.0%) — a pattern consistent with the agent ingesting far less repeated context while producing a comparable volume of work.

6 · Task-Level Highlights

Where the gains concentrate

The largest gains concentrate in tasks that demand heavy environment and repository discovery, where the baseline spent large budgets rebuilding context. On train-fasttext, the Sentra-enabled agent flipped zero baseline successes into two while cutting cost from $138.90 to $14.70; on compile-compcert it reached 5/5 at $13.47 versus $99.89. Of the five tasks the baseline failed in every trial, the Sentra-enabled agent solved four at least once.

Tasks where Sentra Code Memory gained successful trials.
TaskSentraBaselineΔSentra costBaseline cost
extract-moves-from-video30+3$47.89$217.35
qemu-alpine-ssh52+3$6.68$29.45
protein-assembly52+3$7.91$23.62
configure-git-webserver30+3$7.00$10.22
extract-elf41+3$3.33$6.94
train-fasttext20+2$14.70$138.90
pypi-server53+2$2.64$7.94
kv-store-grpc53+2$3.50$5.83
install-windows-3.1143+1$11.48$101.73
compile-compcert54+1$13.47$99.89
video-processing21+1$8.79$25.87
qemu-startup54+1$4.08$22.05
largest-eigenval54+1$3.69$19.36
gcode-to-text21+1$7.75$16.28
dna-assembly21+1$8.56$13.40
pytorch-model-recovery43+1$3.83$12.20
sam-cell-seg54+1$4.98$10.92
torch-tensor-parallelism54+1$3.37$5.15
dna-insert10+1$5.07$4.57
chess-best-move54+1$2.58$3.97

Across the run, the Sentra-enabled agent gained 33 trials on 20 tasks and gave back 11 trials on 8 tasks, for the net improvement of +22. The regressions are concentrated in a small set of tasks and are under analysis ahead of the official benchmark submission.

All tasks where the public baseline outperformed Sentra.
TaskSentraBaselineΔSentra costBaseline cost
torch-pipeline-parallelism25-3$4.79$8.68
vulnerable-secret35-2$1.74$2.91
build-cython-ext45-1$9.06$28.08
db-wal-recovery12-1$7.19$13.40
make-doom-for-mips01-1$5.23$90.98
make-mips-interpreter45-1$11.31$38.08
query-optimize45-1$7.18$29.23
raman-fitting01-1$7.52$17.95
7 · Discussion

Three measures moved together

Improvements on agentic benchmarks typically trade off against cost: higher scores are usually bought with more sampling, longer runs, or heavier reasoning. In this evaluation, the Sentra-enabled configuration scored higher while spending 72.6% less and consuming 41.2% fewer tokens than the public baseline.

The most plausible explanation is reduced rediscovery. Terminal-Bench rewards sustained, multi-step work inside a repository. With task-scoped memory available, the agent recalls context on demand and directs more of its budget toward edits, tests, and verification.

The composition of the token savings supports this reading: input tokens fell by 52.1% while output tokens fell by only 13.0%. The distribution of outcomes suggests the gains reflect greater reliability rather than a handful of fortunate trials — tasks solved in all five trials rose from 63 to 68, and four of the five tasks the baseline never solved were solved at least once with memory available.

8 · Limitations and Verification

Internal evaluation, pending verification

The results in this report were produced by Sentra using the official Terminal-Bench 2.1 task set, harness, and scoring, but they have not yet been verified by the Terminal-Bench team, and Sentra Code Memory does not currently appear on the public leaderboard. Before any leaderboard claim, the configuration will be rerun through the official submission path, and that run will serve as the source of truth.

  • Baseline snapshot. Baseline figures were recomputed from the published per-task trial data for the public Codex CLI + GPT-5.5 (xhigh) entry as of June 2026. Public leaderboard data may be revised over time.
  • Cost scope. Model-cost figures reflect the agent's reported model usage. Sentra-side retrieval costs (embedding and reranking) are not included.
  • Generalization. This evaluation covers a single benchmark and a single base agent and model configuration. Results on other benchmarks, agents, or models may differ.
9 · Availability and Next Steps

What happens next

Sentra will submit this configuration through the official Terminal-Bench 2.1 verification process and will release the complete run trajectories so the community can inspect the agent's behavior directly. Sentra Code Memory itself will be made available shortly. For questions about this report, early access, or partnership inquiries, contact ashwin@sentra.app or visit sentra.app.

Appendix A

Complete per-task results

Per-task results for all 89 Terminal-Bench 2.1 tasks: successful trials out of five for each configuration, the difference, total model cost per task, and total tokens per task under the leaderboard display convention (input + output + cache).

TaskSentraBaseΔSentra $Base $Sentra tokBase tok
adaptive-rejection-sampler550$4.94$11.892.6M9.2M
bn-fit-modify550$3.38$3.432.4M1.0M
break-filter-js-from-html550$4.21$15.063.7M6.1M
build-cython-ext45-1$9.06$28.0816.5M22.0M
build-pmars550$6.86$11.8210.3M9.5M
build-pov-ray550$6.36$14.908.1M12.9M
caffe-cifar-10550$15.04$64.1435.8M62.4M
cancel-async-tasks550$1.78$2.610.8M1.0M
chess-best-move54+1$2.58$3.971.2M1.5M
circuit-fibsqrt550$4.65$8.813.5M4.0M
cobol-modernization550$4.44$12.474.2M4.7M
code-from-image550$1.08$1.040.6M0.6M
compile-compcert54+1$13.47$99.8927.9M72.1M
configure-git-webserver30+3$7.00$10.228.0M9.8M
constraints-scheduling550$2.59$2.271.3M0.9M
count-dataset-tokens550$4.86$5.174.8M3.3M
crack-7z-hash550$5.33$8.257.7M9.4M
custom-memory-heap-crash550$4.63$11.756.6M10.1M
db-wal-recovery12-1$7.19$13.408.0M9.6M
distribution-search550$1.89$2.981.3M1.1M
dna-assembly21+1$8.56$13.406.5M4.7M
dna-insert10+1$5.07$4.575.0M1.9M
extract-elf41+3$3.33$6.943.0M3.2M
extract-moves-from-video30+3$47.89$217.3587.8M99.9M
feal-differential-cryptanalysis550$3.21$9.062.5M5.3M
feal-linear-cryptanalysis550$4.01$5.272.9M2.3M
filter-js-from-html000$4.06$7.342.5M3.5M
financial-document-processor550$8.38$12.428.1M7.3M
fix-code-vulnerability550$3.99$3.034.6M2.5M
fix-git550$2.74$3.712.9M2.8M
fix-ocaml-gc550$8.51$48.9213.8M88.7M
gcode-to-text21+1$7.75$16.288.1M9.9M
git-leak-recovery550$1.74$2.021.3M1.5M
git-multibranch550$3.79$9.883.8M5.4M
gpt2-codegolf550$8.97$23.509.3M14.0M
headless-terminal550$2.47$4.381.2M2.1M
hf-model-inference550$2.82$7.462.3M5.9M
install-windows-3.1143+1$11.48$101.7317.7M69.3M
kv-store-grpc53+2$3.50$5.833.5M3.6M
large-scale-text-editing550$2.14$8.361.6M4.5M
largest-eigenval54+1$3.69$19.363.1M12.2M
llm-inference-batching-scheduler550$5.13$14.794.0M4.9M
log-summary-date-ranges550$1.74$1.611.3M0.6M
mailman550$11.35$51.9920.7M31.2M
make-doom-for-mips01-1$5.23$90.985.9M80.5M
make-mips-interpreter45-1$11.31$38.0814.7M25.2M
mcmc-sampling-stan550$12.20$45.0526.9M26.6M
merge-diff-arc-agi-task550$3.58$10.823.7M5.2M
model-extraction-relu-logits550$4.53$10.693.0M2.7M
modernize-scientific-stack550$1.11$1.170.8M0.5M
mteb-leaderboard550$14.41$40.9933.0M29.9M
mteb-retrieve550$2.60$6.162.7M3.5M
multi-source-data-merger550$2.07$2.181.5M0.7M
nginx-request-logging550$2.27$3.162.4M1.1M
openssl-selfsigned-cert550$0.98$1.650.6M0.6M
overfull-hbox440$4.47$13.904.1M8.0M
password-recovery550$3.44$6.983.2M2.4M
path-tracing550$5.39$42.086.4M24.1M
path-tracing-reverse550$8.68$20.2714.3M10.9M
polyglot-c-py550$3.01$4.822.2M2.3M
polyglot-rust-c550$5.36$11.503.5M3.6M
portfolio-optimization550$1.53$9.381.1M3.1M
protein-assembly52+3$7.91$23.627.9M8.7M
prove-plus-comm550$0.85$1.680.7M0.7M
pypi-server53+2$2.64$7.942.7M2.9M
pytorch-model-cli550$3.44$10.222.7M5.2M
pytorch-model-recovery43+1$3.83$12.203.3M6.2M
qemu-alpine-ssh52+3$6.68$29.458.9M15.4M
qemu-startup54+1$4.08$22.054.6M10.2M
query-optimize45-1$7.18$29.238.3M11.9M
raman-fitting01-1$7.52$17.956.4M5.6M
regex-chess550$7.61$21.586.0M10.7M
regex-log550$2.34$2.451.5M0.9M
reshard-c4-data550$5.25$16.885.6M11.8M
rstan-to-pystan550$10.04$32.7318.6M13.3M
sam-cell-seg54+1$4.98$10.924.1M3.8M
sanitize-git-repo440$4.03$33.083.3M13.7M
schemelike-metacircular-eval550$5.92$39.665.6M17.0M
sparql-university550$3.54$11.553.2M5.8M
sqlite-db-truncate550$2.03$2.751.2M0.8M
sqlite-with-gcov550$2.37$8.552.2M5.8M
torch-pipeline-parallelism25-3$4.79$8.683.0M2.5M
torch-tensor-parallelism54+1$3.37$5.152.1M1.4M
train-fasttext20+2$14.70$138.9029.0M60.4M
tune-mjcf550$3.79$18.514.4M14.2M
video-processing21+1$8.79$25.879.7M13.6M
vulnerable-secret35-2$1.74$2.911.4M1.0M
winning-avg-corewars550$8.25$32.4212.8M13.0M
write-compressor550$2.81$4.841.5M1.8M
All 89 tasks393371+22$510.30$1,862.98663.5M1,127.6M

Sentralize your company.

Remember what matters.

Resources
Articles
Preferences

Subprocessors include Amazon Web Services, GitHub, Slack, Google Cloud Platform, and OpenAI.

© 2026 Dynamis Labs Inc. All rights reserved.