Sentra Code Memory on Terminal-Bench 2.1

Measuring the effect of a task-scoped code memory layer on the accuracy, cost, and token efficiency of a frontier coding agent.

Results are from Sentra's internal evaluation using the official Terminal-Bench 2.1 task set and harness, pending official benchmark verification (see Limitations). Baseline figures were recomputed from the published per-task trial data for the public Codex CLI + GPT-5.5 (xhigh) entry.

Executive Summary

Every headline measure improved at once

88.31%mean reward · +4.94 pts vs. baseline

$510.30total model cost · 72.6% lower

663.5Mdisplay tokens · 41.2% fewer

68 / 89tasks solved in all five trials

Sentra evaluated Codex CLI running GPT-5.5 at xhigh reasoning effort — the configuration that currently leads the public Terminal-Bench 2.1 leaderboard — with Sentra Code Memory made available to the agent as a memory tool. The evaluation covered all 89 Terminal-Bench 2.1 tasks at the standard five trials per task, for 445 trials in total.

Accuracy. The Sentra-enabled agent succeeded on 393 of 445 trials, a mean reward of 88.31%, versus 371 of 445 (83.37%) for the published baseline — an improvement of 4.94 points and 22 additional successful trials.
Cost. Total model cost was $510.30, versus $1,862.98 for the baseline, a 72.6% reduction. Cost per successful trial fell from $5.02 to $1.30.
Tokens. The run consumed 663.5 million tokens under the leaderboard's display convention, versus 1.128 billion for the baseline, 41.2% fewer.
Consistency. Tasks solved in all five trials rose from 63 to 68, and tasks that failed in all five fell from 5 to 3.

The base agent, model, reasoning effort, benchmark harness, and scoring were identical across the two configurations. The only change was the availability of a task-scoped code memory layer, which the agent could query for relevant development context instead of repeatedly rediscovering it.

1 · Leaderboard Context

Where the result sits

At 88.31% mean reward, the Sentra-enabled configuration scores 4.94 points above the highest published entry. Because this result has not yet passed official verification, it is presented as an internal evaluation rather than a leaderboard ranking.

Terminal-Bench 2.1 leaderboard — mean reward, k = 5.
#	Agent	Model	Accuracy
—	Sentra Code Memory + Codex CLIinternal eval	GPT-5.5 · xhigh	88.31%
1	Codex CLI0.125.0	GPT-5.5	83.4%± 2.2 pp
2	Claude Code2.1.152	Claude Opus 4.8	78.9%± 2.5 pp
3	Terminus 22.0.0	GPT-5.5	78.2%± 2.4 pp
4	Terminus 22.0.0	Claude Opus 4.8	74.6%± 2.5 pp
5	Terminus 22.0.0	Gemini 3 Pro	74.4%± 2.6 pp
6	Gemini CLI0.40.0	Gemini 3.1 Pro	70.7%± 3.0 pp
7	Terminus 22.0.0	Gemini 3.1 Pro	70.3%± 3.0 pp
8	Claude Code2.1.123	Claude Opus 4.7	69.7%± 2.8 pp

2 · Sentra Code Memory

A memory layer, not a bigger window

What it is

Sentra builds memory infrastructure for teams and AI agents: a shared memory system that captures interactions, decisions, and evidence into a queryable structure, keeps answers connected to their supporting evidence, and tracks how information changes over time. Sentra Code Memory applies this capability to coding agents. It gives an agent a task-scoped memory of development state — repository structure, relevant code context, tool activity, file changes, test signals, and continuation context — exposed through a CLI, an SDK, and a local API. Rather than repeatedly re-scanning a repository to rebuild context, the agent retrieves compact, relevant development state at the moment it needs it.

Memory is not a bigger context window

A larger context window gives a model more room, but it does not decide what should be remembered, when a fact has gone stale, which evidence matters, or how to separate durable task state from incidental output. Sentra Code Memory operates as a memory layer around the coding workflow rather than as additional raw capacity. The model continues to write code, run commands, and reason about failures; the memory layer provides a scoped recall channel the agent can query as it works.

Why this should matter on Terminal-Bench

Terminal-Bench tasks typically demand repeated repository inspection, build-system discovery, test interpretation, and incremental debugging. Without a memory layer, an agent pays model tokens to rediscover context it has already seen, often several times within a single run. With task-scoped memory available, the agent can preserve and recall relevant state across the run, reducing redundant discovery and leaving more of its budget for edits and verification.

3 · Evaluation Methodology

Configuration and accounting

Evaluation configuration.
Dataset	terminal-bench-2-1 (official task set)
Tasks	89
Trials per task	5 (k = 5, per the leaderboard protocol)
Total trials	445
Base agent	Codex CLI
Model	GPT-5.5
Reasoning effort	xhigh
Memory layer	Sentra Code Memory, as a task-scoped tool
Harness	Harbor, with Docker task environments
Scoring	Official Terminal-Bench reward per trial
Baseline	Public Codex CLI + GPT-5.5 (xhigh), v0.125.0

Memory isolation

Each task ran with isolated memory state. The task repository was indexed before the agent loop began, and an index watcher kept the memory current as the agent edited files during the run. Memory visible to the agent was scoped to the current task's repository rather than to any shared or global workspace.

Cost and token accounting

Model-cost figures reflect the source-reported model cost from the run artifacts. Sentra-side retrieval costs, limited to embedding and reranking, are not included in the model-cost field. To keep comparisons like-for-like, token counts follow the public task-detail conventions: API tokens are input + output; leaderboard-display tokens are input + output + cache. Model cost remains the most direct budget measure, since it comes from the source-reported cost field rather than a derived token formula.

4 · Results

Headline comparison

Sentra Code Memory + Codex CLI (GPT-5.5, xhigh) vs. the public baseline.
Metric	With Sentra	Baseline	Change
Accuracy (mean reward)	88.31%	83.37%	+4.94 pts
Successful trials	393 / 445	371 / 445	+22
Tasks solved in all 5 trials	68	63	+5
Tasks failed in all 5 trials	3	5	−2
Total model cost	$510.30	$1,862.98	−72.6%
Cost per task	$5.73	$20.93	−72.6%
Cost per trial	$1.15	$4.19	−72.6%
Cost per successful trial	$1.30	$5.02	−74.1%
API tokens (input + output)	354.4M	735.2M	−51.8%
Leaderboard-display tokens	663.5M	1,127.6M	−41.2%

Reliability and consistency

Head-to-head at the task level, the Sentra-enabled configuration improved on 20 tasks, tied on 61, and regressed on 8, for a net gain of 22 successful trials. The outcome distribution shifted toward consistency: more tasks solved in every trial, fewer failing in every trial.

Distribution of successful trials per task (number of tasks in each band).
Successful trials per task	With Sentra	Baseline
5 / 5	68	63
4 / 5	8	8
3 / 5	3	4
2 / 5	5	3
1 / 5	2	6
0 / 5	3	5

5 · Cost and Token Efficiency

Spending less for more

Model cost comparison.
Metric	With Sentra	Baseline	Baseline ÷ Sentra
Total model cost	$510.30	$1,862.98	3.65×
Cost per task	$5.73	$20.93	3.65×
Cost per trial	$1.15	$4.19	3.65×
Cost per successful trial	$1.30	$5.02	3.87×

Read this as: the public baseline spent 3.65× more model budget to achieve a lower score. Because the Sentra run also succeeded more often, the gap widens on a per-success basis, to 3.87×.

Token usage by class, totals across all 445 trials.
Token class	With Sentra	Baseline	Reduction
Input tokens	349,168,710	729,230,975	−52.1%
Output tokens	5,191,939	5,966,373	−13.0%
Cache tokens	309,178,624	392,433,664	−21.2%
API tokens (input + output)	354,360,649	735,197,348	−51.8%
Display tokens (input + output + cache)	663,539,273	1,127,631,012	−41.2%

The savings are dominated by input tokens (−52.1%), with output tokens nearly unchanged (−13.0%) — a pattern consistent with the agent ingesting far less repeated context while producing a comparable volume of work.

6 · Task-Level Highlights

Where the gains concentrate

The largest gains concentrate in tasks that demand heavy environment and repository discovery, where the baseline spent large budgets rebuilding context. On train-fasttext, the Sentra-enabled agent flipped zero baseline successes into two while cutting cost from $138.90 to $14.70; on compile-compcert it reached 5/5 at $13.47 versus $99.89. Of the five tasks the baseline failed in every trial, the Sentra-enabled agent solved four at least once.

Tasks where Sentra Code Memory gained successful trials.
Task	Sentra	Baseline	Δ	Sentra cost	Baseline cost
extract-moves-from-video	3	0	+3	$47.89	$217.35
qemu-alpine-ssh	5	2	+3	$6.68	$29.45
protein-assembly	5	2	+3	$7.91	$23.62
configure-git-webserver	3	0	+3	$7.00	$10.22
extract-elf	4	1	+3	$3.33	$6.94
train-fasttext	2	0	+2	$14.70	$138.90
pypi-server	5	3	+2	$2.64	$7.94
kv-store-grpc	5	3	+2	$3.50	$5.83
install-windows-3.11	4	3	+1	$11.48	$101.73
compile-compcert	5	4	+1	$13.47	$99.89
video-processing	2	1	+1	$8.79	$25.87
qemu-startup	5	4	+1	$4.08	$22.05
largest-eigenval	5	4	+1	$3.69	$19.36
gcode-to-text	2	1	+1	$7.75	$16.28
dna-assembly	2	1	+1	$8.56	$13.40
pytorch-model-recovery	4	3	+1	$3.83	$12.20
sam-cell-seg	5	4	+1	$4.98	$10.92
torch-tensor-parallelism	5	4	+1	$3.37	$5.15
dna-insert	1	0	+1	$5.07	$4.57
chess-best-move	5	4	+1	$2.58	$3.97

Across the run, the Sentra-enabled agent gained 33 trials on 20 tasks and gave back 11 trials on 8 tasks, for the net improvement of +22. The regressions are concentrated in a small set of tasks and are under analysis ahead of the official benchmark submission.

All tasks where the public baseline outperformed Sentra.
Task	Sentra	Baseline	Δ	Sentra cost	Baseline cost
torch-pipeline-parallelism	2	5	-3	$4.79	$8.68
vulnerable-secret	3	5	-2	$1.74	$2.91
build-cython-ext	4	5	-1	$9.06	$28.08
db-wal-recovery	1	2	-1	$7.19	$13.40
make-doom-for-mips	0	1	-1	$5.23	$90.98
make-mips-interpreter	4	5	-1	$11.31	$38.08
query-optimize	4	5	-1	$7.18	$29.23
raman-fitting	0	1	-1	$7.52	$17.95

7 · Discussion

Three measures moved together

Improvements on agentic benchmarks typically trade off against cost: higher scores are usually bought with more sampling, longer runs, or heavier reasoning. In this evaluation, the Sentra-enabled configuration scored higher while spending 72.6% less and consuming 41.2% fewer tokens than the public baseline.

The most plausible explanation is reduced rediscovery. Terminal-Bench rewards sustained, multi-step work inside a repository. With task-scoped memory available, the agent recalls context on demand and directs more of its budget toward edits, tests, and verification.

The composition of the token savings supports this reading: input tokens fell by 52.1% while output tokens fell by only 13.0%. The distribution of outcomes suggests the gains reflect greater reliability rather than a handful of fortunate trials — tasks solved in all five trials rose from 63 to 68, and four of the five tasks the baseline never solved were solved at least once with memory available.

8 · Limitations and Verification

Internal evaluation, pending verification

The results in this report were produced by Sentra using the official Terminal-Bench 2.1 task set, harness, and scoring, but they have not yet been verified by the Terminal-Bench team, and Sentra Code Memory does not currently appear on the public leaderboard. Before any leaderboard claim, the configuration will be rerun through the official submission path, and that run will serve as the source of truth.

Baseline snapshot. Baseline figures were recomputed from the published per-task trial data for the public Codex CLI + GPT-5.5 (xhigh) entry as of June 2026. Public leaderboard data may be revised over time.
Cost scope. Model-cost figures reflect the agent's reported model usage. Sentra-side retrieval costs (embedding and reranking) are not included.
Generalization. This evaluation covers a single benchmark and a single base agent and model configuration. Results on other benchmarks, agents, or models may differ.

9 · Availability and Next Steps

What happens next

Sentra will submit this configuration through the official Terminal-Bench 2.1 verification process and will release the complete run trajectories so the community can inspect the agent's behavior directly. Sentra Code Memory itself will be made available shortly. For questions about this report, early access, or partnership inquiries, contact ashwin@sentra.app or visit sentra.app.

References

Terminal-Bench 2.1 leaderboard. tbench.ai/leaderboard/terminal-bench/2.1
Public Codex CLI + GPT-5.5 (xhigh) task-detail pages, Terminal-Bench 2.1.
Terminal-Bench benchmark index. tbench.ai/benchmarks
Sentra. sentra.app
Sentra Terminal-Bench 2.1 run artifacts (445 trial records; trajectories to be released publicly).

Appendix A

Complete per-task results

Per-task results for all 89 Terminal-Bench 2.1 tasks: successful trials out of five for each configuration, the difference, total model cost per task, and total tokens per task under the leaderboard display convention (input + output + cache).

Task	Sentra	Base	Δ	Sentra $	Base $	Sentra tok	Base tok
adaptive-rejection-sampler	5	5	0	$4.94	$11.89	2.6M	9.2M
bn-fit-modify	5	5	0	$3.38	$3.43	2.4M	1.0M
break-filter-js-from-html	5	5	0	$4.21	$15.06	3.7M	6.1M
build-cython-ext	4	5	-1	$9.06	$28.08	16.5M	22.0M
build-pmars	5	5	0	$6.86	$11.82	10.3M	9.5M
build-pov-ray	5	5	0	$6.36	$14.90	8.1M	12.9M
caffe-cifar-10	5	5	0	$15.04	$64.14	35.8M	62.4M
cancel-async-tasks	5	5	0	$1.78	$2.61	0.8M	1.0M
chess-best-move	5	4	+1	$2.58	$3.97	1.2M	1.5M
circuit-fibsqrt	5	5	0	$4.65	$8.81	3.5M	4.0M
cobol-modernization	5	5	0	$4.44	$12.47	4.2M	4.7M
code-from-image	5	5	0	$1.08	$1.04	0.6M	0.6M
compile-compcert	5	4	+1	$13.47	$99.89	27.9M	72.1M
configure-git-webserver	3	0	+3	$7.00	$10.22	8.0M	9.8M
constraints-scheduling	5	5	0	$2.59	$2.27	1.3M	0.9M
count-dataset-tokens	5	5	0	$4.86	$5.17	4.8M	3.3M
crack-7z-hash	5	5	0	$5.33	$8.25	7.7M	9.4M
custom-memory-heap-crash	5	5	0	$4.63	$11.75	6.6M	10.1M
db-wal-recovery	1	2	-1	$7.19	$13.40	8.0M	9.6M
distribution-search	5	5	0	$1.89	$2.98	1.3M	1.1M
dna-assembly	2	1	+1	$8.56	$13.40	6.5M	4.7M
dna-insert	1	0	+1	$5.07	$4.57	5.0M	1.9M
extract-elf	4	1	+3	$3.33	$6.94	3.0M	3.2M
extract-moves-from-video	3	0	+3	$47.89	$217.35	87.8M	99.9M
feal-differential-cryptanalysis	5	5	0	$3.21	$9.06	2.5M	5.3M
feal-linear-cryptanalysis	5	5	0	$4.01	$5.27	2.9M	2.3M
filter-js-from-html	0	0	0	$4.06	$7.34	2.5M	3.5M
financial-document-processor	5	5	0	$8.38	$12.42	8.1M	7.3M
fix-code-vulnerability	5	5	0	$3.99	$3.03	4.6M	2.5M
fix-git	5	5	0	$2.74	$3.71	2.9M	2.8M
fix-ocaml-gc	5	5	0	$8.51	$48.92	13.8M	88.7M
gcode-to-text	2	1	+1	$7.75	$16.28	8.1M	9.9M
git-leak-recovery	5	5	0	$1.74	$2.02	1.3M	1.5M
git-multibranch	5	5	0	$3.79	$9.88	3.8M	5.4M
gpt2-codegolf	5	5	0	$8.97	$23.50	9.3M	14.0M
headless-terminal	5	5	0	$2.47	$4.38	1.2M	2.1M
hf-model-inference	5	5	0	$2.82	$7.46	2.3M	5.9M
install-windows-3.11	4	3	+1	$11.48	$101.73	17.7M	69.3M
kv-store-grpc	5	3	+2	$3.50	$5.83	3.5M	3.6M
large-scale-text-editing	5	5	0	$2.14	$8.36	1.6M	4.5M
largest-eigenval	5	4	+1	$3.69	$19.36	3.1M	12.2M
llm-inference-batching-scheduler	5	5	0	$5.13	$14.79	4.0M	4.9M
log-summary-date-ranges	5	5	0	$1.74	$1.61	1.3M	0.6M
mailman	5	5	0	$11.35	$51.99	20.7M	31.2M
make-doom-for-mips	0	1	-1	$5.23	$90.98	5.9M	80.5M
make-mips-interpreter	4	5	-1	$11.31	$38.08	14.7M	25.2M
mcmc-sampling-stan	5	5	0	$12.20	$45.05	26.9M	26.6M
merge-diff-arc-agi-task	5	5	0	$3.58	$10.82	3.7M	5.2M
model-extraction-relu-logits	5	5	0	$4.53	$10.69	3.0M	2.7M
modernize-scientific-stack	5	5	0	$1.11	$1.17	0.8M	0.5M
mteb-leaderboard	5	5	0	$14.41	$40.99	33.0M	29.9M
mteb-retrieve	5	5	0	$2.60	$6.16	2.7M	3.5M
multi-source-data-merger	5	5	0	$2.07	$2.18	1.5M	0.7M
nginx-request-logging	5	5	0	$2.27	$3.16	2.4M	1.1M
openssl-selfsigned-cert	5	5	0	$0.98	$1.65	0.6M	0.6M
overfull-hbox	4	4	0	$4.47	$13.90	4.1M	8.0M
password-recovery	5	5	0	$3.44	$6.98	3.2M	2.4M
path-tracing	5	5	0	$5.39	$42.08	6.4M	24.1M
path-tracing-reverse	5	5	0	$8.68	$20.27	14.3M	10.9M
polyglot-c-py	5	5	0	$3.01	$4.82	2.2M	2.3M
polyglot-rust-c	5	5	0	$5.36	$11.50	3.5M	3.6M
portfolio-optimization	5	5	0	$1.53	$9.38	1.1M	3.1M
protein-assembly	5	2	+3	$7.91	$23.62	7.9M	8.7M
prove-plus-comm	5	5	0	$0.85	$1.68	0.7M	0.7M
pypi-server	5	3	+2	$2.64	$7.94	2.7M	2.9M
pytorch-model-cli	5	5	0	$3.44	$10.22	2.7M	5.2M
pytorch-model-recovery	4	3	+1	$3.83	$12.20	3.3M	6.2M
qemu-alpine-ssh	5	2	+3	$6.68	$29.45	8.9M	15.4M
qemu-startup	5	4	+1	$4.08	$22.05	4.6M	10.2M
query-optimize	4	5	-1	$7.18	$29.23	8.3M	11.9M
raman-fitting	0	1	-1	$7.52	$17.95	6.4M	5.6M
regex-chess	5	5	0	$7.61	$21.58	6.0M	10.7M
regex-log	5	5	0	$2.34	$2.45	1.5M	0.9M
reshard-c4-data	5	5	0	$5.25	$16.88	5.6M	11.8M
rstan-to-pystan	5	5	0	$10.04	$32.73	18.6M	13.3M
sam-cell-seg	5	4	+1	$4.98	$10.92	4.1M	3.8M
sanitize-git-repo	4	4	0	$4.03	$33.08	3.3M	13.7M
schemelike-metacircular-eval	5	5	0	$5.92	$39.66	5.6M	17.0M
sparql-university	5	5	0	$3.54	$11.55	3.2M	5.8M
sqlite-db-truncate	5	5	0	$2.03	$2.75	1.2M	0.8M
sqlite-with-gcov	5	5	0	$2.37	$8.55	2.2M	5.8M
torch-pipeline-parallelism	2	5	-3	$4.79	$8.68	3.0M	2.5M
torch-tensor-parallelism	5	4	+1	$3.37	$5.15	2.1M	1.4M
train-fasttext	2	0	+2	$14.70	$138.90	29.0M	60.4M
tune-mjcf	5	5	0	$3.79	$18.51	4.4M	14.2M
video-processing	2	1	+1	$8.79	$25.87	9.7M	13.6M
vulnerable-secret	3	5	-2	$1.74	$2.91	1.4M	1.0M
winning-avg-corewars	5	5	0	$8.25	$32.42	12.8M	13.0M
write-compressor	5	5	0	$2.81	$4.84	1.5M	1.8M
All 89 tasks	393	371	+22	$510.30	$1,862.98	663.5M	1,127.6M