Hidden Naming Contracts in SWE-Agent Benchmarks
AI coding benchmarks now influence research priorities, product strategy, and engineering adoption decisions. Over the last year, SWE-bench has become a key benchmark for evaluating AI coding agents, and that momentum has pushed the community to build additional SWE-bench-style benchmarks beyond Python.
As I have explored in earlier analyses on single-file saturation and difficulty distribution, benchmark scores are only as trustworthy as the instances behind them. In this post, I look at a different failure mode: hidden naming contracts.
A hidden naming contract appears when benchmark tests require specific identifiers introduced in the reference solution, even though those names were never made explicit in the issue text. In that setting, an agent can produce a behaviorally correct fix and still be graded as wrong because it chose a different symbol name.
The Core Failure Mode
A typical SWE-bench-style instance has four parts:
- an issue description
- a base repository snapshot
- a reference solution
- executable tests
The intended contract is behavioral correctness: if a submitted patch fixes the issue, the tests should pass.
The problem starts when tests directly call symbols that were newly introduced in the reference solution. Evaluation then requires not only solving the behavior, but also reproducing a naming choice that may never have been stated anywhere.
A Concrete Example
Consider scikit-learn__scikit-learn-12682 from SWE-bench_Verified.
The issue reports that SparseCoder does not expose max_iter for Lasso, which leads to convergence warnings. The name transform_max_iter is not mentioned in the issue and does not exist elsewhere in the codebase at the base commit.
The reference patch introduces a new transform_max_iter parameter:
# sklearn/decomposition/dict_learning.py
class SparseCoder(BaseEstimator, SparseCodingMixin):
def __init__(self, dictionary, transform_algorithm='omp',
transform_n_nonzero_coefs=None, transform_alpha=None,
split_sign=False, n_jobs=None, positive_code=False,
transform_max_iter=1000):
self._set_sparse_coding_params(..., transform_max_iter)
The test patch then calls that exact parameter name:
def test_max_iter():
with pytest.warns(ConvergenceWarning):
model = SparseCoder(
D_multi,
transform_algorithm=transform_algorithm,
transform_max_iter=1,
)
model.fit_transform(X)
with pytest.warns(None) as record:
model = SparseCoder(
D_multi,
transform_algorithm=transform_algorithm,
transform_max_iter=2000,
)
model.fit_transform(X)
An agent could implement the same functionality with lasso_max_iter, max_transform_iterations, or another reasonable name and still fail evaluation. The behavioral fix is there, but the hidden naming contract is not satisfied.
What the Scan Found
I analyzed six SWE-bench-style datasets:
- SWE-bench/SWE-bench
- SWE-bench/SWE-bench_Verified
- SWE-bench/SWE-bench_Multilingual
- ScaleAI/SWE-bench_Pro
- ByteDance-Seed/Multi-SWE-bench
- AmazonScience/SWE-PolyBench
In a screening pass across all 7,567 instances, I found 2,167 instances (28.6%) where tests reference symbols newly introduced in the reference solution.
That 28.6% number is a screening signal, not a final estimate of confirmed false negatives. Some coupled symbols are fair because the name is explicit in the issue text or already established in the codebase. To isolate the highest-risk cases, I ran two refinement checks:
- Is the symbol name mentioned in the issue text?
- Does the symbol already exist elsewhere in the repository?
I treat the intersection as high-risk coupling: symbols that are neither mentioned in the issue nor present in the codebase.
High-Risk Coupling by Benchmark
| Benchmark | Total Instances | Coupled Instances | High-Risk Coupling (% of Total) |
|---|---|---|---|
| SWE-bench/SWE-bench | 2,294 | 574 | 34 (1.5%) |
| SWE-bench/SWE-bench_Verified | 500 | 87 | 4 (0.8%) |
| SWE-bench/SWE-bench_Multilingual | 300 | 15 | 0 (0.0%) |
| ScaleAI/SWE-bench_Pro | 731 | 363 | 80 (10.9%) |
| ByteDance-Seed/Multi-SWE-bench | 1,632 | 593 | 89 (5.5%) |
| AmazonScience/SWE-PolyBench | 2,110 | 535 | 68 (3.2%) |
The main takeaway is that this problem is concentrated rather than uniform.
-
SWE-bench_Prois the clear outlier at 10.9% high-risk coupling. -
Multi-SWE-benchis meaningfully elevated at 5.5%. -
SWE-bench_VerifiedandSWE-bench_Multilingualare much cleaner on this specific failure mode.
That pattern suggests curation helps, but it does not eliminate the issue.
Where the Risk Concentrates
The high-risk subset becomes easier to interpret if we also look at the two refinement signals separately:
| Benchmark | Coupled | None Mentioned in Issue | None Exist in Codebase | High-Risk |
|---|---|---|---|---|
| SWE-bench/SWE-bench | 574 | 233 | 38 | 34 |
| SWE-bench/SWE-bench_Verified | 87 | 32 | 4 | 4 |
| SWE-bench/SWE-bench_Multilingual | 15 | 9 | 0 | 0 |
| ScaleAI/SWE-bench_Pro | 363 | 210 | 92 | 80 |
| ByteDance-Seed/Multi-SWE-bench | 593 | 296 | 104 | 89 |
| AmazonScience/SWE-PolyBench | 535 | 280 | 73 | 68 |
Two things stand out.
First, many coupled instances provide no naming hint in the issue text. For most datasets, roughly 40% to 60% of coupled instances fall into that bucket.
Second, codebase priors help in many cases, but not equally across benchmarks. In SWE-bench_Verified, most coupled symbols already exist somewhere else in the repository, which gives agents a chance to infer the naming pattern through exploration. In SWE-bench_Pro and Multi-SWE-bench, that safety net is much weaker.
This is why the raw coupling rate and the high-risk rate both matter. The raw rate tells us how often tests are structurally tied to reference naming. The high-risk rate tells us where that coupling is most likely to generate false negatives.
Why This Matters for Leaderboards
1. Scores can move at the margin
In SWE-bench_Pro, high-risk coupling appears in 10.9% of all instances. If only a fraction of those behave as false negatives in practice, that is still enough to shift scores by one to several points, which can reorder closely clustered systems.
2. Cross-benchmark comparisons get noisier
A model may appear to improve or regress partly because one benchmark family embeds more hidden naming contracts than another. That makes benchmark-to-benchmark comparisons less clean than leaderboard tables suggest.
3. Curation helps, but one clean metric is not the whole story
The low rates in SWE-bench_Verified and SWE-bench_Multilingual are encouraging for this particular failure mode. But low hidden-contract rates should not be read as a blanket validation of a benchmark’s frontier-tracking quality. Benchmarks can still have other issues, including overly narrow tests, overly wide tests, or contamination.
Recent benchmark audits make the same broader point from a different direction. My analysis is narrower: it is a scalable programmatic scan for one subtype of narrow-test risk.
What Benchmark Maintainers Should Change
1. Prefer behavior-first assertions where feasible
Tests should verify the intended behavior, not accidentally require the exact reference implementation.
2. Make naming requirements explicit when they are part of the task
Some fixes genuinely require a specific API, method name, or parameter name for compatibility reasons. In those cases, the naming requirement should be written into the issue statement rather than hidden in the tests.
3. Publish diagnostic metadata alongside headline scores
Leaderboards should ideally report not just one aggregate score, but also benchmark diagnostics such as coupling prevalence, high-risk hidden-contract rates, and whether a score was computed on a filtered subset that excludes known problematic instances.
4. Keep a human review loop in benchmark maintenance
The cleanest benchmarks in this analysis are also the ones with stronger curation signals. That supports a practical maintenance rule: programmatic scans can pre-screen risky instances, but human review is still needed before release.
Conclusion
Software engineering allows multiple valid implementations, and evaluations should reflect that reality. A benchmark should reward behaviorally correct fixes, not force agents to rediscover one unstated naming choice from a hidden reference patch.
Hidden naming contracts are not the entire benchmark-reliability story, but they are a measurable and actionable part of it. If we want benchmark scores to be more robust, one straightforward step is to identify and remove instances where tests silently depend on names that the task never actually required.
A coding agent should not be graded as wrong for solving the right problem with the wrong unstated identifier.
Methodology
The analysis uses a simple pipeline:
- Parse gold patches and extract symbols introduced on added lines.
- Parse test patches and look for references to those symbols.
- Filter out overly generic names that are likely to match coincidentally.
- For coupled instances, check whether the symbol appears in the issue text.
- For coupled instances, check whether the symbol already exists elsewhere in the repository at the base commit.
The symbol extraction step uses language-specific patterns for Python, Java, JavaScript or TypeScript, Go, Rust, and C or C++.
This post reports filtered counts throughout. The filtering step removes names that are so generic that a match is probably accidental rather than evidence of a real hidden naming contract.
Filtering details (click to expand)
- Generic variable names:
result,data,value,output,input,response,item,obj,args,kwargs, etc. - Single-letter identifiers:
x,y,i,j,k,n, etc. - Common method names:
get,set,add,remove,create,run,execute,parse,read,write,load,save, etc. - Common class names:
Base,Error,Exception,Handler,Manager,Factory,Config, etc. - Test infrastructure names:
test,setUp,tearDown,fixture,mock,patch, and any symbol matching test-naming patterns (test_*,*Test,mock_*,fake_*,stub_*) - Placeholder names:
foo,bar,baz,qux - Built-ins:
True,False,None,null,main,name,type,id
The high-risk subset reported above is the intersection of the two refinement checks: symbols not mentioned in the issue text and not present in the codebase.
Appendix: High-Risk Instances by Benchmark
The following tables list the instances in the high-risk intersection.
SWE-bench/SWE-bench
34 instances with high-risk coupling (click to expand)
| Instance ID | Coupled Symbols |
|---|---|
| django__django-11389 | get_session_cookie_age |
| django__django-11742 | choice_max_length |
| django__django-13250 | supports_json_field_contains |
| django__django-13350 | upload_interrupted |
| django__django-13722 | get_formset_kwargs |
| django__django-14430 | empty_aggregate_value |
| django__django-14559 | rows_updated |
| django__django-14725 | edit_only |
| django__django-14894 | empty_result_set_value |
| django__django-15031 | list_separator |
| django__django-15108 | OrderByList |
| django__django-16302 | supports_unlimited_charfield |
| django__django-16369 | get_languages_for_item |
| django__django-16514 | get_log_entries |
| django__django-16883 | normalize_table_name |
| django__django-7188 | BaseAuthConfig |
| matplotlib__matplotlib-13908 | remove_overlapping_locs, get_remove_overlapping_locs |
| matplotlib__matplotlib-18869 | _parse_to_version_info |
| matplotlib__matplotlib-25746 | labelfontfamily |
| psf__requests-4356 | InvalidProxyURL |
| psf__requests-4718 | should_strip_auth |
| pydata__xarray-4759 | maybe_coerce_to_str |
| pylint-dev__pylint-4421 | get_numversion_from_version |
| pylint-dev__pylint-4604 | IS_PYPY |
| pylint-dev__pylint-5839 | DELETED_MESSAGES |
| pytest-dev__pytest-8124 | pytest_markeval_namespace |
| scikit-learn__scikit-learn-12682 | transform_max_iter |
| scikit-learn__scikit-learn-14806 | skip_complete |
| scikit-learn__scikit-learn-14898 | neg_brier_score |
| sphinx-doc__sphinx-7593 | KeyboardTransform |
| sphinx-doc__sphinx-8026 | docpath |
| sphinx-doc__sphinx-8095 | napoleon_preprocess_types |
| sphinx-doc__sphinx-8291 | napoleon_attr_annotations |
| sympy__sympy-11818 | from_real |
SWE-bench/SWE-bench_Verified
4 instances with high-risk coupling (click to expand)
| Instance ID | Coupled Symbols |
|---|---|
| django__django-14559 | rows_updated |
| django__django-14725 | edit_only |
| pylint-dev__pylint-4604 | IS_PYPY |
| scikit-learn__scikit-learn-12682 | transform_max_iter |
ByteDance-Seed/Multi-SWE-bench
89 instances with high-risk coupling (click to expand)
| Instance ID | Coupled Symbols |
|---|---|
| BurntSushi__ripgrep-2610 | hyperlink |
| BurntSushi__ripgrep-723 | line_number_width |
| anuraghazra__github-readme-stats-117 | ONE_DAY, THIRTY_MINUTES, CONSTANTS |
| anuraghazra__github-readme-stats-293 | defaultTitle, customTitle |
| anuraghazra__github-readme-stats-58 | retryer, fetcher |
| clap-rs__clap-2008 | before_long_help, after_long_help |
| clap-rs__clap-2360 | forbid_empty_values |
| clap-rs__clap-3453 | get_id |
| clap-rs__clap-3990 | external_subcommand_value_parser |
| clap-rs__clap-4080 | ids |
| cli__cli-10139 | transformSecurityAndAnalysisOpts, SecurityAndAnalysisStatus, SecurityAndAnalysisInput |
| cli__cli-1155 | ErrNotOnAnyBranch |
| cli__cli-1279 | StatusStringResponse, HTTPError, httpErr |
| cli__cli-1282 | listURLWithQuery, filterOptions |
| cli__cli-1534 | runPager |
| cli__cli-1639 | SetNeverPrompt |
| cli__cli-1867 | LabelsByNames |
| cli__cli-2034 | GistOwner |
| cli__cli-2058 | HostnameValidator |
| cli__cli-2138 | validateConfigEntry |
| cli__cli-2221 | generateChecksumFromAssets, generateChecksum |
| cli__cli-2224 | mergeMethodSurvey |
| cli__cli-2997 | getFilesToAdd |
| cli__cli-3490 | getExpansion |
| cli__cli-3578 | detectEmptyFiles |
| cli__cli-3833 | NewCmdCancel, CancelOptions, runCancel |
| cli__cli-3898 | AddOriginRemote |
| cli__cli-3992 | browserLauncher |
| cli__cli-4146 | ttySize, ForceTerminal |
| cli__cli-4416 | deleteAssetRun, DeleteAssetOptions, NewCmdDeleteAsset |
| cli__cli-4543 | addPage |
| cli__cli-4562 | normalizeRepoName |
| cli__cli-5069 | CheckContext, eliminateDuplicates |
| cli__cli-5108 | RepoSearchParameters, GetCodespaceRepoSuggestions |
| cli__cli-5462 | ColorFromRGB |
| cli__cli-5681 | SetAlternateScreenBufferEnabled, StartAlternateScreenBuffer, StopAlternateScreenBuffer |
| cli__cli-5799 | artifactsPayload |
| cli__cli-6074 | changedFilesNames |
| cli__cli-6158 | DefaultFilterBySimilarityOpts, FilterBySimilarity, LevenshteinDistance, ListRepos, cands, FilterBySimilarityOpts |
| cli__cli-6292 | PrCheckStatusSummaryWithColor |
| cli__cli-667 | prStateTitleWithColor, issueStateTitleWithColor |
| cli__cli-7205 | RemoveDiacritics, LatinMatchingFilter |
| cli__cli-727 | parseCloneArgs |
| cli__cli-7314 | RepoExists |
| cli__cli-7477 | sanitizeFileName |
| cli__cli-7866 | PendingError |
| cli__cli-810 | formatRemoteURL |
| cli__cli-842 | StubRepoResponseWithDefaultBranch |
| cli__cli-857 | prReopenCmd |
| cli__cli-8934 | FormatSlice |
| cli__cli-9008 | simplifyURL |
| cli__cli-970 | ExpandAlias |
| cli__cli-9933 | ErrExtensionExecutableNotFound |
| elastic__logstash-13825 | getMandatoryJvmOptions |
| elastic__logstash-14058 | getDroppedEvents |
| facebook__zstd-1080 | ZSTD_getFrameHeader_advanced |
| facebook__zstd-1105 | ZSTD_CCtx_getParameter |
| facebook__zstd-1107 | ZSTD_CCtx_resetParameters |
| facebook__zstd-1532 | ZSTD_CCtxParams_setParameter, ZSTD_CCtxParams_getParameter |
| facebook__zstd-1540 | RETURN_ERROR_IF_MSG |
| facebook__zstd-1733 | ZSTD_SRCSIZEHINT_MAX, ZSTD_SRCSIZEHINT_MIN, ZSTD_c_srcSizeHint |
| facebook__zstd-2094 | ZSTD_d_stableOutBuffer |
| facebook__zstd-3530 | ZSTD_CCtx_setParams, ZSTD_CCtx_setFParams |
| fasterxml__jackson-core-964 | setStreamReadConstraints |
| fmtlib__fmt-1361 | compute_float_boundaries |
| fmtlib__fmt-3279 | is_container_adaptor_like |
| grpc__grpc-go-2744 | appendH2ToNextProtos |
| iamkun__dayjs-1047 | localeNameRegex |
| iamkun__dayjs-379 | weekStart |
| mui__material-ui-26173 | isOptionEqualToValue |
| mui__material-ui-29954 | inheritViewBox |
| mui__material-ui-34131 | excludeVariablesFromRoot |
| mui__material-ui-36399 | unstable_level |
| mui__material-ui-37118 | getItemAsString |
| nlohmann__json-1314 | error_handler_t |
| nlohmann__json-2225 | NLOHMANN_DEFINE_TYPE_INTRUSIVE, NLOHMANN_DEFINE_TYPE_NON_INTRUSIVE |
| nlohmann__json-3523 | value_in_range_of |
| nlohmann__json-3605 | JSON_USE_GLOBAL_UDLS |
| nlohmann__json-3663 | is_c_string |
| nushell__nushell-12118 | xdg_config_empty |
| ponylang__ponyc-2865 | divmod_partial, add_partial |
| ponylang__ponyc-3293 | NullablePointer |
| tokio-rs__tokio-5200 | auto_advance, set_auto_advance |
| tokio-rs__tokio-6280 | try_join_next, try_join_next_with_id |
| zeromicro__go-zero-1907 | WithStreamClientInterceptor |
| zeromicro__go-zero-1964 | PrintRoutes |
| zeromicro__go-zero-2363 | DontTracingSpanName |
| zeromicro__go-zero-964 | NewPublisherWithAuth, NewRpcPubServerWithEtcdAuth, KeepAliveWithAuth, getClusterWithAuth, EnableAuth |
| zeromicro__go-zero-990 | ReadLink |
ScaleAI/SWE-bench_Pro
80 instances with high-risk coupling (click to expand)
| Instance ID | Coupled Symbols |
|---|---|
| instance_ansible__ansible-106909db8b730480615f4a33de0eb5b710944e78-v0f01c69f1e2528b935359cfe578530722bca2c59 | multipart_encoding |
| instance_ansible__ansible-185d41031660a676c43fbb781cd1335902024bfe-vba6da65a0f3baefda7a058ebbd0a8dcafb8512f5 | host_label |
| instance_ansible__ansible-29aea9ff3466e4cd2ed00524b9e56738d568ce8b-vba6da65a0f3baefda7a058ebbd0a8dcafb8512f5 | trailing_separator, default_value_name |
| instance_ansible__ansible-415e08c2970757472314e515cb63a51ad825c45e-v7eee2454f617569fd6889f2211f75bc02a35f9f8 | get_best_parsable_locale |
| instance_ansible__ansible-42355d181a11b51ebfc56f6f4b3d9c74e01cb13b-v1055803c3a812189a1133297f7f5468579283f86 | get_delegated_vars_and_hostname |
| instance_ansible__ansible-502270c804c33d3bc963930dc85e0f4ca359674d-v7eee2454f617569fd6889f2211f75bc02a35f9f8 | BaseStrategy |
| instance_ansible__ansible-be2c376ab87e3e872ca21697508f12c6909cf85a-vba6da65a0f3baefda7a058ebbd0a8dcafb8512f5 | _build_doc |
| instance_ansible__ansible-cd9c4eb5a6b2bfaf4a6709f001ce3d0c92c1eed2-v0f01c69f1e2528b935359cfe578530722bca2c59 | get_sysinfo_facts |
| instance_ansible__ansible-e64c6c1ca50d7d26a8e7747d8eb87642e767cd74-v0f01c69f1e2528b935359cfe578530722bca2c59 | _valid_time_stamp |
| instance_ansible__ansible-f86c58e2d235d8b96029d102c71ee2dfafd57997-v0f01c69f1e2528b935359cfe578530722bca2c59 | _replace_stderr_clixml |
| instance_element-hq__element-web-1077729a19c0ce902e713cf6fab42c91fb7907f1-vnan | getLastSelectedRoomIdForSpace |
| instance_element-hq__element-web-33e8edb3d508d6eefb354819ca693b7accc695e7 | isKeyComboMatch |
| instance_element-hq__element-web-41dfec20bfe9b62cddbbbf621bef2e9aa9685157-vnan | delegatedAuthentication |
| instance_element-hq__element-web-53b42e321777a598aaf2bb3eab22d710569f83a8-vnan | RoomOptionsMenu |
| instance_element-hq__element-web-772df3021201d9c73835a626df8dcb6334ad9a3e-vnan | setSelectedDeviceIds, selectedDeviceIds |
| instance_element-hq__element-web-cf3c899dd1f221aa1a1f4c5a80dffc05b9c21c85-vnan | getLiveness |
| instance_flipt-io__flipt-2ce8a0331e8a8f63f2c1b555db8277ffe5aa2e63 | preFliptAcceptServerVersion, FliptAcceptServerVersionFromContext, FliptAcceptServerVersionUnaryInterceptor |
| instance_flipt-io__flipt-36e62baffae2132f78f9d34dc300a9baa2d7ae0e | getTraceExporter |
| instance_flipt-io__flipt-a0cbc0cb65ae601270bdbe3f5313e2dfd49c80e4 | envsubst |
| instance_flipt-io__flipt-a42d38a1bb1df267c53d9d4a706cf34825ae3da9 | AuthenticationSessionCSRF |
| instance_flipt-io__flipt-b6cef5cdc0daff3ee99e5974ed60a1dc6b4b0d67 | ErrorHandler |
| instance_flipt-io__flipt-c8d71ad7ea98d97546f01cce4ccb451dbcf37d3b | SnapshotFromFS, Unwrap |
| instance_flipt-io__flipt-cd2f3b0a9d4d8b8a6d3d56afab65851ecdc408e8 | validateArrayValue |
| instance_flipt-io__flipt-e91615cf07966da41756017a7d571f9fc0fdbe80 | NewExporter, NewImporter |
| instance_flipt-io__flipt-f36bd61fb1cee4669de1f00e59da462bfeae8765 | NewFeaturesValidator |
| instance_future-architect__vuls-2923cbc645fbc7a37d50398eb2ab8febda8c3264 | rhelRebuildOSVersionToRHEL |
| instance_future-architect__vuls-36456cb151894964ba1683ce7da5c35ada789970 | searchCache |
| instance_future-architect__vuls-73f0adad95c4d227e2ccfa876c85cc95dd065e13 | GetCveContentTypes |
| instance_future-architect__vuls-83bcca6e669ba2e4102f26c4a2b52f78c7861f1a | listenIPPorts |
| instance_future-architect__vuls-8d5ea98e50cf616847f4e5a2df300395d1f719e9 | removeInactives |
| instance_future-architect__vuls-e4728e388120b311c4ed469e4f942e0347a2689b-v264a82e2f4818e30f5a25e4da53b27ba119f62b5 | CompareSeverity |
| instance_gravitational__teleport-0ecf31de0e98b272a6a2610abe1bbedd379a38a3-vce94f93ad1030e3136852817f2423c1b3ac37bc4 | NotifyExit |
| instance_gravitational__teleport-2bb3bbbd8aff1164a2353381cb79e1dc93b90d28-vee9b09fb20c43af7e520f57e9239bbcf46b7113d | billingMode |
| instance_gravitational__teleport-326fd1d7be87b03998dbc53bc706fdef90f5065c-v626ec2a48416b10a88641359a169d99e935ff037 | homeEnvVar |
| instance_gravitational__teleport-82185f232ae8974258397e121b3bc2ed0c3729ed-v626ec2a48416b10a88641359a169d99e935ff037 | buildKubeConfigUpdate |
| instance_gravitational__teleport-baeb2697c4e4870c9850ff0cd5c7a2d08e1401c9-vee9b09fb20c43af7e520f57e9239bbcf46b7113d | yubiHSMTestConfig, gcpKMSTestConfig, HSMTestConfig, awsKMSTestConfig, softHSMTestConfig, cloudHSMTestConfig |
| instance_gravitational__teleport-bb69574e02bd62e5ccd3cebb25e1c992641afb2a | LiteralNamespace |
| instance_gravitational__teleport-eefac60a350930e5f295f94a2d55b94c1988c04e-vee9b09fb20c43af7e520f57e9239bbcf46b7113d | ParseOSReleaseFromReader, DMIInfoFromFS |
| instance_internetarchive__openlibrary-0d13e6b4bf80bced6c0946b969b9a1b6963f6bce-v76304ecdb3a5954fcf13feb710e8c40fcf24b73c | remove_author_honorifics |
| instance_internetarchive__openlibrary-3aeec6afed9198d734b7ee1293f03ca94ff970e1-v13642507b4fc1f8d234172bf8129942da2c2ca26 | _get_wikipedia_link, _get_statement_values |
| instance_internetarchive__openlibrary-431442c92887a3aece3f8aa771dd029738a80eb1-v76304ecdb3a5954fcf13feb710e8c40fcf24b73c | luqum_replace_child |
| instance_internetarchive__openlibrary-4b7ea2977be2747496ba792a678940baa985f7ea-v0f5aece3601a5b4419f7ccec1dbda2071be28ee4 | AuthorRemoteIdConflictError |
| instance_internetarchive__openlibrary-5de7de19211e71b29b2f2ba3b1dff2fe065d660f-v08d8e8889ec945ab821fb156c04c7d2e2810debb | is_valid_identifier, get_identifier_forms, get_isbn_or_asin |
| instance_internetarchive__openlibrary-72321288ea790a3ace9e36f1c05b68c93f7eec43-v0f5aece3601a5b4419f7ccec1dbda2071be28ee4 | luqum_replace_field |
| instance_internetarchive__openlibrary-91efee627df01e32007abf2d6ebf73f9d9053076-vbee42ad1b72fb23c6a1c874868a720b370983ed2 | within_date_range |
| instance_internetarchive__openlibrary-c4eebe6677acc4629cb541a98d5e91311444f5d4-v13642507b4fc1f8d234172bf8129942da2c2ca26 | find_staged_or_pending |
| instance_internetarchive__openlibrary-d40ec88713dc95ea791b252f92d2f7b75e107440-v13642507b4fc1f8d234172bf8129942da2c2ca26 | author_import_record_to_author, import_record_to_edition, check_cover_url_host |
| instance_internetarchive__openlibrary-d8162c226a9d576f094dc1830c4c1ffd0be2dd17-v76304ecdb3a5954fcf13feb710e8c40fcf24b73c | get_non_isbn_asin, is_asin_only |
| instance_navidrome__navidrome-1e96b858a91c640fe64e84c5e5ad8cc0954ea38d | validateCredentials |
| instance_navidrome__navidrome-28389fb05e1523564dfc61fa43ed8eb8a10f938c | IsValidPlaylist |
| instance_navidrome__navidrome-31799662706fedddf5bcc1a76b50409d1f91d327 | tokenFromHeader |
| instance_navidrome__navidrome-69e0a266f48bae24a11312e9efbe495a337e4c84 | DecodeArtworkID, EncodeArtworkID |
| instance_navidrome__navidrome-874b17b8f614056df0ef021b5d4f977341084185 | validatePasswordChange |
| instance_navidrome__navidrome-9c3b4561652a15846993d477003e111f0df0c585 | CRLFWriter |
| instance_navidrome__navidrome-b3980532237e57ab15b2b93c49d5cd5b2d050013 | lastFMAPIKey |
| instance_navidrome__navidrome-b65e76293a917ee2dfc5d4b373b1c62e054d0dca | WithClientUniqueId |
| instance_protonmail__webclients-369fd37de29c14c690cb3b1c09a949189734026f | findHolidaysCalendarByCountryCodeAndLanguageTag |
| instance_protonmail__webclients-3a6790f480309130b5d6332dce6c9d5ccca13ee3 | getCachedChildrenCount |
| instance_protonmail__webclients-51742625834d3bd0d10fe0c7e76b8739a59c6b9f | punycodeUrl, getHostnameWithRegex |
| instance_protonmail__webclients-5f0745dd6993bb1430a951c62a49807c6635cd77 | flushPromises |
| instance_protonmail__webclients-ae36cb23a1682dcfd69587c1b311ae0227e28f39 | elementsToRemove, elementsToBypass |
| instance_qutebrowser__qutebrowser-0d2afd58f3d0e34af21cee7d8a3fc9d855594e9f-vnan | qobj_repr |
| instance_qutebrowser__qutebrowser-16de05407111ddd82fa12e54389d532362489da9-v363c8a7e5ccdf6968fc7ab84a2053ac78036691d | _get_locale_pak_path, _get_lang_override |
| instance_qutebrowser__qutebrowser-1943fa072ec3df5a87e18a23b0916f134c131016-vafb3e8e01b31319c66c4e666b8a3b1d8ba55db24 | set_pinned |
| instance_qutebrowser__qutebrowser-2dd8966fdcf11972062c540b7a787e4d0de8d372-v363c8a7e5ccdf6968fc7ab84a2053ac78036691d | qcolor_to_qsscolor |
| instance_qutebrowser__qutebrowser-35168ade46184d7e5b91dfa04ca42fe2abd82717-v363c8a7e5ccdf6968fc7ab84a2053ac78036691d | template_config_variables, frozenset |
| instance_qutebrowser__qutebrowser-473a15f7908f2bb6d670b0e908ab34a28d8cf7e2-v363c8a7e5ccdf6968fc7ab84a2053ac78036691d | _get_locale_pak_path, _get_lang_override |
| instance_qutebrowser__qutebrowser-52708364b5f91e198defb022d1a5b4b3ebd9b563-v2ef375ac784985212b1805e1d0431dc8f1b3c171 | StatusbarWidget |
| instance_qutebrowser__qutebrowser-66cfa15c372fa9e613ea5a82d3b03e4609399fb6-v363c8a7e5ccdf6968fc7ab84a2053ac78036691d | _get_locale_pak_path, _get_lang_override |
| instance_qutebrowser__qutebrowser-8f46ba3f6dc7b18375f7aa63c48a1fe461190430-v2ef375ac784985212b1805e1d0431dc8f1b3c171 | _validate_untrusted_args |
| instance_qutebrowser__qutebrowser-99029144b5109bb1b2a53964a7c129e009980cd9-va0fd88aac89cde702ec1ba84877234da33adce8a | copy_remove_setting, qt_67 |
| instance_qutebrowser__qutebrowser-9b71c1ea67a9e7eb70dd83214d881c2031db6541-v363c8a7e5ccdf6968fc7ab84a2053ac78036691d | _get_locale_pak_path, _get_lang_override |
| instance_qutebrowser__qutebrowser-a84ecfb80a00f8ab7e341372560458e3f9cfffa2-v2ef375ac784985212b1805e1d0431dc8f1b3c171 | for_cmd, EmptyCommandError |
| instance_qutebrowser__qutebrowser-bf045f7ec7c27709ea3ef61cf41a24e8fdd2e7da-v059c6fdc75567943479b23ebca7c07b5e9a7f34c | _FindFlags, to_qt |
| instance_qutebrowser__qutebrowser-c0be28ebee3e1837aaf3f30ec534ccd6d038f129-v9f8e9d96c85c85a605e382f1510bd08563afc566 | extra_suffixes_workaround |
| instance_qutebrowser__qutebrowser-ec2dcfce9eee9f808efc17a1b99e227fc4421dea-v5149fcda2a9a6fe1d35dfed1bade1444a11ef271 | _js_log_to_ui |
| instance_qutebrowser__qutebrowser-ef5ba1a0360b39f9eff027fbdc57f363597c3c3b-v363c8a7e5ccdf6968fc7ab84a2053ac78036691d | _get_locale_pak_path, _get_lang_override |
| instance_qutebrowser__qutebrowser-ff1c025ad3210506fc76e1f604d8c8c27637d88e-v363c8a7e5ccdf6968fc7ab84a2053ac78036691d | set_defaults |
| instance_tutao__tutanota-f3ffe17af6e8ab007e8d461355057ad237846d9d-vbc0d9ba8f0071fbe982809910959a6ff8884dbbf | EntropyFacade |
| instance_tutao__tutanota-fe240cbf7f0fdd6744ef7bef8cb61676bcdbb621-vc4e41fd0029957297843cb9dec4a25c7c756f029 | checkEventValidity |
AmazonScience/SWE-PolyBench
68 instances with high-risk coupling (click to expand)
| Instance ID | Coupled Symbols |
|---|---|
| angular__angular-37484 | clearTsConfigCache |
| apache__dubbo-4379 | whenCompleteWithContext |
| apache__dubbo-5356 | PROMPT |
| apache__dubbo-6498 | SERVICE_PATH_PREFIX, servicePathPrefix |
| apache__rocketmq-1636 | TOPIC_MAX_LENGTH |
| apache__rocketmq-3862 | incPutMessageEntireTime, initPutMessageTimeBuckets, findPutMessageEntireTimePX |
| apache__rocketmq-4122 | setStorePathDLedgerCommitLog |
| apache__rocketmq-4763 | getEnumByString |
| apache__rocketmq-5008 | ConcurrentHashMapUtils |
| apache__rocketmq-5037 | CONTROLLER_ELECT_MASTER_FAILED |
| apache__rocketmq-5834 | incBrokerGetNumsWithoutSystemTopic, BROKER_GET_NUMS_WITHOUT_SYSTEM_TOPIC, getBrokerGetNumsWithoutSystemTopic |
| apache__rocketmq-7455 | decodeCommandCustomHeaderDirectly |
| apache__rocketmq-8051 | setTraceTopic, setEnableTrace |
| apolloconfig__apollo-4119 | SpringCloudInnerDiscoveryService |
| coder__code-server-5633 | welcomeText, appName |
| google__gson-2420 | runTestNoDefaultConstructor |
| google__gson-2549 | originalTimeZone |
| huggingface__transformers-13573 | reorder_and_upcast_attn, scale_attn_by_inverse_layer_idx |
| huggingface__transformers-15831 | resize_decoder_token_embeddings, share_encoder_decoder_embeddings |
| huggingface__transformers-24510 | warn_if_padding_and_no_attention_mask |
| huggingface__transformers-29838 | get_learning_rates, get_num_trainable_parameters, get_optimizer_group |
| huggingface__transformers-31095 | on_optimizer_step |
| langchain-ai__langchain-676 | save_local, load_local |
| mrdoob__three.js-17649 | morphTargetsRelative |
| mrdoob__three.js-20991 | setFromMatrix3 |
| mrdoob__three.js-22404 | setFromAttributeAndIndices |
| mui__material-ui-13003 | StepIconComponent |
| mui__material-ui-14461 | wrapsIntrinsicElement |
| mui__material-ui-19257 | hasPopupIcon, hasClearIcon |
| mui__material-ui-33812 | collapsedIcon |
| mui__material-ui-36426 | getOptionKey |
| prettier__prettier-15408 | GQL_QUERY_WITH_CONST |
| prettier__prettier-9736 | cleanDoc |
| serverless__serverless-2584 | compileRole |
| serverless__serverless-3186 | setFunctionNames |
| serverless__serverless-3521 | getServiceObject, getServiceName |
| serverless__serverless-3622 | mergeResourceArrays |
| serverless__serverless-3700 | loadEnvVarsForLocal |
| serverless__serverless-3808 | assignDefaultOptions |
| serverless__serverless-3812 | invocationId |
| serverless__serverless-4120 | isArnRefOrImportValue |
| serverless__serverless-4293 | canUseS3TransferAcceleration, disableTransferAccelerationForCurrentDeploy, enableS3TransferAcceleration, isS3TransferAccelerationEnabled |
| serverless__serverless-4382 | conceal |
| serverless__serverless-4531 | endpointType |
| serverless__serverless-4793 | iamManagedPolicies |
| serverless__serverless-5662 | getProfile |
| serverless__serverless-5728 | suppressLogIfPrintCommand |
| serverless__serverless-5988 | envVarsFromOptions, getEnvVarsFromOptions |
| serverless__serverless-5994 | dockerArgsFromOptions, getDockerArgsFromOptions |
| serverless__serverless-6293 | validateHeaderCondition, validateIpCondition, validateQueryCondition |
| serverless__serverless-6322 | getAlbTargetGroupName, getAlbTargetGroupNameTagValue |
| serverless__serverless-6823 | getDeploymentBucketPolicyLogicalId |
| serverless__serverless-6869 | getValueStrToBool |
| serverless__serverless-6871 | cfnRoleArn |
| serverless__serverless-6960 | getResolved, getRejected |
| sveltejs__svelte-1364 | assignTrue |
| sveltejs__svelte-1627 | setData |
| sveltejs__svelte-1988 | nextTick |
| sveltejs__svelte-3430 | set_input_value |
| sveltejs__svelte-6525 | insert_hydration |
| sveltejs__svelte-6556 | claim_svg_element |
| sveltejs__svelte-705 | callAll |
| sveltejs__svelte-778 | setInputType |
| trinodb__trino-3638 | updateExecutor, setMaxConcurrentMetastoreUpdates |
| trinodb__trino-3771 | setDelegationTokenCacheTtl, setDelegationTokenCacheMaximumSize |
| trinodb__trino-4393 | validateFileBuckets |
| trinodb__trino-748 | setAwsSecretKey, setAwsAccessKey |
| yt-dlp__yt-dlp-8917 | _deprecated_multivalue_fields |
Citation
If you find this analysis useful for your research, please cite it as:
@misc{ganhotra2026hiddencontracts,
title={Hidden Naming Contracts in SWE-Agent Benchmarks},
author={Ganhotra, Jatin},
year={2026},
month={April},
url={https://jatinganhotra.dev/blog/swe-agents/2026/04/01/hidden-naming-contracts-in-swe-agent-benchmarks/},
note={Blog post}
}
Enjoy Reading This Article?
Here are some more articles you might like to read next: