Hidden Naming Contracts in SWE-Agent Benchmarks

AI coding benchmarks now influence research priorities, product strategy, and engineering adoption decisions. Over the last year, SWE-bench has become a key benchmark for evaluating AI coding agents, and that momentum has pushed the community to build additional SWE-bench-style benchmarks beyond Python.

As I have explored in earlier analyses on single-file saturation and difficulty distribution, benchmark scores are only as trustworthy as the instances behind them. In this post, I look at a different failure mode: hidden naming contracts.

A hidden naming contract appears when benchmark tests require specific identifiers introduced in the reference solution, even though those names were never made explicit in the issue text. In that setting, an agent can produce a behaviorally correct fix and still be graded as wrong because it chose a different symbol name.

The Core Failure Mode

A typical SWE-bench-style instance has four parts:

  • an issue description
  • a base repository snapshot
  • a reference solution
  • executable tests

The intended contract is behavioral correctness: if a submitted patch fixes the issue, the tests should pass.

The problem starts when tests directly call symbols that were newly introduced in the reference solution. Evaluation then requires not only solving the behavior, but also reproducing a naming choice that may never have been stated anywhere.

A Concrete Example

Consider scikit-learn__scikit-learn-12682 from SWE-bench_Verified.

The issue reports that SparseCoder does not expose max_iter for Lasso, which leads to convergence warnings. The name transform_max_iter is not mentioned in the issue and does not exist elsewhere in the codebase at the base commit.

The reference patch introduces a new transform_max_iter parameter:

# sklearn/decomposition/dict_learning.py
class SparseCoder(BaseEstimator, SparseCodingMixin):
    def __init__(self, dictionary, transform_algorithm='omp',
                 transform_n_nonzero_coefs=None, transform_alpha=None,
                 split_sign=False, n_jobs=None, positive_code=False,
                 transform_max_iter=1000):
        self._set_sparse_coding_params(..., transform_max_iter)

The test patch then calls that exact parameter name:

def test_max_iter():
    with pytest.warns(ConvergenceWarning):
        model = SparseCoder(
            D_multi,
            transform_algorithm=transform_algorithm,
            transform_max_iter=1,
        )
        model.fit_transform(X)

    with pytest.warns(None) as record:
        model = SparseCoder(
            D_multi,
            transform_algorithm=transform_algorithm,
            transform_max_iter=2000,
        )
        model.fit_transform(X)

An agent could implement the same functionality with lasso_max_iter, max_transform_iterations, or another reasonable name and still fail evaluation. The behavioral fix is there, but the hidden naming contract is not satisfied.

What the Scan Found

I analyzed six SWE-bench-style datasets:

In a screening pass across all 7,567 instances, I found 2,167 instances (28.6%) where tests reference symbols newly introduced in the reference solution.

That 28.6% number is a screening signal, not a final estimate of confirmed false negatives. Some coupled symbols are fair because the name is explicit in the issue text or already established in the codebase. To isolate the highest-risk cases, I ran two refinement checks:

  1. Is the symbol name mentioned in the issue text?
  2. Does the symbol already exist elsewhere in the repository?

I treat the intersection as high-risk coupling: symbols that are neither mentioned in the issue nor present in the codebase.

High-Risk Coupling by Benchmark

Benchmark Total
Instances
Coupled
Instances
High-Risk
Coupling
(% of Total)
SWE-bench/SWE-bench 2,294 574 34 (1.5%)
SWE-bench/SWE-bench_Verified 500 87 4 (0.8%)
SWE-bench/SWE-bench_Multilingual 300 15 0 (0.0%)
ScaleAI/SWE-bench_Pro 731 363 80 (10.9%)
ByteDance-Seed/Multi-SWE-bench 1,632 593 89 (5.5%)
AmazonScience/SWE-PolyBench 2,110 535 68 (3.2%)

The main takeaway is that this problem is concentrated rather than uniform.

  • SWE-bench_Pro is the clear outlier at 10.9% high-risk coupling.
  • Multi-SWE-bench is meaningfully elevated at 5.5%.
  • SWE-bench_Verified and SWE-bench_Multilingual are much cleaner on this specific failure mode.

That pattern suggests curation helps, but it does not eliminate the issue.

Where the Risk Concentrates

The high-risk subset becomes easier to interpret if we also look at the two refinement signals separately:

Benchmark Coupled None Mentioned
in Issue
None Exist
in Codebase
High-Risk
SWE-bench/SWE-bench 574 233 38 34
SWE-bench/SWE-bench_Verified 87 32 4 4
SWE-bench/SWE-bench_Multilingual 15 9 0 0
ScaleAI/SWE-bench_Pro 363 210 92 80
ByteDance-Seed/Multi-SWE-bench 593 296 104 89
AmazonScience/SWE-PolyBench 535 280 73 68

Two things stand out.

First, many coupled instances provide no naming hint in the issue text. For most datasets, roughly 40% to 60% of coupled instances fall into that bucket.

Second, codebase priors help in many cases, but not equally across benchmarks. In SWE-bench_Verified, most coupled symbols already exist somewhere else in the repository, which gives agents a chance to infer the naming pattern through exploration. In SWE-bench_Pro and Multi-SWE-bench, that safety net is much weaker.

This is why the raw coupling rate and the high-risk rate both matter. The raw rate tells us how often tests are structurally tied to reference naming. The high-risk rate tells us where that coupling is most likely to generate false negatives.

Why This Matters for Leaderboards

1. Scores can move at the margin

In SWE-bench_Pro, high-risk coupling appears in 10.9% of all instances. If only a fraction of those behave as false negatives in practice, that is still enough to shift scores by one to several points, which can reorder closely clustered systems.

2. Cross-benchmark comparisons get noisier

A model may appear to improve or regress partly because one benchmark family embeds more hidden naming contracts than another. That makes benchmark-to-benchmark comparisons less clean than leaderboard tables suggest.

3. Curation helps, but one clean metric is not the whole story

The low rates in SWE-bench_Verified and SWE-bench_Multilingual are encouraging for this particular failure mode. But low hidden-contract rates should not be read as a blanket validation of a benchmark’s frontier-tracking quality. Benchmarks can still have other issues, including overly narrow tests, overly wide tests, or contamination.

Recent benchmark audits make the same broader point from a different direction. My analysis is narrower: it is a scalable programmatic scan for one subtype of narrow-test risk.

What Benchmark Maintainers Should Change

1. Prefer behavior-first assertions where feasible

Tests should verify the intended behavior, not accidentally require the exact reference implementation.

2. Make naming requirements explicit when they are part of the task

Some fixes genuinely require a specific API, method name, or parameter name for compatibility reasons. In those cases, the naming requirement should be written into the issue statement rather than hidden in the tests.

3. Publish diagnostic metadata alongside headline scores

Leaderboards should ideally report not just one aggregate score, but also benchmark diagnostics such as coupling prevalence, high-risk hidden-contract rates, and whether a score was computed on a filtered subset that excludes known problematic instances.

4. Keep a human review loop in benchmark maintenance

The cleanest benchmarks in this analysis are also the ones with stronger curation signals. That supports a practical maintenance rule: programmatic scans can pre-screen risky instances, but human review is still needed before release.

Conclusion

Software engineering allows multiple valid implementations, and evaluations should reflect that reality. A benchmark should reward behaviorally correct fixes, not force agents to rediscover one unstated naming choice from a hidden reference patch.

Hidden naming contracts are not the entire benchmark-reliability story, but they are a measurable and actionable part of it. If we want benchmark scores to be more robust, one straightforward step is to identify and remove instances where tests silently depend on names that the task never actually required.

A coding agent should not be graded as wrong for solving the right problem with the wrong unstated identifier.


Methodology

The analysis uses a simple pipeline:

  1. Parse gold patches and extract symbols introduced on added lines.
  2. Parse test patches and look for references to those symbols.
  3. Filter out overly generic names that are likely to match coincidentally.
  4. For coupled instances, check whether the symbol appears in the issue text.
  5. For coupled instances, check whether the symbol already exists elsewhere in the repository at the base commit.

The symbol extraction step uses language-specific patterns for Python, Java, JavaScript or TypeScript, Go, Rust, and C or C++.

This post reports filtered counts throughout. The filtering step removes names that are so generic that a match is probably accidental rather than evidence of a real hidden naming contract.

Filtering details (click to expand)
  • Generic variable names: result, data, value, output, input, response, item, obj, args, kwargs, etc.
  • Single-letter identifiers: x, y, i, j, k, n, etc.
  • Common method names: get, set, add, remove, create, run, execute, parse, read, write, load, save, etc.
  • Common class names: Base, Error, Exception, Handler, Manager, Factory, Config, etc.
  • Test infrastructure names: test, setUp, tearDown, fixture, mock, patch, and any symbol matching test-naming patterns (test_*, *Test, mock_*, fake_*, stub_*)
  • Placeholder names: foo, bar, baz, qux
  • Built-ins: True, False, None, null, main, name, type, id

The high-risk subset reported above is the intersection of the two refinement checks: symbols not mentioned in the issue text and not present in the codebase.

Appendix: High-Risk Instances by Benchmark

The following tables list the instances in the high-risk intersection.

SWE-bench/SWE-bench
34 instances with high-risk coupling (click to expand)
Instance ID Coupled Symbols
django__django-11389 get_session_cookie_age
django__django-11742 choice_max_length
django__django-13250 supports_json_field_contains
django__django-13350 upload_interrupted
django__django-13722 get_formset_kwargs
django__django-14430 empty_aggregate_value
django__django-14559 rows_updated
django__django-14725 edit_only
django__django-14894 empty_result_set_value
django__django-15031 list_separator
django__django-15108 OrderByList
django__django-16302 supports_unlimited_charfield
django__django-16369 get_languages_for_item
django__django-16514 get_log_entries
django__django-16883 normalize_table_name
django__django-7188 BaseAuthConfig
matplotlib__matplotlib-13908 remove_overlapping_locs, get_remove_overlapping_locs
matplotlib__matplotlib-18869 _parse_to_version_info
matplotlib__matplotlib-25746 labelfontfamily
psf__requests-4356 InvalidProxyURL
psf__requests-4718 should_strip_auth
pydata__xarray-4759 maybe_coerce_to_str
pylint-dev__pylint-4421 get_numversion_from_version
pylint-dev__pylint-4604 IS_PYPY
pylint-dev__pylint-5839 DELETED_MESSAGES
pytest-dev__pytest-8124 pytest_markeval_namespace
scikit-learn__scikit-learn-12682 transform_max_iter
scikit-learn__scikit-learn-14806 skip_complete
scikit-learn__scikit-learn-14898 neg_brier_score
sphinx-doc__sphinx-7593 KeyboardTransform
sphinx-doc__sphinx-8026 docpath
sphinx-doc__sphinx-8095 napoleon_preprocess_types
sphinx-doc__sphinx-8291 napoleon_attr_annotations
sympy__sympy-11818 from_real
SWE-bench/SWE-bench_Verified
4 instances with high-risk coupling (click to expand)
Instance ID Coupled Symbols
django__django-14559 rows_updated
django__django-14725 edit_only
pylint-dev__pylint-4604 IS_PYPY
scikit-learn__scikit-learn-12682 transform_max_iter
ByteDance-Seed/Multi-SWE-bench
89 instances with high-risk coupling (click to expand)
Instance ID Coupled Symbols
BurntSushi__ripgrep-2610 hyperlink
BurntSushi__ripgrep-723 line_number_width
anuraghazra__github-readme-stats-117 ONE_DAY, THIRTY_MINUTES, CONSTANTS
anuraghazra__github-readme-stats-293 defaultTitle, customTitle
anuraghazra__github-readme-stats-58 retryer, fetcher
clap-rs__clap-2008 before_long_help, after_long_help
clap-rs__clap-2360 forbid_empty_values
clap-rs__clap-3453 get_id
clap-rs__clap-3990 external_subcommand_value_parser
clap-rs__clap-4080 ids
cli__cli-10139 transformSecurityAndAnalysisOpts, SecurityAndAnalysisStatus, SecurityAndAnalysisInput
cli__cli-1155 ErrNotOnAnyBranch
cli__cli-1279 StatusStringResponse, HTTPError, httpErr
cli__cli-1282 listURLWithQuery, filterOptions
cli__cli-1534 runPager
cli__cli-1639 SetNeverPrompt
cli__cli-1867 LabelsByNames
cli__cli-2034 GistOwner
cli__cli-2058 HostnameValidator
cli__cli-2138 validateConfigEntry
cli__cli-2221 generateChecksumFromAssets, generateChecksum
cli__cli-2224 mergeMethodSurvey
cli__cli-2997 getFilesToAdd
cli__cli-3490 getExpansion
cli__cli-3578 detectEmptyFiles
cli__cli-3833 NewCmdCancel, CancelOptions, runCancel
cli__cli-3898 AddOriginRemote
cli__cli-3992 browserLauncher
cli__cli-4146 ttySize, ForceTerminal
cli__cli-4416 deleteAssetRun, DeleteAssetOptions, NewCmdDeleteAsset
cli__cli-4543 addPage
cli__cli-4562 normalizeRepoName
cli__cli-5069 CheckContext, eliminateDuplicates
cli__cli-5108 RepoSearchParameters, GetCodespaceRepoSuggestions
cli__cli-5462 ColorFromRGB
cli__cli-5681 SetAlternateScreenBufferEnabled, StartAlternateScreenBuffer, StopAlternateScreenBuffer
cli__cli-5799 artifactsPayload
cli__cli-6074 changedFilesNames
cli__cli-6158 DefaultFilterBySimilarityOpts, FilterBySimilarity, LevenshteinDistance, ListRepos, cands, FilterBySimilarityOpts
cli__cli-6292 PrCheckStatusSummaryWithColor
cli__cli-667 prStateTitleWithColor, issueStateTitleWithColor
cli__cli-7205 RemoveDiacritics, LatinMatchingFilter
cli__cli-727 parseCloneArgs
cli__cli-7314 RepoExists
cli__cli-7477 sanitizeFileName
cli__cli-7866 PendingError
cli__cli-810 formatRemoteURL
cli__cli-842 StubRepoResponseWithDefaultBranch
cli__cli-857 prReopenCmd
cli__cli-8934 FormatSlice
cli__cli-9008 simplifyURL
cli__cli-970 ExpandAlias
cli__cli-9933 ErrExtensionExecutableNotFound
elastic__logstash-13825 getMandatoryJvmOptions
elastic__logstash-14058 getDroppedEvents
facebook__zstd-1080 ZSTD_getFrameHeader_advanced
facebook__zstd-1105 ZSTD_CCtx_getParameter
facebook__zstd-1107 ZSTD_CCtx_resetParameters
facebook__zstd-1532 ZSTD_CCtxParams_setParameter, ZSTD_CCtxParams_getParameter
facebook__zstd-1540 RETURN_ERROR_IF_MSG
facebook__zstd-1733 ZSTD_SRCSIZEHINT_MAX, ZSTD_SRCSIZEHINT_MIN, ZSTD_c_srcSizeHint
facebook__zstd-2094 ZSTD_d_stableOutBuffer
facebook__zstd-3530 ZSTD_CCtx_setParams, ZSTD_CCtx_setFParams
fasterxml__jackson-core-964 setStreamReadConstraints
fmtlib__fmt-1361 compute_float_boundaries
fmtlib__fmt-3279 is_container_adaptor_like
grpc__grpc-go-2744 appendH2ToNextProtos
iamkun__dayjs-1047 localeNameRegex
iamkun__dayjs-379 weekStart
mui__material-ui-26173 isOptionEqualToValue
mui__material-ui-29954 inheritViewBox
mui__material-ui-34131 excludeVariablesFromRoot
mui__material-ui-36399 unstable_level
mui__material-ui-37118 getItemAsString
nlohmann__json-1314 error_handler_t
nlohmann__json-2225 NLOHMANN_DEFINE_TYPE_INTRUSIVE, NLOHMANN_DEFINE_TYPE_NON_INTRUSIVE
nlohmann__json-3523 value_in_range_of
nlohmann__json-3605 JSON_USE_GLOBAL_UDLS
nlohmann__json-3663 is_c_string
nushell__nushell-12118 xdg_config_empty
ponylang__ponyc-2865 divmod_partial, add_partial
ponylang__ponyc-3293 NullablePointer
tokio-rs__tokio-5200 auto_advance, set_auto_advance
tokio-rs__tokio-6280 try_join_next, try_join_next_with_id
zeromicro__go-zero-1907 WithStreamClientInterceptor
zeromicro__go-zero-1964 PrintRoutes
zeromicro__go-zero-2363 DontTracingSpanName
zeromicro__go-zero-964 NewPublisherWithAuth, NewRpcPubServerWithEtcdAuth, KeepAliveWithAuth, getClusterWithAuth, EnableAuth
zeromicro__go-zero-990 ReadLink
ScaleAI/SWE-bench_Pro
80 instances with high-risk coupling (click to expand)
Instance ID Coupled Symbols
instance_ansible__ansible-106909db8b730480615f4a33de0eb5b710944e78-v0f01c69f1e2528b935359cfe578530722bca2c59 multipart_encoding
instance_ansible__ansible-185d41031660a676c43fbb781cd1335902024bfe-vba6da65a0f3baefda7a058ebbd0a8dcafb8512f5 host_label
instance_ansible__ansible-29aea9ff3466e4cd2ed00524b9e56738d568ce8b-vba6da65a0f3baefda7a058ebbd0a8dcafb8512f5 trailing_separator, default_value_name
instance_ansible__ansible-415e08c2970757472314e515cb63a51ad825c45e-v7eee2454f617569fd6889f2211f75bc02a35f9f8 get_best_parsable_locale
instance_ansible__ansible-42355d181a11b51ebfc56f6f4b3d9c74e01cb13b-v1055803c3a812189a1133297f7f5468579283f86 get_delegated_vars_and_hostname
instance_ansible__ansible-502270c804c33d3bc963930dc85e0f4ca359674d-v7eee2454f617569fd6889f2211f75bc02a35f9f8 BaseStrategy
instance_ansible__ansible-be2c376ab87e3e872ca21697508f12c6909cf85a-vba6da65a0f3baefda7a058ebbd0a8dcafb8512f5 _build_doc
instance_ansible__ansible-cd9c4eb5a6b2bfaf4a6709f001ce3d0c92c1eed2-v0f01c69f1e2528b935359cfe578530722bca2c59 get_sysinfo_facts
instance_ansible__ansible-e64c6c1ca50d7d26a8e7747d8eb87642e767cd74-v0f01c69f1e2528b935359cfe578530722bca2c59 _valid_time_stamp
instance_ansible__ansible-f86c58e2d235d8b96029d102c71ee2dfafd57997-v0f01c69f1e2528b935359cfe578530722bca2c59 _replace_stderr_clixml
instance_element-hq__element-web-1077729a19c0ce902e713cf6fab42c91fb7907f1-vnan getLastSelectedRoomIdForSpace
instance_element-hq__element-web-33e8edb3d508d6eefb354819ca693b7accc695e7 isKeyComboMatch
instance_element-hq__element-web-41dfec20bfe9b62cddbbbf621bef2e9aa9685157-vnan delegatedAuthentication
instance_element-hq__element-web-53b42e321777a598aaf2bb3eab22d710569f83a8-vnan RoomOptionsMenu
instance_element-hq__element-web-772df3021201d9c73835a626df8dcb6334ad9a3e-vnan setSelectedDeviceIds, selectedDeviceIds
instance_element-hq__element-web-cf3c899dd1f221aa1a1f4c5a80dffc05b9c21c85-vnan getLiveness
instance_flipt-io__flipt-2ce8a0331e8a8f63f2c1b555db8277ffe5aa2e63 preFliptAcceptServerVersion, FliptAcceptServerVersionFromContext, FliptAcceptServerVersionUnaryInterceptor
instance_flipt-io__flipt-36e62baffae2132f78f9d34dc300a9baa2d7ae0e getTraceExporter
instance_flipt-io__flipt-a0cbc0cb65ae601270bdbe3f5313e2dfd49c80e4 envsubst
instance_flipt-io__flipt-a42d38a1bb1df267c53d9d4a706cf34825ae3da9 AuthenticationSessionCSRF
instance_flipt-io__flipt-b6cef5cdc0daff3ee99e5974ed60a1dc6b4b0d67 ErrorHandler
instance_flipt-io__flipt-c8d71ad7ea98d97546f01cce4ccb451dbcf37d3b SnapshotFromFS, Unwrap
instance_flipt-io__flipt-cd2f3b0a9d4d8b8a6d3d56afab65851ecdc408e8 validateArrayValue
instance_flipt-io__flipt-e91615cf07966da41756017a7d571f9fc0fdbe80 NewExporter, NewImporter
instance_flipt-io__flipt-f36bd61fb1cee4669de1f00e59da462bfeae8765 NewFeaturesValidator
instance_future-architect__vuls-2923cbc645fbc7a37d50398eb2ab8febda8c3264 rhelRebuildOSVersionToRHEL
instance_future-architect__vuls-36456cb151894964ba1683ce7da5c35ada789970 searchCache
instance_future-architect__vuls-73f0adad95c4d227e2ccfa876c85cc95dd065e13 GetCveContentTypes
instance_future-architect__vuls-83bcca6e669ba2e4102f26c4a2b52f78c7861f1a listenIPPorts
instance_future-architect__vuls-8d5ea98e50cf616847f4e5a2df300395d1f719e9 removeInactives
instance_future-architect__vuls-e4728e388120b311c4ed469e4f942e0347a2689b-v264a82e2f4818e30f5a25e4da53b27ba119f62b5 CompareSeverity
instance_gravitational__teleport-0ecf31de0e98b272a6a2610abe1bbedd379a38a3-vce94f93ad1030e3136852817f2423c1b3ac37bc4 NotifyExit
instance_gravitational__teleport-2bb3bbbd8aff1164a2353381cb79e1dc93b90d28-vee9b09fb20c43af7e520f57e9239bbcf46b7113d billingMode
instance_gravitational__teleport-326fd1d7be87b03998dbc53bc706fdef90f5065c-v626ec2a48416b10a88641359a169d99e935ff037 homeEnvVar
instance_gravitational__teleport-82185f232ae8974258397e121b3bc2ed0c3729ed-v626ec2a48416b10a88641359a169d99e935ff037 buildKubeConfigUpdate
instance_gravitational__teleport-baeb2697c4e4870c9850ff0cd5c7a2d08e1401c9-vee9b09fb20c43af7e520f57e9239bbcf46b7113d yubiHSMTestConfig, gcpKMSTestConfig, HSMTestConfig, awsKMSTestConfig, softHSMTestConfig, cloudHSMTestConfig
instance_gravitational__teleport-bb69574e02bd62e5ccd3cebb25e1c992641afb2a LiteralNamespace
instance_gravitational__teleport-eefac60a350930e5f295f94a2d55b94c1988c04e-vee9b09fb20c43af7e520f57e9239bbcf46b7113d ParseOSReleaseFromReader, DMIInfoFromFS
instance_internetarchive__openlibrary-0d13e6b4bf80bced6c0946b969b9a1b6963f6bce-v76304ecdb3a5954fcf13feb710e8c40fcf24b73c remove_author_honorifics
instance_internetarchive__openlibrary-3aeec6afed9198d734b7ee1293f03ca94ff970e1-v13642507b4fc1f8d234172bf8129942da2c2ca26 _get_wikipedia_link, _get_statement_values
instance_internetarchive__openlibrary-431442c92887a3aece3f8aa771dd029738a80eb1-v76304ecdb3a5954fcf13feb710e8c40fcf24b73c luqum_replace_child
instance_internetarchive__openlibrary-4b7ea2977be2747496ba792a678940baa985f7ea-v0f5aece3601a5b4419f7ccec1dbda2071be28ee4 AuthorRemoteIdConflictError
instance_internetarchive__openlibrary-5de7de19211e71b29b2f2ba3b1dff2fe065d660f-v08d8e8889ec945ab821fb156c04c7d2e2810debb is_valid_identifier, get_identifier_forms, get_isbn_or_asin
instance_internetarchive__openlibrary-72321288ea790a3ace9e36f1c05b68c93f7eec43-v0f5aece3601a5b4419f7ccec1dbda2071be28ee4 luqum_replace_field
instance_internetarchive__openlibrary-91efee627df01e32007abf2d6ebf73f9d9053076-vbee42ad1b72fb23c6a1c874868a720b370983ed2 within_date_range
instance_internetarchive__openlibrary-c4eebe6677acc4629cb541a98d5e91311444f5d4-v13642507b4fc1f8d234172bf8129942da2c2ca26 find_staged_or_pending
instance_internetarchive__openlibrary-d40ec88713dc95ea791b252f92d2f7b75e107440-v13642507b4fc1f8d234172bf8129942da2c2ca26 author_import_record_to_author, import_record_to_edition, check_cover_url_host
instance_internetarchive__openlibrary-d8162c226a9d576f094dc1830c4c1ffd0be2dd17-v76304ecdb3a5954fcf13feb710e8c40fcf24b73c get_non_isbn_asin, is_asin_only
instance_navidrome__navidrome-1e96b858a91c640fe64e84c5e5ad8cc0954ea38d validateCredentials
instance_navidrome__navidrome-28389fb05e1523564dfc61fa43ed8eb8a10f938c IsValidPlaylist
instance_navidrome__navidrome-31799662706fedddf5bcc1a76b50409d1f91d327 tokenFromHeader
instance_navidrome__navidrome-69e0a266f48bae24a11312e9efbe495a337e4c84 DecodeArtworkID, EncodeArtworkID
instance_navidrome__navidrome-874b17b8f614056df0ef021b5d4f977341084185 validatePasswordChange
instance_navidrome__navidrome-9c3b4561652a15846993d477003e111f0df0c585 CRLFWriter
instance_navidrome__navidrome-b3980532237e57ab15b2b93c49d5cd5b2d050013 lastFMAPIKey
instance_navidrome__navidrome-b65e76293a917ee2dfc5d4b373b1c62e054d0dca WithClientUniqueId
instance_protonmail__webclients-369fd37de29c14c690cb3b1c09a949189734026f findHolidaysCalendarByCountryCodeAndLanguageTag
instance_protonmail__webclients-3a6790f480309130b5d6332dce6c9d5ccca13ee3 getCachedChildrenCount
instance_protonmail__webclients-51742625834d3bd0d10fe0c7e76b8739a59c6b9f punycodeUrl, getHostnameWithRegex
instance_protonmail__webclients-5f0745dd6993bb1430a951c62a49807c6635cd77 flushPromises
instance_protonmail__webclients-ae36cb23a1682dcfd69587c1b311ae0227e28f39 elementsToRemove, elementsToBypass
instance_qutebrowser__qutebrowser-0d2afd58f3d0e34af21cee7d8a3fc9d855594e9f-vnan qobj_repr
instance_qutebrowser__qutebrowser-16de05407111ddd82fa12e54389d532362489da9-v363c8a7e5ccdf6968fc7ab84a2053ac78036691d _get_locale_pak_path, _get_lang_override
instance_qutebrowser__qutebrowser-1943fa072ec3df5a87e18a23b0916f134c131016-vafb3e8e01b31319c66c4e666b8a3b1d8ba55db24 set_pinned
instance_qutebrowser__qutebrowser-2dd8966fdcf11972062c540b7a787e4d0de8d372-v363c8a7e5ccdf6968fc7ab84a2053ac78036691d qcolor_to_qsscolor
instance_qutebrowser__qutebrowser-35168ade46184d7e5b91dfa04ca42fe2abd82717-v363c8a7e5ccdf6968fc7ab84a2053ac78036691d template_config_variables, frozenset
instance_qutebrowser__qutebrowser-473a15f7908f2bb6d670b0e908ab34a28d8cf7e2-v363c8a7e5ccdf6968fc7ab84a2053ac78036691d _get_locale_pak_path, _get_lang_override
instance_qutebrowser__qutebrowser-52708364b5f91e198defb022d1a5b4b3ebd9b563-v2ef375ac784985212b1805e1d0431dc8f1b3c171 StatusbarWidget
instance_qutebrowser__qutebrowser-66cfa15c372fa9e613ea5a82d3b03e4609399fb6-v363c8a7e5ccdf6968fc7ab84a2053ac78036691d _get_locale_pak_path, _get_lang_override
instance_qutebrowser__qutebrowser-8f46ba3f6dc7b18375f7aa63c48a1fe461190430-v2ef375ac784985212b1805e1d0431dc8f1b3c171 _validate_untrusted_args
instance_qutebrowser__qutebrowser-99029144b5109bb1b2a53964a7c129e009980cd9-va0fd88aac89cde702ec1ba84877234da33adce8a copy_remove_setting, qt_67
instance_qutebrowser__qutebrowser-9b71c1ea67a9e7eb70dd83214d881c2031db6541-v363c8a7e5ccdf6968fc7ab84a2053ac78036691d _get_locale_pak_path, _get_lang_override
instance_qutebrowser__qutebrowser-a84ecfb80a00f8ab7e341372560458e3f9cfffa2-v2ef375ac784985212b1805e1d0431dc8f1b3c171 for_cmd, EmptyCommandError
instance_qutebrowser__qutebrowser-bf045f7ec7c27709ea3ef61cf41a24e8fdd2e7da-v059c6fdc75567943479b23ebca7c07b5e9a7f34c _FindFlags, to_qt
instance_qutebrowser__qutebrowser-c0be28ebee3e1837aaf3f30ec534ccd6d038f129-v9f8e9d96c85c85a605e382f1510bd08563afc566 extra_suffixes_workaround
instance_qutebrowser__qutebrowser-ec2dcfce9eee9f808efc17a1b99e227fc4421dea-v5149fcda2a9a6fe1d35dfed1bade1444a11ef271 _js_log_to_ui
instance_qutebrowser__qutebrowser-ef5ba1a0360b39f9eff027fbdc57f363597c3c3b-v363c8a7e5ccdf6968fc7ab84a2053ac78036691d _get_locale_pak_path, _get_lang_override
instance_qutebrowser__qutebrowser-ff1c025ad3210506fc76e1f604d8c8c27637d88e-v363c8a7e5ccdf6968fc7ab84a2053ac78036691d set_defaults
instance_tutao__tutanota-f3ffe17af6e8ab007e8d461355057ad237846d9d-vbc0d9ba8f0071fbe982809910959a6ff8884dbbf EntropyFacade
instance_tutao__tutanota-fe240cbf7f0fdd6744ef7bef8cb61676bcdbb621-vc4e41fd0029957297843cb9dec4a25c7c756f029 checkEventValidity
AmazonScience/SWE-PolyBench
68 instances with high-risk coupling (click to expand)
Instance ID Coupled Symbols
angular__angular-37484 clearTsConfigCache
apache__dubbo-4379 whenCompleteWithContext
apache__dubbo-5356 PROMPT
apache__dubbo-6498 SERVICE_PATH_PREFIX, servicePathPrefix
apache__rocketmq-1636 TOPIC_MAX_LENGTH
apache__rocketmq-3862 incPutMessageEntireTime, initPutMessageTimeBuckets, findPutMessageEntireTimePX
apache__rocketmq-4122 setStorePathDLedgerCommitLog
apache__rocketmq-4763 getEnumByString
apache__rocketmq-5008 ConcurrentHashMapUtils
apache__rocketmq-5037 CONTROLLER_ELECT_MASTER_FAILED
apache__rocketmq-5834 incBrokerGetNumsWithoutSystemTopic, BROKER_GET_NUMS_WITHOUT_SYSTEM_TOPIC, getBrokerGetNumsWithoutSystemTopic
apache__rocketmq-7455 decodeCommandCustomHeaderDirectly
apache__rocketmq-8051 setTraceTopic, setEnableTrace
apolloconfig__apollo-4119 SpringCloudInnerDiscoveryService
coder__code-server-5633 welcomeText, appName
google__gson-2420 runTestNoDefaultConstructor
google__gson-2549 originalTimeZone
huggingface__transformers-13573 reorder_and_upcast_attn, scale_attn_by_inverse_layer_idx
huggingface__transformers-15831 resize_decoder_token_embeddings, share_encoder_decoder_embeddings
huggingface__transformers-24510 warn_if_padding_and_no_attention_mask
huggingface__transformers-29838 get_learning_rates, get_num_trainable_parameters, get_optimizer_group
huggingface__transformers-31095 on_optimizer_step
langchain-ai__langchain-676 save_local, load_local
mrdoob__three.js-17649 morphTargetsRelative
mrdoob__three.js-20991 setFromMatrix3
mrdoob__three.js-22404 setFromAttributeAndIndices
mui__material-ui-13003 StepIconComponent
mui__material-ui-14461 wrapsIntrinsicElement
mui__material-ui-19257 hasPopupIcon, hasClearIcon
mui__material-ui-33812 collapsedIcon
mui__material-ui-36426 getOptionKey
prettier__prettier-15408 GQL_QUERY_WITH_CONST
prettier__prettier-9736 cleanDoc
serverless__serverless-2584 compileRole
serverless__serverless-3186 setFunctionNames
serverless__serverless-3521 getServiceObject, getServiceName
serverless__serverless-3622 mergeResourceArrays
serverless__serverless-3700 loadEnvVarsForLocal
serverless__serverless-3808 assignDefaultOptions
serverless__serverless-3812 invocationId
serverless__serverless-4120 isArnRefOrImportValue
serverless__serverless-4293 canUseS3TransferAcceleration, disableTransferAccelerationForCurrentDeploy, enableS3TransferAcceleration, isS3TransferAccelerationEnabled
serverless__serverless-4382 conceal
serverless__serverless-4531 endpointType
serverless__serverless-4793 iamManagedPolicies
serverless__serverless-5662 getProfile
serverless__serverless-5728 suppressLogIfPrintCommand
serverless__serverless-5988 envVarsFromOptions, getEnvVarsFromOptions
serverless__serverless-5994 dockerArgsFromOptions, getDockerArgsFromOptions
serverless__serverless-6293 validateHeaderCondition, validateIpCondition, validateQueryCondition
serverless__serverless-6322 getAlbTargetGroupName, getAlbTargetGroupNameTagValue
serverless__serverless-6823 getDeploymentBucketPolicyLogicalId
serverless__serverless-6869 getValueStrToBool
serverless__serverless-6871 cfnRoleArn
serverless__serverless-6960 getResolved, getRejected
sveltejs__svelte-1364 assignTrue
sveltejs__svelte-1627 setData
sveltejs__svelte-1988 nextTick
sveltejs__svelte-3430 set_input_value
sveltejs__svelte-6525 insert_hydration
sveltejs__svelte-6556 claim_svg_element
sveltejs__svelte-705 callAll
sveltejs__svelte-778 setInputType
trinodb__trino-3638 updateExecutor, setMaxConcurrentMetastoreUpdates
trinodb__trino-3771 setDelegationTokenCacheTtl, setDelegationTokenCacheMaximumSize
trinodb__trino-4393 validateFileBuckets
trinodb__trino-748 setAwsSecretKey, setAwsAccessKey
yt-dlp__yt-dlp-8917 _deprecated_multivalue_fields

Citation

If you find this analysis useful for your research, please cite it as:

@misc{ganhotra2026hiddencontracts,
  title={Hidden Naming Contracts in SWE-Agent Benchmarks},
  author={Ganhotra, Jatin},
  year={2026},
  month={April},
  url={https://jatinganhotra.dev/blog/swe-agents/2026/04/01/hidden-naming-contracts-in-swe-agent-benchmarks/},
  note={Blog post}
}




    Enjoy Reading This Article?

    Here are some more articles you might like to read next:

  • Do SWE-Agents Solve Multi-File Issues Like Humans? A Deep Dive into SWE-Bench Verified
  • From 73% to 11%: Revealing True SWE-Agent Capabilities with Discriminative Subsets
  • OpenHands CodeAct v2.1 v/s Tools + Claude 3.5 Sonnet
  • The Visual Complexity Penalty in Code Understanding - SWE-bench Multimodal Analysis
  • SWE-Bench Verified ⊊ real-world SWE tasks