Test Strategy¶
This document describes the practical test approach for diffsan (a monolithic Python CLI run via pipx) with a focus on: fast local feedback, high confidence in CI, and minimal operational overhead.
Goals¶
- Catch regressions in core behaviors: diff preparation, redaction, truncation, prompt construction, schema validation, and GitLab posting.
- Ensure artifacts are always written (prompt + agent output + run status) even on failure.
- Keep tests small, deterministic, and offline by default.
- Make it easy to add new skip rules, agents, and SCM providers without rewriting tests.
Non-goals¶
- Full end-to-end integration tests on every CI run (agent calls are expensive and flaky).
- Perfect secret detection correctness (best-effort regex); tests assert no raw secret leaks rather than perfect detection coverage.
- Exhaustive GitLab position algorithm validation (we test happy paths + graceful degradation).
Test pyramid¶
1) Unit tests (offline, fast)¶
These cover deterministic pure logic and schema validation. They should run in <5s locally.
Targets
- Config merging / precedence
- Diff filtering, prioritization, truncation
- Secret scanning + redaction
- Fingerprinting + stable IDs
- Prompt assembly
- JSON parsing + Pydantic validation
- Markdown formatting of summary metadata/truncation
Tooling
pytestpytest-cov(optional)- Avoid network and subprocess calls (mock them)
2) Component tests (offline with fakes/mocks)¶
These test modules with controlled dependencies:
- Agent runners with fake subprocess execution (Cursor and Codex paths)
- GitLab client with a fake HTTP server or mocked transport
- Orchestrator pipeline with faked agent + faked GitLab client, verifying artifacts/events
These provide confidence in interactions and error handling without calling external services.
3) Smoke test (optional, manual or scheduled)¶
A single real run against:
- selected agent CLI (Cursor or Codex)
- GitLab API posting
This is expensive; keep it manual or run nightly/scheduled to avoid flakiness and cost.
Canonical test fixtures¶
All fixtures live under tests/fixtures/:
tests/fixtures/
diffs/
small.patch
large.patch
secrets.patch
only_docs.patch
agent_outputs/
valid_review.json
invalid_not_json.txt
invalid_schema.json
mixed_markdown_and_json.txt
gitlab/
mr_notes_with_prior_summary.json
mr_details_auto_merge_true.json
create_note_response.json
create_discussion_response.json
create_discussion_position_error.json
Fixture guidance:
*.patchshould be small and readable.secrets.patchshould include synthetic secrets (not real) and assert they are redacted.- Agent outputs should include both parse failures (not JSON) and schema failures (JSON but wrong fields).
What to test (by module)¶
core/config.py¶
Unit tests
- Default config loads with sensible values.
- Precedence: CLI flags > env/CI vars > repo file > defaults.
- Invalid config produces
CONFIG_PARSE_ERROR.
Example checks
max_diff_charsoverridden by env var is reflected inAppConfig.- Unknown field behavior matches your policy (strict or permissive).
core/diff_provider.py¶
Unit tests
- Environment parsing for CI variables (MR iid, project id, branches).
- Command construction for
git fetch origin <target>+git diff target...head.
Component tests
- Subprocess wrapper returns a fixture diff for “git diff”; provider returns
DiffBundle.
Failure tests
- Missing required CI vars raises
DIFF_FETCH_FAILEDwith context. - Git command non-zero exit raises
DIFF_FETCH_FAILED(retryable: false by default).
core/preprocess.py¶
Unit tests
- Ignore globs exclude expected paths.
- Prioritization sorts code files ahead of docs.
- Truncation obeys:
max_diff_charsmax_filesmax_hunks_per_file
- Truncation report contains:
truncated, counts, and at least oneTruncationItemwhen truncated.
- Secret scanning/redaction:
- matches generate
RedactionReport.found == true - secrets replaced with
[REDACTED] - no raw secret ends up in report, events, or stdout (only hash/length)
- matches generate
Edge cases
- Binary diffs / large blobs are handled without crashes (may be excluded).
- Diff with unusual encoding doesn’t crash (best-effort).
core/fingerprint.py¶
Unit tests
- Fingerprint stable for identical diff text.
- Fingerprint changes when diff text changes.
- If you compute
finding_id: stable across identical findings after normalization.
core/prior.py¶
Component tests
- Given fixture MR notes with a prior summary note, returns:
prior_fingerprint- compact
PriorDigest.findings - all prior summary markdown blocks in
PriorDigest.summaries - all inline discussion comments (resolved + unresolved) in
PriorDigest.inline_comments
Failure tests
- Notes endpoint failure maps to
GITLAB_FETCH_PRIOR_FAILED(retryable depending on HTTP status). - Missing/invalid prior format returns empty digest (should not crash).
core/skip.py¶
Unit tests
- Auto-merge signal true =>
should_skip == truewith reasonAUTO_MERGE. - Otherwise
should_skip == false.
Component tests
- MR details fixture used to confirm skip decision.
core/prompt.py¶
Unit tests
-
Prompt includes:
- schema instruction (“JSON only”) for cursor only
- prepared diff content
- truncation disclosure + what was truncated (or instruction to include it)
- redaction flag (if secrets found)
- prior digest + “avoid repeating” instruction
- verbosity + skills if configured
-
For codex prompts, assert schema and hard JSON-only sections are omitted.
Safety tests
- Prompt does not include raw secret strings (use
secrets.patchfixture).
core/agent_cursor.py¶
Component tests with fake subprocess
- Attempt 1 returns invalid JSON -> repair attempt succeeds -> returns validated
ReviewOutput. - Attempt N exhaustion -> raises
AGENT_OUTPUT_INVALIDand writes all raw outputs to artifacts.
Timing tests
- Stats are captured and included in
ReviewMeta.timings(optional tokens allowed to be empty).
core/parse_validate.py¶
Unit tests
- Valid JSON fixture parses to
ReviewOutput. - Invalid JSON fixture raises parse error with helpful context.
- JSON missing required fields fails Pydantic validation.
- If you support stripping code fences: test mixed markdown+json fixture behavior (decide policy and test it).
core/agent_codex.py¶
Component tests with fake subprocess
- Default command writes schema/output files and reads structured payload from
--output-last-message. - Custom command behavior preserves user flags while wiring output schema/message flags.
- Missing/empty output file and non-zero exit map to
AGENT_EXEC_FAILED.
core/format.py¶
Unit tests
-
Produces summary note markdown that includes:
<details><summary>Metadata</summary>...- fingerprint, agent, duration, token usage (if available)
- truncation disclosure clearly marked when truncated
- truncation details
<details>section listing excluded/limited items - redaction warning section if secrets found (without raw secrets)
-
Builds
PostPlanwith:discussions[]mapped from findings- unpositioned findings degrade gracefully (if position not computable)
- unpositioned findings rendered as per-finding
<details>blocks with<summary>category/severity + path/line-range and fullbody_markdown
core/gitlab.py¶
Component tests with mocked HTTP
POST notesuccess -> records correct payload.POST discussionsuccess -> records correct payload.- Retry behavior:
- 429 then 201 -> succeeds with retry_count incremented
- 5xx then 201 -> succeeds
- Non-retryable:
- 401/403 ->
GITLAB_AUTH_ERROR(no retries) - 400 invalid position ->
GITLAB_POSITION_INVALID(no retries unless recompute is implemented)
- 401/403 ->
run.py (orchestrator)¶
Component tests
- Happy path with fake agent + fake gitlab:
- writes artifacts:
prompt.txt,agent.raw.txt,review.json,post_plan.json,post_results.json,events.jsonl,run.json run.json.ok == true
- writes artifacts:
- Failure path:
- cursor invalid after retries -> structured
AGENT_OUTPUT_INVALID - codex invalid output on single attempt -> structured
AGENT_OUTPUT_INVALID - still writes:
prompt.txt,agent.raw.txt,run.jsonwith structured error - exits non-zero (test via calling
run()directly or usingpytestto capture SystemExit)
- cursor invalid after retries -> structured
- Skip path:
- writes minimal artifacts + stdout message
- does not call agent or GitLab poster
Assertions that matter most (high ROI)¶
-
No secret leaks
-
Redaction report never stores raw secrets.
- Prompt artifact contains
[REDACTED]not the original token. -
Summary/discussions do not include raw secrets.
-
Artifacts always exist
-
On any error,
run.jsonandevents.jsonlare written. -
prompt.txtandagent.raw.txtare written once prompt/agent stages begin. -
Schema contract enforcement
-
Any agent output must validate against
AgentReviewOutputPydantic schema. ReviewOutput.metashould be populated by diffsan runtime, not by agent output.-
Invalid outputs fail loudly (non-zero exit) and are preserved as artifacts.
-
Graceful degradation
-
If a discussion position can’t be computed or GitLab rejects it, the tool still posts/keeps a usable summary (and records the failure in
post_results.json).
Suggested CI test jobs (GitHub Actions)¶
unit:pytest -q
lint/typecheck:- ruff/black + mypy (if enabled)
smoke(manual trigger):- runs
diffsanwith real Cursor + GitLab token against a controlled MR
- runs
Keep smoke tests out of the default pipeline or make them workflow_dispatch only.
Adding new features without breaking tests¶
When you add:
- New skip rules: add unit tests in
test_skip.pyand extend fixtures. - New agent (beyond Cursor/Codex): add a parallel fake runner and reuse parse/validate + format tests.
- GitHub support: replicate
core/gitlab.pytests for acore/github.pymodule and keep contracts stable.
Quick checklist for a new PR¶
- [ ] Updated/added fixtures if behavior changed
- [ ] Unit tests cover new logic
- [ ] Component tests cover interactions (agent or API)
- [ ] No secrets printed or stored unredacted
- [ ] Artifacts are still written on failure
- [ ]
ReviewOutputschema remains the single source of truth