I have a few steps so far in the code editing at https://github.com/TrafficGuard...

I have a few steps so far in the code editing at https://github.com/TrafficGuard/nous/blob/main/src/swe/codeE... There is a first pass I initially created when I was re-running a partially completed task and it would sometimes duplicate what already had been done. This helps Aider focus on what to do.

  <files>${fileContents}</files>
  <requirements>${requirements}</requirements>
  You are a senior software engineer. Your task is to review the provided user requirements against the code provided and produce an implementation design specification to give to a developer to implement the changes in the files.
  Do not provide any details of verification commands etc as the CI/CD build will run integration tests. Only detail the changes required in the files for the pull request.
  Check if any of the requirements have already been correctly implemented in the code as to not duplicate work.
  Look at the existing style of the code when producing the requirements.

Then there is a compile/lint/test loops which feeds back in the error messages, and in the case of compile errors the diff since the last compiling commit. Aider added some similar functionality recently.

Then finally there's a review step which asks:

  Do the changes in the diff satisfy the requirements, and explain why? Are there any redundant changes in the diff? Was any code removed in the changes which should not have been? Review the style of the code changes in the diff carefully against the original code.  Do the changes follow all the style conventions of the original code?

This helps catch issues that Aider inadvertently introduced, or missed.

I have some ideas around implementing workflows that mimic what we do. For example if you have a tricky bug, add a .only to the relevant describe/it tests (or create tests if they dont exist) add lots of logging and assertions to pinpoint the fix required, then undo the .only and extra logging. Thats whats going to enable higher overall success rates, which you can see the progress in the SWE-bench lite leaderboard as simple RAG implementations had up to ~4% success rate with Opus, while the agentic solutions are reaching 43% pass rate on the full suite.