Friday, November 22, 2024

Exploring Generative AI

TDD with GitHub Copilot

by Paul Sobocinski

Will the advent of AI coding assistants such as GitHub Copilot mean that we won’t need tests? Will TDD become obsolete? To answer this, let’s examine two ways TDD helps software development: providing good feedback, and a means to “divide and conquer” when solving problems.

TDD for good feedback

Good feedback is fast and accurate. In both regards, nothing beats starting with a well-written unit test. Not manual testing, not documentation, not code review, and yes, not even Generative AI. In fact, LLMs provide irrelevant information and even hallucinate. TDD is especially needed when using AI coding assistants. For the same reasons we need fast and accurate feedback on the code we write, we need fast and accurate feedback on the code our AI coding assistant writes.

TDD to divide-and-conquer problems

Problem-solving via divide-and-conquer means that smaller problems can be solved sooner than larger ones. This enables Continuous Integration, Trunk-Based Development, and ultimately Continuous Delivery. But do we really need all this if AI assistants do the coding for us?

Yes. LLMs rarely provide the exact functionality we need after a single prompt. So iterative development is not going away yet. Also, LLMs appear to “elicit reasoning” (see linked study) when they solve problems incrementally via chain-of-thought prompting. LLM-based AI coding assistants perform best when they divide-and-conquer problems, and TDD is how we do that for software development.

TDD tips for GitHub Copilot

At Thoughtworks, we have been using GitHub Copilot with TDD since the start of the year. Our goal has been to experiment with, evaluate, and evolve a series of effective practices around use of the tool.

0. Getting started

Exploring Generative AI

Starting with a blank test file doesn’t mean starting with a blank context. We often start from a user story with some rough notes. We also talk through a starting point with our pairing partner.

This is all context that Copilot doesn’t “see” until we put it in an open file (e.g. the top of our test file). Copilot can work with typos, point-form, poor grammar — you name it. But it can’t work with a blank file.

Some examples of starting context that have worked for us:

  • ASCII art mockup
  • Acceptance Criteria
  • Guiding Assumptions such as:
    • “No GUI needed”
    • “Use Object Oriented Programming” (vs. Functional Programming)

Copilot uses open files for context, so keeping both the test and the implementation file open (e.g. side-by-side) greatly improves Copilot’s code completion ability.

1. Red

TDD represented as a three-part wheel with the 'Red' portion highlighted on the top left third

We begin by writing a descriptive test example name. The more descriptive the name, the better the performance of Copilot’s code completion.

We find that a Given-When-Then structure helps in three ways. First, it reminds us to provide business context. Second, it allows for Copilot to provide rich and expressive naming recommendations for test examples. Third, it reveals Copilot’s “understanding” of the problem from the top-of-file context (described in the prior section).

For example, if we are working on backend code, and Copilot is code-completing our test example name to be, “given the user… clicks the buy button, this tells us that we should update the top-of-file context to specify, “assume no GUI” or, “this test suite interfaces with the API endpoints of a Python Flask app”.

More “gotchas” to watch out for:

  • Copilot may code-complete multiple tests at a time. These tests are often useless (we delete them).
  • As we add more tests, Copilot will code-complete multiple lines instead of one line at-a-time. It will often infer the correct “arrange” and “act” steps from the test names.
    • Here’s the gotcha: it infers the correct “assert” step less often, so we’re especially careful here that the new test is correctly failing before moving onto the “green” step.

2. Green

TDD represented as a three-part wheel with the 'Green' portion highlighted on the top right third

Now we’re ready for Copilot to help with the implementation. An already existing, expressive and readable test suite maximizes Copilot’s potential at this step.

Having said that, Copilot often fails to take “baby steps”. For example, when adding a new method, the “baby step” means returning a hard-coded value that passes the test. To date, we haven’t been able to coax Copilot to take this approach.

Backfilling tests

Instead of taking “baby steps”, Copilot jumps ahead and provides functionality that, while often relevant, is not yet tested. As a workaround, we “backfill” the missing tests. While this diverges from the standard TDD flow, we have yet to see any serious issues with our workaround.

Delete and regenerate

For implementation code that needs updating, the most effective way to involve Copilot is to delete the implementation and have it regenerate the code from scratch. If this fails, deleting the method contents and writing out the step-by-step approach using code comments may help. Failing that, the best way forward may be to simply turn off Copilot momentarily and code out the solution manually.

3. Refactor

TDD represented as a three-part wheel with the 'Refactor' portion highlighted on the bottom third

Refactoring in TDD means making incremental changes that improve the maintainability and extensibility of the codebase, all performed while preserving behavior (and a working codebase).

For this, we’ve found Copilot’s ability limited. Consider two scenarios:

  1. “I know the refactor move I want to try”: IDE refactor shortcuts and features such as multi-cursor select get us where we want to go faster than Copilot.
  2. “I don’t know which refactor move to take”: Copilot code completion cannot guide us through a refactor. However, Copilot Chat can make code improvement suggestions right in the IDE. We have started exploring that feature, and see the promise for making useful suggestions in a small, localized scope. But we have not had much success yet for larger-scale refactoring suggestions (i.e. beyond a single method/function).

Sometimes we know the refactor move but we don’t know the syntax needed to carry it out. For example, creating a test mock that would allow us to inject a dependency. For these situations, Copilot can help provide an in-line answer when prompted via a code comment. This saves us from context-switching to documentation or web search.

Conclusion

The common saying, “garbage in, garbage out” applies to both Data Engineering as well as Generative AI and LLMs. Stated differently: higher quality inputs allow for the capability of LLMs to be better leveraged. In our case, TDD maintains a high level of code quality. This high quality input leads to better Copilot performance than is otherwise possible.

We therefore recommend using Copilot with TDD, and we hope that you find the above tips helpful for doing so.

Thanks to the “Ensembling with Copilot” team started at Thoughtworks Canada; they are the primary source of the findings covered in this memo: Om, Vivian, Nenad, Rishi, Zack, Eren, Janice, Yada, Geet, and Matthew.


Related Articles

Latest Articles