Generative AI for Coding

22 Apr 2024 by cthos
About 9 min

AI Clacky Keyboard Tests
_{Owlie Productions @ Shutterstock #2372381633}

Update 7-13-24: The tl;dr of this post is "You shouldn't use these, it's not worth it", in case that's unclear.

What happens when you get an extreme “GenAI Skeptic” and shove him in front of an LLM coding assistant? This, turns out.

This is a follow up on my last post, which was an epic rant about Generative AI. In that, I mentioned that while I’m generally skeptical about GenAI replacing developers (for a few different reasons, but also because they can’t do what their promoters say they can), I do think it’s one of the use cases where it could actually be useful as a productivity augment for the “write code” portion of the project (if it worked better and wasn't setting the planet on fire to do it).

I don’t think it’s going to actually be able to replace developers. I think that the AI companies want you to think it’s going to be able to replace developers, but they’re not currently capable of doing so. Watch this video on Debunking Devin by Internet of Bugs, it’s a good watch.
Likewise, the “writing code” portion of software development isn’t the main point of software development, and the LLM can’t do all the fiddly human bits of creating a software product.

Before I get into how I’ve tested this myself I want to call out a couple of posts from people I respect that cover GenAI more thoughtfully and less “cathartic rant-centric” than I was.

First up, Molly White wrote AI isn’t useless. But is it worth it?. Go read it, it’s great. In it, she mentions something I’m glad she did: there are other, better, tools that use way less energy than existing tools for proofreading / editing / grammar checking. But I basically agree with the entire post (and I think it makes a lot of the same points I did, more elegantly and thoughtfully).

The other one is a Guardian Article (which….wow I’m linking to the Guardian), that talks about the AI bubble we’re definitely in.

The tl;dr or “I’m going to have GPT summarize this for me"

I want to emphasize that these are my opinions, and if you find LLMs for coding personally useful, that’s okay. I don't think you should use them, though, and I'm getting to the point where if you mention "I used GPT for..." it's more likely that I'm going to not give the rest of your argument much weight.
I also want to call out that I do not use Generative AI for my writing. There are other tools for editing and grammar checking and thesauruses and so on.

For me, there’s some utility in how these things operate. That utility is variable and hard for me to properly quantify. Sometimes, it’s a time save. Sometimes, it’s a time sink. If I had to guess, I’d say it’s a net time-save right now, but that time save is not nearly enough to offset the environmental and social costs.

“But Alex, these will get exponentially better, and will eventually do everything for you”, you might be saying. I’m not trying to set up a straw man. This is what the AI companies are selling. I don’t actually think this is true. My prediction is that we’re already reaching the end of the exponential growth curve, and the amount of utility we can get out of LLMs will plateau.

And look, even if I’m wrong, the change of pace here is so fast - you’re not going to be missing out if you don’t adopt LLMs for coding right now. If you’re a business owner, I’d argue that waiting a bit longer makes more business sense because I suspect once the free money runs out and these folks need to turn a profit the cost is going to go way up, and you’re going to have to recalculate your costs again.

So maybe it’ll be more useful (for me) eventually. Maybe they’ll solve the energy requirements and you can run one of these models locally.

Disclaimer

I’m just one guy talking about his own experiences with a Coding LLM. I obviously think I’m right, otherwise I would be singing a different tune, but I want to drop a quote here:

IOW, everything written about LLMs from the perspective of a single practitioner can be dismissed out of hand. The nature of LLMs makes it impossible to distinguish signal from noise in your own practice. - Baldur Bjarnason, The Intelligence Illusion

That is to say, any individual account, whether positive or negative for LLMs, is inherently biased. (See also - You should not be using LLMs)

If you want a counterpoint, go have a look at Simon Willison’s blog, which Molly linked to in her article. I disagree with his assessment of the ethics and the inevitability, but go have a look and learn for yourself.

Sidebar: Even his posts that have nothing to do with Generative AI have started to include statements like "So I asked GPT-4o to help me...".

Okay, that out of the way, let’s talk about the Coding Experiment.

The Rules

I set a couple of rules for myself to try and make this experiment have some constraints so that I can emulate how I’d expect a competent coding assistant to function.

Minimal “prompt engineering”. The tool can inject whatever context it wants, but I’m going to use the tool as the marketing says I should be able to.
1. Likewise, if I need to type out more words to describe what I want than it’d have taken to simply write the code…. that’s not great.
Working on a well-represented language: Typescript is well-represented in GPT’s dataset.
Real project I’m working on.
No googling, only doing what the Assistants tell me to do.

Here are the tasks I’m going to accomplish:

Replace the Deprecated ‘request’ module with Axios in the http class.
Fix the failing unit tests and refactor them to use async / await
Fix a problem with getDeeperInfo related to closures / scope.

I’ve chosen 3 different projects to run this test on:

Zed with GPT-4-turbo
Continue.dev (Extension) on VS Code with GPT-4-Turbo
Github Copilot on VS Code

The first two are because I want to do a “control” for the LLM (though they do use slightly different versions). The last one is just for a commercial-off-the-shelf “optimal” experience.

I don’t know what extra context the tools are injecting before sending data to their LLM counterpart, but hopefully by doing two with the same model we’ll get a decent comparison between the two.

Summery - Overall

I’m not really looking to rate any of these things as a clear “winner”, but I did want to see if any of these clearly outperformed each other.

Copilot had the weirdest behavior of them all, it would frequently not update the code inline even when I told it to. It also was the only one that flat out started removing braces, leading to a fun little compile error.

But, overall eventually each of the three were able to assist with the tasks. They all performed “best” at transforming existing code - all three of them were able to turn promises into async/await without too much trouble.

All three of them had some issues with creating more code than was necessary, or generating code that doesn’t work. They all did a decent job of summarizing code (with some fun little inaccuracies), and usually were able to help me spot things like a missing return statement that an IDE could notice, but is frequently not configured to notice.

Like I said in the tl;dr - these things were fiddly and inconsistent. Frequently, the built-in IDE features we’ve already had were much better. What you're being sold right now is a future where this works better, which would be "fine" if the hype cycle weren't also selling these tools as a "Developer replacer" to execs (shout out to Copilot for slapping warnings of "this doesn't replace human effort" all over the actual tool... but it's not going to be enough).

Test 1 - Zed with GPT-4-Turbo

So the first test actually took the longest, largely because this was the test where I had no prior experience with the issues that I need to fix. The second and third tests got progressively faster as I knew where to look for things, but the LLM got a little… “spicy”.

If you want to have a look at the test, here’s the video:

First off, it did a fine job replacing requests with axios, though it (and my undercaffeinated brain) had some issues with the typings. When it came to actually call the API though, it introduced some mistakes with how the API was called. It’s arguable here that I’d have gotten a better result with a better prompt, but looking back to the rules, I shouldn’t have to type a paragraph.

On the second test, it did fine, though I did forget to have it do an async/await in the video. I did try it later, and it worked fine - basically the same as the others. This was probably the most consistent, but it’s also one of the things where I can complete the task / update in about 30-45 seconds and the LLM does it in around 20 seconds (inclusive of typing the prompt).

Sidebar: my best time using liberal copy/paste in that was around 16 seconds. This is what I’m taking about with variable time save.

For the Third test - I wound up not needing the LLM to help with it. It was a missing param on a couple of calls. So, it wound up being unnecessary.

Overall: This was “fine” - and it mostly stayed out of my way while I was doing the experiment, which I appreciate.

Test 2 - VS Code + Continue + GPT4 Turbo

Right, so this one was the first of the two that use VS Code Plugins. Continue.dev is an open-source way to call all sorts of LLMs, including locally hosted LLMs. I’ve also given that a shot but the experience untethered from a GPU is not great.

For this test I thought I’d disabled the code autosuggestion, but for some reason the config didn’t stick and it was using…some…. llm’s free trial API for it? I honestly have no idea what happened there and it’s present in the video.

Speaking of video, the commentated test is here:

Like the others, this one did a pretty okay job at each of the tasks, but it had some standouts:

The inline updates worked consistently and made an actually diff in the IDE, so I could accept or reject things individually. That made it a lot easier than Zed to see if the LLM had inserted something that I didn’t want (which happened a lot).
Subjectively I think there were more hallucinations, but it’s the same LLM so that’s probably random.

It’s worth calling out here that all of these calls to the LLM are non-deterministic. If you were to set the temperature to 0 you could get more deterministic behavior but that would also destroy its “creativity” so no one really does that.

Like before, it completed task 1 in a very similar way to Zed, but I had to futz with the output more. It created a similar error when refactoring the actual callAPI method as well - which took a little bit longer to fix.

For task 2, it did just fine, and it was likewise able to eventually figure out there was a missing return statement.

Test 3 - Github Copilot

Right, so Copilot was frustrating in more way than one.

Sidebar: Microsoft, I need you to get your shit together and stop naming different products the same thing please.

First off, the inline editing was extremely inconsistent. If you check out the video, you’ll encounter this immediately:

For the first several attempts, it just doesn’t work. It won’t make changes, or even accurately suggest what to do - it just craters and suggests I use the chat sidebar. Eventually it starts working and I assume that it’s looking for cues in the response from the LLM to do its inline replacements but it messes up more than once.

The chat is “fine”, in that it’s a bit verbose (how much money are we burning on extra tokens?) and it’s moderately helpful at times.

For problem 1, I wound up having to use the autosuggest to make the changes, after several failed attempts at the inline changes.

For problem 2, it worked fine - like the others it was able to both suggest the missing return (though it also added a bunch of unnecessary code) and to refactor the tests to be async.

Bonus Round - How much did this cost?

So, for Github Copilot, I’d never used it before so I got to get a 30 day free trial! Hooray!

For Zed and Continue, I was using my OpenAI API key (which yes, I do have an OpenAI API key - I’ve spent about $10 on it total so far for experimentation). Have a look:

I spent $1.55 for approximately 1.5 hours of “light” usage (maybe we be generous and call it 2). If we assume that I’m using that level of usage across all “tasks” that the AI folks envision (like, I’m using Copilot for code and the other Copilot for Emails, because they’re shoving them into absolutely everything) - we can “extrapolate” 160 hours at roughly $0.75 per hour puts us at $120 of usage per month. Were I using GPT-3.5 you can pretty well divide that by 20 (ish) for a usage of $6/mo.

You might notice there’s a math problem here. Assuming Copilot is using GPT-4, and they charge $10/mo - they’re likely loss-leading a lot here. If they’re using a more efficient model….they’re probably still losing money on every subscription unless devs aren’t using features.

I suspect this is also the case for the Office 365 + Edge Browser + everything else - burning investment dollars to get everyone hooked.

I only mention this because I think they’re going to raise prices eventually, unless there’s some egregious advance in chip technology (which Nvidia is chasing), and when they do….

Wrap

So yeah, to summarize what I said at the beginning, I’m just not getting enough value for these things to justify their use and the ethics and energy use bother me.

If we can get to a point where we’re not burning egregious amounts of energy and these models aren’t consolidating even more capital in the big tech companies, that might change. Maybe we'll get really advanced AI chips that let you run all of these locally on commodity hardware. I just think the current trend of "more power and bigger" is unsustainable.

I still also have some existential concerns with newer developers learning through the use of LLMs, when they'd be better served by really good curated documentation, but I'll leave that for another post.

tech AI blog