Testing out code written by GPT-4 and Claude 3 Opus
Day 94 / 366
Yesterday I wrote about comparing the outputs of 2 flagship models — GPT 4 and Claude 3 Opus. I gave them the task of writing a Python script that gets the entire YouTube history of a user.
You can read about it in yesterday's blog —
Today I took both the outputs and ran them to see if they work and if one of them is correct. Here’s what I found —
Google Cloud Setup
Claude 3’s instructions were concise but to the point. I was able to get the credentials, set up the Google Cloud project, and get the code running in under minutes. I know for a fact that it would have taken me at least half an hour of googling if I had to do it myself.
GPT-4 had more elaborate steps. It was better than Claude 3 because it included the step to set up the OAuth screen, something that Claude 3 missed.
Running the code
Claude 3’s code ran without any errors on the first try. On the other hand, GPT-4’s code gave a key error. I knew what the fix was, but just to test GPT-4 I sent it the error and asked it for a fix. Unfortunately, the fix it gave was incorrect. I was able to fix the issue only because I know Python myself. Someone new to Python might have had a tough time.
Accuracy
Unfortunately, both the scripts did not give me my YouTube watch history. I would say that both models here suffered from Hallucinations. I found out through some googling that Youtube API doesn't return the watch history playlist anymore. This is the playlist that Claude 3 fetched and that's why it failed.
GPT-4 did a bit of a better job as pointed out in the end of it’s response
Note: This script might not fetch the watch history directly due to privacy changes and limitations of the YouTube Data API. Instead, it attempts to list activities associated with the authenticated user’s account. The ability to directly access or list watch history items has been restricted in recent API updates. However, this script sets a foundation for how to authenticate and make requests to the YouTube Data API.
However, the script it gave did not do the job as well.
Conclusion
Clearly, none of these models are ready to replace coders. I think there is another layer of logical reasoning that is needed for them to look at the recent documentation and update their code accordingly. Both also suffer from hallucinations which makes things worse, as they try to confidently give you a wrong answer.
But if I had to pick a winner, I would say its GPT-4.