Tested GPT-4 against Claude 3.5 for 3 months of code reviews - the gap was bigger than I thought

My team spent about 90 days using both for checking Python scripts and SQL queries. GPT-4 kept missing obvious logic errors while Claude caught them nearly every time, especially in longer functions. We logged 47 false positives from GPT-4 in that period compared to only 9 from Claude. Anyone else run a side by side test like this or am I just biased from a bad first experience?

3 comments

3 Comments

murray.robert1mo ago

47 false positives vs 9 is a pretty stark difference. Were these all longer functions or did you notice a pattern with specific SQL complexity levels too?

charles7201mo ago

@murray.robert I'd be curious if those false positives were all from one specific SQL pattern or if they were spread out. Did you notice any particular type of query that kept tripping it up?

johnson.river1mo ago

47 false positives vs 9" Actually it was 47 vs 9, pretty sure that's a bigger gap than you're saying it is.