GPT-5.4 was released just yesterday and because I’m currently testing the strix autonomous AI tool for web penetration testing, the temptation to compare it with other LLM models was too strong to resist. As I already spoiled in the title, the results were pretty bad. But there could be a good explanation for this.

What I saw

First, what do I actually mean when I say “bad results”?

I tested strix + GPT 5.4 against three Hack The Box machines. If you are not familiar with Hack The Box, it is an online platform that hosts intentionally vulnerable machines for security training and capture the flag-style challenges. Each machine simulates a realistic attack scenario where your goal is to gain an initial foothold as a low-privileged user and capture the user.txt flag, then escalate privileges to root and capture the root.txt flag.

During my previous tests, most LLM models did a pretty good job - with GPT 5.3 Codex being a particular standout, nailing all three machines.

That’s why my expectations for GPT 5.4 were pretty high… but it left me disappointed and a bit baffled.

It started well, strix + GPT 5.4 successfully finished one HTB machine in record time. But then, for those other two tested machines, it just identified the initial attack vector, created the final report and stopped. It didn’t attempt to leverage that vector to gain access to the machine and continue the exploitation chain - effectively ignoring three quarters of the work it was expected to do. This was strange, because up to that point it went really smoothly, finding those initial vectors quickly and without major distractions.

I added more detailed results on this page, so if you are interested in some data, like length and cost of each test and the final reports generated by the tool, look there.

Why it behaved like this

I spent some time chatting about this experience and especially about differences between GPT 5.3 Codex and GPT 5.4 with, well, ChatGPT and Claude, and this final summary of the whole conversation from Claude makes the most sense to me:

“GPT-5.4 is designed as a general frontier model optimized for professional work, emphasizing clean task completion with fewer iterations, whereas GPT-5.3-Codex is explicitly tuned for long-horizon agentic and coding tasks that require persistent exploration. This difference in optimization target likely explains why GPT-5.4 tends to stop after finding the first major issue in a penetration testing context - it interprets the core objective as met and stops, rather than continuing to enumerate.”

This explanation is somewhat at odds with OpenAI claiming that “GPT‑5.4 brings together the best of our recent advances in reasoning, coding, and agentic workflows into a single frontier model. It incorporates the industry-leading coding capabilities of GPT‑5.3‑Codex⁠…“, but hey, it wouldn’t be the first time that marketing speak has won out over technical precision.

Conclusion

I’m sure I can make GPT 5.4 work much better just by sending it more specific instructions about expected results. This would very likely make it continue the investigation much longer. But, you know, I tested a few other models, and all of them understood what was expected without me needing to spell it out.

So I’m not saying that GPT 5.4 is a bad model. I’m just saying that it’s not an ideal model for use with strix and similar autonomous AI frameworks.