• Agentic AI pentesting with Strix: results from 18 LLM models

    Over the last couple of months, I spent close to a hundred hours testing an autonomous AI pentesting tool called Strix with 18 different LLM models. My goal was to evaluate which LLM model performed best with the tool in this lab setup and what that might say about autonomous AI pentesting more generally.

    After a few dead ends and a lot of discarded results (I summarised that earlier failed testing in my How not to test LLM models post), I finally arrived at a methodology that I think produces meaningful practical benchmark of real Strix usage under my specific provider, tier, pricing, and rate-limit constraints.

    This post contains the results of my testing and a few observations.

    Continue reading →


  • How not to test LLM models

    In the Czech Republic, we have a whole lore built around a fictitious character called Jára Cimrman. He was partially a genius (one of the greatest playwrights, composers, teachers, travellers, inventors, detectives, gynecologists and sportsmen, among many other things) but mostly a loser (“… while running away from one furious tribe, he missed the North Pole by just seven meters, thus almost becoming the first human to reach the North Pole.”) One of his strongest skills was finding dead ends. He found many ways in which things should NOT be done and helped humanity many times by being able to authoritatively say: “This isn’t the way to do it, my friends!”

    After spending several days trying to compare the performance of different LLM models, I’m sure Jára would be very proud of me.

    Continue reading →


  • How GPT-5.4 performed with Strix - and why it fell short

    GPT-5.4 was released just yesterday and because I’m currently testing the strix autonomous AI tool for web penetration testing, the temptation to compare it with other LLM models was too strong to resist. As I already spoiled in the title, the results were pretty bad. But there could be a good explanation for this.

    Continue reading →


  • LLM model statistics from my Strix testing

    In my previous post I summarized a few impressions from my strix testing (TL;DR I was impressed).

    Since then, I have collected some hard data and summarized it on this page. I still haven’t run enough tests to be able to objectively compare different models, but I believe that page is not a bad starting point when selecting an LLM model for your own testing.

    Beyond the numbers, here are some short personal observations for each model.

    Continue reading →


  • Strix - First impressions

    We’ve all heard it: penetration testers are over. Their job will soon be done by agentic AI frameworks that can find the same (or even more elusive) vulnerabilities for a fraction of their bloody money - and since they don’t need to sleep, eat, or have a work-life balance, they can run 24/7.

    And you, Red Teamers, are next.

    Ok, doomers, you got my attention. I decided to look at one of these rising AI penetration testing superstars, strix, and be generous enough to share my random thoughts with you. If you plan to test this tool yourself, check the APPENDIX: Practical tips for Strix testing section at the end of this post - I think I can save you some time and money.

    Here’s the TL;DR for those of you who don’t have enough time or patience to read my whole rant:

    • After this test, am I scared to death and looking for a plumbing job? No, not yet.
    • Am I impressed? Yes, I am. Actually, thinking about it, I’m very impressed.

    Continue reading →