Blog

Frontier Model Benchmark Matrix

Jul 9, 2026

A source-audited comparison of public benchmark results for GPT-5.6, Claude 5, Muse Spark 1.1, and Grok 4.5.

Thinking in RL

Apr 13, 2026

An opinionated tour through the algorithm tree of modern LLM RL — PPO, GRPO, REINFORCE, REINFORCE++, DPO, and the theoretical ideas that tie them together.

The Quest of User-Effective AI Agents

Nov 2, 2025

Exploring what makes AI agents truly effective for users, beyond benchmark performance.

The overlooked "bad" word list ☠️

Dec 15, 2024

Stop using outdated bad word lists. Use ToxicTrig instead for better toxic language analysis.