Last week Moonshot AI released an upgraded version of their Kimi K2 model with thinking capabilities. The benchmarks are competitive and the pricing is cheaper than other proprietary models. But benchmarks don't tell the whole story, that's not what caught my attention.
The real problem with AI coding assistants is tool use. They forget context, repeat themselves, or drift off track. For developers working with codebases, this is a dealbreaker. You need file searches, code reads, edits, test runs, all chained together without the model falling apart.
K2 Thinking is built and tested specifically to handle large amounts of sequential tool calls while maintaining coherence. It performs continuous cycles of reasoning, searching, browsing, and coding without losing the thread. For coding tasks that involve MCP servers and extensive tool use, this changes what's possible.
I tested it on a production codebase to see how it performs on real engineering work.
What Makes K2 Thinking Different#
Before we dive into the hands on experience, let's take a look at the technical details of the model.
Kimi K2 is open source and unlike the closed source models, it shows you its reasoning process before responding. The model uses a Mixture-of-Experts (MoE) architecture with 1 trillion total parameters, but only 32 billion active per inference across 61 layers with 384 experts, selecting 8 per token and offering a 256K token context window. The MoE design is what makes it remotely runnable on consumer hardware. Instead of loading all trillion parameters, you're working with 32 billion active parameters at any given time while the rest sit dormant until needed.
The Benchmark Story#
Moonshot published some strong numbers, and they're competing directly with closed models like GPT-5 and Claude Sonnet 4.5:
The SWE-Bench score is particularly interesting because that benchmark tests actual software engineering tasks - the kind of work I wanted to evaluate.
Kimi K2 Thinking on a Real Codebase#
Benchmarks always inflate the capabilities of the model and sometimes they don't accurately reflect the model's capabilities when it comes to actual use cases. For that reason, I'm going to test the model on a real codebase and give it a real task to solve.
The task I've chosen includes a large codebase with very large files, some of the files are over 2000 lines of code. It requires a lot of context along with reasoning to solve the problem. It also requires searching through the codebase to find the relevant files and pieces of code, which makes it a good test for the model's tool calling capabilities.
The GitHub issue that I chose can be found here in the Shuttle repository.
To summarize the issue: when you create a project with the Shuttle CLI, it generates a
.shuttle/config.tomlfile storing your project ID, but when you runshuttle project delete, the project gets removed from the platform while the stale project ID stays in your local config file. This stale value causes confusion in subsequent commands since the CLI tries to use a project ID that doesn't exist anymore. Similarly, if you delete a project and create a new one, the old ID persists instead of updating to the new project's ID. The fix needs to delete the project ID from config when runningshuttle project deleteand update the config with the new project ID when runningshuttle project create.
The Prompt#
I wanted to give it a very high level prompt without delving into the detials and give the model a chance to understand the problem and come up with a plan by itself.
Here's the prompt:
Kimi K2 Thinking in Action#
Like any other model, the first step is to search the codebase and collect context. K2 used find to locate project-related files, ran Grep searches for project create and project delete, then found and read cargo-shuttle/src/lib.rs and cargo-shuttle/src/config.rs to understand how the config management works.
Worth noting: the lib.rs file is over 2000 lines, and most AI models struggle with files this size. K2 Thinking read it, understood the structure, and correctly identified where the changes needed to happen. After the exploration phase, it presented a plan and started implementing the fix.
Testing the Implementation#
It updated the codebase and wrote some code, the code compiles without any errors, but the real question is if it works as expected.
The script that I used for testing is below, it just compiles the code and creates an alias for the dev CLI:
Test 1: Project Creation
Running shuttle-dev project create --name my-axum-app:
Success. The project was created with ID proj_01KA3S8JN8TNWBGSJZFJRXM7PG, and checking .shuttle/config.toml:
The config file was updated correctly with the new project ID.
Test 2: Project Deletion
Running shuttle-dev project delete:
The project was deleted from the platform. Checking the config file:
Empty. The stale project ID is gone.
Test 3: Creating a New Project After Deletion
The final scenario: delete a project, create a new one, and verify the config syncs with the new ID.
Running shuttle-dev project create --name my-new-axum-app:
The config updated with the new project ID (proj_01KA3SE04947EBMCF2RMZQT3BP). Everything stayed in sync.
All three test cases ✅ passed. The solution performed exactly as needed.
Wrap Up#
K2 Thinking handled this specific task well. It explored a large codebase systematically, understood the context across a 2000+ line file, and implemented a working solution on the first try. The reasoning mode showed its value in the methodical approach. Rather than jumping to conclusions, it mapped out the codebase structure before making changes.
This was a real-world coding use case, not a synthetic benchmark, and it worked well. That said, it doesn't mean it will work on all projects. Success depends on the type of complexity involved. For this particular issue (understanding config file management and modifying CLI commands), the model performed solidly.
The tradeoff is speed. This isn't a fast model. The thinking phase adds latency, and for simple tasks, that overhead might not be worth it. But for complex codebases where understanding context matters more than raw speed, the deliberate approach pays off.
Would I use it for quick one-off scripts or simple bug fixes? Probably not. For navigating unfamiliar production code and making changes that need to be right the first time? It's a solid option, especially considering it's open-source and you can run it locally (if you're rich and have the hardware).
The benchmark numbers hold up in practice. The 71.3% SWE-Bench score makes sense after seeing it work through a real engineering task. Not perfect, but capable enough to be useful.
Working with AI Coding Tools#
If you're interested in getting the most out of AI coding assistants like Claude Code, check out our comprehensive guide on best practices.
It covers practical techniques for prompt engineering, codebase navigation, debugging workflows, and how to structure your projects to work effectively with AI assistants.
Join our Discord server to stay updated on the latest AI coding tools and best practices.
Join the Shuttle Discord Community
Connect with other developers, learn, get help, and share your projects






