Weekly Update

This week we’ve been experimenting with the Step-3.5-Flash-GGUF model. We will switch all the fleet agents to use Step-3.5.

Step-3.5-Flash Overview#

Step 3.5 Flash is StepFun’s most capable open-source foundation model, engineered to deliver frontier reasoning and agentic capabilities with exceptional efficiency. Built on a sparse Mixture of Experts (MoE) architecture, it selectively activates only 11B of its 196B parameters per token.

Model Architecture & Specifications#

Model Architecture: Sparse Mixture of Experts (MoE) transformer
Backbone: 45-layer Transformer with 4,096 hidden dimensions
Total Parameters: 196.81B (196B backbone + 0.81B head)
Active Parameters: ~11B per token generation
Context Window: 256K tokens
Vocabulary: 128,896 tokens
Quantization: Available in GGUF format (Q4_K_S)
License: Apache 2.0

Key Capabilities#

Deep Reasoning at Speed: Powered by 3-way Multi-Token Prediction (MTP-3), Step 3.5 Flash achieves generation throughput of 100–300 tok/s (peaking at 350 tok/s for coding tasks). This enables complex, multi-step reasoning chains with immediate responsiveness.

Agentic Performance: The model excels at agentic tasks, achieving:

74.4% on SWE-bench Verified
51.0% on Terminal-Bench 2.0

Efficient Long Context: Supports 256K context window using 3:1 Sliding Window Attention (SWA) ratio, integrating three SWA layers for every full-attention layer to reduce computational overhead.

Local Deployment: Optimized for accessibility, runs securely on high-end consumer hardware (Mac Studio M4 Max, NVIDIA DGX Spark) ensuring data privacy.

Performance Benchmarks#

Step 3.5 Flash demonstrates competitive performance against leading closed-source models:

Benchmark	Step 3.5 Flash	DeepSeek V3.2	Kimi K2.5
AIME 2025	97.3%	93.1%	94.5%
SWE-bench Verified	74.4%	73.1%	71.3%
LiveCodeBench-V6	86.4%	83.3%	83.1%

Full benchmark data available on the official model page.

Research Resources#

Hugging Face: https://huggingface.co/stepfun-ai/Step-3.5-Flash
GitHub: https://github.com/stepfun-ai/Step-3.5-Flash
Technical Paper: https://arxiv.org/abs/2602.10604
Blog Post: https://static.stepfun.com/blog/step-3.5-flash/

Conclusion#

This experimentation shows promise for local model deployment. Step-3.5-Flash’s MoE architecture provides an excellent balance of performance and efficiency, making it suitable for resource-constrained environments while maintaining competitive results with much larger dense models.

We’ll continue to evaluate its performance and integration possibilities for our development workflows.

Stay Connected#

Follow our journey on:

GitHub: @mudler
Twitter: @mudler_it
LocalAI: localai.io