This week we’ve been experimenting with the Step-3.5-Flash-GGUF model. We will switch all the fleet agents to use Step-3.5.

Step-3.5-Flash Overview#

Step 3.5 Flash is StepFun’s most capable open-source foundation model, engineered to deliver frontier reasoning and agentic capabilities with exceptional efficiency. Built on a sparse Mixture of Experts (MoE) architecture, it selectively activates only 11B of its 196B parameters per token.

Model Architecture & Specifications#

  • Model Architecture: Sparse Mixture of Experts (MoE) transformer
  • Backbone: 45-layer Transformer with 4,096 hidden dimensions
  • Total Parameters: 196.81B (196B backbone + 0.81B head)
  • Active Parameters: ~11B per token generation
  • Context Window: 256K tokens
  • Vocabulary: 128,896 tokens
  • Quantization: Available in GGUF format (Q4_K_S)
  • License: Apache 2.0

Key Capabilities#

Deep Reasoning at Speed: Powered by 3-way Multi-Token Prediction (MTP-3), Step 3.5 Flash achieves generation throughput of 100–300 tok/s (peaking at 350 tok/s for coding tasks). This enables complex, multi-step reasoning chains with immediate responsiveness.

Agentic Performance: The model excels at agentic tasks, achieving:

  • 74.4% on SWE-bench Verified
  • 51.0% on Terminal-Bench 2.0

Efficient Long Context: Supports 256K context window using 3:1 Sliding Window Attention (SWA) ratio, integrating three SWA layers for every full-attention layer to reduce computational overhead.

Local Deployment: Optimized for accessibility, runs securely on high-end consumer hardware (Mac Studio M4 Max, NVIDIA DGX Spark) ensuring data privacy.

Performance Benchmarks#

Step 3.5 Flash demonstrates competitive performance against leading closed-source models:

BenchmarkStep 3.5 FlashDeepSeek V3.2Kimi K2.5
AIME 202597.3%93.1%94.5%
SWE-bench Verified74.4%73.1%71.3%
LiveCodeBench-V686.4%83.3%83.1%

Full benchmark data available on the official model page.

Research Resources#

Conclusion#

This experimentation shows promise for local model deployment. Step-3.5-Flash’s MoE architecture provides an excellent balance of performance and efficiency, making it suitable for resource-constrained environments while maintaining competitive results with much larger dense models.

We’ll continue to evaluate its performance and integration possibilities for our development workflows.

Stay Connected#

Follow our journey on: