FUNDA

FUNDA

Deep|DeepSeek V4 vs Claude vs GPT-5.4: A 38-Task Benchmark Across Coding, Reasoning, and Financial Research

FUNDA's avatar
FUNDA
Apr 24, 2026
∙ Paid

Important note: This report is not a research report. It is an evaluation report completed by the FundaAI Engineering Team, not written by the FundaAI Analyst Team. It does not represent the views of the FundaAI Analyst Team.

All test cases are based on the actual working environment of the FundaAI Platform.

As of time of publication, GPT-5.5 has not yet officially released its API. Testing solely through Codex 5.5 may not fully reflect the complete performance of the API. We have currently only conducted urgent testing on DeepSeek V4, and will include GPT-5.5 test results as soon as its API becomes officially available.

Key Takeaways

  • Claude Opus 4.6 (Thinking) and Claude Opus 4.7 tie for #1 overall (both 8.72 weighted avg). They lead for different reasons: Opus 4.6 Thinking is strongest on coding and hard reasoning, while Opus 4.7 leads writing and full-coverage multi-step work.

  • DeepSeek V4 Pro (Thinking) has the highest completed-task multi-step score at 8.90, ahead of Opus 4.7 at 8.87, but with partial coverage: 29/38 tasks completed because several hard coding/reasoning tasks timed out.

  • DeepSeek V4 Pro received the only financial research 10/10 because it produced the strongest answer to the NVDA game theory task. It did not score highly because it was long; it scored highly because it fully developed 11 players, 18 citations, and forced-move economics.

  • GPT-5.4 remains the fastest full-suite model (105s avg), with strong coding and reasoning, but its latest composite score is 7.88 and it no longer leads the coding table.

  • DeepSeek V4 has a substantially lower estimated cost than Claude in this benchmark. Flash is ~$0.007/task, Flash Thinking is ~$0.008/task, Pro is ~$0.10/task, and Pro Thinking is ~$0.15/task, all below Claude Opus per-task cost estimates.

  • DeepSeek V4’s relative weakness is presentation format more than analysis quality. It generally produces strong markdown research, while Claude Opus 4.5 more readily produces dashboard-ready OpenUI charts, metric cards, and data tables.

  • The frontier is now a three-way race between Anthropic (writing, earnings, citation rigor), DeepSeek (analytical depth, data synthesis, cost), and OpenAI (speed, incident debugging, system design). GPT-5.5 API results still need to wait until the official API becomes available.


1. Summary

DeepSeek V4 is DeepSeek’s next-generation foundation model, shipped as two variants through the Pandora API: Pro (deeper, slower — optimized for thoroughness) and Flash (faster, more concise — optimized for production throughput). Both support multi-turn conversations with full tool calling. We obtained early access and benchmarked them against other frontier model configurations across 38 tasks spanning coding, reasoning, writing, and complex multi-step analysis with live data. DeepSeek V4 Pro performs in the same analytical tier as Claude Opus 4.7, while Claude Opus 4.6 (Thinking) and Claude Opus 4.7 tie for the top full-benchmark score. In the latest composite table, Claude Opus 4.6 (Thinking) and Claude Opus 4.7 both score 8.72 weighted avg, while DeepSeek V4 Pro follows at 8.27. DeepSeek V4 Pro (Thinking) reaches 8.90 on completed multi-step tasks, but its partial coverage makes it better interpreted as a high-quality thinking-mode result rather than a full-suite winner. The gap among models is driven by specific strengths: Opus 4.7 wins on earnings call integration, citation rigor, and writing quality; DeepSeek Pro wins on analytical depth, game theory, competitive mapping, multi-company comparison, and sentiment attribution.

Against standard Claude Opus 4.6 (8.17), DeepSeek V4 Pro (8.27) remains in the same tier while offering stronger depth on several financial research tasks. Against Claude Opus 4.5 and GPT-5.4 (both 7.88 in the latest composite table), DeepSeek V4 Pro is ahead by roughly 0.4 points. DeepSeek V4 Flash (8.01) remains competitive with standard Opus 4.6 while being 36% faster than Pro. With thinking enabled, DeepSeek Flash reaches 8.55 on completed multi-step tasks and becomes a stronger quality-per-dollar option, but its completion rate drops to 33/38.

DeepSeek V4 Flash averages 165s per task — faster than all three Claude Opus models (Opus 4.5: 138s, Opus 4.7: 227s, Opus 4.6: 267s). Pro is slower at 256s — comparable to Opus 4.7. Both models produce dense, token-efficient output: Pro averages ~4,200 output tokens and Flash ~3,800, compared to Claude’s ~7,500-8,800 characters. DeepSeek shows competitive quality per token — the content-to-padding ratio is notably higher, with fewer boilerplate disclaimers and more substantive tables, data points, and analysis per word.On cost, DeepSeek V4 shows a notable advantage in this benchmark. Flash at $0.28/1M output tokens costs ~$0.007/task, and Flash Thinking is still only ~$0.008/task. Pro at $3.48/1M output costs ~$0.10/task, while Pro Thinking is estimated at ~$0.15/task. Even with thinking enabled, DeepSeek remains below Claude Opus per-task cost estimates in this benchmark. Both support 1M context windows, 384K max output, and cache-hit discounts (Flash input drops to $0.028/1M, Pro to $0.145/1M).

DeepSeek V4 is now publicly available via api.deepseek.com (supporting both OpenAI and Anthropic API formats) — the first non-Anthropic model to match Claude Opus 4.7 on complex, tool-assisted analytical tasks. The implications:

  • For AI-assisted research workflows: Teams using Claude for deep analysis may now have a lower-cost alternative to evaluate, based on this benchmark. Flash at $0.28/1M output tokens is ~50x cheaper than Claude Opus; Pro at $3.48/1M is ~4x cheaper. Both models support 1M context with 384K max output.

  • For the competitive landscape: The benchmark suggests the gap between DeepSeek V4 and Anthropic’s flagship model has narrowed on analytical tasks. Claude retains an edge on writing quality and earnings call texture; DeepSeek leads on analytical breadth and data source synthesis.

  • For production deployments: Flash offers a strong speed-quality tradeoff at 165s avg and ~$0.007/task. Flash Thinking is slower at ~180s and has partial coverage, but its completed-task quality improves materially at only ~$0.008/task. This suggests teams should evaluate both modes against their own timeout budgets.

  • For the broader market: GPT-5.4, now tested on 37/38 tasks, scores 7.88 overall in the latest composite table. It remains fastest and strong on incident debugging/system design, but trails Claude and DeepSeek on deep implementation and financial research. GPT-5.5/Codex 5.5 appears highly capable for coding, but official API benchmarking is still pending. The top tier of frontier models is now a three-way race between Anthropic, DeepSeek, and OpenAI, with each excelling in different domains.

What We Tested

We ran 11 model configurations through a 38-task benchmark spanning 4 categories:

Source: Funda AI

Source: Funda AI

GPT-5.4 was tested with our multi-turn OpenAI-compatible runner with full tool calling support. All financial research tasks use live API data via the financial skills framework. Kimi K2.6 was tested on financial research only. Claude Opus 4.6 (Thinking) is the same model as Opus 4.6 but with extended thinking explicitly enabled (budget_tokens=10,000). The DeepSeek Thinking rows reflect public-API thinking mode with partial coverage because several hard coding tasks timed out. All other Claude models were tested in standard mode without extended thinking.


2. DeepSeek V4 Performance Deep Dive

2.1 Latency & Speed

Source: Funda AI

Where Pro is slow: Complex multi-turn tasks with heavy tool use — comps (577s), financial policy (468s), game theory (450s), DCF (439s). Pro tends to make more tool calls and produce longer intermediate reasoning before generating output.

Where Flash is fast: Earnings-adjacent tasks and simpler retrievals — semi-sentiment (49s), financial data (18s), multi-company (73s). Flash converges faster with fewer tool-use loops.

2.2 Token Efficiency

Source: Funda AI

Pro generates 30% more analytical content per financial task. This isn’t padding — it manifests as additional tables, deeper competitive breakdowns, more footnotes, and extended scenario analysis. The game theory task is the clearest example: Pro produced 4,260 words with 11 players, 18 footnotes, and forced-move economics analysis, vs Flash’s 2,800 words covering the same framework with less depth.

2.3 Success Rate & Reliability

Source: Funda AI

2.4 Performance by Category

Multi-step (21 tasks): DeepSeek V4’s strongest category. Standard Pro ties Opus 4.7 at 7-7 on the 16 financial research tasks and scores 8.29 avg across all multi-step tasks. With thinking enabled, Pro reaches 8.90 on completed multi-step tasks and Flash reaches 8.55, but both have partial coverage due to timeouts. DeepSeek excels at analytical depth (game theory, competitive mapping, multi-company comparison); Opus 4.7 excels at earnings call integration, citation rigor, and structured valuation methodology. See §4.1 for full task-by-task comparison.

Coding (8 tasks): Standard DeepSeek Pro and Flash score about 8.1 avg. DeepSeek Pro (Thinking) scores 8.48 on 4 completed coding tasks and Flash (Thinking) scores 8.44 on 6 completed coding tasks, but hard coding timeouts make the thinking-mode results less complete. Claude Opus 4.6 Thinking leads at 8.88, followed by DeepSeek Pro (Thinking), Opus 4.7, and Flash (Thinking). See §4.2 for details. Reasoning (6 tasks): Standard DeepSeek Pro scored 8.17 and Flash scored 8.12, close to GPT-5.4 (8.15) and standard Opus 4.6 (8.18). Thinking mode lifts Flash to 8.63 on 4 completed reasoning tasks and Pro to 8.53 on 4 completed tasks. Claude Opus 4.6 Thinking leads this category at 8.82. See §4.3.Writing (3 tasks): Opus 4.7 remains the writing leader at 9.17 in the detail table, with the only 10/10 writing result. DeepSeek V4 Pro (Thinking) scores 8.80, standard Pro scores 8.33, Flash Thinking scores 8.20 on its completed writing task, and standard Flash scores 7.87. See §4.4.

2.5 Thinking Mode: DeepSeek V4 with Reasoning Enabled

DeepSeek V4’s public API defaults to thinking mode, where the model reasons internally before responding. Our standard benchmark used the pre-release Pandora API without thinking; we then re-ran Pro and Flash with thinking enabled on the public API.

Source: Funda AI

Thinking mode is the largest quality lever in the DeepSeek data, especially on earnings recap, DCF, and non-financial multi-step tasks. The tradeoff is reliability: Pro Thinking completed 29/38 tasks and Flash Thinking completed 33/38, with the missing tasks concentrated in hard coding and hard reasoning timeouts. Scores for the thinking rows should therefore be read as completed-task quality, not full-suite coverage.


3. Horizontal Comparison: All Models

3.1 Latency

GPT-5.4 (105s) is the fastest full-suite model in the benchmark. DeepSeek V4 Flash and Opus 4.5 are the fastest models with complete benchmark coverage after GPT-5.4. DeepSeek Pro Thinking is the slowest tested configuration at ~350s avg, with partial coverage.

Source: Funda AI

3.2 Success Rate

Standard DeepSeek, Claude, GPT-5.4, and Kimi configurations all achieve 95%+ completion. Claude Opus 4.6 (Thinking) completed 39/40 attempted tasks; the ML pipeline failure hit a token limit. DeepSeek thinking mode has lower completion because several hard coding and reasoning tasks timed out on the first API call. Infrastructure errors (SSL, API bugs, proxy timeouts) are not counted against model capability.

Source: Funda AI

3.3 Multi-step Quality (0-10 Scale, 21 tasks)

Multi-step tasks include all 16 financial research tasks (company primers, comps, DCF, earnings, sentiment, game theory, etc.) plus 5 analytical/engineering tasks (NVDA vs AVGO research, contrarian bear case, SaaS IPO, production incident debugging, ML fraud pipeline).

Source: Funda AI

3.4 Coding Quality

Source: Funda AI

Claude Opus 4.6 (Thinking) records the highest coding-detail average in this benchmark at 8.88. DeepSeek Pro (Thinking) follows at 8.48 on 4 completed coding tasks, while Claude Opus 4.7 and DeepSeek Flash (Thinking) both score 8.44. The caveat is coverage: DeepSeek thinking-mode rows exclude hard coding tasks that timed out. GPT-5.4’s system design output was strong, but the rate limiter implementation was notably shallow compared with Opus 4.7’s production-grade implementation with 48 tests.

3.5 Reasoning Quality

Source: Funda AI

Claude Opus 4.6 (Thinking) records the highest reasoning-detail average at 8.82. DeepSeek Flash (Thinking) and Pro (Thinking) also improve substantially on completed reasoning tasks, at 8.63 and 8.53 respectively, but with only 4/6 reasoning tasks completed. GPT-5.4’s causal inference score was rescored from a prior 10/10 to 8.8 after side-by-side comparison; it remains strong, but not at the publication-ready threshold. Bayesian probability remains broadly strong across frontier models.

3.6 Writing Quality

* DeepSeek V4 Flash (Thinking) writing average reflects the completed writing task only.

Source: Funda AI

Opus 4.7’s technical blog post is the only 10/10 in the entire benchmark — publication-ready with authentic VP Engineering voice, real company details ($340K/month Datadog bill), and genuine emotional texture. DeepSeek Pro also produced an excellent blog post, and Pro Thinking improved the writing average to 8.80 on completed tasks.

3.7 Token Efficiency & Cost

Source: Funda AI

DeepSeek V4 is now publicly priced: Flash ($0.28/1M output, $0.14/1M input) is the cheapest high-quality model in this benchmark at ~$0.007/task. Flash Thinking remains nearly as cheap at ~$0.008/task despite using ~18% more output tokens. Pro ($3.48/1M output, $1.74/1M input) is estimated at ~$0.10/task, while Pro Thinking is ~$0.15/task with ~62% more output tokens. Both support cache-hit pricing at ~10-20% of standard rates. Among other models, GPT-5.4 offers strong cost/quality for standard tasks, and Claude Opus models are the most expensive while producing the highest quality output on complex writing and research tasks.

3.8 Overall Composite Ranking (0-10 Scale)

Scores represent averages across tasks in each category — a 1-point difference is directionally meaningful but not definitive. DeepSeek thinking-mode rows are included for visibility, but their coverage is partial; Kimi K2.6 is included only for financial research coverage.

Source: Funda AI

* DeepSeek thinking rows have partial coverage; weighted averages are computed over completed/scored tasks and should not be read as full-suite ranks. Kimi K2.6 was tested on financial research only (16 tasks). Direct comparison is most meaningful among full-coverage models.

4. Detailed Analysis: DeepSeek V4 vs All Models by Category

4.1 Multi-step — Detailed Comparison

Multi-step tasks (21 total) include 16 financial research tasks requiring live data retrieval, multi-turn tool calling, and analytical synthesis, plus 5 analytical/engineering tasks (research synthesis, contrarian analysis, SaaS IPO, production incident debugging, ML pipeline design).

DeepSeek V4 vs Claude Opus 4.7 (7-7 Tie on Financial Research)Opus 4.7’s advantages: Earnings call integration (management quotes, Q&A themes, market reaction analysis), citation rigor (more footnotes, data reconciliation), structured comps methodology (core-subset peer selection, PEG adjustment), and conservative DCF modeling (SBC as real cost).DeepSeek V4 areas of relative strength: Extended analytical depth on complex frameworks (game theory, multi-company comparison, sentiment analysis), broader data source utilization (insider transactions, options Greeks, supply chain details), and consistently higher word count with substantive content (30% more words on average).Task highlights:

  • Earnings Recap (Opus 4.7 wins decisively): Opus 4.7 produced 5+ management quotes, Q&A theme analysis, and market reaction tracking. DeepSeek’s recap was competent but lacked the earnings call texture that distinguishes professional equity research.

  • Game Theory (DeepSeek 10 vs Opus 9): Both produced strong NVDA strategic analysis. DeepSeek’s 28,302-character output with 11 players, 18 footnotes, and forced-move economics analysis received the only perfect score in the benchmark. Opus 4.7’s 12,700-character analysis with 8+ tables and 25 citations was also strong, but somewhat less comprehensive on the “If I Were Each Player” framework.

  • Sentiment Analysis (Opus 4.7 wins slightly): Opus delivered a 229-line report with 5-step attribution methodology, daily score tracking, and 3 named attribution chains. DeepSeek’s Pro produced strong attribution analysis with CEO interview data that Opus missed.

DeepSeek V4 vs Claude Opus 4.6 (7-7-1 Tie)

User's avatar

Continue reading this post for free, courtesy of FUNDA.

Or purchase a paid subscription.
© 2026 FundaAI · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture