Apertus vs GPT-5.1, Gemini 3.0 Pro and Claude 4.5 Sonnet
Why a fully open Swiss model is still worth watching in a frontier-model world
This report compares Swiss AI’s Apertus model
https://publicai.co/
with three leading large language models:
-
OpenAI – GPT-5.1
-
Google DeepMind – Gemini 3.0 Pro
-
Anthropic – Claude 4.5 “Sonnet”
The comparison spans:
-
Performance on standard benchmarks and a real classroom-style coding task
-
Open-source status and licensing
-
Training data transparency
-
Model architecture and scalability
-
Deployment cost and efficiency
-
Safety, alignment and guardrails
The conclusion: Apertus is not a “GPT-5 killer”, but a strategically different choice that trades a bit of raw power for openness, sovereignty and transparency.
1. Model Overview and Availability
High-level snapshot:
| Model | Developer | Scale | Open-Source? | Access / License | Notable Features |
|---|---|---|---|---|---|
| Apertus (70B & 8B) | Swiss AI Initiative (EPFL, ETHZ, CSCS) | 70B & 8B params, ~15T tokens | Yes | Apache 2.0, weights + training data public on Hugging Face | 1,800+ languages, 65k-token context, transparency and EU-AI-Act-ready documentation |
| GPT-5.1 | OpenAI | Not disclosed (successor to GPT-5) | No | Closed API (ChatGPT, OpenAI API) | “Instant” vs “Thinking” modes, ~400k-token context, fully multimodal (text, vision, audio) |
| Gemini 3.0 Pro | Google DeepMind | Not disclosed (flagship Gemini) | No | Google ecosystem (Gemini app, Vertex AI, etc.) | SOTA reasoning & multimodality, “Deep Think” mode, tops many public benchmarks |
| Claude 4.5 Sonnet | Anthropic | Not disclosed (latest Claude) | No | Claude API, AWS Bedrock, Vertex AI | Agentic design for long-running tasks, 200k–1M token context, extremely strong coding/tool use |
Key structural difference:
Apertus is the only fully open-source model here; weights, training recipes, and even training data are public. GPT-5.1, Gemini 3.0 Pro and Claude 4.5 Sonnet are proprietary, closed-weights models accessible only via API or platform integrations.
2. Performance Benchmarks (High-Level)
On standard benchmarks, GPT-5.1, Gemini 3.0 Pro and Claude 4.5 Sonnet sit at the frontier. Apertus aims instead for “LLaMA-3-era” performance with complete transparency.
Approximate picture:
| Benchmark | Apertus 70B | GPT-5.1 | Gemini 3.0 Pro | Claude 4.5 Sonnet |
|---|---|---|---|---|
| MMLU (academic knowledge) | ~70% (LLaMA-3-level, not SOTA) | ~84% (est.) | ~90% (SOTA or near) | ~89% (near SOTA) |
| GSM8K (math word problems) | N/A public; likely below SOTA | ~90%+ (with tools higher) | ~100% with code (AIME), ~95% without tools | ~100% with code (AIME) |
| HumanEval (Python coding) | No public numbers; far lower in practice (see case study) | ~92% (SOTA code generation) | ~90%+ (competitive) | ~90%+ (SOTA for coding agents) |
Roughly: Apertus ≈ top open models of 2024.
GPT-5.1 / Gemini 3.0 Pro / Claude 4.5 ≈ frontier proprietary models of 2025.
3. Case Study: Can They Actually Build a Real Interactive?
To move beyond benchmarks, I tested all models on a real classroom-style coding task:
Task:
Create a complete, self-contained HTML5 interactive on the Trigonometry Unit Circle for Sec 3–4 students (Singapore level), in a single HTML file with embedded CSS and JavaScript.
Key requirements (abridged):
-
Rich educational content (Sec 3–4 trigonometry)
-
Interactive controls (sliders, buttons, checkboxes, drag-and-drop)
-
Dynamic 2D visualizations (unit circle, graphs, triangle view)
-
Real-time feedback and live readings of sin θ and cos θ
-
Toggle between degrees/radians
-
Touch-friendly (mobile responsive, 44px touch targets)
-
Randomised problem generator with step-by-step hints and “Why?” explanations
-
Real-time analytics panel for learning assessment (timestamps, action log, state values)
-
All in one HTML file, ready to drop into an LMS
Apertus
-
Outcome: Repeatedly produced incomplete code and got stuck at essentially the same truncated output each time.
https://chat.publicai.co/c/75fe5835-66be-4ae4-a0b1-b90c61da0d73 -
The model clearly understood the prompt, but could not sustain the long, complex generation needed for a full, production-ready interactive.
Reference: Apertus test log
https://chat.publicai.co/c/75fe5835-66be-4ae4-a0b1-b90c61da0d73
GPT-5.1, Gemini 3.0 Pro, Claude 4.5, DeepSeek v2.3
-
GPT-5.1, Gemini 3.0 Pro and DeepSeek v2.3 were all able to generate complete, runnable HTML5 interactives, albeit with varying polish and bugs that a human developer would still need to fix.
-
Claude 4.5 Sonnet is particularly strong at refactoring and improving such interactives, even if not always perfect on first pass.
-
These models could:
-
Maintain the long prompt requirements
-
Create a working canvas/unit circle + sliders
-
Wire up basic analytics logging in JavaScript
-
Sample transcripts (for reference, not necessary in the blog):
-
GPT-5.1: https://chatgpt.com/share/69269c75-42a8-8008-8268-5389633343ab
https://chatgpt.com/share/69269c75-42a8-8008-8268-5389633343ab -
Gemini 3.0 Pro: (Gemini project link)
https://gemini.google.com/u/1/app/16fdb1e578a3431c?pageId=none -
DeepSeek v2.3: https://chat.deepseek.com/a/chat/s/b485bcee-29c1-46eb-bf3b-26a25d32b829
https://chat.deepseek.com/a/chat/s/b485bcee-29c1-46eb-bf3b-26a25d32b829 -
Claude 4.5 Sonnet: https://claude.ai/chat/edb2b8ac-86fb-4f9a-9bf0-ce3fdeabf6ce
https://claude.ai/chat/edb2b8ac-86fb-4f9a-9bf0-ce3fdeabf6ce"}">https://claude.ai/chat/edb2b8ac-86fb-4f9a-9bf0-ce3fdeabf6ce
Takeaway from the case study
For complex, long-form, production-grade coding tasks, the frontier proprietary models are clearly ahead. Apertus behaves more like a strong open model that still needs careful chunking, scaffolding and possibly external tools to match this level of output.
4. Open-Source Status and Licensing
This is where Apertus truly stands out.
-
Apertus
-
Fully open-source: weights, training code, and training data are public.
-
Licensed under Apache 2.0, allowing commercial use, modification and redistribution.
-
Downloadable from Hugging Face in 8B and 70B variants; you can self-host, fine-tune and even publish derivatives.
-
-
GPT-5.1
-
Closed-source, proprietary model.
-
Access only via OpenAI’s APIs / ChatGPT.
-
No public weights or full training dataset.
-
-
Gemini 3.0 Pro
-
Proprietary model embedded in Google’s ecosystem (Gemini app, Workspace, Vertex AI).
-
No weights or training data released; license is proprietary.
-
-
Claude 4.5 Sonnet
-
Proprietary model; accessed via Anthropic API, AWS Bedrock, Vertex AI, etc.
-
No public weights or training corpus.
-
If you need self-hosting, auditability, and no vendor lock-in, Apertus is in a category of its own.
5. Training Data Transparency
Another area where Apertus is very different:
-
Apertus
-
Trained on 15 trillion tokens from publicly available data only.
-
About 60% English / 40% non-English; explicitly includes many under-represented languages (Swiss German, Romansh, African and Asian languages).
-
Respects opt-out requests from websites.
-
Provides scripts to reconstruct the training dataset and a detailed record of the training pipeline.
-
Designed explicitly to comply with the EU AI Act transparency requirements.
-
-
GPT-5.1, Gemini 3.0 Pro, Claude 4.5
-
Training data is described only in broad terms (web text, books, code, images, etc.).
-
No detailed list of sources, no way to reconstruct the corpus.
-
Users must largely trust the provider’s assurances about privacy, copyright and bias mitigation.
-
For researchers, regulators and public institutions, this level of openness from Apertus is rare and powerful: you can actually know what went into the model.
6. Architecture and Scalability
Apertus
-
70B-parameter decoder-only Transformer (plus an 8B variant).
-
Uses modern tricks: a custom activation (xIELU) and optimizer (AdEMA-Mix).
-
Trained from scratch, then aligned with supervised fine-tuning and QRPO.
-
Supports 65,536-token context – unusually long for an open model.
-
All architectural details are fully documented.
-
Not multimodal in practice: while UI may allow image upload on some demos, Apertus currently behaves as a text-only model in real use.
GPT-5.1
-
Architecture details are proprietary, but likely a very large Transformer with mixture-of-experts and extensive tool-calling support.
-
Provides:
-
~400k-token context window.
-
Instant vs Thinking modes (adaptive compute – spending more “thinking tokens” on hard questions).
-
Full multimodality: text, images, and audio/video.
-
Gemini 3.0 Pro
-
“Built from the ground up” as a multimodal model (text, images, code, video).
-
Multiple variants (Pro, Flash, Lite, Deep Think).
-
Very long context (hundreds of thousands of tokens), plus strong tool orchestration.
-
Designed as a “one model for many domains” system.
Claude 4.5 Sonnet
-
Transformer-based, optimized for agentic, long-horizon tasks.
-
Context window: 200k tokens, extendable up to 1M for special use cases.
-
Highly optimized for long sessions (e.g. 30+ hours of coding with stable memory and “checkpoints”).
-
Strong code execution and file-creation tools built in.
Summary
All four models push context and scale in different ways. Apertus demonstrates what public institutions can do at 70B / 65k context; the proprietary models push further with trillions of parameters, multimodality and more aggressive context scaling – but behind closed doors.
7. Cost of Deployment and Inference
Apertus
-
No API fees – but you pay for hardware.
-
70B model typically needs multiple high-VRAM GPUs for low-latency inference, especially with long contexts.
-
8B model can run on a single 16–24 GB GPU or even CPU (slow).
-
Can be quantized, pruned and fine-tuned by the user; supports efficient inference engines like vLLM/SGLang.
-
Attractive if you:
-
Already have GPU infrastructure (e.g. national supercomputers, institutional clusters), or
-
Expect very high usage and want to avoid per-token API bills.
-
GPT-5.1
-
Available as a paid API service.
-
Typical pricing (late-2025 ballpark): roughly US\(1.25 / 1M input tokens and US\)10 / 1M output tokens, cheaper than GPT-4 and lower than Claude for output.
-
Adaptive reasoning (Instant vs Thinking) reduces cost for simple tasks.
Gemini 3.0 Pro
-
Accessed via Google Cloud (Vertex AI) and the Gemini app.
-
Pricing is broadly in line with other frontier models; may be bundled into enterprise contracts.
-
Highly optimized on Google TPUs; benefits from caching and batching in production.
Claude 4.5 Sonnet
-
Premium pricing: ~US\(3 / 1M input tokens and US\)15 / 1M output tokens.
-
Positioned as a high-end model for complex, long-running tasks.
-
No self-hosting; entirely cloud/API based.
Cost pattern
-
For small to medium usage: GPT-5.1 / Gemini / Claude are much easier — no hardware to manage, just pay per token.
-
For very large-scale, sensitive or long-term deployments: Apertus may become cheaper and more strategic, especially if you can amortise hardware costs and use the 8B model for simple tasks while reserving 70B for harder ones.
8. Safety, Alignment and Guardrails
Apertus
-
Training pipeline deliberately filters personal and toxic data; respects opt-outs.
-
Post-training alignment with QRPO, but no strict output filter is shipped by default.
-
Model card explicitly says that real-time content moderation is the user’s responsibility (filters are planned but not enforced yet).
-
Focus is on legal and ethical compliance via transparency, rather than heavy, baked-in refusal behaviour.
GPT-5.1
-
Strong, built-in content filters and refusal behaviour.
-
Extensive RLHF, red-teaming, and continuous safety updates; significantly fewer successful jailbreaks compared to GPT-4.
-
Tends to be helpful but will decline disallowed content consistently.
Gemini 3.0 Pro
-
Marketed as Google’s “most secure model yet”.
-
Subjected to rigorous internal and external safety evaluations (e.g. UK AI Safety Institute).
-
Improved resistance to prompt injection, sycophancy and misuse; deeply integrated with Google’s safety stack.
Claude 4.5 Sonnet
-
Anthropic’s signature “Constitutional AI” approach; possibly the strongest public emphasis on safety and alignment.
-
High refusal rates for dangerous content; reduced sycophancy and deceptive behaviours.
-
Designed to stay safe even over very long, agentic sessions with complex tool use.
Bottom line
-
The three proprietary models come with strong, maintained guardrails out-of-the-box, but you can’t inspect or alter their safety layers.
-
Apertus gives you full visibility and control, but you must build your own moderation and policy layer if you need strict guardrails.
9. Why Apertus Is Still Worth Exploring
Given all this, why should anyone bother with Apertus when GPT-5.1, Gemini 3.0 Pro and Claude 4.5 are clearly stronger on raw capability?
1. Open and Sovereign AI
Apertus embodies a “public infrastructure” model of AI:
-
Built by public institutions (EPFL, ETHZ, CSCS) for the public good.
-
Can be deployed entirely on national or institutional infrastructure, keeping sensitive data on local soil.
-
Avoids dependence on foreign vendors’ policies, pricing and availability.
For governments, universities and regulated sectors (banks, hospitals, schools), this is strategically different from renting a black-box API from overseas.
2. Transparency and Regulatory Compliance
-
Full architectural and data transparency makes Apertus auditable.
-
Easier to answer questions like: “Was this model trained on copyrighted X?” or “How did we mitigate bias against group Y?”.
-
Aligns naturally with emerging AI governance regimes (e.g. EU AI Act) that demand provenance and explainability, not just performance.
3. Multilingual and “Long Tail” Strengths
-
Apertus dedicates a large portion of its training to non-English and under-represented languages.
-
For many “long-tail” languages and local dialects, it may be stronger out of the box than frontier models that are mostly optimised for English and a few major languages.
-
This makes it especially interesting for countries and communities where such languages matter for education and public services.
4. Adaptability and Research Value
-
With Apache 2.0, you can:
-
Fine-tune Apertus on your own domain data.
-
Experiment with new alignment techniques.
-
Add modalities (e.g. vision) or integrate into open-source toolchains.
-
Release derivative models.
-
For researchers, Apertus is a live laboratory for studying large-scale LLM behaviour, bias and safety in a way that closed models simply cannot offer.
5. Cost Benefits at Very Large Scale
-
For organisations consuming billions of tokens per month, API costs from GPT-5.1 / Gemini / Claude add up quickly.
-
Running Apertus on existing GPU clusters can be cheaper in the long run, especially with:
-
Quantisation,
-
Tiered deployment (8B for simple tasks, 70B for hard ones), and
-
Careful batching.
-
6. Community and Ecosystem
-
Apertus lives within the open-source ecosystem: Hugging Face, community fine-tunes, open research projects.
-
Over time, we can expect:
-
Domain-tuned Apertus variants (coding, legal, educational, etc.)
-
Shared prompts, evaluation sets and safety tooling.
-
That community-driven evolution is already what keeps open models like LLaMA competitive; Apertus extends that story with a much higher level of transparency.
10. Conclusion
If your question is simply “Which model is the most capable right now?”, the answer in late 2025 is still:
Gemini 3.0 Pro, GPT-5.1 and Claude 4.5 Sonnet lead on frontier benchmarks and complex tasks.
Apertus does not beat them on raw performance. My own classroom-style coding test (the Trigonometry Unit Circle interactive) made that very clear: Apertus repeatedly failed to complete the full HTML5 simulation, while the proprietary models could.
But that is not the only question that matters.
If your questions are:
-
“Who controls this model?”
-
“Can we audit the training data?”
-
“Can we run it on national infrastructure?”
-
“Can we adapt and publish our own versions?”
then Apertus offers something the frontier models cannot:
complete openness, sovereignty, and a blueprint for transparent AI at scale.
In a world where GPT-5.1, Gemini 3.0 Pro and Claude 4.5 Sonnet dominate the performance charts, Apertus stands out as a credible, high-end public alternative. It deliberately trades a few percentage points of benchmark score for “sunlight” in every layer of the stack – from data to weights to training code.
As AI regulation, digital sovereignty and educational use cases mature, that trade-off is looking less like a weakness and more like a necessary second pillar alongside frontier proprietary models.