Runs in your browser — 100% private

In-Browser
AI Chat

Hermes 3 (Llama 3.1 8B) runs entirely on your device via WebGPU. No server. No account. No data ever leaves your browser.

No uploads No account needed No conversation logging Unlimited messages WebGPU accelerated
01

Model loads once

On first visit, the Hermes 3 weights download from a CDN and are cached in your browser. Subsequent visits start instantly.

02

Inference runs locally

Your GPU executes the model entirely inside the browser tab via WebGPU. No cloud inference, no API calls.

03

Chat freely

Type a message, press Enter, and get a streamed response. The full conversation history stays on your device.

About Hermes 3

Hermes 3 is an instruction-tuned variant of Llama 3.1 8B by NousResearch, optimised for multi-turn conversation and instruction following. The quantised build used here (q4f16_1) keeps quality high while fitting the model in a reasonable amount of VRAM.

Why WebGPU?

WebGPU gives browsers low-level access to your GPU, making it fast enough to run 8B-parameter models in real time. This tool uses MLC web-llm, a compiler-optimised runtime purpose-built for in-browser LLM inference.

Frequently asked questions

Is this AI chat really free?
Yes, completely free with no usage limits and no account required.
Is my conversation private?
Yes. The model runs inside your browser tab. No messages are ever sent to a server — there is no backend involved at all.
Why does the first load take a while?
The model weights are about 5 GB and are downloaded on first use, then cached. After that, the model loads from cache in a few seconds.
Which browsers are supported?
Chrome 113+, Edge 113+, and recent Safari on macOS 14+ all support WebGPU. Firefox does not yet have WebGPU enabled by default.
Can the model access the internet or my files?
No. The model is entirely sandboxed inside the browser. It has no network access and cannot read your filesystem.

More free tools

More on didof.dev