Model loads once
On first visit, the Hermes 3 weights download from a CDN and are cached in your browser. Subsequent visits start instantly.
Inference runs locally
Your GPU executes the model entirely inside the browser tab via WebGPU. No cloud inference, no API calls.
Chat freely
Type a message, press Enter, and get a streamed response. The full conversation history stays on your device.
About Hermes 3
Hermes 3 is an instruction-tuned variant of Llama 3.1 8B by
NousResearch, optimised for multi-turn conversation and instruction following.
The quantised build used here (q4f16_1) keeps quality high while
fitting the model in a reasonable amount of VRAM.
Why WebGPU?
WebGPU gives browsers low-level access to your GPU, making it fast enough to run 8B-parameter models in real time. This tool uses MLC web-llm, a compiler-optimised runtime purpose-built for in-browser LLM inference.
Frequently asked questions
- Is this AI chat really free?
- Yes, completely free with no usage limits and no account required.
- Is my conversation private?
- Yes. The model runs inside your browser tab. No messages are ever sent to a server — there is no backend involved at all.
- Why does the first load take a while?
- The model weights are about 5 GB and are downloaded on first use, then cached. After that, the model loads from cache in a few seconds.
- Which browsers are supported?
- Chrome 113+, Edge 113+, and recent Safari on macOS 14+ all support WebGPU. Firefox does not yet have WebGPU enabled by default.
- Can the model access the internet or my files?
- No. The model is entirely sandboxed inside the browser. It has no network access and cannot read your filesystem.
More free tools
More on didof.devPro carousels for Instagram and LinkedIn. 16 templates, 12 themes, on-device AI.
Zero-knowledge QR codes for emergencies, property, vCards, and offline image sharing.
Generate click-to-chat links and QR codes for your business. No tracking.
Remove any image background instantly. AI runs on your device — no uploads.