Ollama, the gift that keeps on giving.

For the past few weeks, I’ve been developing a web component that wraps Ollama’s features, making it easy to integrate into web applications. Although the feature set isn’t complete yet—I’d like to add Retrieval-augmented generation (RAG) functionality—the core implementation with Ollama is working smoothly.

In this post, I’d like to highlight a few insights and discoveries I’ve made along the way.

Before diving in, let’s clarify the goal: using AI to bring real-time analysis to data interaction. Simply put, the objective is to help make sense of what we’re seeing.

In my current role, we have clients who routinely load datasets of up to 40,000 rows to gain operational insights. Managing this volume effectively requires better tooling, as manually interpreting it is impractical. We need assistance in querying and reasoning to understand the patterns within these records. Since this data is sensitive, keeping everything local is essential.

Using Ollama makes that task easy.

Working with models

In an ideal world, AI models would be seamlessly supported as browser features. Until that day, however, I prefer running models on my local machine to keep sensitive data secure. Ollama enables local installation and execution of AI models, with strong hardware support and a promising roadmap.

Ollama provides “out-of-the-box” compatibility with a variety of models. However, a common challenge is that models must be run via the command line, which lacks user-friendliness and discoverability. My goal with this web component is to make model access easier, especially for less tech-savvy users. I envision a web interface where users can view available models, install them, see what’s currently installed, and track version details. Thankfully, Ollama offers a REST API that simplifies interaction with the models, making it much easier to experiment and interface with models once they’re up and running locally.

The main drawback is that Ollama doesn’t provide a REST endpoint to fetch available models. To work around this, I wrote a script that scrapes their model documentation to gather the necessary information, which I then store as a JSON file within the component. However, this approach requires manually updating the file whenever a new model is released. It’s not ideal—a REST endpoint for this data would be much more convenient.

Selecting the model

After installing a model, the next step is to choose the model and set the communication mode. There are two options: “generate” and “chat.” The “generate” mode provides a straightforward single-question, single-response interaction. In “chat” mode, however, the conversation history is cached and included in each call to the model. While this enables contextual conversations, it may eventually hit the maximum token limit.

The component currently supports three types of models: one for chat, one for single-response generation, and another for generating embeddings.

Rendering response

When receiving streams from Ollama, it’s important to render the text in a readable format. Different models may produce varied output styles, but I’ve found that treating the output as Markdown generally ensures successful rendering into semantic HTML. This approach provides a consistent and readable presentation of the results.

Ollama streams responses one token at a time, but ideally, I want to wait for each line to complete before converting it from Markdown to HTML. For security, sanitizing the HTML is essential, so I used Rust libraries to handle both the Markdown conversion and HTML sanitization. I then compiled this functionality to WebAssembly (WASM) to generate secure, formatted results efficiently.

I now needed to append the new HTML to the existing content.

const html = await MarkdownModule.to_html({
    markdown: this.#row.join(""),
});

const template = document.createElement("template");
template.innerHTML = html;
this.shadowRoot.appendChild(template.content.cloneNode(true));

Vector database

To support embeddings, a vector database is essential. Some lightweight options exist for WASM, but for reliability, it’s crucial to choose one that is actively maintained. SurrealDB is one option: it can run locally or on a server and supports WASM with either in-memory or IndexDB storage. However, at around 9MB, it can be a hefty addition. Whether this size is manageable depends on the application type and hosting environment; a single network instance may work better in some cases. SurrealDB’s plans for SurrealML, which brings data and models closer together, add to its appeal, and I’ve found it to be a flexible and user-friendly database.

Ideally, having a dedicated vector database API in the browser would be perfect. Although IndexDB is an option, it lacks the search capabilities optimized for vector data—and, let’s face it, working with IDB isn’t the smoothest experience.

Next steps

The next steps for the component are:

  1. Enable image processing to work with multi modal models.
  2. Enable a RAG pipeline so that you can embed documents as part of the conversation.
  3. Enable better data visualization in the component.

The better UI

A longer-term goal is to enable voice UI through natural language, allowing users to talk to the application and receive responses in kind. This approach also enhances accessibility by eliminating the need for screen readers; users can simply interact with the application as they would with a person, all while running locally and in their own dialect. While services like ElevenLabs provide some of this capability, they rely on external servers, which reintroduces privacy concerns.

Ollama already simplifies integrating large language models into applications, but for a truly future-proof solution, we need to run models directly as browser features. This includes, but is not limited too, speech-to-text, text-to-speech, and specific language models (LLMs). Libraries like Hugging Face’s transformers.js are paving the way, though they currently lack end-user discoverability and personal model customization as a default.

Intent driven

The process API defines intent using JSON, and the execution pipeline ensures that intent is carried out as specified. This setup allows large language models (LLMs) to easily interact with the process API by generating the appropriate JSON. With the rise of agents, however, I need to consider how we can better leverage them. Right now, it feels like everyone is working in their own silo, but what we really need is a standards body to govern agent behavior and ensure controlled, secure interactions.

Summary

The world is evolving rapidly, and at first, AI felt distant—something out there that we didn’t really acknowledge, even though we’ve had it on our devices for years. Now, it feels like a new energy has been injected into the space, unlocking significant potential for client-side applications. I believe that, in time, AI agency will become a standard feature in browsers, enabling web applications to help us better understand our data. Until then, thanks to initiatives like Ollama and its REST API, adding AI-driven agency to your web applications has become much easier.