, ,

Offline Smart Action

Introduction

The future of smart, AI-powered user interfaces doesn’t need to rely on the cloud. This blog post dives into a powerful, yet lightweight implementation of offline smart action detection, using JavaScript, WebAssembly, and ONNX—all right in the browser.

We’ll unpack a complete working code sample, explore the AI model behind the scenes, discuss performance and memory usage, and understand what makes this all possible.

What This Demo Does

The page allows you to type a command like:

“save the current changes”

It then tries to detect the intent (e.g., save) and automatically executes the corresponding action (like popping up an alert: ✅ SAVE executed).

All of this happens offline in your browser using a pre-trained language model loaded via @xenova/transformers.

The Model: Xenova/bge-m3

What is it?

​BGE-M3 is a sophisticated text embedding model developed by the Beijing Academy of Artificial Intelligence (BAAI), known for its versatility across multiple dimensions:​

  • Multi-Functionality: BGE-M3 uniquely integrates three retrieval methods—dense retrieval, multi-vector retrieval, and sparse retrieval—within a single framework. This allows for flexible and efficient information retrieval tailored to various application needs. ​
  • Multi-Linguality: The model supports over 100 languages, facilitating robust multilingual and cross-lingual retrieval tasks. This extensive language coverage makes it suitable for global applications requiring comprehensive language understanding. ​
  • Multi-Granularity: Capable of processing inputs ranging from short sentences to long documents up to 8,192 tokens, BGE-M3 is adept at handling texts of varying lengths, enhancing its applicability across diverse textual analyses. ​

Model Family and Architecture:

BGE-M3 is part of the BGE (BAAI General Embedding) family of models, which are designed to provide versatile and high-performance text embeddings. It is based on the XLM-RoBERTa architecture, a robust multilingual variant of the RoBERTa model, enabling it to effectively handle multiple languages and complex retrieval tasks. ​

Advantages:

  • Versatility: The integration of multiple retrieval methods within a single model offers flexibility for various information retrieval scenarios.​
  • Extensive Language Support: With support for over 100 languages, BGE-M3 is suitable for applications requiring multilingual capabilities.​
  • Handling Long Inputs: The ability to process long documents up to 8,192 tokens makes it ideal for tasks involving extensive textual data.​

Considerations:

  • Computational Resources: Processing long documents may require substantial computational resources, which could be a consideration for deployment in resource-constrained environments.
  • Performance Across Languages: While BGE-M3 supports numerous languages, performance may vary across different language families and linguistic features, necessitating evaluation for specific use cases.

In summary, BGE-M3 stands out for its comprehensive approach to text embedding, offering a blend of functionality, language support, and input flexibility. Its design makes it a compelling choice for a wide range of natural language processing tasks, particularly those requiring nuanced retrieval capabilities across multiple languages and document lengths.

Why Embeddings?

Embeddings are numerical representations of text that capture the underlying meaning of words, phrases, or entire sentences. Unlike traditional keyword matching, embeddings enable models to understand semantic relationships—the subtle nuances of meaning between different expressions. For example, if a user writes “store this info,” and the model was trained on similar phrases like “save the data,” the embedding vectors for both phrases will be very close in the multidimensional space where they’re represented. This proximity reflects their shared intent, even though the exact wording is different. In our system, we leverage this property to detect user intent: rather than relying on exact keywords, we compare embedding vectors to find the closest match in meaning. This allows the model to generalize beyond specific phrases and respond intelligently to varied natural language input. Whether the user says “keep a copy,” “archive it,” or “log this entry,” the embeddings help us recognize that they all point to the same core action—saving data.

Size and Performance

  • Model Size: ~100MB (quantized ONNX format)
  • Inference Time: <500ms on modern browsers
  • Memory Use: ~150MB peak usage
  • Format: Runs in ONNX format via WebAssembly (no GPU needed)

It’s a tradeoff: high accuracy with minimal setup, yet small enough for real-time inference in-browser.

What is @xenova/transformers?

With the help of the @xenova/transformers library—a high-performance JavaScript and WebAssembly implementation of Hugging Face’s Transformers—developers can seamlessly run powerful natural language models like Xenova/bge-m3 directly in the browser or at the edge, eliminating the need for a backend server or internet connection. This groundbreaking library translates pre-trained models such as BERT, DistilBERT, and BGE into a web-friendly format using ONNX and WebAssembly, enabling fast, memory-efficient inference entirely within the client environment. As a result, applications can deliver advanced language understanding capabilities—like semantic search, content classification, intent detection, and contextual recommendations—entirely offline and with zero data leaving the user’s device. When combined, @xenova/transformers and Xenova/bge-m3 empower developers to build rich, intelligent, privacy-preserving user experiences right in the browser.

Benefits

  • Offline-first: No API calls, no latency, no data privacy risks.
  • WebAssembly-accelerated: Good performance even without GPU.
  • HuggingFace model support: Load from Hugging Face without server setup.

Why ONNX?

ONNX (Open Neural Network Exchange) is an open, interoperable format designed to represent machine learning models in a standardized way. Developed by Microsoft and Facebook, ONNX allows models trained in various frameworks—such as PyTorch, TensorFlow, or scikit-learn—to be exported and run across different hardware and software environments without requiring framework-specific code. This universal compatibility makes it ideal for deploying machine learning models in diverse scenarios, from cloud services to mobile and edge devices.

In the browser, ONNX models are typically executed using WebAssembly (WASM), a low-level binary format that enables high-performance computation in a secure and portable manner. WebAssembly is supported by all major browsers and runs at near-native speed, making it an excellent runtime for AI inference without needing native plugins or server-side computation. When combined, ONNX and WebAssembly allow powerful machine learning models to run entirely in-browser—fast, offline, and private—opening up use cases like real-time language understanding, image recognition, and recommendation systems without ever sending data to the cloud.

Code

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8" />
  <title>Smart Action Detection (Offline)</title>
  <style>
    body { font-family: sans-serif; padding: 2em; }
    input[type="text"] { width: 300px; padding: 0.5em; }
    button { padding: 0.5em 1em; margin-left: 0.5em; }
    pre {
      margin-top: 1em;
      background: #f4f4f4;
      padding: 1em;
      border-radius: 5px;
      white-space: pre-wrap;
      max-height: 200px;
      overflow-y: auto;
    }
  </style>
</head>
<body>
  <h1>Type a command:</h1>
  <input type="text" id="userInput" placeholder="e.g. save the changes" />
  <button id="executeBtn">Execute</button>
  <pre id="log">🔄 Loading model...</pre>

  <script type="module">
    import { pipeline } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers';

    class SmartActionDetector {
      #extractor = null;
      #actionEmbeddings = new Map();
      #logElem = null;
      #logBuffer = [];

      constructor(logElem) {
        this.#logElem = logElem;
        this.actions = {
          save: 'store the current data safely by saving it',
          create: 'create or make a new item or entry',
          update: 'update or change something that already exists',
          fetch: 'get or retrieve information that was saved before',
        };
      }

      async initialize() {
        this.#setLog('🔄 Loading model...');
        this.#extractor = await pipeline('feature-extraction', 'Xenova/bge-m3');
        await this.#cacheActionEmbeddings();
        this.#setLog('✅ Model loaded. Enter a command.');
      }

      async detect(inputText) {
        if (!inputText.trim()) return alert('Please type something.');

        this.#clearLog();
        this.#log(`🧠 Embedding: "${inputText}"`);

        const inputVec = await this.#getMeanEmbedding(inputText);
        const results = [];

        for (const [action, cachedVec] of this.#actionEmbeddings.entries()) {
          const score = this.#cosineSimilarity(inputVec, cachedVec);
          results.push({ action, score });
          this.#log(`${action.toUpperCase()} = ${score.toFixed(4)}`);
        }

        results.sort((a, b) => b.score - a.score);
        this.#flushLog();

        const [best, second] = results;

        if (best.score - second.score > 0.005 && typeof window[best.action] === 'function') {
          window[best.action]();
        } else {
          alert('⚠️ Not confident enough. Try rephrasing.');
        }
      }

      async #cacheActionEmbeddings() {
        for (const [action, desc] of Object.entries(this.actions)) {
          const vec = await this.#getMeanEmbedding(desc);
          this.#actionEmbeddings.set(action, vec);
        }
      }

      async #getMeanEmbedding(text) {
        const { data, dims } = await this.#extractor(text);
        const [_, tokens, dim] = dims;
        const mean = new Float32Array(dim);

        for (let i = 0; i < tokens; i++) {
          for (let j = 0; j < dim; j++) {
            mean[j] += data[i * dim + j];
          }
        }

        return this.#normalize(mean.map(x => x / tokens));
      }

      #normalize(vec) {
        const norm = Math.hypot(...vec) || 1;
        return vec.map(x => x / norm);
      }

      #cosineSimilarity(a, b) {
        let sum = 0;
        for (let i = 0; i < a.length; i++) sum += a[i] * b[i];
        return sum;
      }

      #log(message) {
        this.#logBuffer.push(message);
      }

      #flushLog() {
        this.#logElem.textContent = this.#logBuffer.join('\n');
        this.#logBuffer = [];
      }

      #clearLog() {
        this.#logElem.textContent = '';
        this.#logBuffer = [];
      }

      #setLog(message) {
        this.#logElem.textContent = message;
      }
    }

    // Setup
    const logElem = document.getElementById('log');
    const inputElem = document.getElementById('userInput');
    const button = document.getElementById('executeBtn');

    const detector = new SmartActionDetector(logElem);
    await detector.initialize();

    button.addEventListener('click', () => {
      detector.detect(inputElem.value);
    });

    // Action handlers
    window.save = () => alert('✅ SAVE executed');
    window.create = () => alert('✅ CREATE executed');
    window.update = () => alert('✅ UPDATE executed');
    window.fetch = () => alert('✅ FETCH executed');
  </script>
</body>
</html>

Full Breakdown of the JavaScript Code

This app performs offline smart action detection in the browser using the @xenova/transformers library and the Xenova/bge-m3 model. It maps free-text input (like “store this info”) to predefined actions like “save”, “create”, or “update” using embeddings.

1. Importing the Transformers Library

import { pipeline } from 'https://cdn.jsdelivr.net/npm/@xenova/transformers';

This line imports the pipeline function from the @xenova/transformers library (served via CDN).

  • pipeline is used to instantiate specific tasks like text classification or feature extraction using pre-trained models.
  • This library runs entirely in-browser using ONNX + WebAssembly.

2. The SmartActionDetector Class

Encapsulates all the logic for:

  • Loading the model
  • Computing embeddings
  • Comparing user input to reference embeddings
  • Executing matched actions

Private Fields

#extractor = null;
#actionEmbeddings = new Map();
#logElem = null;
#logBuffer = [];

These private fields store:

  • #extractor: the feature extraction pipeline (used to get embeddings)
  • #actionEmbeddings: a map of action name → embedding vector
  • #logElem: the DOM element used for UI logging
  • #logBuffer: stores log lines before flushing them to the UI

Action Definitions

this.actions = {
  save:   'store the current data safely by saving it',
  create: 'create or make a new item or entry',
  update: 'update or change something that already exists',
  fetch:  'get or retrieve information that was saved before',
};
  • These are semantic descriptions of what each action means.
  • Used to compute reference embeddings during initialization.
  • This is what makes the system language-aware and flexible.

initialize()

async initialize() {
  this.#setLog('🔄 Loading model...');
  this.#extractor = await pipeline('feature-extraction', 'Xenova/bge-m3');
  await this.#cacheActionEmbeddings();
  this.#setLog('✅ Model loaded. Enter a command.');
}
  1. Sets UI to “Loading model…”
  2. Loads the Xenova/bge-m3 model using the 'feature-extraction' pipeline
  3. Caches embeddings for all defined actions
  4. Updates the UI to let the user know it’s ready

Note: This model runs in-browser and doesn’t send any data to a remote server.

detect(inputText)

  • Called when the user submits input
  • Performs the following:
    1. Skips empty input
    2. Logs the input
    3. Computes an embedding for it
    4. Compares it with cached action embeddings using cosine similarity
    5. Logs similarity scores
    6. Chooses the best match (if confidence threshold is met)
    7. Calls the corresponding action handler (e.g. window.save())
if (best.score - second.score > 0.005 && typeof window[best.action] === 'function') {
  window[best.action]();
} else {
  alert('⚠️ Not confident enough. Try rephrasing.');
}

The 0.005 margin ensures the match is confident enough to avoid false positives.

#cacheActionEmbeddings()

For each action in this.actions, it:

  • Converts the action’s description to an embedding
  • Caches it in #actionEmbeddings

This way, we only compute action embeddings once.

#getMeanEmbedding(text)

  • Uses the model to extract a multi-dimensional tensor (token embeddings)
  • Averages across all token embeddings to get a sentence-level embedding
  • Normalizes it to unit length (to prepare for cosine similarity)
const mean = new Float32Array(dim);
...
return this.#normalize(mean.map(x => x / tokens));

This averaging technique makes the result robust and comparable.

#cosineSimilarity(a, b)

Cosine similarity is a measure of how similar two vectors are, based on the angle between them—not their magnitude.

cos(θ) = A · B

Since vectors are already normalized.
1.0 → Vectors point in the same direction (perfect similarity)
0.0 → Vectors are orthogonal (no similarity)
-1.0 → Vectors point in opposite directions (completely dissimilar)

Because embeddings are often normalized (length = 1), cosine similarity becomes simply their dot product.

Logging Utilities

#log(), #flushLog(), #clearLog(), #setLog()

These helper methods:

  • Accumulate messages (#log)
  • Write messages to the UI (#flushLog)
  • Clear logs or set a specific message (#clearLog, #setLog)

App Setup

const detector = new SmartActionDetector(logElem);
await detector.initialize();
  • Instantiates and initializes the smart action detector
  • Once initialized, the model is ready to accept user input
button.addEventListener('click', () => {
  detector.detect(inputElem.value);
});
  • Binds the Execute button to trigger intent detection on user input

Action Handlers

window.save = () => alert('✅ SAVE executed');
window.create = () => alert('✅ CREATE executed');
window.update = () => alert('✅ UPDATE executed');
window.fetch = () => alert('✅ FETCH executed');
  • These are the concrete behaviors triggered when a command is detected.
  • You can swap these with real app logic (e.g. calling APIs or updating UI).

Performance Optimizations

The app is designed with performance in mind, especially important for running AI models entirely within the browser. Here’s how it achieves efficiency without compromising accuracy:

Pre-caching embeddings

Instead of recalculating embeddings for each action every time the user submits input, the app pre-computes the embeddings for all defined actions during initialization and stores them in memory. This ensures that at runtime:

  • Only the input sentence embedding needs to be computed.
  • The comparison step is extremely fast, using cached vectors.
  • Response times are significantly reduced, especially for repeated use.

This approach trades a small amount of memory for much faster execution.

Using mean pooling

The output of the model includes embeddings for each token (word or subword). To reduce this down to a single vector that represents the entire sentence, the app uses mean pooling—i.e., averaging all token vectors.

Benefits of mean pooling:

Effectiveness: In many NLP tasks, mean pooling offers solid performance, especially when combined with high-quality embeddings like those from BGE-M3.

Simplicity: No need for complex attention mechanisms or extra model layers.

Speed: Aggregating by average is computationally cheap.

Float32 and manual normalization

Instead of relying on third-party math or vector libraries, the code uses:

  • Float32Array for memory-efficient numerical data.
  • Math.hypot() for calculating vector norms (used in normalization).
  • A manual normalization step to scale each embedding to unit length.

These low-level operations are both fast and lightweight, which matters when you’re running in resource-constrained environments like older mobile devices or low-power tablets.

WebAssembly backend

The model is executed entirely in-browser using the ONNX runtime with WebAssembly (WASM) as the backend:

  • WebAssembly provides a secure, sandboxed environment that runs at near-native speed.
  • It doesn’t require GPU acceleration, making it ideal for CPUs and edge devices.
  • Inference runs locally, so no data ever leaves the device—ensuring privacy and offline support.

This backend is the secret to running complex transformer models without needing server infrastructure or a client-side ML framework like TensorFlow.js.


Pros and Cons

✅ Pros

  • Fully offline – works in air-gapped environments
  • Data never leaves the browser
  • Fast enough for real-time use
  • Modular – easy to expand with more actions
  • No server or backend needed

❌ Cons

  • Initial load ~100MB (can be optimized with model quantization)
  • Limited to predefined actions unless retrained or extended
  • Quality degrades with complex or long inputs
  • Memory spike during model init (~150MB)

Possible Improvements

  • Compress the model further using int8 quantization
  • Fine-tune embeddings to your specific domain
  • Use localStorage or IndexedDB for persistent model cache
  • Add fuzzy fallback suggestions if confidence is low

Try It Yourself

This code is plug-and-play. Just open the HTML file in a modern browser. You don’t need a server or internet connection (after initial model load).

Conclusion

This project shows how powerful modern in-browser ML has become. With @xenova/transformers and ONNX, you’re not just running JavaScript—you’re embedding intelligence, locally and privately.

Whether you’re building smart UIs, command palettes, or offline assistants, this is the pattern to watch.