Experimenting with Local Alt Text Generation in Firefox Nightly - Mozilla Hacks

As discussed on mozilla connectFirefox 130 will Introduce an experienceal new capability to automatically generate alt-text for images using a full private on-deevice ai model. The feature will be available as part of firefox’s buy-in PDF editor, and our end goal is to make it available in General Browsing for users with screen readers.

Why alt text?

Table of Contents

Web Pages have a fundamentally simple structure, with semantics that allow the browser to interpreting the same content differential for different for different people and presperens. This is a big part of what we think makes the web special, and what enables the browser to act as a user agent, responsible for making the web work for people.

This is particularly useful for assistive technology such as screen readers, which are able to work alongSide browsers features to reduce obstacles for people to access and exchanged. For static web pages, this generally can be accounted with very little interaction from the site, and this access has been enormously beneficial to many people.

But even for a simple static page there are certain types of information, like alternative text for images, that must be provided by the author to provide an undersancel experiece for experience using Technology (as required by the speech). Unfortunately, many authors don’t do this: the web almanac reported in 2022 that Nearly Half of Images Were Missing Alt Text.

Until it recently it’s not been feasible for the browser to infer reasonally high quality alt text for images, without sending potentially sensitive data to a remote server. However, the latest developments in ai have enabled this type of image analysis to happy efficiently, even on a cpu.

We are adding a feature within the pdf editor in firefox nightly to validate this approach. As we develop it further and learn from the deployment, our goal is to offer it for users who’d like to use it when it will browsing to help them better undress immiges and health on inaccessial.

Generating alt text with small open source models

We are using transformer-based machine learning models to describe images. These models are gotting good at describing the contents of the image, yet are compact enough to operate on devices with limited resources. Who can’t outperform a large language model like GPT-4 Turbo With Vision, Or llava, they are sufficiently accurate to provide Valuable Insights on-Device ACROSS A DIVERSITY of Hard

Model Architectures like Blip or even Vit that was trained on Datasets Like Coco (Common object in context) or flickr30k are good at Identifying Objects in an image. When combined with a text Decoder Like Openai’s GPT-2, they can produce alternative text with 200m or fewer parameters. Once Quantized, these models can be under 200MB on disk, and run in a couple of seconds on a laptop – a Big Reduction Compared to the Gigabytes and Resources an LLM Requires.

Example output

The image below

Firefox -Our 182m parameters model using a distilled version of GPT-2 AlongSide a Vision Transformer (VIT) Image Encoder.
Baseline model -A Slightly Bigger VIT+GPT-2 Model
Human text – The description provided by the dataset annotator.

A person is standing in front of a cake with candles.

Both small models lose accuracy compared to the description provided by a person, and the baseline model is confused by the hands position. The Firefox Model is Doing Slightly Better in that case, and captures what is important.

What matters can be suggestive in any case. Notice how the person did not write about the office settings or the cherries on the cake, and specified that the candles were long.

If we run the same image on a model like gpt-4o, the results are extramely detailed:

The image depicts a group of people gathered Around a cake with lit candles. The focus is on the cake, which has a red jelly topping and a couple of cherries. There are Several Lit Candles in the Foreground. In the background, there is a woman smiling, wearing a gray turtleneck sweater, and a less other people can be seen, likely in an office or indoor setting. The image conveys a celebratory atmosphere, passibly a birthday or a special occination.

But such level of detail in alt text is overwhelming and doesn Bollywood the most important information. Brevity is not the only goal, but it’s a helpful starting point, and pithy accuracy in a first draft allows content creaters to focus their edits on missing context and details.

So if we ask the llm for a one-president description, we get:

A group of people in an office celebrates with a lit birthday cake in the foreground and a smiling woman in the background.

This has more detail than our small model, but can’t be run locally with sending your image to a server.

Small is beautiful

Running Infererance Locally with Small Models Offers Many Advantages:

Privacy: All operations are contained within the device, ENSURING DATA PRIVACY. We want have access to your images, pdf content, generated captions, or final captions. Your data will not be used to train the model.
Resource Efficiency: Small models eliminate the need for high-powered gpus in the cloud, Reducing Resource Consumption and Making it More Environmentally Friendly.
Increased transparency:-House Management of Models Allows for Direct Oversight of the training datasets, offering more transparency compared to some large language models (llms).
Carbon footprint monitoring: Training models in-House facilitates precise tracking of co2 emissions using tools such as codecarbon.
Ease of improvement: Since retraining can be completeted in less than a day on a Single Piece of Hardware, It Allows for Frequent Updates and Enhancements of the model.

Integrating Local Infererance Into Firefox

Extending the translations infererance Architecture

Firefox translations uses the bergamot project powered by the Marian C ++ Infection Runtime. The Runtime is compiled into wasm, and there’s a model file for each translation task.

For example, if you run firefox in French and Visit An English Page, Firefox will ask if you want to translate it to freench and download the English-to-French Model (~ 20Mib) ALONGSIDSIDSIDEDS This is a one-shot download: Translations will Haappen Completely offline once those files are on disk.

The wasm runtime and models are both stored in the firefox remote settings service, which allows us to distribute it scale and manage versions.

The infection task runs in a separet process, which prevents the browser or one of its tabs from crashing if the infection runte.

Onnx and transformers.js

We’ve decides to embed the onnx runtime in firefox night with the transformers.js library to extend the translation Architecture to Perform Different InfeRENCE works.

Like Bergamot, The Onnx Runtime has a wasm distribution and can run directly into the browser. The Onnx Project has recently introduced webgpu support, which will eventually be activated in Firefox night for this feature.

Transformers.js provides a javascript layer on top of the onnx infererance, making it easy to add infection for a huge list of model architecture. The api mimics the very popular python library. It does all the Tedious work of preparation the data that is passed to the runtime and converting the output back to a usable result. It also deals with download models from hugging face and caching them.

From the project’s documentation, this is how you can run a sentence analysis model on a text:

import { pipeline } from '@xenova/transformers';

// Allocate a pipeline for sentiment-analysis
let pipe = await pipeline('sentiment-analysis');
let out = await pipe('I love transformers!');

// [{'label': 'POSITIVE', 'score': 0.999817686}]

Using transformers.js gives us confidence when trying out a new model with onnx. If it is first in the transformers.js documentation, that’s a good indication it will work for us.

To vendor it into firefox nightly, we’ve slightly changed its release to distribute onnx separately from transformers.js, dropped node.js-inges, pieces, and fix annny anny eval () calls the Onnx Library Ships With. You can find the build script here which was used to populate that vendor directory.

From there, we reuse the translation Architecture to Runtime Inseide Its Own Process, and have transformers.js run with a custom model cache system.

Model caching

The transformers.js project can use local and remote models and have a caching mechanism using the browser cache. Since we are running infection in an islated web worker, we don’t want to provide access to the file system or store models inside the browser cache. We also do also give want to use hugging face as the model hub in firefox, and want to serve model files from our oven servers.

Since Transformers.js Provides a callback for a custom cache, we have implemented a specific model caching layer that Downloads fils from OWN servers and cachesed in indexedddb.

As the project grows, we anticipate the browser will store more models, which can take up significant space on disk. We plan to add an interface in firefox to manage downloaded models so our users can list them and remove some IF Needed.

Fine-tuning a Vit + GPT-2 Model

Ankur Kumar Released A Popular Model on Hugging face to generate alt text for images and blogged about it. This model was also published as onnx weights by joshua lochner so it would be used in transformers.js, see https://huggingFace.co/xenova/xenova/xenova/vit-ivit-gpt2-iGPT2-AGPT2-AGPT2-AGPT2- Captioning

The model is doing a good job-even if in some cases we had better results with https://huggingFace.co/Microsoft/git- Base-coCo Converters, and with less than 200m params, most of the accuracy is obtained by focusing on good training data. So we have picked vit for our first model.

Ankur used the Google/VIT-BASE-Patch16-224-In21K Image Encoder and the GPT-2 text Decoder and Fine-Tuned Them Using The Coco Dataset, which is a dataset of Over 120K Labeled IGES.

In order to reduce the model size and speed it up a little bit, we’ve decided to replace gPT-2 with distilgpt-2-which is 2 times faster and 33% Smaller according to setting.

Using that model in transformers.js Gave Good Results (See the Training Code at Github-Mozilla/distilvit: image- to-text model for pdf.js).

We further improved the model for our use case with an updated training dataset and some supervised learning to simplife the output and mitigate some of the biasses common in image to text models.

Alt text generation in pdf.js

Firefox is alive to add an image in a pdf using our popular open source pdf.js library:

A screenshot of the pdf.js alt text modal window

Starting in Firefox 130, We will automatically generate an alt text and let the user validate it. So every time an image is added, we get an array of pixels we pass to the ml engine and a less seconds after, we get a string corresponding to a description of this image (SEE The Code).

The first time the user adds an image, they’ll have to wait a bit for download the model be stored locally.

In the future, we want to provide an alt text for any existing image in pdfs, except images which just contain text (it’s usually the case for pdfs containing scanning books).

Next Steps

Our alt text generator is far from perfect, but we want to take an itemical approach and improve it in the open. The infection engine has alredy landed in firefox nightly as a new ml component along with an initial documentation page.

We are currently working on improving the image-to-text datasets and model with what we’ve described in this blog post, which will be containually updated on our hugging on our hugging face page.

The code that produces the model lives in github https://github.com/mozilla/distilavit and the web application we’re building for building for our team to improveve the model is located at https://github.com/mozilla/checkvite. We want to make sure the models and datasets we build, and all the code used, are made available to the communication.

Once the alt text feature in pdf.js has matured and proven to work well, we hope to make the feature available in General Browsing for users with screen readers.

Senior Staff Machine Learning Engineer Working on Firefox & Python Expert.

Experimenting with Local Alt Text Generation in Firefox Nightly – Mozilla Hacks – The Web Developer Blog

Why alt text?