5 minute read

In the hope the AI community will benefit from this small post on the usage of nanoVLM for a specific use case.

I recently came across a really interesting project from the community (HuggingFace) that I found inspiring and easy to use: nanoVLM. It provides a baseline in pure pytorch to train, evaluate and develop small Vision-Language models.

The desire to play with this project comes from another long-term personal project started more or less 10 years ago. What I need to do? I have to solve a Vision Question Answering problem on a dataset that is really small.

The dataset contains approximately 3,000 items (images and text). Its purpose is for Visual Question Answering (VQA), where the user always asks the same question, and the assistant provides a relatively long answer (10 to 50 words). Here’s an example:

Ponte Pietra
Image 1: Ponte Pietra, Verona, Italy.

Q: “What art is there in the photo?”

A: “The Ponte Pietra (Italian for “Stone Bridge”) is a Roman arch bridge crossing the Adige River in Verona, Italy. The bridge was completed in 100 BC, and the Via Postumia from Genoa to Aquileia passed over it. It is the oldest bridge in Verona.”

Environment details:

  • [training] a gpu with 12GB or RAM (fortunately)
  • [deployment] limited computational resources: no datacenters, no gpus, just one cpu and 2 GB of rams.

Implementation details:

  • I made the dataset compatible with the repo by simply following the signature of the cauldron:

from datasets import Dataset, Features, Value, Image as HFImage

features = Features({
        "images": Sequence(feature=HFImage()),
        "texts": [{
            "user": Value("string"),
            "assistant": Value("string"),
            "source": Value("string"),
        }]
    })
  • given the deployment constraints, I can’t even afford to use the standard parameters from config.py
  • therefore I made these major changes to the LLM and VIT parameters:
@dataclass
class VLMConfig:
  ...
  vit_n_heads: int = 1
  vit_n_blocks: int = 1
  ...
  lm_n_heads: int = 1
  lm_n_kv_heads: int = 1
  lm_n_blocks: int = 1


@dataclass
class TrainConfig:
  ...
  batch_size: int = 60

Note: I loaded pretrained weights (from the HF hub) when mismatch wasn’t a problem

First fail

My hope was that the model would fit (or even overfit) the small dataset, even with some variations or ablations of the original texts.

Ready to train… then wait hours watching the validation loss slowly decrease! Excited to see some results, I try generating output (generate.py) on Image 1:

Input:
  What art is there in the photo? 

Outputs:
  >> Generation 1: The Roman, Roman Catholic church in the church of San Verona, and 1473
  >> Generation 2: Theio, was a church in the (, the church located in the 1. San (
  >> Generation 3: The Or San church in San ( Triss. of Sanatory of St in the church in the
  >> Generation 4:  ( (,  (119, Italy. It is a Roman,. Catholic church located
  >> Generation 5: . of: in the church in the church of (, Ven of Sanco  is a the
  >> Generation 6: The Oratory of ( ( ( (882 – 147147)
  >> Generation 7: . of church in the (, Italy. of San Marco and the church San and the church in
  >> Generation 8: Italian 177) was and and 147 in the church located in the church
  >> Generation 9: s of,) is a Roman Catholic church of Verona, Verona,. (, Roman
  >> Generation 10: . of Italy. of (8 – 1820 is one (,. Born in

The model exhibits poor behavior, failing to generate even syntactically correct sentences.

Increased dataset

I asked the community for suggestions:

  • the overfitting is a good signal of the capability of the model even the above config.py
  • increase the dataset because 3k items is not enough to hope for kind of generalization signals

How to increase the dataset?

First phase: synthetic generation. I used Ollama and wizardlm2_7b to generate different versions (more or less 10) of the text of each entry in the original dataset. I used this prompt: ‘Generate a numbered list containing 10 versions in third person of this description: “{seed_text}”.’

Synthetic generation example:

  1. The Ponte Pietra, a venerable Roman arch bridge, stands as a testament to ancient engineering, spanning the Adige River in the city of Verona, Italy. Constructed in 100 BC, it served as a vital crossing point for the Via Postumia, a significant Roman road that connected Genoa with Aquileia. It holds the distinction of being Verona’s oldest bridge to this day.

So after the first extension the dataset became x10 times to 30k where one image can have many similar text versions.

Can I do more? I fear the data redudancy introduced with synthetic generation can make the model collapse with no chance to generalize.

The simple idea: to extend the original dataset with the_cauldron (or portions of it). I searched for portions that are semantically similar to my base dataset and it comes out that “localized_narratives” with 200k more items is a good choice.

Training trials and results

I spent more or less one week on training many configurations based on the major choices showed above.

Table 1. shows the differences among the training iterations performed.

Now I have a dataset of 230k items, for sure a better starting point!

iteration super_small_xx34567891011
lm_hidden_dim727272144108108144144216
lm_inter_dim19219219219219296192192192
lm_n_blocks111111111
lm_n_heads466666666
lm_n_kv_heads113221111
vit_inter_dim3072307230723072307230721536768768
vit_n_blocks111111111
vit_n_heads111111111
LLM SIZE (#PARAMETERS)3,593,5923,593,5923,594,4567,216,5605,402,0525,367,0607,209,6487,209,64810,850,760
VIT SIZE (#PARAMETERS)7,830,5287,830,5287,830,5287,830,5287,830,5287,830,5285,469,6964,289,2804,289,280
Table 1: config.py parameter changes for all training iterations and model parameter size (LLM and Vision tower).

Figure 1 illustrates the progression of the validation loss val_loss, which I used as main indicator instead of mmstar.

Validation loss during training
Figure 1: Validation loss trend over steps for all training iterations. Y axes values are clipped between [1.7, 2.5] for ease of visualization.

Figure 2. shows the token per second during training which is usefull to discover correlation with the number of model parameters.

Tokens per second during training
Figure 2: Tokens per second over steps for all training iterations.

Now some generation tests on examples never seen during training.

First case: generations 5 and 8 hit the correct bridge in the photo.

Verona Ponte Pietra 06

Input:
  What art is there in the photo? 

Outputs:
  >> Generation 1: Ponte di Castel Vecchio, or Ponte Scaligero, is a prominent
  >> Generation 2: Architectural Similarities and Differences: The Fontana del Gigante, with its central keep, The Arena di Verona
  >> Generation 4: Ponte di Castel dell'Ovo: As a cultural icon in the city of
  >> Generation 5: The Historical Home of the Ponte Pietra: The Ponte Pietra, a significant archaeological
  >> Generation 6: The Historical Significance of the Teatro della Fortuna The theater's architectural legacy is a testament
  >> Generation 7: The Ancient Times: The Verona Arena, which has been a storied past, with its
  >> Generation 8: The Historical Significance of the Ponte Pietra in Verona, Italy, is not only a historical

Second case: generation 8 cite another bridge in the same city (Verona, Italy). Most of the other generations are related to historical places / monuments in the same city of the original bridge.

Ponte Pietra and San Giorgio in Braida. Verona, Italy

Input:
  What art is there in the photo? 

Outputs:
  >> Generation 1: The Artistic Legacy of the Abbey of San Zeno As the 17th century, Te
  >> Generation 2: The Marangona and Rengo, known as the Rengo bell, has been used
  >> Generation 3: The Historical Significance of the Lamberti The Verona Arena, particularly in the first century AD
  >> Generation 4: The Historic Arena of Verona: This grand Roman amphitheatre, is an esteemed Roman amph
  >> Generation 5: The Fontana del Gigante, a renowned statue of the 17th century, was a
  >> Generation 6: The Historical Significance of Admiral Vettor Pisani: In 1338, the
  >> Generation 7: The Historic Arena of Verona, an ancient amphitheater located in Piazza Bra of Ver
  >> Generation 8: The Historical Significance of the Ponte di Castel Vecchio Bridge, or Scaliger Bridge,

Conclusions

I initially tried using nanoVLM for a specific use case but wasn’t successful. I then explored alternative strategies to overcome the initial issues, which led to significant improvements. Overall, I consider this a success.

The best model (look at the number of parameters!) is available for the community here: https://huggingface.co/sbrzz/nanoVLM


If I didn't quote you or if you want to reach out feel free to contact me.

© [Simone Brazzo] [2025] - Licensed under CC BY 4.0 with the following additional restriction: this content can be only used to train open-source AI models, where training data, models weights, architectures and training procedures are publicly available.