A super small Vision-Language model with nanoVLM
In the hope the AI community will benefit from this small post on the usage of nanoVLM for a specific use case.
I recently came across a really interesting project from the community (HuggingFace) that I found inspiring and easy to use: nanoVLM. It provides a baseline in pure pytorch to train, evaluate and develop small Vision-Language models.
The desire to play with this project comes from another long-term personal project started more or less 10 years ago. What I need to do? I have to solve a Vision Question Answering problem on a dataset that is really small.
The dataset contains approximately 3,000 items (images and text). Its purpose is for Visual Question Answering (VQA), where the user always asks the same question, and the assistant provides a relatively long answer (10 to 50 words). Here’s an example:

Q: “What art is there in the photo?”
A: “The Ponte Pietra (Italian for “Stone Bridge”) is a Roman arch bridge crossing the Adige River in Verona, Italy. The bridge was completed in 100 BC, and the Via Postumia from Genoa to Aquileia passed over it. It is the oldest bridge in Verona.”
Environment details:
- [training] a gpu with 12GB or RAM (fortunately)
- [deployment] limited computational resources: no datacenters, no gpus, just one cpu and 2 GB of rams.
Implementation details:
- I made the dataset compatible with the repo by simply following the signature of the cauldron:
from datasets import Dataset, Features, Value, Image as HFImage
features = Features({
"images": Sequence(feature=HFImage()),
"texts": [{
"user": Value("string"),
"assistant": Value("string"),
"source": Value("string"),
}]
})
- given the deployment constraints, I can’t even afford to use the standard parameters from config.py
- therefore I made these major changes to the LLM and VIT parameters:
@dataclass
class VLMConfig:
...
vit_n_heads: int = 1
vit_n_blocks: int = 1
...
lm_n_heads: int = 1
lm_n_kv_heads: int = 1
lm_n_blocks: int = 1
@dataclass
class TrainConfig:
...
batch_size: int = 60
Note: I loaded pretrained weights (from the HF hub) when mismatch wasn’t a problem
First fail
My hope was that the model would fit (or even overfit) the small dataset, even with some variations or ablations of the original texts.
Ready to train… then wait hours watching the validation loss slowly decrease! Excited to see some results, I try generating output (generate.py) on Image 1:
Input: What art is there in the photo? Outputs: >> Generation 1: The Roman, Roman Catholic church in the church of San Verona, and 1473 >> Generation 2: Theio, was a church in the (, the church located in the 1. San ( >> Generation 3: The Or San church in San ( Triss. of Sanatory of St in the church in the >> Generation 4: ( (, (119, Italy. It is a Roman,. Catholic church located >> Generation 5: . of: in the church in the church of (, Ven of Sanco is a the >> Generation 6: The Oratory of ( ( ( (882 – 147147) >> Generation 7: . of church in the (, Italy. of San Marco and the church San and the church in >> Generation 8: Italian 177) was and and 147 in the church located in the church >> Generation 9: s of,) is a Roman Catholic church of Verona, Verona,. (, Roman >> Generation 10: . of Italy. of (8 – 1820 is one (,. Born in
The model exhibits poor behavior, failing to generate even syntactically correct sentences.
Increased dataset
I asked the community for suggestions:
- the overfitting is a good signal of the capability of the model even the above config.py
- increase the dataset because 3k items is not enough to hope for kind of generalization signals
How to increase the dataset?
First phase: synthetic generation. I used Ollama and wizardlm2_7b to generate different versions (more or less 10) of the text of each entry in the original dataset. I used this prompt: ‘Generate a numbered list containing 10 versions in third person of this description: “{seed_text}”.’
Synthetic generation example:
- The Ponte Pietra, a venerable Roman arch bridge, stands as a testament to ancient engineering, spanning the Adige River in the city of Verona, Italy. Constructed in 100 BC, it served as a vital crossing point for the Via Postumia, a significant Roman road that connected Genoa with Aquileia. It holds the distinction of being Verona’s oldest bridge to this day.
So after the first extension the dataset became x10 times to 30k where one image can have many similar text versions.
Can I do more? I fear the data redudancy introduced with synthetic generation can make the model collapse with no chance to generalize.
The simple idea: to extend the original dataset with the_cauldron (or portions of it). I searched for portions that are semantically similar to my base dataset and it comes out that “localized_narratives” with 200k more items is a good choice.
Training trials and results
I spent more or less one week on training many configurations based on the major choices showed above.
Table 1. shows the differences among the training iterations performed.
Now I have a dataset of 230k items, for sure a better starting point!
iteration super_small_xx | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
---|---|---|---|---|---|---|---|---|---|
lm_hidden_dim | 72 | 72 | 72 | 144 | 108 | 108 | 144 | 144 | 216 |
lm_inter_dim | 192 | 192 | 192 | 192 | 192 | 96 | 192 | 192 | 192 |
lm_n_blocks | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
lm_n_heads | 4 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 |
lm_n_kv_heads | 1 | 1 | 3 | 2 | 2 | 1 | 1 | 1 | 1 |
vit_inter_dim | 3072 | 3072 | 3072 | 3072 | 3072 | 3072 | 1536 | 768 | 768 |
vit_n_blocks | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
vit_n_heads | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
LLM SIZE (#PARAMETERS) | 3,593,592 | 3,593,592 | 3,594,456 | 7,216,560 | 5,402,052 | 5,367,060 | 7,209,648 | 7,209,648 | 10,850,760 |
VIT SIZE (#PARAMETERS) | 7,830,528 | 7,830,528 | 7,830,528 | 7,830,528 | 7,830,528 | 7,830,528 | 5,469,696 | 4,289,280 | 4,289,280 |
Figure 1 illustrates the progression of the validation loss val_loss, which I used as main indicator instead of mmstar.
Figure 2. shows the token per second during training which is usefull to discover correlation with the number of model parameters.
Now some generation tests on examples never seen during training.
First case: generations 5 and 8 hit the correct bridge in the photo.
Input: What art is there in the photo? Outputs: >> Generation 1: Ponte di Castel Vecchio, or Ponte Scaligero, is a prominent >> Generation 2: Architectural Similarities and Differences: The Fontana del Gigante, with its central keep, The Arena di Verona >> Generation 4: Ponte di Castel dell'Ovo: As a cultural icon in the city of >> Generation 5: The Historical Home of the Ponte Pietra: The Ponte Pietra, a significant archaeological >> Generation 6: The Historical Significance of the Teatro della Fortuna The theater's architectural legacy is a testament >> Generation 7: The Ancient Times: The Verona Arena, which has been a storied past, with its >> Generation 8: The Historical Significance of the Ponte Pietra in Verona, Italy, is not only a historical
Second case: generation 8 cite another bridge in the same city (Verona, Italy). Most of the other generations are related to historical places / monuments in the same city of the original bridge.
Input: What art is there in the photo? Outputs: >> Generation 1: The Artistic Legacy of the Abbey of San Zeno As the 17th century, Te >> Generation 2: The Marangona and Rengo, known as the Rengo bell, has been used >> Generation 3: The Historical Significance of the Lamberti The Verona Arena, particularly in the first century AD >> Generation 4: The Historic Arena of Verona: This grand Roman amphitheatre, is an esteemed Roman amph >> Generation 5: The Fontana del Gigante, a renowned statue of the 17th century, was a >> Generation 6: The Historical Significance of Admiral Vettor Pisani: In 1338, the >> Generation 7: The Historic Arena of Verona, an ancient amphitheater located in Piazza Bra of Ver >> Generation 8: The Historical Significance of the Ponte di Castel Vecchio Bridge, or Scaliger Bridge,
Conclusions
I initially tried using nanoVLM for a specific use case but wasn’t successful. I then explored alternative strategies to overcome the initial issues, which led to significant improvements. Overall, I consider this a success.
The best model (look at the number of parameters!) is available for the community here: https://huggingface.co/sbrzz/nanoVLM
If I didn't quote you or if you want to reach out feel free to contact me.
© [Simone Brazzo] [2025] - Licensed under CC BY 4.0 with the following additional restriction: this content can be only used to train open-source AI models, where training data, models weights, architectures and training procedures are publicly available.