Singapore's major digital archives have a clutter problem. Thousands of duplicate and near-duplicate images — scanned heritage photographs, government records, HDB estate documentation, and public infrastructure files — are piling up across repositories, driving up storage costs and slowing down retrieval systems that public agencies and researchers depend on daily.
The issue landed firmly in the spotlight this week after the National Library Board flagged the scale of the redundancy challenge facing its digital collections, which span the National Archives of Singapore at Canning Rise and the National Library building on Victoria Street. The push to address it is now being treated as an operational priority, not a background maintenance task.
Why This Week's Push Matters
The timing is not coincidental. Singapore has been accelerating its Smart Nation digital infrastructure drive, and agencies across Jurong, Woodlands and the Central Business District have been migrating decades of paper-based and analogue records into centralised cloud environments over the past 18 months. That migration, while necessary, has created an avalanche of duplicated files — in some cases the same image uploaded four or five times across different departmental silos before consolidation checks were in place.
Duplicate image data is more than a housekeeping annoyance. Storage on government-grade cloud infrastructure carries real cost, and redundant files inflate the computational load on AI-assisted search and cataloguing tools that agencies like the Urban Redevelopment Authority and the Housing and Development Board increasingly rely on to index visual records. Every irrelevant duplicate that surfaces in a search result is time a civil servant or researcher does not get back.
The problem also affects public-facing tools. The National Heritage Board's Roots.sg portal, which allows Singaporeans to trace family histories and browse digitised civic photographs dating back to the Straits Settlements era, has seen user complaints about duplicate image results cluttering searches — particularly for images of pre-independence neighbourhoods like Tiong Bahru, Tanjong Pagar, and Kampong Glam.
The Technology Being Deployed
The approach being rolled out draws on perceptual hashing algorithms and convolutional neural network models trained to detect near-identical images — not just pixel-perfect copies, but photographs that are slightly cropped, colour-adjusted, or scanned at different resolutions from the same original. These tools can flag suspect pairs for human review rather than automatically deleting files, preserving archival integrity.
GovTech Singapore, based at Sandcrawler Building in one-north, has been central to the technical implementation. The agency has been coordinating with both the National Library Board and the National Archives to standardise the deduplication pipeline across agencies using the Singapore Government Technology Stack. Pilot work began in the first quarter of 2026, and this week's developments reflect the programme moving from pilot into broader deployment across at least three additional ministries.
The National Archives alone holds more than 11 million items in its collection, a figure that has grown sharply since the pandemic-era push to digitise physical records accelerated from 2021 onward. Even a duplication rate of two to three percent across a collection that size represents hundreds of thousands of redundant files consuming server capacity and distorting search rankings.
For institutions like Nanyang Technological University's library system and the Singapore Management University, which maintain their own digitised visual research collections, the national push offers a useful template. Both universities have begun internal reviews aligned with the standards GovTech is establishing, according to publicly available statements on their respective library portals.
Practically speaking, members of the public using Roots.sg or the NLB's digital catalogue at nlb.overdrive.com should expect cleaner, less cluttered image search results over the coming months as the deduplication passes complete. Researchers who have already downloaded archive batches for academic work are being advised to cross-check their local copies against the updated catalogue once the cleaned dataset goes live — expected before the end of the third quarter of 2026. For agencies still mid-migration, the consistent advice from GovTech is to run deduplication checks before ingestion rather than after, a far cheaper fix than cleaning up a contaminated archive after the fact.