Singapore's push to digitise everything — from HDB flat inspection reports to hawker-centre licensing documents — has produced an unintended consequence: tens of thousands of duplicate images clogging the databases of public agencies, slowing retrieval systems and inflating storage costs across the civil service. The problem did not emerge overnight. It is the accumulated result of more than a decade of digitisation drives that prioritised speed of upload over data hygiene.
The issue matters today because Singapore is in the middle of repositioning itself as a regional AI and data hub, a goal that depends on the quality of the underlying data that government systems hold. Dirty data — including repeated image files with inconsistent metadata — undermines machine-learning pipelines before they are even trained. The Government Technology Agency of Singapore, known as GovTech, has flagged data standardisation as a prerequisite for the public-sector AI projects it is rolling out under the Smart Nation 2.0 framework announced in 2024.
How the Backlog Built Up
The duplication problem has roots in the mid-2010s, when agencies began scanning paper records in bulk. The Housing & Development Board, which manages more than one million residential flats across estates from Punggol to Queenstown, digitised decades of floor-plan photographs, renovation-permit images and structural-inspection shots in parallel batches. Because different departments used different naming conventions and uploaded to separate servers, the same photograph could enter the system three or four times under different file names. A 2023 review by the Public Service Division identified this kind of siloed digitisation as a systemic vulnerability, though the full scale of duplication across all agencies has not been made public.
The National Library Board faced a similar reckoning with its digital archive at the National Archives of Singapore on Canning Rise. Newspaper photograph collections and oral-history session images digitised before 2018 contained significant overlap because scanning was outsourced to multiple vendors without a unified deduplication protocol. The NLB subsequently invested in hash-based deduplication software as part of its ArchiveSG refresh, which began in earnest in 2022.
Costs compound the problem. Government cloud storage in Singapore is procured through the Government Commercial Cloud framework, and redundant files translate directly into redundant expenditure. Industry benchmarks suggest that unmanaged duplication can inflate storage requirements by between 20 and 40 per cent in large document repositories — a range that, applied to the public sector's scale, represents material budget waste at a time when ministries are managing tighter operational budgets after the post-pandemic spending cycle.
What Comes Next for Agencies and Residents
GovTech is now piloting an automated image-deduplication layer within the Whole-of-Government Data Architecture, which is designed to sit upstream of any agency-level upload portal. Under this approach, a perceptual-hashing algorithm flags near-identical images before they are committed to long-term storage, routing them to a human reviewer rather than simply deleting them — a safeguard against losing genuinely distinct files that happen to look similar.
For ordinary Singaporeans, the practical effect will be felt most in interactions with MyInfo and the Singpass app, where document uploads — identity photographs, property images, supporting files for grant applications — have historically been re-uploaded repeatedly by applicants across different transactions without the system recognising them as identical. Streamlining that backend means faster verification and fewer requests for re-submission.
Businesses on the CorpPass system, particularly the roughly 570,000 registered entities that interact with government licensing portals, stand to benefit from shorter processing queues once legacy duplicate files are purged and active queues run on cleaner data.
The deduplication work is unglamorous and largely invisible to the public. But it is the kind of foundational remediation that determines whether Singapore's AI ambitions run on solid ground or on a landfill of repeated files. Getting the data right now, before the next wave of AI procurement begins, is the more pragmatic path — and agencies appear to know it.