Singapore's push to digitise everything — from HDB flat applications to hawker centre licensing — has produced an unintended side effect: sprawling repositories of duplicate, near-duplicate, and mismatched images that are slowing systems, distorting AI training datasets, and quietly inflating cloud storage bills. The problem did not emerge overnight. It is the cumulative result of more than a decade of rushed digitisation campaigns, legacy system migrations, and departmental data siloes that were never designed to talk to each other.
The issue matters now because the government's Smart Nation and Digital Government Office, which oversees the national data infrastructure, is in the middle of its next phase of AI deployment across public-sector agencies. If the image libraries feeding those systems are riddled with duplicates, the AI models trained on them inherit the mess. A planning algorithm shown the same Toa Payoh block 47 times and an Ang Mo Kio block zero times does not produce useful output. Garbage in, garbage out — and in Singapore's case, the garbage has been accumulating since at least the mid-2010s.
How the Duplication Built Up
The roots stretch back to 2014 and 2015, when multiple agencies ran parallel digitisation drives with no unified data standard. The Urban Redevelopment Authority digitised planning records. The Housing and Development Board scanned tens of thousands of flat inspection photographs. The National Heritage Board digitised archival imagery from the National Library on Victoria Street. Each agency used different file-naming conventions, different compression formats, and different metadata schemas. When these datasets were later consolidated onto the Government on Commercial Cloud platform — a migration that accelerated after the Public Sector Data Security Review in 2019 — duplicates from disparate sources ended up sitting side by side with no automated deduplication in place.
Commercial platforms compounded the problem. E-commerce sellers listing on Lazada and Shopee, businesses filing documents with the Accounting and Corporate Regulatory Authority, and even fitness studios uploading class schedules to ActiveSG's booking portal all contributed to bloated image stores on shared infrastructure. Industry estimates, based on comparable cloud migration audits in markets like Japan and South Korea, suggest that between 20 and 35 percent of images in large, multi-contributor repositories are duplicates or near-duplicates — a figure that Singapore's data engineers have cited internally as a working benchmark, according to publicly available conference presentations from GovTech's annual Stack developer conference.
The Deduplication Drive
GovTech, the agency responsible for Singapore's government technology stack, began piloting hash-based image deduplication tools in 2024 across selected datasets held at its one-north headquarters in Buona Vista. Perceptual hashing — a technique that generates a fingerprint from an image's visual content rather than its exact file data — allows near-identical images to be flagged even when they have been resized, recompressed, or had metadata stripped. The pilot covered a subset of HDB-related inspection imagery and, by GovTech's own public reporting at the 2025 Stack conference, produced a meaningful reduction in redundant storage load.
The next step, which agencies are working through now, is harder. Deduplication tools can identify duplicates but cannot always determine which copy is the authoritative one. That requires human review workflows, clear data governance policies, and — critically — inter-agency agreements on who owns what. The Ministry of Digital Development and Information published a revised data governance framework in March 2026 that begins to address exactly these ownership questions, requiring agencies to designate a named data custodian for every shared dataset by the end of this year.
For businesses and individuals interacting with government digital services, the practical upshot is straightforward: portals that have historically asked users to re-upload the same identity photograph or property image multiple times are being redesigned so that a single verified upload persists across agencies. Singpass-linked document vaults, already available through the LifeSG app, are the intended backbone for this. Singaporeans who have not yet linked their documents to LifeSG would do well to do so — it is the architecture through which the clean, deduplicated image record will eventually be their own.