Singapore's public sector is confronting a concrete, unglamorous problem that sits beneath its smart-nation ambitions: thousands of duplicate images embedded in official digital records, from HDB property files to National Registry of Persons documentation, are creating data integrity headaches that slow down service delivery and complicate AI-readiness goals.
The issue has been building quietly since at least 2023, when agencies accelerated the digitisation of legacy paper files under the Smart Nation and Digital Government Office's (SNDGO) broader push to migrate records onto centralised cloud platforms. Scanning large volumes of documents at speed produced inevitable redundancies — the same ID photograph appearing multiple times across different government systems, or property inspection images duplicated across MyInfo and agency-specific portals. The problem is not dramatic. But it is persistent, and it is getting harder to ignore as Singapore deepens its integration of AI tools that train on these datasets.
Why the Next Decision Window Matters
SNDGO has signalled that government agencies are expected to meet updated data governance benchmarks by the first quarter of 2027, giving departments roughly nine months to clean up their image repositories and implement deduplication protocols. The urgency is partly driven by the GovTech-led initiative to expand AI services within the Singpass ecosystem — the national digital identity platform used by more than 4.5 million residents. Dirty image data, including duplicates, directly degrades the reliability of facial recognition and document verification tools that Singpass increasingly relies on for transactions ranging from CPF withdrawals to Housing Board flat applications.
At the ground level, the operational friction shows up in places like HDB branches in Toa Payoh and Tampines, where service officers have flagged inconsistencies between images stored in internal case management systems and those retrieved through MyInfo integrations. The National Library Board's digitisation programme for heritage collections — which stores more than 1.3 million records at the Lee Kong Chian Reference Library on Victoria Street — faces a parallel challenge: duplicate photographic assets inflating storage costs and muddying search results for researchers.
The technology options on the table are not cheap. Perceptual hashing tools, which identify near-identical images even when file names or metadata differ, typically cost government-scale deployments in the range of several hundred thousand dollars for licensing and integration alone, according to procurement benchmarks cited in GovTech's publicly available technology stack guidance documents from 2024. Machine-learning-based deduplication, which can handle more complex cases such as cropped or slightly altered scans, runs significantly higher and requires ongoing model training.
Three Decisions That Will Define the Outcome
Agencies face three concrete choices in the months ahead. First, whether to deploy a centralised deduplication engine shared across ministries — a model that reduces cost but requires buy-in from agencies protective of their own data pipelines — or let each ministry procure its own solution. Second, how to handle records where a duplicate image is the only surviving copy of a document; deletion without verification risks permanent data loss, and the National Archives of Singapore has its own retention rules under the National Library Board Act that complicate automatic purging. Third, whether to bring in private sector vendors, including local firms from the Launchpad community at one-north in Buona Vista, or keep the remediation effort entirely in-house within GovTech.
The staffing dimension adds pressure. GovTech employed around 3,600 people as of its last published annual report, and data engineering talent is already stretched across parallel projects including the Intelligent Nation masterplan. Contracting out the deduplication work would be faster but raises questions about data sovereignty — the same questions that surfaced during debates over the National Electronic Health Record system and its vendor arrangements.
For residents, the most tangible near-term effect is the reliability of digital services. If the deduplication effort goes well and agencies meet the Q1 2027 benchmarks, transactions through Singpass should become faster and error rates on identity verification should drop. If agencies miss the window or implement incompatible solutions, the problem compounds — more records, more duplicates, more AI models trained on flawed data. The decisions being made in the middle of this year, largely out of public view, will determine which outcome Singapore gets.