Addressing Gaps: Missing Data Challenges in AI Development
top of page
close-up-businessman-with-digital-tablet.jpg

Addressing Gaps: Missing Data Challenges in AI Development

  • Writer: Kehinde Soetan
    Kehinde Soetan
  • 5 minutes ago
  • 2 min read
ree

The challenge that comes with missing data in AI is one that should be consistently looked into and given adequate attention- since many decisions stems from data quality. Artificial intelligence models learn from data and these models can only be as strong as the data that feeds them. Artificial intelligence models can make decisions, predict outcomes, understand patterns, generate contexts and classify things based on the data that it learns from. These data can exist as structured, semi-structured , unstructured, textual, numerical and in various other forms.


In contrast to what many think/expect, perfect datasets rarely exist and most datasets need to be cleaned before the outcomes predicted by using such data can be termed as reliable. According to #Tableau, data cleaning is the process of fixing or removing incorrect, corrupt, incorrectly formatted, duplicate or incomplete data within a dataset. Data needs to be cleaned due to inaccuracies that can exist in datasets, to avoid rework which invariably saves time on the long run, to prevent errors, to make predictions more accurate, improve the performance of models, make datasets easier to work with and for lots of other reasons.


A part of the process of cleaning data entails handling missing data which can stem from human errors, system limitations, system timeouts, inconsistent data collection techniques, outdated data, data corruption, data anonymisation, privacy and a host of other reasons. For example in healthcare, a patient intentionally refusing to disclose some vital information thereby causing missing family medical record may mean that physicians miss important medical information when making decisions such as prescribing medications or diagnosing the patient. This inaccurate decision could lead to bad treatment outcomes for the patient and reduced trust in the medical system by other patients. Another example can be from the financial sector where a system glitch or failure could lead to skips when capturing all transition history in the financial statement of a bank customer. An immigration official reading such inaccurate bank statement might make decisions which could lead to bad outcomes for the client. The challenges of missing data is not only limited to wrong decision making which can affect lives but can also raise ethical concerns, bias, fairness concerns, lead to loss of statistical power, scalability issues, trust issues and a host of other challenges.


In order to manage missing data, it’s important to first understand how to identify if there is data missing in a dataset. This can be done by looking out for inconsistencies, understanding data categories, understanding missing data marking, investigating data type mismatches, handling logical inconsistencies and profiling data properly at the early stage before they are fed into a model or used for decision making.


The challenges of missing data in AI development not only affects decision making, but can also significantly impact the trust individuals and societies have on AI models. Domain knowledge experts, data engineers/scientists, leaders and well as system engineers need to collaborate better in order to be able to quickly identify, strategise and ensure high quality of data is being fed into AI models.

 
 
 
bottom of page