Published work. Technology. AI and ML.

 

Challenges in Data Preparation for AI and ML

 

Data is a major component of any Machine learning project. Without it, it's like catching fish in the sand. Data preparation is one of the most bothersome parts of using data for Machine Learning. According to a recent study, data preparation takes more than 80% of the time spent on each ML project. Data scientists spend the majority of their time on data cleaning (25%), labeling (25%), augmentation (15%), aggregation (15%), and identification (5%).[1]

 

  • Challenges in Data Collection: A few of the significant problems in data collection are:
      • Complicated forms: They can cause people to respond wrongly, in an incorrect fashion or not at all due to the difficulty.
      • Literacy can be a block: The scholarship of a person can be a blockage in answering the form correctly. Also, the person can be specially-abled and hence incapable of solving it successfully.
      • Language can be a block: The form may exist in a style that is indiscernible to the respondent.
      • Insufficiently trained staff: There should be a properly defined team and a leader assigned to them. Transparent methodologies should be there for the survey's success.
  • Challenges in Data Preprocessing: And a few problems in Data Preprocessing are:
      • Missing data: Due to various reasons including poor collection and preservation of data, the data can go missing and results in Missing Data, which is a hindrance.
      • Manual input: Due to manual information of data, sometimes erroneous data can get filed, which may lead to incorrect results of the analysis.
      • Data inconsistency: Anything that affects the integrity of data causes data inconsistency. For example, if a customer has two home phone numbers, the system will be confused about which phone number to use.
      • Wrong data types: Due to wrong data types present in the survey, data mismatch errors happen at the time of the input, which is again, a hindrance.
  • Challenges in data transformation: Data exists in silos. Collectively making sense of different data means merging those silos. However, there are some challenges faced while combining structured and unstructured data. There is no way to transform two types of data when brought from two different tables if their structures don't match. This means a lot of the data is rendered useless and pushed into the learning model. The output of that is as good as the data set. Due to unstructured data, the customer pattern cannot be detected by the learning model.
  • Challenges in getting adequate high-quality data:
      • Duplicates: Multiple copies of the same record may exist in the dataset. They take a toll on computation and usage. They may also provide skewed or incorrect answers when they are undetected. The remedy of this is "Data deduplication."
      • Incomplete data: Sometimes, because of the incorrect entering of data, or because specific files have become corrupt, the remaining data has many missing variables.
      • Inconsistent formats: If the data is stored inconsistently, the system may not interpret the data correctly.
  • Incompatible data formats:
      • Structured data: Structured data is data that has been formatted correctly in a database. They relational keys and are easily mapped into pre-designed fields.
      • Semi-structured data(Poly Structured Data): Semi-structured data is data that does not exist in a relational database but is more comfortable to analyze.
      • Unstructured data: Data that is not organized in a pre-defined manner or doesn't have a pre-defined model is known as unstructured data.

 

 

Well arranged and well-prepared data are necessary for the success of ML models. However, preparing the data is time-taking and sensitive, which is full of challenges. Therefore self-service data preparation tools have been created for scientists to develop data for ML purposes. Such tools give scientists the freedom to clean, prepare, and deploy data.

 

 

 

[1]-https://www.cognilytica.com/2019/03/06/report-data-engineering-preparation-and-labeling-for-ai-2019/

https://eliteresearch.com/what-are-some-data-collection-challenges-and-how-do-you-overcome-them-1

 -��"

 

Comments

Popular posts from this blog

Published work. Technology. LMS.

Published work. Health. Corona virus.

Published work. Technology. CRM