202109281804 Data-centric AI

#ai #data-centric #machine learning #data

Data-centric AI

A paradigm shift from model-centric AI to data-centric AI, advocated by many companies and scholars recently, especially Andrew Ng and his famous talk.

Previously, we have model-centric AI where people focus mostly on

  • feature engineering
  • model architecture design
  • training algorithm design

However, those tasks soon are marginalized by the popularization of large pre-trained models where general knowledge is learned through huge datasets and parameters. Any work that tries to compete with those pre-trained models with new architectures will find themselves in need for pre-training from scratch, which will most likely burn many holes in their wallets.

In short, those pre-trained models are powerful, increasingly data-hungry, and less practically modifiable.

The easy way out for now is redirecting the attention back to data:

  • data collection
  • labeling
  • augmentation
  • slicing
  • management

Key Components in Data-centric AI

  • Data
  • Programmatic access: higher level of data manipulation that also takes into considerations such as:
    • Privacy concerns
    • Domain expertise
    • Rapid changes in the real world
    • Ethical concerns (bias, audits, lineage)
  • Expertise: labeling and modeling should be unified into a positive feedback loop instead of individual components. Labeled data help to design better models and algorithms and the modeling process provide the labeling process with guidance.