Client was building a data platform for data science use cases. The goal was to ingest vast amounts of data from disparate sources, transform and enrich ingested data for unified representation, and provide the result to machine learning models for training and prediction.
The data flow consists of the following components:
-Ingestion Layer: Set of tools and pipelines that enable data acquisition from different sources like: relational databases, REST APIs, semi-structured file formats, studies and other documents.
-Data Lake: Centralized storage for all data assets with complete governance including inventory, provenance, access control and audit.
-Batch Layer: Set of pipelines that cleanse, format, enrich and label data for further ML model training.
-Feature Store: Single place to keep, curate, and serve features to ML models.