AWS Native Serverless Data & ML Pipelines Implementation
About the client
The client has been building a data platform for data science use cases. The goal was to ingest vast amounts of data from disparate sources, transform and enrich ingested data for unified representation, as well as provide the result to machine learning models for training and prediction.
The data flow comprises the following components:
– Ingestion Layer: The set of tools and pipelines that enable data acquisition from different sources, such as relational databases, REST APIs, semi-structured file formats, studies, and other documents.
– Data Lake: The centralized storage for all data assets with complete governance including inventory, provenance, access control, and audit.
– Batch Layer: The set of pipelines that cleanse, format, enrich, and label data for further ML model training.
– Feature Store: The single place to keep, curate, and serve features to machine learning (ML) models.
Challenge
The client had high costs for development, deployment, and, most importantly, operation of the data platform including Data Lake, Ingestion, and ML Pipelines. The pipelines were mostly running in EC2 instances, which led to the increased cost of operations and required a significant amount of time to deploy and test the pipelines in lower environments.
Solution
The proposed solution split the monolith ingestion service and ML data preparation pipelines into several small, autonomous serverless functions.
The first stage is the Ingestion Layer where custom-developed AWS lambda functions ingest data from different sources and drop it to the Data Lake as unstructured data. Then, the set of custom AWS Step Functions retrieve unstructured data from the Data Lake and transform it into a format that could be consumed by the ML model.
High-level solution diagram
Technologies
AWS Lambda, AWS Step Functions, Terraform, GitLab CI.
Result
The solution allowed to significantly reduce the cost and time spent on development and deployment. It was achieved by introducing infrastructure as code that quickly and seamlessly deploys infrastructure on multiple environments (Dev, Stage, Production). Besides, during the implementation, the team developed the set Lambda and Step function which triggers data processing by the event that excludes cases where EC2 instances are idling and created cost-efficient Data Lakes.
OUR FOCUS IS
CLOUD SOLUTIONS
Our primary goal is to deliver value to our clients by resolving technical challenges and helping them achieve their objectives. We utilize cloud solutions as a powerful toolset to make this happen. With Matoffo expertise, companies can significantly reduce the time from idea to market and rapidly scale their digital business. Additionally, Matoffo enables seamless adoption of rapidly growing tech capabilities, transforming businesses to stay competitive in the market.