Boosting ML Model Training Flow for a Healthcare Company

Executive Summary

The client’s data science team lacked access to GPUs, so they undertook manual ML model training, which led to slower processing time. The Matoffo team used Python and the Ray Tune library to improve processing speed and enable cluster computing in AWS with autoscaling. We established a centralized platform to track execution results using MLFlow and created necessary AMIs for cluster autoscaling. To reduce costs, we developed an AWS Lambda instance that automatically terminates unused EC2 clusters.

About the Customer

Naring Health is a healthcare company based in the US that uses Precision Nutrition and Multi Omics Diagnostic technologies to manage diseases effectively. They have succeeded in creating healthcare service organizations, designing cutting-edge technology solutions, achieving positive results in clinical settings, and transforming how people understand the relationship between food and illnesses. Naring Health believes that a healthy diet is not a punishment, but rather a delicious and enjoyable way to manage chronic health issues.

Customer Challenge

Initially, our client had a team of data scientists who applied their personal computers to execute ML model training. However, a considerable number of modern laptops are not equipped with Graphics Processing Units (GPUs). Since the project was based on image processing tasks, the client’s data science team required equipment powered by GPUs. Therefore, the ML run outcomes were a manual process.

A part of our client’s infrastructure was already hosted in AWS Cloud. That’s why leveraging the existing cloud provider was critical. Additionally, the client asked us to use Python. The client suggested several frameworks that could potentially automate their processes. The Matoffo team was to conduct a thorough investigation to determine their feasibility.

Why Amazon EC2

AWS EC2 offers scalable computing power and flexibility to meet our client’s business needs. EC2 allows us to easily spin up and down instances and choose from a wide range of machine learning services and tools that integrate seamlessly with EC2. Additionally, AWS provides high levels of security, reliability, and availability, allowing companies to focus on developing and deploying machine learning models without worrying about infrastructure maintenance. Overall, leveraging AWS EC2 and cloud for machine learning tasks helps to reduce costs, improve efficiency, and accelerate innovation.

Why Matoffo

The Matoffo team consists of experienced developers who specialize in optimizing ML model training processes and can provide you with tailored cloud solutions to meet your specific needs. By leveraging AWS EC2 instances, we can ensure that your ML models are trained with the highest computing power available. Additionally, our team uses diverse AWS services that can be easily integrated into your infrastructure, enabling you to scale your business rapidly and efficiently. With our expertise in AWS solutions, we can help you navigate the complex cloud environment. By choosing our company as your cloud provider, you will receive reliable and efficient service that will help you achieve your business goals.

Matoffo Solution

To improve the processing speed of the client’s ML models, we conducted research and selected the Ray Tune Python library to achieve our goals. The library offers smooth integration with the AWS environment and ML libraries commonly used by our client, including PyTorch. With Ray Tune, we were able to easily spin up clusters in AWS, using the command-line interface and Infrastructure as Code (IaC) concepts. By creating a YAML file to specify the required cluster configurations, we enabled a wrapper for ML model training code to tap into the full computing power of the cluster. Ray Tune leverages AWS EC2 with autoscaling, which means the cluster size can be adjusted depending on the actual job.

To further enhance our solution, we established a centralized platform to track execution results. We selected MLFlow and spun up an EC2 instance with a static IP address, which allowed anyone on the team to easily access the web interface and review the actual results. We integrated RayTune with MLFlow server in our wrapper.

Finally, we created necessary AMIs for the MLFlow server and RayTune worker machines, and established a private VPC to improve data security. Our client provided data scientists with VPN for them to access the MLFlow server and review the actual results.

We also established a CI/CD pipeline with Gitlab CI and serverless architecture. To reduce costs, we developed an AWS Lambda instance that automatically tears down Ray Tune clusters if they run for too long. This ensures that EC2 instances are terminated if someone forgets to do so at the end of the day or leaves some runs for the night.

Technologies

AWS EC2, Python, MLFlow, RayTune, AWS Lambda

Business Value

The solution we developed for our client resulted in a significant improvement in the velocity of their ML model development. We significantly reduced the execution time of single ML model training from approximately six hours to just around 20 minutes in most cases. Here are additional benefits of cooperation with Matoffo:

• Increased processing power
With our solution, there were no longer any limitations related to local compute power. Instead, the client was able to leverage AWS compute GPU-based instances, which provided much more powerful computing capabilities.

• Automatization and better tasks management
The client benefited from distributed computation, which allowed them to use clusters of instances.

• Increased accuracy of ML models
We provided the client with a centralized platform to track results and gain insights from the training outcomes. This was particularly useful because it eliminated the influence of human factor that might have affected the previous training processes. Overall, our solution significantly improved the efficiency and accuracy of the client’s ML model development process.

Ready to Unlock
Your Cloud Potential?