Compute hungry Dynamic, Unpredictable AI model build/trainings and Static AI infrastructure

Compute hungry Dynamic, Unpredictable AI model build/trainings and Static AI infrastructure

- Anil Ravindranath, Founder, President & CTO

Are you tired of ...

→ guess work and predefining of how much/many GPUs” you need for your AI work

Data scientists given limited compute and waiting for compute availability

→ Fixed, preset compute shares for your AI models

→ High operational cost(clouds & on-premise) to build AI models

Constant training disruptions and failures

→ Slower TTV(Time To Value) of AI model in spite of using faster high performance GPU

→ Localized training failures @Edge MEC resulting in high lat.

Look no further,

rapt.ai’s market first, “adaptive model aware learning system” a Kubernetes integrated platform dynamically transforms your AI infrastructure in cloud, on-premise & edge, fungible to your AI model needs.

rapt’s software transparently analyzes AI model graphs, GPU utilization needs, business workflows and dynamically allocates precise fraction of GPU out of rapt’s virtual GPU pool cluster. Delivering acceleration, maximum utilization at unprecedented scale and economics.

All this is done automatically with “zero” user intervention. No more guess work or presets required. In addition, you can also set priorities and workflow policies and the learning system will take care of it.

Background

AI, ML/DL are a compute hungry tasks and need desired and precise AI compute resources to achieve the expected levels of accuracy and TTV (Time To Value). The AI compute is out there could be in your data centers, your public cloud deployments but it is not used and managed efficiently resulting high TCOs, slow TTV(Time to Value) of your AI model and slow data scientists productivity.

Static AI infrastructure, GPU utilization, business workflows

Dynamic, unpredictable AI workloads Vs. Static GPU allocations

Unlike traditional software workloads AI workloads are “dynamic & unpredictable”. AI model build & training is all about tuning and retuning of parameters to achieve high prediction accuracy and the process repeats until the desired levels of accuracy is achieved.

Add to this in mix is datasets changes, data drifts with varied datasets resulting in varying AI workload patterns. A traditional virtual infrastructure would solve this by providing fixed static shares to workloads. But this does not solve AI workload problem as the pattern and profile is dynamic and unpredictable. Static shares leads to underutilization of valuable GPU resources.

Fixed shares are good when the workloads are predictable and static.

Non-availability of valuable compute resources leads to slow AI development. Imagine running a distributed training and limited by number of workers and GPUs. This leads to slow training and overall low performance.

challenges

Low GPU util. with some peaks leading to inefficient and underutilized GPU resources

Different AI workloads with different tuning parameters and graphs results in dynamic GPU util. patterns

Resulting in constant guess work of how many GPU resources I need for my AI training/build.

Fixed/preset shares/quota GPU allocations leads to underutilization, oversubscriptions & constant training disruptions & failures.

Static allocations with each workload taking a dedicated GPU and utilizing it for ~26% leading to underutilizations and high system costs.

Inefficient underutilized system usage with high system costs and expensive resources sitting idle.
High operational costs in clouds & on-premise.

rapt solution

rapt.ai’s market first, “adaptive model aware learning system” dynamically transforms your AI infrastructure fungible to your AI model needs. Rapt pools virtual pool of GPUs/accelerators and dynamically allocates the resources based on the AI model graphs, parameters profile patterns. You can set policies based on the business workflows and team requirements as well.

Maximize GPU utilization. Pooled dynamic resource allocations and sharing between AI workloads, teams and business workflows

Guaranteed shares/allocations . Dynamic GPU resource shares based on AI workloads. As AI workload profile patterns change, the allocations are increased or decreased automatically. Zero user intervention.

Automatic compute consolidation. Dynamically schedule the AI workloads based on GPU utilization patterns and consolidate the systems/instances.

Reduce TCO or can support more AI jobs into the same systems

Maximum performance and faster training.Using rapt’s dynamic and adaptive sharing, run more AI jobs & workers and maximize performance.

Kubernetes and Horovod integrated.You can use it as a plugin to your K8s and horovod frameworks transparently & seamlessly.

Visibility.Dashboard which provides complete visibility into GPU util, model history, stats, etc. for your resource consumption and planning.

© Copyright 2020. All Rights Reserved.

Terms & Conditions | Privacy policy