Compute hungry Dynamic, Unpredictable AI model build/trainings and Static AI infrastructure
Are you tired of ...
→ guess work and predefining of how much/many GPUs” you need for your AI work
→ Data scientists given limited compute and waiting for compute availability
→ Fixed, preset compute shares for your AI models
→ High operational cost(clouds & on-premise) to build AI models
→ Constant training disruptions and failures
→ Slower TTV(Time To Value) of AI model in spite of using faster high performance GPU
→ Localized training failures @Edge MEC resulting in high lat.
Look no further,
rapt.ai’s market first, “adaptive model aware learning system” a Kubernetes integrated
platform dynamically transforms your AI infrastructure in cloud, on-premise & edge,
fungible to your AI model needs.
rapt’s software transparently analyzes AI model graphs, GPU utilization needs, business
workflows and dynamically allocates precise fraction of GPU out of rapt’s virtual GPU
pool cluster. Delivering acceleration, maximum utilization at unprecedented scale and
economics.
All this is done automatically with “zero” user intervention. No more guess work or
presets required. In addition, you can also set priorities and workflow policies and the
learning system will take care of it.
Background
AI, ML/DL are a compute hungry tasks and need desired and precise AI compute resources to achieve the expected levels of accuracy and TTV (Time To Value). The AI compute is out there could be in your data centers, your public cloud deployments but it is not used and managed efficiently resulting high TCOs, slow TTV(Time to Value) of your AI model and slow data scientists productivity.
Static AI infrastructure, GPU utilization, business workflows
Dynamic, unpredictable AI workloads Vs. Static GPU allocations
Unlike traditional software workloads AI workloads are “dynamic & unpredictable”.
AI model build & training is all about tuning and retuning of parameters to achieve
high prediction accuracy and the process repeats until the desired levels of accuracy is
achieved.
Add to this in mix is datasets changes, data drifts with varied datasets resulting in
varying AI workload patterns.
A traditional virtual infrastructure would solve this by providing fixed static shares to
workloads. But this does not solve AI workload problem as the pattern and profile is
dynamic and unpredictable. Static shares leads to underutilization of valuable GPU
resources.
Fixed shares are good when the workloads are predictable and static.
Non-availability of valuable compute resources leads to slow AI development.
Imagine running a distributed training and limited by number of workers and GPUs.
This leads to slow training and overall low performance.
challenges
Low GPU util. with some peaks leading to inefficient and underutilized GPU resources
Different AI workloads with different tuning parameters and graphs results in dynamic GPU util. patterns
Resulting in constant guess work of how many GPU resources I need for my AI training/build.
Fixed/preset shares/quota GPU allocations leads to underutilization, oversubscriptions & constant training disruptions & failures.
Static allocations with each workload taking a dedicated GPU and utilizing it for ~26% leading to underutilizations and high system costs.
Inefficient underutilized system usage with
high system costs and expensive resources
sitting idle.
High operational costs in clouds & on-premise.
rapt.ai’s market first, “adaptive model aware learning system” dynamically transforms your AI infrastructure fungible to your AI model needs. Rapt pools virtual pool of GPUs/accelerators and dynamically allocates the resources based on the AI model graphs, parameters profile patterns. You can set policies based on the business workflows and team requirements as well.
Maximize GPU utilization. Pooled dynamic resource allocations and sharing between AI workloads, teams and business workflows
Guaranteed shares/allocations . Dynamic GPU resource shares based on AI workloads. As AI workload profile patterns change, the allocations are increased or decreased automatically. Zero user intervention.
Automatic compute consolidation. Dynamically schedule the AI workloads based on GPU utilization patterns and consolidate the systems/instances.
Reduce TCO or can support more AI jobs into the same systems
Maximum performance and faster training.Using rapt’s dynamic and adaptive sharing, run more AI jobs & workers and maximize performance.
Kubernetes and Horovod integrated.You can use it as a plugin to your K8s and horovod frameworks transparently & seamlessly.
Visibility.Dashboard which provides complete visibility into GPU util, model history, stats, etc. for your resource consumption and planning.
© Copyright 2020. All Rights Reserved.
Terms & Conditions | Privacy policy