Compute Optimization & Orchestration
for AI in Healthcare

Inefficient GAN training in cloud –
Slow data processing, less simulations, low data scientists productivity 

Al in healthcare enables hospitals and clincians to make faster, more accurate decisions using intelligent medical devices to improve patient outcomes, drive efficiency and eliminate errors.

Problem

Medical devices generate massive amounts of data that needs to be processed by data scientists. Training and processing of data involves complex Al training with synthetic image generation and simulations takes long hours, days and weeks. The training process requires large amounts of GPU compute resources. The GPU compute is out there in the cloud, but not used efficiently leading to slow data training proess, less simulations and low data scientists productivity

Why

To train the model to process the data, complex GANs (Progressive GANs) are applied for data simulations and view the results. GANs are used to generate synthetic images using “synthetic data augmentation”. GANs generate high quality images. GANs are dynamic with unpredictable workloads and needs compute on-demand, but the infrastructure is all static resulting in inefficient training process

GANs background

In CNN-GAN model training, there are multiple “generators” and a discriminator which are applied to generate varied image size high resolution images. The key here is “multiple generators”. The more generators we have, more varied images are generated. Hence it takes days, weeks to run and achieve the required number of images. PGANs are used to generate high resolution images. For example, you can start with an Image size of 24 x 24 and scale it to 128 x 128 all the way to 1024 x 1024

GANs Infrastructure challenges

Lots of computation resources
GANs require lots of computational resources. (For example, if running an 8 Generator
PGAN training, 8 GPUs are required to run in parallel.) Less GPU resources results in slow model completion and less generated images

Resource guesswork

Data scientists/ML engineers have to guess GPU resources and shares for each GAN generator. With any variance in training progression results in requiring on-demand resources in run-time resulting in training disruptions, failed models.

Al trainings/Data scientists waiting for compute & time spent for manual tooling

Static, resource guesswork leads to “over provisioned” resource allocations resulting in other Al training/generator jobs waiting for compute leading to disruptions, delays and frustrations and relaunch the trainings again. To avoid this, Data scientists have to manually checkpoint, sync with incnsistent trainings and ineffective results

Dynamic workloads with “static” GPU allocations

GANs are dynamic, non-linear workloads with drifts in training parameters resulting in constant need for more resources on-demand. But the infrastructure and the resources are all static

Slow Al models

Slow Al models

Slow Al models -due to insufficient resources

Progressive GANs

Progressive GANs

Progressive GANs need compute resources on­ demand. If no resources, the trainings have disruptions, delays and failed trainings

High TCO

High TCO

High TCO - Over provisioned resources

Low utilization

Low utilization

Low utilization - Due to static allocations

Data scientist manual tooling

Data scientist manual tooling

Data scientist manual tooling - Constant guess works,checkpoints, sync, relaunch the trainings again

rapt.ai healthcare solution

Dynamic auto GPU resource share allocations based on workloads with demand variance over time resulting in more Generator trainings per GPU.

raptIQ™engine – ML based resource predective analytics and automation

GANs are dynamic and unpredictable with progession based auto skewing based on the image augmentation and resolution needs. For example, a GAN training starts with an image size of 4×4, and as it progresses, it moves to 24×24 and then all the way to 1024×1024. This applies to various other parameters as well, such as generator filter size and batch size. As the GAN training progresses, it dynamically adjusts over time. The above automation frame work analyzes, optimizes and allocates the resources on-demand in runtime providing dynamic resources based on workflow needs with zero intevention

raptIQTM Engine Resource Prediction – Analyzes GAN workloads and training configurations, predicts
GPU/compute type, GAN trainings util patterns and optimizes resources (number of GPUS, % share ot threads/mem) dynamically. This result in on -demand precise resource allocations for GAN trainings.

raptIQTM Adaptive Scheduler – A Kubernetes-integrated scheduler pools the GPUs into a virtual compute environment and schedules the training jobs using a ‘packing’ method to run multiple training sessions on the GPUs without interference. This means the packed jobs do not affect each other’s execution.

raptIQTM Virtualization – slices the GPUs from the pool for the jobs and assigns required number of threads, cores, memory for each GAN job.

Data scientists need not have to preset resource shares for GAN training process. The shares are predicted and assigned on-demand based on workload. No code changes are required in the model.

Analyze Models
Optimize Compute
Virtualize
Request for a Demo




    100% secure your website.
    Powered by