Comparing AWS, Azure, Google Cloud for AI/ML model training, MLOps, GPU access

0
454

I have been experimenting with the main #cloud providers – #AWS, #Azure, #Google for #AI #ML model training using #CPUs and #GPUs.

Must say Microsoft Azure and Google Cloud were the most difficult, unnecessarily complex to setup in terms of getting GPU instance, buckets, IAM #permissions, etc. AWS was definitely way easier and quicker.

Understand that given the wide scope of user types for cloud providers, they have predefined hierarchies/permissions, etc, but somehow don’t make much sense for small companies or single users. Need to find some smaller startup cloud providers, can’t spend so much time on cloud admin work.

Biggest problem with Azure was that despite all the request emails to multiple teams/people over days, I could still not get access to a GPU. Just did not use it.

Google cloud also has setup complexity and CPU performance is not good, used a simple e2-highmem-2 virtual machine. When trying to use it with Databricks the performance was bad and just couldn’t complete the installation of google cloud cli despite running for hours multiple times.

AWS has worked the quickest and best, cli installed quickly, i3.xlarge virtual machine worked well no integration issues.

For GPU access used a startup, #Vast.ai. Took 15 mins to signup, pay, select cheap high memory GPUs (~$.4 to $3+, RTX / H100), launch prebuilt #docker with #jupyter and related libraries installed. Everything required to train AI/ML models. Simple, cheap, no admin work. While it’s primarily for GPU access, can easily integrate with a simple cloud app building and hosting service.

Also trying #MLOps tools like #Databricks, #Sagemaker with Google Cloud and AWS. Google is too complex and difficult to use. Using AWS for both as it’s much easier and faster to setup and use.

Tools like Databricks, Sagemaker are worth learning as lot of companies are using them now. They have lot of beneficial features like managing the MLOps lifecycle, prebuilt connectors to #datalakes or #warehouses like #Snowflake, feature store, etc.

Easy to learn (1-2 days), as long as you know the entire AI/ML #datascience lifecycle process – problem, algorithm design, train, tune, present to users, deploy models, retrain.

Just need to learn about model artifact logging, accessing datastores which generally use the cloud platform you run them on or connect to their cloud store instance.

Overall I think the cloud providers have become too complex and clunky to use. Prefer to use smaller providers, as the services are the same.