Need advice - building a private GPU cloud

deep-dev · Nov 4, 2023

Hi, new here, I've built a 3D application that I will run for some customers on my own servers. I've tested nearly every GPU cloud provider (probably all of them at this point) but it's looking likely I will need to create at least 2-3 colocation deployments of some GPU servers in the US for my needs. I will/can use some of the GPU cloud providers that I've tested but none of them give me exactly what I need for all use cases...

For instance, my network egress for some customers will be high, therefore I can't be capped by what some providers offer in terms of bandwidth and I can't use the massively inflated bandwidth charges for AWS/Azure for example. So, here are some basic features for my app and I'm looking for some advice on the best bang for buck and availability for a GPU to use in my servers.

- My app delivers low-res 3D graphics (photo realism is not required) and poly count is low compared to high-end games.
- This is more of an engineering type 3D (similar to CAD) application
- I'm leaning towards AMD for price / value and potentially availability bc it seems like it may be hard to get my hands on a few dozen (I may need up to 30-40 over the next year possibly more) NVIDIA GPUs but maybe I'm wrong about this...?
- I can run on either Windows server or Linux but likely will standardize on Linux (probably Ubuntu)

For a data center GPU, that comes in under say $4-5K per GPU which GPU would you guys recommend?

Thanks!

Aquinus · Nov 4, 2023

Have you only looked at AWS and Azure for Cloud solutions? Google Compute Platform (GCP,) has historically been priced aggressively compared to AWS. It might be worth exploring. None of this looks out-of-scope for a Cloud provider.

The only way you can figure out if costs are reasonable for a Cloud solution is to know exactly what your egress usage is predicted to be along with the amount of capacity you need at a given time. GCP has different egress pricing compared to AWS for the standard pricing tier. You get 200GiB/mo free of charge. 200GiB-10TiB is $0.085/GiB. 10-150TiB is $0.065/GiB, and 150-500TiB is $0.045/GiB.

Another thing you might want to consider is that if you're using a cloud provider, you can scale resources to the load. You don't need to provision the maximum amount you need out of the gate and eat the costs for everything you're not using. You can't do that if you're buying all the hardware upfront and running it yourself, not to mention you're incurring the running costs of the servers, hardware, and anything needed to keep everything going.

All in all, you need to do a thorough cost analysis of the two solutions. My experience with running multi-million dollar software products in the cloud is that cost wise and operationally, it's cheaper to use the cloud. However, you still need to be mindful of the resources you provision.

If you were one of my engineers telling me this, I would tell you to go write an ADR (architectural decision/design record) and come back to me with that and a cost analysis of the proposed solutions.

I guess the tl;dr is that on-prem hardware has a huge costs that isn't always apparent. It's easy to just look at price tags and think it's cheaper, when the logistics tell a very different story. On-prem hardware demands staffing to maintain it. That need in the cloud is reduced by a lot and some of the biggest costs to a business are the people.

deep-dev · Nov 4, 2023

Thanks! All good points. The cost analysis works well to host some of my servers closer to certain customer networks (lower latency) and for high-bandwith (several 100 TB/mon).

Right, I get that capex and inability to scale up/down are both negatives for private hosting. System / app management works to my favor with private hosting and also the added security / privacy works to my favor with private data center for many customers.

However, I did find and will use one cloud provider for use cases where latency, security/compliance aren't as critical.

I'll post back here after I test across several more GPUs and if anyone can find that useful, I'll post my findings.

unwind-protect · Nov 5, 2023

Aquinus said:
Have you only looked at AWS and Azure for Cloud solutions? Google Compute Platform (GCP,) has historically been priced aggressively compared to AWS. It might be worth exploring. None of this looks out-of-scope for a Cloud provider.

The only way you can figure out if costs are reasonable for a Cloud solution is to know exactly what your egress usage is predicted to be along with the amount of capacity you need at a given time. GCP has different egress pricing compared to AWS for the standard pricing tier. You get 200GiB/mo free of charge. 200GiB-10TiB is $0.085/GiB. 10-150TiB is $0.065/GiB, and 150-500TiB is $0.045/GiB.

Another thing you might want to consider is that if you're using a cloud provider, you can scale resources to the load. You don't need to provision the maximum amount you need out of the gate and eat the costs for everything you're not using. You can't do that if you're buying all the hardware upfront and running it yourself, not to mention you're incurring the running costs of the servers, hardware, and anything needed to keep everything going.

All in all, you need to do a thorough cost analysis of the two solutions. My experience with running multi-million dollar software products in the cloud is that cost wise and operationally, it's cheaper to use the cloud. However, you still need to be mindful of the resources you provision.

If you were one of my engineers telling me this, I would tell you to go write an ADR (architectural decision/design record) and come back to me with that and a cost analysis of the proposed solutions.

I guess the tl;dr is that on-prem hardware has a huge costs that isn't always apparent. It's easy to just look at price tags and think it's cheaper, when the logistics tell a very different story. On-prem hardware demands staffing to maintain it. That need in the cloud is reduced by a lot and some of the biggest costs to a business are the people.

I used AWS GPU instances in $oldjob. My experience was that it was often difficult to get machines in the evening EST because there was too much demand. We used in-house GPUs.

Aquinus · Nov 7, 2023

unwind-protect said:
I used AWS GPU instances in $oldjob. My experience was that it was often difficult to get machines in the evening EST because there was too much demand. We used in-house GPUs.

The solution to that is cross-region HA which is a good thing to strive for as you scale anyways. You don't want to assume that a single region is going to work without issue (despite what AWS would like you to believe.) I wouldn't call that a more expensive solution than taking on all of the costs of bringing everything on-prem and you definitely don't gain multi-location HA unless you invest in that cost. I'm not an AWS guru, but I know that these are problems that at most be solved and at least be mitigated.

A proper software implementation in the cloud will give you the knobs to turn when you need to turn them. That has been my observation. Also, if you build all of this with something like Terraform it shouldn't matter if you run it on-prem or in the cloud. In fact, do this regardless of what you choose to do. It'll make changing your mind a whole lot easier in the long run.

Solaris17 · Nov 7, 2023

Aquinus said:
despite what AWS would like you to believe

good old US-EAST1

Aquinus · Nov 10, 2023

Solaris17 said:
good old US-EAST1

S3 has never gone down and Textract throughput has never been limited. :roll:

Don't get me started on API gateway payload limits not being what AWS claims.

alexanderm993 · Feb 28, 2024

Your project sounds fascinating, and navigating the complexities of GPU server deployment is no small feat. It's clear you've done a lot of groundwork on this. Your approach towards balancing cost, performance, and bandwidth is spot on, especially considering the unique demands of your 3D application.

Given your requirements, especially around network egress concerns and the potential for high bandwidth without the hefty price tag of services like AWS/Azure, have you considered looking into Seeweb for your colocation needs? While not as widely known as some cloud giants, Seeweb might offer the flexibility and scalability you’re seeking, especially with their supportive approach towards custom solutions.

I understand the dilemma of choosing between AMD and NVIDIA GPUs based on availability and price points. Seeweb has a variety of options that could potentially match your budget and technical needs, and their team is quite responsive when it comes to discussing specific requirements and helping find the best fit.

It's great to hear you're leaning towards Linux, as flexibility in OS can indeed open up more hardware and deployment options. Seeweb supports a wide range of configurations, which could be beneficial as you standardize your setup.

Of course, it’s all about finding the right partner that aligns with your project's specific needs. Just thought Seeweb might be worth adding to your radar as you explore your options. Best of luck with your deployments, and I hope you find the ideal solution for your application!

System Name	Apollo
Processor	Intel Core i9 9880H
Motherboard	Some proprietary Apple thing.
Memory	64GB DDR4-2667
Video Card(s)	AMD Radeon Pro 5600M, 8GB HBM2
Storage	1TB Apple NVMe, 2TB external SSD, 4TB external HDD for backup.
Display(s)	32" Dell UHD, 27" LG UHD, 28" LG 5k
Case	MacBook Pro (16", 2019)
Audio Device(s)	AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply	Display or Thunderbolt 4 Hub
Mouse	Logitech G502
Keyboard	Logitech G915, GL Clicky
Software	MacOS 15.3.1

System Name	Apollo
Processor	Intel Core i9 9880H
Motherboard	Some proprietary Apple thing.
Memory	64GB DDR4-2667
Video Card(s)	AMD Radeon Pro 5600M, 8GB HBM2
Storage	1TB Apple NVMe, 2TB external SSD, 4TB external HDD for backup.
Display(s)	32" Dell UHD, 27" LG UHD, 28" LG 5k
Case	MacBook Pro (16", 2019)
Audio Device(s)	AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply	Display or Thunderbolt 4 Hub
Mouse	Logitech G502
Keyboard	Logitech G915, GL Clicky
Software	MacOS 15.3.1

System Name	RogueOne
Processor	Xeon W9-3495x
Motherboard	ASUS w790E Sage SE
Cooling	SilverStone XE360-4677
Memory	128gb Gskill Zeta R5 DDR5 RDIMMs
Video Card(s)	MSI SUPRIM Liquid 5090
Storage	1x 2TB WD SN850X \| 2x 8TB GAMMIX S70
Display(s)	49" Philips Evnia OLED (49M2C8900)
Case	Thermaltake Core P3 Pro Snow
Audio Device(s)	Moondrop S8's on chitt Gunnr
Power Supply	Seasonic Prime TX-1600
Mouse	Razer Viper mini signature edition (mercury white)
Keyboard	Wooting 80 HE White, Gateron Jades
VR HMD	Quest 3
Software	Windows 11 Pro Workstation
Benchmark Scores	I dont have time for that.

System Name	Apollo
Processor	Intel Core i9 9880H
Motherboard	Some proprietary Apple thing.
Memory	64GB DDR4-2667
Video Card(s)	AMD Radeon Pro 5600M, 8GB HBM2
Storage	1TB Apple NVMe, 2TB external SSD, 4TB external HDD for backup.
Display(s)	32" Dell UHD, 27" LG UHD, 28" LG 5k
Case	MacBook Pro (16", 2019)
Audio Device(s)	AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply	Display or Thunderbolt 4 Hub
Mouse	Logitech G502
Keyboard	Logitech G915, GL Clicky
Software	MacOS 15.3.1

Need advice - building a private GPU cloud

deep-dev

New Member

Aquinus

Resident Wat-man

deep-dev

New Member

unwind-protect

Aquinus

Resident Wat-man

Solaris17

Super Dainty Moderator

Aquinus

Resident Wat-man

alexanderm993

New Member