Wednesday, March 20th 2024
Tiny Corp. Pauses Development of AMD Radeon GPU-based Tinybox AI Cluster
George Hotz and his Tiny Corporation colleagues were pinning their hopes on AMD delivering some good news earlier this month. The development of a "TinyBox" AI compute cluster project hit some major roadblocks a couple of weeks ago—at the time, Radeon RX 7900 XTX GPU firmware was not gelling with Tiny Corp.'s setup. Hotz expressed "70% confidence" in AMD approving open-sourcing certain bits of firmware. At the time of writing this has not transpired—this week the Tiny Corp. social media account has, once again, switched to an "all guns blazing" mode. Hotz and Co. have publicly disclosed that they were dabbling with Intel Arc graphics cards, as of a few weeks ago. NVIDIA hardware is another possible route, according to freshly posted open thoughts.
Yesterday, it was confirmed that the young startup organization had paused its utilization of XFX Speedster MERC310 RX 7900 XTX graphics cards: "the driver is still very unstable, and when it crashes or hangs we have no way of debugging it. We have no way of dumping the state of a GPU. Apparently it isn't just the MES causing these issues, it's also the Command Processor (CP). After seeing how open Tenstorrent is, it's hard to deal with this. With Tenstorrent, I feel confident that if there's an issue, I can debug and fix it. With AMD, I don't." The $15,000 TinyBox system relies on "cheaper" gaming-oriented GPUs, rather than traditional enterprise solutions—this oddball approach has attracted a number of customers, but the latest announcements likely signal another delay. Yesterday's tweet continued to state: "we are exploring Intel, working on adding Level Zero support to tinygrad. We also added a $400 bounty for XMX support. We are also (sadly) exploring a 6x GeForce RTX 4090 GPU box. At least we know the software is good there. We will revisit AMD once we have an open and reproducible build process for the driver and firmware. We are willing to dive really deep into hardware to make it amazing. But without access, we can't."Another post provided a behind-the-scenes look at Hotz's diplomatic approach: "I have spoken with AMD on multiple occasions, we have gotten through to top people, and they have been quite nice to us. I believe they want to be more open, and obviously they don't want their driver to have bugs. Unfortunately, this access and responses prolonged this decision, part of me wishes they just said it's a consumer card, you get what you pay for and we could have switched earlier. We probably tried too hard to make it work. We have an amazing team at tinygrad. Someday, we are going to make our own chips, and I figure if we can make our own chips, we better be able to make the 7900XTX software great. But we can't if we don't have access. The firmware is complex, undocumented, closed source, and signed, all struggles we wouldn't have with our own hardware. If and when the firmware is open and installable, if we aren't too far along with a different chip, we are down to put resources into writing fuzzers and rewriting whatever needs to be rewritten. The 7900XTX hardware seems great, but we aren't going to put resources into fixing a black box."
Sources:
tinygrad Tweet, Tom's Hardware, Wccftech
Yesterday, it was confirmed that the young startup organization had paused its utilization of XFX Speedster MERC310 RX 7900 XTX graphics cards: "the driver is still very unstable, and when it crashes or hangs we have no way of debugging it. We have no way of dumping the state of a GPU. Apparently it isn't just the MES causing these issues, it's also the Command Processor (CP). After seeing how open Tenstorrent is, it's hard to deal with this. With Tenstorrent, I feel confident that if there's an issue, I can debug and fix it. With AMD, I don't." The $15,000 TinyBox system relies on "cheaper" gaming-oriented GPUs, rather than traditional enterprise solutions—this oddball approach has attracted a number of customers, but the latest announcements likely signal another delay. Yesterday's tweet continued to state: "we are exploring Intel, working on adding Level Zero support to tinygrad. We also added a $400 bounty for XMX support. We are also (sadly) exploring a 6x GeForce RTX 4090 GPU box. At least we know the software is good there. We will revisit AMD once we have an open and reproducible build process for the driver and firmware. We are willing to dive really deep into hardware to make it amazing. But without access, we can't."Another post provided a behind-the-scenes look at Hotz's diplomatic approach: "I have spoken with AMD on multiple occasions, we have gotten through to top people, and they have been quite nice to us. I believe they want to be more open, and obviously they don't want their driver to have bugs. Unfortunately, this access and responses prolonged this decision, part of me wishes they just said it's a consumer card, you get what you pay for and we could have switched earlier. We probably tried too hard to make it work. We have an amazing team at tinygrad. Someday, we are going to make our own chips, and I figure if we can make our own chips, we better be able to make the 7900XTX software great. But we can't if we don't have access. The firmware is complex, undocumented, closed source, and signed, all struggles we wouldn't have with our own hardware. If and when the firmware is open and installable, if we aren't too far along with a different chip, we are down to put resources into writing fuzzers and rewriting whatever needs to be rewritten. The 7900XTX hardware seems great, but we aren't going to put resources into fixing a black box."
36 Comments on Tiny Corp. Pauses Development of AMD Radeon GPU-based Tinybox AI Cluster
Oh no… anyways.
AMD does (did?) this with their GCN-based (CDNA now) products through ROCm, but unfortunately not yet for RDNA. The problem is that AMD does not offer any consumer/prosumer card that can be used for local development (e.g. GPGPU developers using a RTX 3090/4090). The Radeon VII was their pinnacle of success, but unfortunately got hampered by "gaming" reviews.
Eventually they will get there once ROCm is in a good state of support for the RX 7900 XTX and the PRO W7900.
Also publicly blaming AMD could be a way out if he already got money for systems not ready to be send to customers.
Also the pre-orders were only $100. The only customers who would be "angry" about this are the ones who don't want to use NVIDIA hardware for whatever reason.
He is absolutely right here, software of AMD gpu's are a mess, only fanboys would disagree.
He is giving AMD a fighting chance to compete with NVIDIA by kickstarting the grassroots enthusiasm for their chips in ML. And AMD is throwing it away.
Another reason to never buy an AMD gpu, their software team is just too incompetent.
rtx 4060 laptop screen freeze - Google
AMDIntel/NVIDIA graphics drivers. Much like how I blamed ASUS (instead of AMD) for their lack of support of their G15 Advantage (5980HX/6850M XT) with the black screen issue, until they finally released a BIOS update that resolved it.All AMD needs to do is open source the MES/CP firmware and Geohot can correct any bugs that his driver fork of the AMD driver is encountering. That's the main problem and why he asked for AMD's help. He cannot identify what the issue the GPU is having if he does not have access to those specific GPU components. What's sad is that they allowed this on the Radeon VII (and the Vegas actually) but stopped doing this on the RDNA cards.
I have not looked into this, I don't know exactly what issue he has with RDNA3, if there is a bug in the command processor I don't know even know if it can be "fixed" and who knows if that's even the problem. AMD probably has a good reason for not allowing access to it, Nvidia didn't allow it until a year or so ago as well.
I suspect the reason AMD doesn't seem to care much about consumer products in this particular segment is because they plan to change the hardware dramatically anyway, it's clear that they intend to jump on the AI train, RDNA3 still doesn't really have dedicated ML hardware blocks, no reason to overhaul the software when the hardware is likely subject to change.
Raja Koduri's Ellesmere-Polaris and (especially) Fury and Vega *started* CDNA/AI-MI compute @ AMD.
To this day, Vega 10 and Vega 20 cards are some of the best 'budget' options for 'tinkering' with LLMs, etc.
He's well-within his (and his team's) rights to have thought that 3rd Generation Navi could be used for such.
Not to mention, he wasn't told that; AMD historically 'likes to see' new uses for their products, and can more-less crowd source off them. 'Consumer Hardware' started the AI/MI revolution.
Vega is retaining support largely because of how similar it is to currently-supported CDNA.
We're only 2generation off from Raja's last pre-CDNA work, Navi 1x. (Navi 12 is something strange...)
I can't blame Hotz for poking @ AMD when, previously they'd been quite accommodating towards this kind of use.
Ex: I received VII air coolers (for a MI25 mod) from a EHW'r that *still* runs quad VIIs for his work. Quite often, I'm reminded that: Gamers =/= Enthusiasts.
You clearly have no enthusiasm for technology, beyond the FPS and the pretties on the screen...
While NVIDIA does not have a public ISA, you can get really close to direct GPU access using NVPTX (yes, I know its a VM), but it runs code directly on the GPU.
Intel is doing something similar to PTX with Level Zero, but I haven't seen it utilized on campus yet.
I get that the modern GPU can do a lot more than just graphics, but they were designed to give you graphics, the radiance display engine, high-bandwidth HDMI and DisplayPort for higher refresh rate and FPS.
If you buy a product thats marketed and designed for a purpose (gaming) then complain it can't do the dishes for you, thats a you problem.
1/2 of scientific or tech "news" nowdays is just wishful thinking that never comes to fruition but is published to push clicks.
Intel on the other hand might be willing, if only to help jump-start their own AI cards, but I wouldn't put it past them to eventually split off development and put AI on dedicated cards while forcing restrictions on their consumer GPUs. After all, Intel has done product segmentation before, and they want both the AI and gaming sector.
At any rate, here's a friendly reminder that the black box stuff in question is not related to raster or gaming whatsoever, so don't expect improvements to game performance even if AMD and TinyCorp work out a code-sharing agreement.
Yes, it'll cook salmon, and yes, the 'salmon cooker' variety appliance is cheaper. -doesn't change the fact the hardware itself is both useful and capable of more.
IMO, Tiny was building a Dishwasher out of the Salmon Cooker. They knew full-well that's not the intended use, but were willing to build all the ancillary stuff to make it work reliably.
They're pissed, less because they were actively blocked, and more because AMD cannot 'nail down' an answer or a solution. (denying both Tiny and AMD new marketshare)
I don't think they would ever reach an agreement since Geohot actually wants the MES/CP firmware open-sourced (not just shared to him for use). As I mentioned above (again, lol), that may not be of AMD's best interests since it may mean potentially losing out on their HPC division since people would just get the cheaper 7900XTX/W7900 and use those instead of the Instinct Accelerators.
The main goal of the tinybox is to have the option of having an on-prem $15,000 compute cluster. The main goal of tinygrad is to have an alternative (and possibly more optimized?) framework to PyTorch (and its autograd).
They made an announcement before partnering with AMD, before qualifying a solution like any competent firm would have. If they want 8 card nodes, they should be using MI210s.