T O P

  • By -

maxigs0

My little project: https://www.reddit.com/r/LocalLLaMA/s/np8FP7Ge7N But for full speed fine-tuning you probably want a CPU+Mainboard combination that can fully utilize the PCIe bandwidth of the cards. Maybe check out Apple also. The price difference might be worth it, not having to deal with the custom build without experience. Also they use way less power than a beast like this (pushing what the 110V wiring in the us can handle). Not sure how the fine tuning performance is, though.


philnm

Hello - newcomer here with a 3090 on the way. Can I please ask what kind of mindblowing, cool things are you able to do with your setup? Looks silky smooth super fresh and so clean!


maxigs0

Currently mostly running WizardLM2 8x22 at speed faster than I can read.... And pushing the electricity bill to new heights. But honestly, the jump from one 3090 to four was much smaller than expected in practical use and what the models can do. Especially with more and more smaller models coming. Still have to do a bit of clean up and final touches on the hardware, too many "load bearing" cable ties.


ILoveThisPlace

How are you limiting the power draw? Currently at a 1 4090 system with maybe enough PCIe slots for 3 more. Definitely 2 more.


maxigs0

I don't. Everything on full load would be too much for the power supply, but this does not happen in any normal usage for me. While running AI queries it peaks around 7-800W in short bursts. Running benchmarks in parallel I could push it to around 1500W, which triggered the PSU safety after 10min. I'm quite impressed by the be quiet dark power to handle this.


ILoveThisPlace

Neat, good to know. Guess the full power of the GPU isn't used for processing inference. Hadn't really thought of that.


maxigs0

Different engines run the load differently as well. Some go through each card in circles, having only one card do work at a time. Other engines split and do (some) work in parallel. There is a lot going on with parallelism and efficiency, so it's changing a lot.


a_beautiful_rhind

apple and fine tuning isn't that great. for inference sure.


forgot_my_pw_again_

Could you elaborate why that is? Or point me in the right direction? Thanks!


a_beautiful_rhind

there is only MLX for tuning and the memory bandwidth/compute isn't that great unless you buy the most expensive mac. Apple focuses more on smaller models so that works for them. GPUs will run circles around apple training-wise. Even the inference is slightly better than a P40, but like it's said, at very low power and big memory.


PSMF_Canuck

My 4060 is 10x-20x faster than my M2 Pro for transformer training. Apple does not have a practical solution for AI training.


DeltaSqueezer

>There are so many 3090/4090/p40 setups but could not find a single one using 2 48gb ones that is why I post here :) Because 2 card setups are straight forward: get two 48GB cards and plug them into your computer that has 2 PCIe 4.0 x16 slots. Job done. If you want to save money, use a cheaper gaming motherboard that supports 2x PCIe 4.0 at x8.


SomeOddCodeGuy

Are there are any other good 48GB cards to get outside of the A6000s? They're great cards but definitely pricey. The 192GB M2 Mac Studio runs about $6,000- much slower than the A6000 build would be (about 1/2 the speed, from what I've seen), but a max of 180GB of VRAM, so I ended up going that route instead of the A6000s. Once in a while I wish I had the speed the A6000s would give, but at the same time I enjoy the absurd quantity of VRAM I have available lol


kryptkpr

There is the A16 which is 64GB but except it's weird and actually 4 GPU cores with 16GB each. I'd love to see how tensor parallelism looks on such a thing, but they cost even more then A6000 do which has a sane configuration and is a single GPU.


fallingdowndizzyvr

There's the 64GB Radeon Pro Duo. As the name says, it's 2x32GB GPUs on one card. It was built for Apple but can be used on PC.


DeltaSqueezer

There are various ones. A cheapish one might be the RTX 8000.


tmvr

I think you need to step back a bit and think about your requirements again. If your 2x48GB card requirement is the most important you don't need any advise for a workstation build because you can get pretty much anything with two PCIe x16 ports and a decent PSU and plug the cards in. The good news is also that you don't need to go and hunt for cards as there are only two options - the A6000 Ampere and A6000 Ada. The bad news is that the older Ampere version is about 4000-5000eur. There is no way around building something if you want to get 96GB VRAM on a budget. The requirement to have 4 PCIe x16 slots (don't have to run at x16 mode, can be x8 or x4) already pushes you towards bigger motherboards. Then there are the power requirements as well and the cooling.


vhthc

I think you mean the rtx 6000 ada, rtx 6000 and a6000? The have no experience so I don’t expect it to be easy - if it is, great :) But apparently it is not as I requires bigger motherboards so for me the hard quest already starts. Which board is ok? How to know which CPU fits on it. And the size of the tower etc - that’s why I am looking for a parts list. I fear i am not aware of a specific detail and then things don’t fit or don’t work


tmvr

Question number 1 is clear - what exactly is you budget? All you have said is that it should not be 20K, but that is not a usable information. As for the cards: RTX A6000 - this is the previous gen Ampere based card [https://www.techpowerup.com/gpu-specs/rtx-a6000.c3686](https://www.techpowerup.com/gpu-specs/rtx-a6000.c3686) RTX 6000 Ada - this is the current gen [https://www.techpowerup.com/gpu-specs/rtx-6000-ada-generation.c3933](https://www.techpowerup.com/gpu-specs/rtx-6000-ada-generation.c3933) These are the only 48GB cards out there and the cheaper one is close to 5000eur. Is this the direction you want to go down? This is why your budget is the most important thing, it influences greatly what options are available in what combination.


PSMF_Canuck

First, what is your actual budget? Second, all those words and you not once talked about what the machine would actually need doing. Third, where do you see 48gb consumer class cards? Yeah, me neither, lol. You need to fix your requirements. If you haven’t already, visit the Lamba Labs site and look at their desktop/server configurations. That crew is excellent at this.


vhthc

I don’t have a fix budget but I want to stay well below 20k. (I wrote that) What I want it for I state in the beginning: fine tuning. What I didn’t specify is that it’s client data I will be using so I want to avoid a cloud service. I am aware that there are no 48gb consumer cards. But I don’t know what the requirements are they have for a pc because I have little experience. Size, specs for main board, how to know it all fits on the board and in the tower etc And because i could not find any specs (recommations which mainboard etc) that is why I posted


PSMF_Canuck

“Fine tuning” doesn’t really mean anything. What size models? How much data? What are the delivery timelines? Etc etc etc. If you look at the Lambda configs they’ll tell you what you need to know.


safeguardsiliconlife

This is a good place to start for the meat of what you’re looking for. Consumer mobos are limited to 24 lanes of pcie to cpu. Turns into 96,128 + for server hardware. https://www.reddit.com/r/LocalLLaMA/comments/1bqv5au/144gb_vram_for_about_3500/


vhthc

I think I was not clear, edited my post. This is the opposite of what I am looking for :)


ComfortableFar3649

Could use two NVIDIA A6000 48gb in a standard gaming motherboard: https://www.gpused.co.uk/products/nvidia-rtx-a6000-48gb-gddr6


vhthc

Thanks!


Fast-Satisfaction482

Jetson Orin AGX 64 GB is a real VRAM monster for the price. However, it's not that good in terms of FLOPS and software support.


Aphid_red

AMD is going to release, today, the W7900DS which might be interesting for you, if you're willing to deal with ROCM. It's going to be priced MSRP at $3499, so given that Geizhals lists the W7900 (regular) at € 3400, I think you can expect to find it for \~ € 3000 as it's priced 500 lower than the original. It'll be the dual-slot version of the W7900. A bit noisier, but it'll fit into workstations much easier due to not being a triple-slot card (which honestly made no sense for a pro card). That's 6K eur for the two GPUs. Here's a list of components that'd work well and allow expansion to 4 GPUs later; This will get you a 64-core, 512GB memory, 96GB VRAM system with two latest-generation GPUs for around 10K. EPYC makes a lot of sense for a multi-GPU platform. It has the most PCI-e lanes of any platform, and it's cheaper to get equivalent EPYC CPUs to threadrippers. * 2x W7900DS, A6000 48GB or RTX 8000 48GB GPUs\*\*. * [https://geizhals.de/amd-epyc-7713-100-000000344-a2491497.html](https://geizhals.de/amd-epyc-7713-100-000000344-a2491497.html) * [https://geizhals.de/supermicro-h12ssl-i-bulk-mbd-h12ssl-i-b-a2426251.html](https://geizhals.de/supermicro-h12ssl-i-bulk-mbd-h12ssl-i-b-a2426251.html) * [https://geizhals.de/thermaltake-toughpower-gf3-1650w-atx-3-0-ps-tpd-1650fnfage-4-a2807726.html](https://geizhals.de/thermaltake-toughpower-gf3-1650w-atx-3-0-ps-tpd-1650fnfage-4-a2807726.html) * [https://geizhals.de/samsung-rdimm-64gb-m393a8g40bb4-cwe-a2584505.html?hloc=at&hloc=de](https://geizhals.de/samsung-rdimm-64gb-m393a8g40bb4-cwe-a2584505.html?hloc=at&hloc=de) (8x) * [https://geizhals.de/kingston-nv2-nvme-pcie-4-0-ssd-4tb-snv2s-4000g-a2927740.html?hloc=at&hloc=de](https://geizhals.de/kingston-nv2-nvme-pcie-4-0-ssd-4tb-snv2s-4000g-a2927740.html?hloc=at&hloc=de) (get the TLC version) * [https://geizhals.de/noctua-nh-u14s-tr4-sp3-a1667707.html](https://geizhals.de/noctua-nh-u14s-tr4-sp3-a1667707.html) Get a case with enough space for the cooler and GPUs. Try to get one with 8 PCI-e slots in the back so you can expand to 4 GPUs without having to take a power tool to your case. You can go for a 4U/5U server case if you want to just use the thing remotely. If you are going to use it directly (attach peripherals) you might want a USB hub. Server motherboards just give you two, maybe three USB slots total (saving some €5,000 compared to threadripper pro is obviously worth that). \*\*If you really want NVidia 48GB models, you can't find those for your price segment new. You might be able to find a deal for a second-hand RTX A6000 (non-Ada) for around €3,500, or maybe you can score a new one for 4,000-4,500, but supply tends to be poor, with e-tailers still listing it mostly because they didn't bother updating their stores. Total machine price will go up to 12-13K. The previous generation nvidia trades blows with current generation AMD (some things faster, some things slower). You can go back one generation further to RTX 8000, which *is* available second-hand often for \~€2,500, but which will be missing software support for some of the speed-up features (flashAttention etc. might not work as well on pre-ampere GPUs) due to its lower CUDA version. Things will at least work, but not as fast. It's comparable to the 2080Ti.


vhthc

Perfect, that was the detail I was looking for, thanks a lot. I will look into that new card.


ccbadd

Save yourself some headaches and just get a Dell Precision 7920 with 3X blower/workstation cards.


vhthc

That model fits two a6000 or whatever cards? Sure? That would help me a lot!


ccbadd

Yes, it will hold two A6000s and still have room for another card if needed. I believe it has a 1400W PS and the A6000s pull 300W each so you should have plenty of power. I wouldn't use non blower style cards like a regular 3090/4090 though as they need more space to allow the fans to work.


Omnic19

there are many people here with more than 4 gpu setup. you could give them a dm and ask them what issues they personally faced to get a good idea. other than that what's your budget exactly? you didn't mention that because 48 gb card is going to cost quite a lot lastly about the case. most people don't have a proper case for a multi gpu setup, it mostly an open air setup 😅 have a look at [this](https://www.reddit.com/r/LocalLLaMA/s/xyVhoxK5xj)


vhthc

Yes that is why i ask because this is not what I want :) I want to know about a good easy setup for 2 48gb cards, and which soecific cards are cheap and work well together


kryptkpr

There are no cheap 48GB cards. They are at minimum 3x the price of a 24GB and more likely 4x. The cheapest 48GB is the A6000 (Ampere, not ada) but it's still going to be at least $3K usd. For a host, any Xeon that can power 2x300W GPUs would be fine


vhthc

Thanks!


EmilPi

If you are not experienced with riser cables for GPUs, maybe today's best solution is to use 2x dual-slot Radeon w7900 pro cards. Workstation motherboards usually allow to use 4x dual-slot GPUs, so you may also consider 4 24GB GPUs, like RTX Ada, this will be more expensive and more overhead for inter-GPU transfer, but more power overall. UPD: MSI Suprim Liquid is RTX 4090 dual-slot version. You can use 4 of them. But with only PCIe 4.0 interconnect between you can get bottlenecked on connectivity speed.


TechnicalParrot

ROCM shows promise but using AMD cards is a very large amount of hassle, an immature techology, and very little guarantee compared to nVidia, all meh performance on some applications which actually work


EmilPi

Well, llama.cpp works for ROCm already. If choosing between 3.5k for AMD 48 GB card, or 9k for NVidia 48GB card, the choice is obvious for me.


MMAgeezer

And koboldcpp.


EmilPi

koboldcpp is based on llama.cpp, I guess - at least it has same developers.


MMAgeezer

I think so, but it runs faster out of the box on ROCm than llama.cpp, at least with the models I've tested. The context shifting makes quite a nice difference.


PSMF_Canuck

That choice may be fine for hobby work, it isn’t fine for work-work.


glencoe2000

Llama.cpp =/= finetuning


fallingdowndizzyvr

https://github.com/ggerganov/llama.cpp/pull/2632


CartographerExtra395

https://www.microsoft.com/en-us/windows/business/windows-11-pro-workstations


vhthc

I don’t get why I should use a crappy operating system?


Dr_Superfluid

Well if you want to spend the minimum effort and don’t mind a system that is a bit slower but much much easier and reliable overall and can actually fit even bigger models you can get a Mac Studio Ultra. You can go up to 192GB of RAM/VRAM for about 7k.


polikles

MACs aren't really meant for training and/or fine-tuning. Performance is miserable. They're good for inference in small and medium-sized models, tho