Heat death, how to kill your gpu in less than a year

edited February 2014 in General

Hi all,

(Note, I have references here to limiting fan speed to 85%.  I've changed my limits to 65% as of 02/2014.  It's too much trouble to change every reference here.  But, I did change the config file.  Ron)

I've been mining Litecoin off and on since April 2013. 
I'm using gpus.  I wanted to post some info about killing (or not) your
gpu fans.  Basically, if the fans fail, your gpu is out of commission as
a miner.  The problem is that gpu's are designed to run a few hours per
day at heavy load.  They are not designed for 24/7 operation with the
fans at 100%.  There is a little blurb about this in the cgminer gpu
readme file.  I have personal knowledge that running the fans at high
speed on a couple of particular cards will kill them in less than a
year.  I've had this happen on the MSI 7850 cards with the Twin Frozr
III cooling system.  RMAing these is a pain and I have to pay shipping
so I'm phasing these out of my system.  I think their 7870 cards have
the same design.  (Actually MSI's warranty procedure is fairly painless,
but uninstalling and packing and shipping and reinstalling is a pain.) 
Not only that, I lose mining capacity in the mean time.

In my
opinion, what you want to do is run the card at a heavy but reasonable
pace.  I have my cards set to throttle up the fan and back the cpu as
needed to maintain 85 deg C.  I have my upper fan limit set to 85%, as
suggested in the cgminer gpu readme.  The electronics of the card should
be able to run at 85 deg C for a very long time.  However, the FAN WILL
NOT RUN AT EVEN 85%, much less at 100% for a very long time.  The fan
is the weak link.  So, unless you want to be RMAing cards until the
factory tells you to stop, or want to prematurely junking cards that are
out of warranty, you HAVE TO PROTECT THE FAN(S) FROM PREMATURE DEATH.

I
should say that I have my cards in a conventional computer enclosure.  I
may experiment with open rigs later, but that's not the situation I'm
in now.

I'm totally making the following numbers up.  But, say
the fans were designed to run for 4 hours per day at high speed.  If
you're running 24/7, which is 6 TIMES more, the card will fail 6 TIMES
sooner than it's original design life.  If the design life is 3 years,
that works out to a failure after about 6 months, which is about what I
experienced.

The problem is the air gap between the cards.  Most
cards have the fan intake on the side.  When you put two cards side by
side, there is very little air gap between cards to allow air flow.  If
the motherboard is vertical, as in a tower case, any card which is on
top of another will not be happy.  According to the ATX spec which I
looked up, there should be 1.6" between two double width cards (circuit
cards).  If the cards are almost 1.6" thick, there will be NO ROOM for
air intake for the fans.  This will make the fans ramp up to maximum
velocity striving to draw in some air and maintain the temperature
target.  Thus, the fans wear out.  The thinner the card you can get and
the more efficient it's cooling system, the longer they will last. 
Generally, the bottom card next to the power supply will have more air
flow anyway.  If you have a card that you know is weak, you can stick it
down there.  That way, it can work with less fan velocity.

When I
was looking around for new cards, I found one Gigabyte card that had
specs of 1.7" thickness.  That one may not even fit in the motherboard
beside another card.  Even if it did, it would get no air.  I'm not
buying that one.

I'm phasing out the MSI 7850 Twin Frozr and 7870
Twin Frozr cards.  They are very thick.  When in adjacent slots, they
have very little space between them and they kill their fans quickly.

I
have some Asus 7850 cards with their DirectCU cooling system.  These
cards are noticeably thinner than the MSI's and the fans run noticeably
slower, say in the 40-50% range, to keep the card cool even when they're
adjacent.  Even then, there is a visibly larger air gap between them.  I
haven't owned them long enough to know when they fail.  But, since the
fans are running much slower, it should be much later.  If you give them
an even bigger air gap, they love it.  I currently have one running
overclocked from 860 MHz original to 1000 MHz, with several inches of
air next to it, running at a cool 70 deg C and 20% fan speed (which is
my minimum).

In considering new cards, I would stick to
manufacturers with a 3 year warranty or more.  That pretty much limits
things to Asus and MSI as far as I know.  XFX has some sort of lifetime
warranty but it seems like there are lots of catches when you read the
fine print.  If you want to put cards adjacent to each other in a
motherboard, don't get anything thicker than 1.5" (rounded off number). 
Thinner is much better.  I would never buy a card for mining without
dual fans.

Based on reading specs only, but not experience yet, I like the new Asus and MSI R9 200 series cards.

The Asus (R9270) cards have the DirectCU II cooling system.
The MSI (R9270) cards have the Twin Frozr IV cooling system, a version upgrade compared to the cards I own.

I
did a subjective comparison of an older MSI 7850 and an Asus 7850 by
running the fans on each one up to 100% one at a time.  To me,
subjectively, the MSI card seemed louder.  I did not do any
measurements.  However, if you keep the fans at 30% - 50%, this will be a
non issue.

Note that I'm NOT recommending to LIMIT the fan to
50%.  I'm still setting my upper fan limit to 85%, and cgminer will bump
it to 100% if the temperature exceeds the overheat number.

I AM
recommending to buy a thinner card and give it enough air space, if
possible, or overclock it less, so the fan never has to exceed 50% or so
to keep the card cool.

If you look at your card in gpuz, you can
find the default clock frequency, always a good number to keep in
mind.  You can also look at the specs.  If you're running Linux, and the
ATI driver, you can issue this command:

aticonfig --adapter=all --odgc

That
stands for overdrive get clocks.  It will give you a list of the
current clock settings and loads.  It will also show you the adjustable
range for that card.  I don't know how rigid that is, but it is at least
a guideline.

The Asus 7850's, for example, have a default clock
of 860 MHz.  They have an adjustable range of 300 - 1050.  I normally
run them at the max of that range, adjacent to each other, and they have
no problems at all.  The fans on the hottest card are running about
50%.  That's not as good as 30%, but it's much better than 85%.

Note
also, that I have my minimum gpu engine speed set to 300 MHz.  YES, I'm
giving it permission to clock down that low.  I want cgminer to do
whatever it takes to avoid frying the card's guts.  Wearing out the fans
is one thing.  Wearing out the circuit card or components is quite
another.  I've seen an example of this underclocking in action.  Today I
had a fan failure on an MSI 7850.  I was looking at GPUz and noticed
the card clocking at 500 MHz.  The fan was still at 85%, and the typical
clock speed would have been around 1000 MHz.

Well, I discovered
that one fan had failed, essentially half the cooling system.  So, the
card was severely underclocking to keep the temperature below 85 deg C. 
That's exactly what I wanted it to do.  This is an advantage of using
the gpu-engine command in cgminer.  It can overclock and underclock, if
you let it, to protect the card.  It will completely shut down the card
if it exceeds the temp-cutoff number.

This turned out to be longer than I thought, so I don't know what I'll post later.

For
those who are interested, here's my config file.  By the way, for those
still running cgminer from the command line or a batch file, it's
totally worth learning the config file.  It's much easier to configure
then.  You don't have to write it from scratch.  Get it running with a
command first in interactive mode.  Then use the Settings, Write command
to write out a config file with the current settings.  You can then
edit that.  It's best to start with a multi gpu setup so you know how
that's structured in the file.  I've had problems doing gpu engine and
fan control with multiple independent cgminer instances.  I recommend
you run all gpu's from one cgminer.

In this config file, the last
parameter is the one driving my monitor, so it's different.  If you
cannot tell which card is which, drive the fans on each one to 100% one
at a time while the others are at 20% or so.  Physically touch the FRAME
of the card and feel which one has vibration.  DO NOT TOUCH THE FAN
WITH YOUR FINGER OR STICK ANYTHING INTO THE BLADE.  Note which slot in
the config file controls that card and which instance of gpuz (for
example) monitors that card.  Note which items can take multiple
parameters and which cannot.  For example, most of the temperature
limits take multiple parameters, but temperature hysteresis does not. 
Note that many of these parameters were generated by cgminer when
writing the file and I have not edited them.  I have set the lines that I
edited to bold.  If you add or delete cards, you will have to edit all multi parameter lines whether you originally edited them or not and add or delete parameters.

{
"pools" : [
    {
        "url" : "stratum+tcp://ltc-eu.give-me-coins.com:3334",
        "user" : "USERNAME",
        "pass" : "PASSWORD"
    }
]
,
"gpu-reorder" : true,
"intensity" : "19,19,15",
"vectors" : "1,1,1",
"worksize" : "256,256,256",
"kernel" : "scrypt,scrypt,scrypt",

"lookup-gap" : "0,0,0",
"thread-concurrency" : "0,0,8192",
"shaders" : "0,0,0",
"gpu-engine" : "300-1050,300-1050,300-1000",
"gpu-fan" : "20-65,20-65,20-65",
"gpu-memclock" : "0,0,0",
"gpu-memdiff" : "0,0,0",
"gpu-powertune" : "0,0,0",
"gpu-vddc" : "0.000,0.000,0.000",
"temp-cutoff" : "95,95,95",
"temp-overheat" : "90,90,90",
"temp-target" : "84,84,84",
"api-mcast-port" : "4028",
"api-port" : "4028",
"auto-fan" : true,
"auto-gpu" : true,
"expiry" : "120",
"gpu-dyninterval" : "7",
"gpu-platform" : "0",
"gpu-threads" : "1",
"hotplug" : "5",
"log" : "1",
"no-pool-disable" : true,
"queue" : "1",
"scan-time" : "30",
"scrypt" : true,
"temp-hysteresis" : "3",
"shares" : "0",
"kernel-path" : "/usr/local/bin"
}

I hope this is helpful to others with similar issues and concerns.

Sincerely,

Ron

Comments

  • PS, High velocity case fans are your friends.

    Good:
    Corsair Air Series AF120 Performance Edition CO-9050003-WW 120mm High Airflow Case Fan
    http://www.newegg.com/Product/Product.aspx?Item=N82E16835181022

    Better:
    Cooler Master JetFlo 120 - POM Bearing 120mm Blue LED High Performance Silent Fan for Computer Cases, CPU Coolers, and Radiators
    http://www.newegg.com/Product/Product.aspx?Item=N82E16835103192

    Ron

  • I've overclocked my XFX 7850 to 960 and the mem to 1450. I'm only seeing the fan top out at 76% and the temp at 72. Am I good to let her rip? Also, I have access to a 68 degree server room and I can vent air directly on the GPU

  • Hi gn3tlc,

    Some of the answers you want are subjective and subject to individual preference.

    All I can say is that I killed the fans on a couple of MSI 7850's that were adjacent to other cards in less than a year running them full speed and even taking a couple of months off mining when it wasn't profitable.  I'm trying to arrange my cards so I have thinner and more modern cards with better cooling systems adjacent to each other and fatter cards with lesser cooling systems either next to a larger air space so they can get air and so the fans run slower, or selling them.  Since the fans usually have sleeve bearings, they don't have a long lifespan at high speed.  I'm personally OK with my cards running at 84 degrees C but everyone's preference is different.  If you raise the temperature limit or lower the overclock, the card can meet those criteria and still run the fans slower.  I'm assuming you're running the system in a case and not an open frame.  If you have an open frame, leave 2" of air space next to each card and they'll be much happier.  At this point, I'm personally trying to keep all my card fans running at less than 50%, even if I have to add more high velocity fans to the case.  This number is arbitrary, and I cannot predict the effect of running at 70%, for example, other than that the fan will expire sooner.

    If the fans fail on the card, it's essentially out of commission, although you might be able to run at 1/2 clock speed with one dead fan out of two.  I looked and looked and didn't really find a good way to replace them myself.  Yes, there are some things that can be hacked together, often with parts that don't really fit.  Yes, you can get a complete aftermarket cooler with bigger fans and a bigger heat sink, but that doesn't fit in a two slot space.  Also, both those options void the warranty.

    So, I'm keeping one MSI 7850 that's still functional down by the power supply where there's an air gap and where its fans can run slower with nothing adjacent to it.  Two others, which I will soon have back from RMA from the factory, I will sell after they've been refurbished.

    Blowing cool air on the cards, if you have that option, is a good idea.  Set up automatic gpu engine and gpu fan control as described in this thread.  Set the parameters as you see fit.  Don't force the fans to run slow.  Give them conditions so that they can run slow, but still speed up if required.  Allow underclocking if necessary to maintain the temperature target.  On my cards with more modern cooling systems, and / or a thinner profile and more air gap when adjacent, I'm able to overclock slightly and still keep the fans below 50% while the cards are running at 84 degrees.

    Monitor fan speed, load, temperature, and clock speeds.  If using Windows, you can do this with gpuz:

    http://www.techpowerup.com/gpuz/

    If running linux and using the ati drivers, you can do this with the aticonfig command to monitor clocks and load and temperature and cgminer itself to monitor fan speed.

    If you see a dramatic increase in fan speed, or a dramatic decrease in clock speed, it's likely that the card is not cooling properly and you'd want to check the fans.  Do NOT stick anything, even a piece of paper, near the fan blade to test it.  You have to take enough cards out of the case to see the fans on that card visually and then take action if you have to.  Another advantage of running cgminer in auto-gpu mode is that it will disable a card if it severely overheats.

    Sincerely,

    Ron

  • edited January 2014

    Consider opening up your case to allow more fresh air to circulate.  In the summer you may have to point a desk fan at your gpus.  Else get a crate from staples and go open air. That will bring up the life expectancy of your gpu fans by allowing full air circulation.  Heres a pic of a recent one I took out of the case.

    Brought down the temps of the gpu by 10C  You can also replace the thermal paste on the heatsink to some better stuff, say Arctic silver 5, or Arctic cooling MX-4.  That will allow for better heat transference from the gpu to the heatsink.  I also took it one step further and replaced the thermal pad for the vrms.  Seeing that there have been lots of problems with vrms blowing out.  Also a side note on aftermarket coolers.  If the memory chips don't get contact to the main heatsink expect to see your gpu have a hard time.  You won't be able to overclock do to the fact that your mem isnt being adequately cooled.  Well hapy mining :)
  • Hi Lathanium,

    Thanks for the tips about doing an open pc.  If I build another, I'll consider that.  However, if the cards are in a conventional motherboard with tight spacing and not risers, I think they're probably better off in an enclosed case with high velocity case fans.  I believe that will remove heat faster than just sitting in open air.  I don't want to mess with the heat sinks on the cards as that will void the warranty.  For now, I'm getting the cards refurbed by the factory and then selling them.

    Sincerely,

    Ron

  • edited January 2014

    Hello,


    I have 2 Asus 280X DC2-3GD5. They run at 1070 MHz, 1100 mV. They are in a conventional PC case, with the side open and 2 Silent wings2 140 mm blowing on them. I let the normal fan settings and that brings about 70°C and 1600 rpm fan. 
    They are very stable and silent. I can watch TV or surf the internet close to them and not be bothered by the noise.

    On this subject, beware of Sapphire cards that have very high after-sales return rates.

    Of course, to lower long-term fan failures and maximise the life of the graphic cards, it's advisable to switch to open-air chassis with risers to provide enough space between the cards.
  • PS to my original posting at the top.

    @Lathanium, I was wondering what kind of risers you use in your open air system.  I've read that you can use pcie 1x risers and that doesn't adversely affect your hash rate.  Do you know if that's true?  Also, do your risers have a separate power connector?  Where do you get them?

    To all, In doing further research, I found that Gigabyte apparently offers a 3 year warranty.  As I understand it, that means MSI, ASUS, and Gigabyte all offer that.  Also, as I understand it, I believe they all only require a serial number for warranty claims.  So, I might consider purchasing from any of them, if the cards met my thinness criteria.

    FYI, I've found out that, with MSI at least, when you file a warranty claim, they may send you a different card, in a different box, with a different serial number.  You won't be able to file another warranty claim without their approval.  Initial approval is automatic.  I did have to send one card back twice as the fan control system still had problems.  I requested the rma, and they answered me back a few days later with an approval.  However, if you try to send in a card again for the exact same issue, they might not let you.

    In rereading the config file I posted above, I noticed that the kernel path is of the Linux variety.  This would be different on a Windows system.  That's why it's a good idea to create the config file first from a running instance of cgminer, then edit it.  That way, you know it's right for your system.

    Sincerely,

    Ron

  • While doing research last night, I discovered some new kind of risers
    that are pcie 1x going to a pcie 16x slot (but with 1x bandwidth) and
    they are connected by usb3 cables which can be up to 1 meter long.  They
    look pretty cool.  They're not a usb3 device, they just use that for
    cables.  This thread talks about them:

    https://bitcointalk.org/index.php?topic=365181.0

    I'll probably look into those if I go to an open frame design.

    Sincerely,

    Ron

Sign In or Register to comment.