Graphics Driver Headaches

Nothing like computer problems when you’re trying to wrap up your dissertation to really make your day…

Quick tl;dr version: I’ve had a lot of problems with NVIDIA Linux drivers 337.x and even more with 340.24, at least on dual Quadro K600 cards driving four displays using basemosaic. No such problems under 331, but Debian didn’t package the 331 update that supports the current xorg-server ABI. I took their 331.79 package and updated it with the upstream 331.89 version, which fixes a few bugs, adds two PCI IDs, and adds support for the newer xorg-server ABI. Here are the source package, as well as the build results from my i386 and amd64 jessie cowbuilders. Use at your own risk, but they work great for me so far. You’ll have to pin the versions otherwise it’ll try to upgrade after you install them.

The standard disclaimers (no warranty, not responsible for problems, crashes, nasal demons, etc) apply: use at your own risk.

After the jump: deb packages, source packages, source debdiff, problem description, process, and lessons learned.

AMD64 debs

i386 debs

Other files, including source package

tl;dr package details: I didn’t bother making a separate “legacy” 331 branch though that would be nice. Sorry, no source control – upstream uses SVN and I couldn’t be bothered to convert it to git. Packaging changes over 331.79 are basically trivial.

Background: My computer at work (which, despite these headaches, is really nice and I’ll miss when I graduate) is a Dell Precision T3600, with a quad-core, hyperthreading Xeon E5-1620 3.6 GHz, 8 GB of RAM, and dual NVIDIA Quadro K600 graphics cards powering a combined four 20″ displays.

I’d been running Ubuntu Linux 12.04, when something went wrong somehow and I had data integrity issues that the Intel “fakeraid” didn’t notice or fix, leading me to have no graphics no matter how hard I tried. 12.04 had been a pain to install (and boot, somehow – it’s like the BIOS/UEFI gets easily confused) with an attempted dual boot on this fakeraid, but I had gotten it installed. (I’ve been using Linux for well over 10 years now, and that was the toughest install I had – and I’ve used tomsrtbt and installed Slackware, ancient Red Hat, and old boxed SuSE on vastly under-powered surplus machines with floppy drives and EISA SCSI controllers…)

When I decided I had to reinstall, I went for 14.04, figuring that as long as I had to start over, I’d go for the latest long-term release. Well, the desktop installer didn’t even want to admit that the computer had hard drives – according to some docs on Launchpad, looks like they planned to transition from dmraid to mdadm for fakeraid management, but stopped halfway and ended up with neither working. Apparently they also decided that the “alternate installer” ISO (the text based one that I typically used) was redundant and eliminated it, so I lacked that option to more carefully set it up manually. By this point I was frustrated and fed up with Ubuntu. I was still a bit intimidated by the idea of directly running Debian Testing, which I realize was totally unreasonable now that I think back to some of the first Linux experiences I’d had. I had been using Cinnamon from a PPA on Ubuntu, and I had read that Linux Mint Debian Edition was more like a de-fanged Debian Testing than whatever Linux Mint is, so I went for it. I will say that I appreciated its “OK, you picked manual partition, please set up your partitions how you like, configure this file if you need a module in initramfs, and click continue when done” approach for letting me set up my drives in that it actually worked. However, these K600 cards are pretty new and need a pretty recent driver – which LMDE didn’t have yet. I grabbed the driver from Debian Testing and carried on. (I was using DKMS so the actual things that aren’t binary were getting compiled for my system automatically, so I didn’t worry there. Anyway, LMDE was supposed to be just a testing snapshot, not far from the “real thing”)

Noticed some flickering, and I like keeping things up to date, so I updated when a 337 came out with a Debian package. Then things started going weird – if I’d get too many chrome tabs open, they’d turn out fully black, or even unpainted. Sometimes I’d even get the display to hang. Could reboot over SSH, so it was a little annoying, but it typically meant I had too many tabs open anyway, so whatever, I dealt with it.

I saw that a new “long-term stable” driver for Linux comes out from NVIDIA: 340.24. I updated, hoping it would fix the issue. Instead, it made it much worse: dragging a tab out from Chrome to form a new window was a sure-fire way to lock the graphics. Other times it would just lock on its own with no apparent proximate cause. Sometimes I’d still be able to move the mouse cursor but nothing would respond, and my only way out was an SSH reboot.

I started getting tearing/incomplete refreshes in apps when scrolling, especially in GTK-based apps (as I’m using gedit and evince for dissertation work, terrible timing… Kile and Kate looked OK, though, but I was completely unfamiliar with them).

Some searching suggests there’s issues with older Clutter and COGL (like perhaps in my LMDE install) which might be related to the flickering. By this point I’ve learned that LMDE is not worth the pain: it’s not actually rolling, just frozen testing/Jessie, with no updates. This means that it’s close enough to testing to make you want things as new as in testing, but old enough that pulling things in from testing via apt preferences becomes harder and harder. I realize that I actually do just want debian testing/jessie, and I do the apt preferences changes to pin repos such that I can upgrade fully to jessie (over 1000 updates, which gives you an idea how out-of-date the supposedly rolling LMDE was), and furthermore switch the jessie repos to a priority of over 1000 to force “downgrades” from outdated versions of stuff LMDE had grabbed from deb-multimedia (which forcibly “self-pins” by using a higher epoch number in its version than real Debian packages) to the actually newer releases available in plain jessie. I get the system fully on jessie (doing this all over SSH and the console, to avoid it freezing during the process), purge all the remaining Mint junk, return pins to their normal levels. I even get Cinnamon installed (which required a package from unstable since apparently one of the packages, but not all, can build on s390x aka IBM System/z mainframes, which was blocking the migration to testing…) and in a more updated version than LMDE had, which is interesting since Linux Mint is the upstream for Cinnamon. I even got XFCE on there as a non-composited alternative session in case it goes bad and it’s something to do with compositing window managers derived from Metacity with weird names starting with m (metacity, mutter, muffin). System-wise, we’re very clean at this point. Unfortunately, the machine’s still freezing.

I looked up these XID numbers I was seeing in my kernel log, and they were suggesting that it could be a software problem, a driver problem, or a hardware problem (depending on which message I looked up). This, of course, was no help. For one of them (XID 61 IIRC), it mentioned at this message only is available in “newer drivers”, and that it was indicating a “firmware checksum error”, but none of the “cause” columns in the table were even marked…

Eventually (yes, it seemed to get worse over time) just a few minutes of interacting in X would be enough to lock up the graphics. In Cinnamon, in Gnome Shell, and even in XFCE. Once it went, the display locked hard (not even mouse movements anymore), and while I could ssh in, I couldn’t reboot through there: had to actually push and hold the switch. When it got this bad, I saw that the NVIDIA driver was pegging one GPU at 100% according to nvidia-smi, and also flooding my kernel log (driving both rsyslogd and systemd nuts, which I am assuming is why rebooting thru software didn’t’ work) with infinite repetitions of an XID error.

A bit more poking on the NVIDIA Linux forums found a few other people with less dramatic issues than myself, but also reporting problems with modern desktops, recent cards, and the 337 and 340 drivers. While I know video drivers often contain regressions, especially when they’re operating complex systems, I didn’t think my system was that complex compared to others at the lab here, and anyway, these were only the Quadro K600 card – the baby sibling of the multi-thousand-dollar top-end Quadro cards. Thus I had upgraded in the past, hoping it would fix bugs, knowing that support for this card was fairly new and probably still being developed.

Well, I decided that it must be a driver problem. THe forum people had liked 331, and it was good when I last used it (331.79, I think), but I’d since switched to pure jessie, which had an upgraded xorg-server ABI (18) that had broken all the existing nvidia driver packages until 340.24 was packaged. Looked around, and found that Ubuntu Trusty was using a 331 series driver with ABI 18. After realizing that the NVIDIA packages in Ubuntu and Debian have almost nothing in common, I decided to start with the Debian package instead of the Ubuntu one, and just change the version. Looking online, I discovered that there was a 331.89 update that added ABI 18 support to the upstream NVIDIA installers, and a light at the end of the tunnel was sighted. I found the Debian 337.79 package source, updated it to 337.89 which involved pretty minimal changes, built it, and installed it.

It’s been working for a few hours now (last night and this morning), and while I’d be worried about counting chickens before they’ve hatched, I will say that I have dragged Chrome tabs to new windows and similar things that used to be pretty reliable reproducible failures. I re-built the source package using my cowbuilder, and monkeyed with cowbuilder to get both a jessie-amd64 one and a jessie-i386 one so I could get the 32-bit GL libraries installed.

And that, readers, is the story of the source and binary packages I posted before the break. The source package is very clean, and the binaries were built in pretty run-of-the-mill cowbuilder instances. Not much change was needed from the original Debian package for the previous version: see the debdiff below. That said, the standard disclaimers (no warranty, not responsible for problems, crashes, nasal demons, etc) apply: use at your own risk. For me, 340.24 was way riskier, but I can’t imagine my full set of issues being very common or I would have seen outrage on the internet. (That, or everybody with Quadro cards running setups like mine just go right to an NVIDIA contact through their institutional IT people rather than post on the NVIDIA forums or elsewhere. I suppose I’m not much better, since when I did manage to capture nvidia-bug-report.sh data I emailed it directly to NVIDIA’s Linux bugs email instead of posting on the forum, in the vain hope they’d help more quickly.)

Lessons learned:

  • Linux Mint Debian Edition is more trouble than it’s worth.
    • It’s not Linux Mint (which apparently lots of people like but isn’t for me), it’s not Ubuntu, and it’s definitely not Debian Testing.
    • Just use Debian Testing if you want to. It’s not difficult, it’s not unstable, it’s cohesive, it gets updates (that you can accept or decline – and that you’d probably be using a PPA for or rebuilding from the debian package in Ubuntu), and it doesn’t intentionally have a “mess with other files” package installed by default unlike LMDE. (Wish I could remember the package name so I could provide a citation for that one…) I know to wait out major package transitions (like the xorg-server one) before doing a full upgrade, but looking at the package statement before pressing “Y” to update would have clued me in. That’s pretty much the only catch I can think of.

  • Hang on to old driver packages, and if you have a hint of trouble roll back, even if the new one says “long term stable.”

  • If needed, don’t be afraid to update a package of an old driver series to a newer release in that series – it was super easy.

  • If you’re having NVIDIA graphics problems, run nvidia-bug-report.sh often and use the output parameter to specify a filename describing the condition.

  • nvidia-smi can be a handy troubleshooting or inspection tool. Sadly, the “GPU Reset” feature only works on compute-only GPUs.

  • Less-related lessons:

    • Don’t screw up typing your username or password when logging into 1and1’s SSH – it will ban your IP from your whole site (actually, every site in your account), apparently until you call in?, after just a few (2 or 3?) failures. Cf. SuperUser post, annoyed blogger, forum of weather people randomly banned for uploading their data

    • Interestingly, WinSCP includes a feature to SCP into a server through an SSH tunnel that it sets up through another server of your choice, which is useful given the above lesson.

    • If going from a single cowbuilder to a more organized system of multiple cowbuilders, you’re best if you delete the existing one, put the pbuilderrc commands you find in /etc/pbuilderrc instead of your home directory, and create them from new. I’ve put my cow/pbuilder configs here, in case it helps: https://gist.github.com/rpavlik/b5faea075581489a28e7

Comments are closed.