sprezzatech blog #0017
So many problems, Chef.
Tue, 02 Apr 2013 03:40:36 -0400
so very many problems.
ughhhgh, there i was happily hacking away last evening, when suddenly everything goes to hell. advancing a line in Konsole yields a popup notification that it couldn't write to /tmp. running anything but trivial commands hits immediate read errors. those which can be run inevitably hit similar errors processing their arguments. it's just as if a network drive dropped out from under you, or you unplugged an external drive without properly unmounting it. well, indeed, intel had unplugged one of my 128GB 320 SSDs for me, an apparently well-known bug which manifests itself as a serial number of BAD_CTX [hexval] and only 8MB of zeroes available on the block device. annoyingly, this was my primary device (/, /usr, /home, everything but active torrents, a few VMs, and bulk storage). that'd not have been much of a problem, but—most unfortunately, i realized as i sat up with a bolt—i'd pulled apart the mdadm software RAID1 which typically backed these data last week to stash aforementioned VMs. indeed, the dead SSD had three partitions on them, two of them degraded halves of my raid. yep, boom, with that i'd lost everything aside from media on my machine. at least a day lost to rebuilding it, i knew, if rebuilding was even possible.
well, i managed to rescue...most everything, i think, though i'm only finishing up now (around 0400 tuesday) after starting at 0700 monday morning, and working continuously. utterly fucking wasted day. a horrible time. i'm shocked and appalled by intel's shoddy work here, especially given the reliability premium i'd happily paid them. admittedly, the problem appears to be in the firmware as opposed to the NAND flash, and it's not as if intel makes the firmware/controller, but still, augh, the product ought not be on the market. this failure case is hard and fast, from all i can tell. i've got some new data, by the way, for anyone dealing with this issue (from the fora searches i've performed, there seem to be a great many of you):
whilst reinstalling my box, i took a few quick notes on problems i noticed in sprezzos. ordered in order that i noticed them, allow me to reproduce this list. totally not daunting at all:
fuck me gently with a chainsaw!
the one positive thing is that the two 256GB SSDs i ordered to eliminate Intel 320 SSDs from my life forevermore are by....Plextor! nostalgia for my 17 year old hella warez-burning self overwhelms me :D.
ughhhgh, there i was happily hacking away last evening, when suddenly everything goes to hell. advancing a line in Konsole yields a popup notification that it couldn't write to /tmp. running anything but trivial commands hits immediate read errors. those which can be run inevitably hit similar errors processing their arguments. it's just as if a network drive dropped out from under you, or you unplugged an external drive without properly unmounting it. well, indeed, intel had unplugged one of my 128GB 320 SSDs for me, an apparently well-known bug which manifests itself as a serial number of BAD_CTX [hexval] and only 8MB of zeroes available on the block device. annoyingly, this was my primary device (/, /usr, /home, everything but active torrents, a few VMs, and bulk storage). that'd not have been much of a problem, but—most unfortunately, i realized as i sat up with a bolt—i'd pulled apart the mdadm software RAID1 which typically backed these data last week to stash aforementioned VMs. indeed, the dead SSD had three partitions on them, two of them degraded halves of my raid. yep, boom, with that i'd lost everything aside from media on my machine. at least a day lost to rebuilding it, i knew, if rebuilding was even possible.
well, i managed to rescue...most everything, i think, though i'm only finishing up now (around 0400 tuesday) after starting at 0700 monday morning, and working continuously. utterly fucking wasted day. a horrible time. i'm shocked and appalled by intel's shoddy work here, especially given the reliability premium i'd happily paid them. admittedly, the problem appears to be in the firmware as opposed to the NAND flash, and it's not as if intel makes the firmware/controller, but still, augh, the product ought not be on the market. this failure case is hard and fast, from all i can tell. i've got some new data, by the way, for anyone dealing with this issue (from the fora searches i've performed, there seem to be a great many of you):
- The 8MB being reported is definitely a firmware failure mode as opposed to some arbitrary bitrot. If you run hdparm -g to dump “physical” geometry (SSDs do not, of course, have heads or cylinders), you'll see that only 1 cylinder is being reported.
- I burned an ISO of the firmware update, and successfully booted it. It detected my SSD (so long as it was on the motherboard's builtin AHCI controller, as opposed to an expansion SAS/SATA board), but was never able to apply the update. I tried this in both native AHCI and legacy IDE configs.
- hdparm -N clearly reveals a Host Protected Area (HPA) in use, likely set up by the firmware as a world-communicable failure litmus. This makes plenty of sense as an error-signaling mechanism, or enough sense anyway. the result sure sucketh ass though.
- attempting to remove the HPA via hdparm -Np resulted in EIO
- attempting to restore original Device Configuration Overlay (DCO) information also resulted in EIO
- sending the device into hard sleep mode, then waking it, did nothing to resolve this, nor did a hard reset of the device. that works to effect a T13 defreeze.
whilst reinstalling my box, i took a few quick notes on problems i noticed in sprezzos. ordered in order that i noticed them, allow me to reproduce this list. totally not daunting at all:
- ssmtp needs a rebuild (gnutls)
- pack net-ip-perl, io-socket-inet6-, digest-hmac-,authen-ntlm-, authen-sasl-
- pack guile-1.8 or move users
- rebuild nullmailer (gnutls)
- still not running gpm -- not even installing, actually
- when gpm is installed, warning about /usr/lib/gpm/gpm_has_mouse_control
- once installed, "systemctl start gpm" does indeed work
- ...though it still doesn't start at boot
- still not bringing up interfaces on boot
- writing allow-hotplug entries to /etc/network/interfaces twice
- need fix usb keyboards yesterday
- need run setupcon on boot
- pack surfraw
- pack iselect
- pack screenie
- pack libcache-perl, libclass-*-perl, libdata-*-perl
- pack libfeed-find-perl, libunicode-map8-perl, libaudio-scrobbler-perl
- i think we've lost our aptitude defaults
- should rsyslog really be running with systemd?
- colorize prompt by default
- really must move to physical naming for disks
- turn on syncookies by default -- lost sysctl settings?
- nvidia doesn't build out of the box against our shipped kernel
- need run detect-sensors and add discovered modules to /etc/modules
- ssh-agent is running through pam but need generate ssh key for added users
- pm-utils recommend obsolete cpufrequtils
- clutter-2.0-gst conflicts with clutter-1.0-gst
- compiz9 doesn't recommend/depend on plugin, backend, or settins manager
- udisks2 recommends obsolete cryptsetup-bin
- gconf2 APT hooks bitch about dbus when run from console
- default xdg directories are terrible (Templates? Videos? fuck you)
- need rebuild qdbus with epoch >= 4 or old one gets reinstalled over it
- konsole throws up a 'knotify crashed' dialog for invalid tab completes(!)
- top needs be colorized by default
- need solarized vim theme by default
- smarttools ought be installed by default
- smart doesn't run automatically once installed
- smart needs a semi-sensible config
- probably want a more useful default xinitrc than "xterm"
- why is notify-osd installed to /usr/lib/notify-osd where it's useful to no one?
- compiz8 also doesn't recommend backend (does get plugins)
- compiz-kde isn't installable
- nouveau 9.1.1 doesn't appear to work -- warning about can't open nouveau dri
- looks like we're missing a nouveau_dri.so in /usr/lib/dri
- what the hell is agetty? use mingetty by default
- mpd ought start lastmp and lastfmsubmitd for me
- at the very least, lastmp ought start lastfmsubmitd or vice versa
- why the hell are we installing 3.7.2 spl/zfs
- need upgrade spl/zfs to 0.6.0
- wtf on login: "-bash: data/zeitgeist-daemon.bash_completion: NSFoDirectory"
- set up wireless interfaces as wpa-roam/manual with an example .conf
- our python-libxml2 doesn't get discovered by autotools
- raptorial needs depend on apt-file unless it wants to download contents files
- investigate this "too big file to journald" crap (maybe systemd-coredump?)
- nautilus appears to be missing most of its icons
- gnome/gtk are both horrifically ugly out of the box (fonts are not so bad!)
- gnome-session fails (use --debug to get more info)
- why does udisksd burst as much cpu as mdraid_resync+mdraid_raid6 (no gui used!)
- growlight's corners in non-fbterm console are abominably ugly
- holy fucking shit cert is only good for www.sprezzatech.com we use no www augh
- editing profile preferences crashes gnome-terminal
- when you go back an entry in growlight, if the previous entry was not on the
- default set of entries listed, you get that set and nothing highlighted
- growlight crashes on exit sometimes
- growlight crashes if 'h' is pressed while blocked on a slow disk during init
- ccsm doesn't have any icons or text (patch at https://bugs.launchpad.net/ubuntu/+source/compiz/+bug/1130941 works)
- lightdm doesn't start (/usr/share/xgreeters/default.desktop missing)
fuck me gently with a chainsaw!
the one positive thing is that the two 256GB SSDs i ordered to eliminate Intel 320 SSDs from my life forevermore are by....Plextor! nostalgia for my 17 year old hella warez-burning self overwhelms me :D.