====== Reverse Engineering a NAND Flash Device Management Algorithm ===== {{ http://joshuawise.com/photos/etc/sd-card/sd-fux-1.scale.jpg?350}} Around June of 2012, I had gotten myself into a very bad habit. Instead of carrying my SD card in my camera, I left it sticking out of the side of my laptop, presumably intending to do something with the photos on it eventually. On my flight home from Boston, the predictable thing happened: as I got up out of my seat, the machine fell out of my lap, and as the machine hit the ground, the SD card hit first, and was destroyed. I was otherwise ready to write off the data stored on that device, but something inside me just wasn't happy with that outcome. Before I pitched the SD card in the trash, I took a look at what remained -- as far as I could tell, although the board was badly damaged, the storage IC itself was fully intact (although with a few bent pins). The following is a description of how I went about reverse-engineering the on-flash format, and of the conclusions that I came to. My efforts over the course of about a month and a half of solid work -- and a "long tail" of another five months or so -- resulted in a full recovery of all pictures and videos that were stored on the SD card. //If you're just looking for the resources that go with this project, [[http://joshuawise.com/photos/etc/sd-card/|here are the pictures of the hardware]], and [[http://github.com/jwise/ndfslave|here is the source code]].// //You can [[https://news.ycombinator.com/item?id=8133450|discuss this article on Hacker News]].// ===== Introduction ===== It is probably fitting to start with a motivation for why this problem is complex; doing data recovery from a mass-production SD card seems like it should be a trivial operation (especially given the interface that SD cards present), but as will become clear, it is not. From there, I will discuss the different parts of the problem in detail, both in terms of how they physically work, and in terms of what it means from the standpoint of a data recovery engineer. {{http://joshuawise.com/img/h3630.jpg }} I begin with a brief history of the field. In the past ten years, solid-state data storage has become increasingly complex. Although flash memory was originally commercialized in 1988, it only began taking off as consumer mass storage recently. In August of 2000, COMPAQ (and later, HP) began producing the [[http://pdadb.net/index.php?m=specs&id=3&c=compaq_ipaq_h3630__h3635__h3650|iPAQ h3100/h3600]] series of handheld computers, which had between 16 and 64MB of flash memory. This was approximately a standard capacity for the time period; the underlying technology of the flash device was called "[[http://en.wikipedia.org/wiki/Flash_memory#NOR_flash|NOR flash]]", because of how the memory array was structured. NOR flash, in many regards, behaved like classic ROM or SRAM memories: it had a parallel bus of address pins, and it would return data on the bus I/O pins in a fixed amount of time. The only spanner in the works was that writes could only change bits that were previously ones to zeroes; in order to change a zero back to a one, a whole block of bits (generally, around 16 KBytes) had to be erased at once. This was okay, though, and we learned to work around these issues. NOR flash had a limited erase life -- perhaps only some millions of erases per block -- so filesystems on the device generally needed to be specially designed to "wear-level" (i.e., scatter their writes around the device) in order to avoid burning an unusable hole in the flash array. Even still, it still appeared a lot like a block device that operating systems knew how to deal with; indeed, since it looked so much like SRAM, it was possible to boot from it on an embedded system, given appropriate connections of the bus pins. Around 2005, however, NOR flash ran into something of a problem -- it stopped scaling. As the flash arrays became larger, the decode logic began to occupy more of the cell space; further, NOR flash is only about 60% as efficient (in terms of bits per surface area) as its successor. To continue a Moore's Law-type expansion of bits per flash IC, flash manufacturers went to a technology called [[http://en.wikipedia.org/wiki/Flash_memory#NAND_flash|NAND flash]]. Unfortunately, as much as it sounds like the difference between NOR flash and NAND flash would be entirely internal to the array, it isn't: the external interface, and characteristics of the device, changed radically. As easy as NOR flash is to work with, NAND flash is a pain. NOR flash has a parallel bus in which reads can be executed on a word-level; NAND flash is operated on with a small 8-bit wide bus in which commands are serialized, and then responses are streamed out after some delay. NOR flash has an amazing cycle life of millions of erases per block; modern NAND flashes may permit only tens of thousands, if that. NOR flash has small erase blocks of some kilobytes; NAND flash can have erase blocks of some megabytes. NOR flash allows arbitrary data to be written; NAND flash imposes constraints on correlating data between adjacent pages. Perhaps most distressingly of all, NOR flash guarantees that the system can read back the data that was written; NAND flash permits corruption of a percentage of bits. In short, where NOR flash required simply a wear-leveling algorithm, modern NAND flash requires a full device-management algorithm. The basic premises of any NAND device-management algorithm are three pieces: data decorrelation (sometimes referred to as entropy distribution), error correction, and finally, wear-leveling. Oftentimes, the device management algorithms are built into controllers that emulate a simpler (block-like) interface; the [[http://www.siliconmotion.com/download.php?t=U0wyRnpjMlYwY3k4eU1ERXhMekF5THpBNEwzQnliMlIxWTNRMU1ETXpOVFU0TWpVMUxuQmtaajA5UFZOTk1qWTRNMFZPSUZCeWIyUjFZM1FnUW5KcFpXWT1D|Silicon Motion SM2683EN]] that was in my damaged card is marketed as a "all-in-one" SD-to-NAND controller. ===== Data extraction ===== {{ http://nyus.joshuawise.com/sd-fux/sd-fux-9.scale.jpg?400}} Of course, none of the device-management information is relevant if the data can't be recovered from the flash IC itself. So, I started by building some hardware to extract the data. I had a [[http://www.digilentinc.com/Products/Detail.cfm?Prod=NEXYS2|Digilent Nexys-2 FPGA board]] lying around, which has a set of 0.1" headers on it; those headers are good to around 20MHz, which means that with some care, I should be able to interface it directly with the NAND flash. A bigger problem that I had facing me was that the NAND flash was physically damaged. The pins still had pads on them, ripped from the board; the pins were also bent. Additionally, the part was in a TSSOP package, which was too small for me to solder directly to. I first experimented with doing a "dead-bug" soldering style -- soldering AWG 36 leads directly to each pin -- but this proved ultimately too painful to carry out for the whole IC. Ultimately, I settled on using a [[http://www.schmartboard.com/index.asp?page=products|Schmartboard]]; I sliced it in half, and allowed each side to self-align. This meant that I didn't have to worry about straightening both sides at the same time -- as long as I got them each individually, I could get a functional breakout from the flash IC. (The curious reader might enjoy some [[http://joshuawise.com/photos/etc/sd-card/|photos of my various attempts to re-assemble the NAND flash]].) The next question was the mechanics of using the FPGA to transfer data back to the host. I used Digilent's on-board EPP-like controller to communicate with the FPGA; ultimately, I created a mostly-dumb state machine interface on the FPGA, which could be instructed to perform a small number of tasks. It knew how to latch a command into the FPGA, it knew how to latch data in and out of the FPGA, and it knew how to wait for the flash to become ready -- beyond that, everything was handled in a wrapper library that I wrote on the host side. ([[https://github.com/jwise/ndfslave/blob/master/fpga/NDFSlave.v|Verilog source]] for this chunk of code, which I called ''ndfslave'', is available; if you ever need some sample RTL to read for how to communicate with the Digilent EPP controller, this might be useful.) To test that my interface was working correctly and use it, I wrote three tools: * ''id'': The ''id'' tool interrogated the flash's identification registers, and decoded all of the registers as best it was able. This was a basic diagnostic proving that both commands worked and data reads worked. [[https://github.com/jwise/ndfslave/blob/master/sw/flash-det.c|(source: flash-det.c)]] * ''iobert'': IOBERT stands for "I/O Bit Error Rate Test". Instead of simply reading the identification registers once, it spun in a tight loop repeatedly rereading them, and comparing them on each iteration. When I built the interface, the length of the wires (and lack of termination) concerned me, so I wanted to make sure that there weren't any timing or integrity marginality issues. I ran the BERT test overnight, transferring some 25GB, without error. [[https://github.com/jwise/ndfslave/blob/master/sw/iobert.c|(source: iobert.c)]] * ''flash'': This was the main driver; its purpose in life was to dump the flash device. Since the flash device is split into two chip-enables, it dumps each in sequence into their own files. With some optimizations, the ''flash'' tool peaked around 1.2 MByte/second -- a far cry from the limits of either the NAND flash or the USB interface, but still sufficiently performant to dump both chip selects in about 6.5 hours. [[https://github.com/jwise/ndfslave/blob/master/sw/flash.c|(source: flash.c)]] One curious challenge was that no datasheet existed for the chip that I had. I used [[http://e2e.ti.com/cfs-filesystemfile.ashx/__key/CommunityServer-Discussions-Components-Files/100/8055.K9GBG08UXA_5F00_1.0.pdf|the closest available]], but even that did not mention some important characteristics of the chip -- for instance, it did not discuss the fact that there is an "address hole" between page 0x80000 and page 0xFFFFF, and that the rest of the data begins at 0x100000. (This comes from the fact that this flash IC is a "multi-level" flash chip, which means that each cell in the array stores three bits, not just one. The address hole exists, then, because three is not an even power of two...). The datasheet also does not mention that the chip has multiple chip-selects. The end result of this phase was a pair of 12GiB files, ''flash.cs0'' and ''flash.cs1'', which contained the raw data from the NAND flash medium. ===== Unwhitening ===== {{:projects:nand_flash_structure-01.png?direct&246 }} Modern NAND flash devices have a bunch of difficult constraints to work with, but one that might not be obvious is that it matters what data you write to it! It is possible to construct some "worst case" data patterns that can result in data becoming more easily corrupted. As it turns out, the "worst case" patterns are often when the data we write is very similar to other data nearby; to solve this, then, a stage in writing data to NAND flash is to "scramble" it first. To understand how this can be, I've included an image to the right that shows a schematic of how the building blocks of NAND flash work internally (adapted from [[http://en.wikipedia.org/wiki/File:Nand_flash_structure.svg|this drawing, from Wikipedia]]). This drawing shows one column of a NAND flash that has four pages; in total, four bits are represented here. To read from one page of flash, a charge is placed on the bitline, and then all of the wordlines (labeled 'WL') //except for the one that we wish to sense// are turned on. This means that charge can flow through the flash transistors that we're not trying to sense, and the one that we're trying to sense controls whether the charge stays on the bitline or whether it flows to ground. Then, some time later, we can sense whether that flash transistor was programmed or not by telling how much charge was left on the bit line. (This is a very simplified discussion, that may not make much sense without some background in the field. That's okay; this won't be on the final exam. [[http://en.wikipedia.org/wiki/NAND_flash#Principles_of_operation|The Wikipedia article]] on flash memory explains in some more detail, if you're interested.) The problem with this is that the bypassed transistors (that is to say, the ones that we're //skipping over// -- the ones that we're not sensing) are not perfect. Even though they are supposed to conduct when they are being bypassed, if they have been programmed, they might conduct slightly less well. If a few of the bits on that bitline have been programmed, that's not a problem; but if all of them except for the one that you're trying to read have been programmed, then it might take quite a lot longer for the bitline to become discharged, and the sense amp that reads the charge later could incorrectly read the transistor you're trying to sense as having been programmed. This has only become a problem recently, as we've moved to more modern technologies. On larger processes, the effects that caused this to happen were apparently less likely; to compound the problem, multi-level flash chips, which use multiple voltage levels to encode multiple bits in a single transistor, require even more sensitivity in the sense amps. So, if other cells on the bitline are adding extra resistance, there no longer needs to be a flip all the way from one end to the other, but a small change in the middle can happen, which could still disrupt the read. Even if you didn't understand the mechanics of it, what this means from the perspective of programming the device is that if many adjacent pages all have the same value, a page that has a differing value might be very hard to read, and might often flip to the "wrong" value when you try to read it. This affects each bit in the page individually, so if it happens to all of the bits, then even error correction will not be possible. The upshot of this is a new constraint on NAND flash: pages in the same block shouldn't have data that's excessively correlated. The solution to this problem, then, is to add another step -- I'll take the obvious name, "data decorrelation". (Sometimes, other people call it "data scrambling", "entropy distribution", or "data whitening".) The basic idea is to mix deterministic noise into the data on a per-page basis, and hope that the noise doesn't have the same pattern as the data that the user is trying to store. On flash, the data will look like garbage, so to get real data back out, you must then remove the noise -- a process that I call "unwhitening". ==== Mechanics ==== I started off without knowledge of the reason for scrambling the on-flash data, and began by looking for a one-byte XOR key that would find the ''FAT32'' filesystem marker somewhere on the device. I did an exhaustive search of all possible one-byte XOR keys, while scanning the device for the ''FAT32'' marker; one came up with a result, but not at a sensible offset, and no other plaintext that I would expect from the device (for instance, no ''DCIM'' or ''DSC0xxxxJPG''). (I originally called this tool ''[[https://github.com/jwise/ndfslave/blob/master/sw/xor-me.c|xor-me]]'', but that tool eventually morphed into a generic tool to xor two files together.) Since I wasn't recovering anything from a one-byte key, I wanted to know whether there was anything sensible to be recovered from an otherwise-short key. I figured that if there were real data around, the distribution of bytes would not be uniform, so I wrote a program that I called ''[[https://github.com/jwise/ndfslave/blob/master/sw/entropy.c|entropy]]'' to do a distribution analysis both on byte values and bits per byte. Irritatingly, I found that the distribution was pretty even, indicating that I was unlikely to find any real data around by simple methods. Around this time, I was talking with [[http://www.recovermyflashdrive.com/about-us/|Jeremy Brock, of A+ Perfect Computers]]. I was pasting him some samples of some pages on the device, and he recognized one of them as looking like an XOR key that he'd seen before. This was progress! I wrote a tool that looked for that pattern, matching on blocks in which it appears, and doing a probabilistic analysis on the rest of the block to figure out what bytes were likely to appear there (I reasoned that either zeroes or FFs were the most likely byte on the medium). Since the tool seemed to be looking for the known-pattern needle in the haystack of the image, I called it ''[[https://github.com/jwise/ndfslave/blob/master/sw/haystack-me.c|haystack-me]]''. It produced vaguely satisfactory output -- I started to see directory structures -- but some blocks didn't have the needle in them at all, and so I was still missing more than half of the medium. I was considering doing a more intelligent search, looking for common patterns (I didn't want to come up with patterns lost in the noise), but I decided to run an experiment first -- I wanted to know if the most naïve thing possible could work. I wrote a tool to do a probabilistic analysis on a byte-by-byte basis in each row, without regard to every other byte in the row; from some of the results in ''haystack-me'', I knew that the key repeated every 64 rows, so I picked the most common byte for each column in each row (mod 64). Since the algorithm was somewhat stupid, I called this tool ''[[https://github.com/jwise/ndfslave/blob/master/sw/dumb-me.c|dumb-me]]''. Both of these tools produced a "confidence" value for each byte in the extracted pattern, based on how probable the selected key byte was in relation to the second most-common byte. ''dumb-me'' produced satisfactory results with almost no further tweaking; it ended up being the method I chose to extract the key. Once I had extracted the key, I ran the entire image through a new, revised ''xor-me'', to produce a full flash image that had been unwhitened. ==== Future work ==== The probabilistic method for extracting a key, of course, kind of sucks. It's bad because it's not a sure thing (it could always be fooled by sufficiently strange data), but it's also bad because it's not how the flash controller does the job. The key that I extracted, I treated as an opaque blob -- that is to say, a chunk of data that has no real relation, and nothing interesting used to generate it. That works okay for my applications, but it can't possibly be the way a tiny flash controller does this. To understand why, it's important to look at the size of the key that I extracted -- 512 KB per chip select. This is a completely unreasonable amount of memory to have in the controller; SRAM is very area-expensive, and only relatively large microcontrollers these days have that amount of memory. Those microcontrollers likely sell for some $7 a shot, primarily driven in cost by the die area for the SRAM, so it's not really possible to drive the price lower for that much SRAM in a different application. $7 would be up to half (if not more) of the bill-of-materials cost for this SD card; clearly, this is untenable. I suspect that the key pattern is generated, instead, by a pseudo random number generator. The easiest such to implement in silicon is a LFSR -- a Linear Feedback Shift Register. (LFSRs happen to be good for quite a lot of things, actually; a basic coding theory textbook should cover many of them, and [[http://en.wikipedia.org/wiki/Linear_feedback_shift_register|Wikipedia's article]] isn't half bad, either.) Good future work for the unwhitening step, then, would be to reverse engineer the generating LFSR, and generalize that step to other flash controllers. ===== ECC recovery ===== {{:projects:uncertain-nand.svg }} Vendors manage to drive the cost of NAND flash lower by allowing the flash memory to be "imperfect". At the scale at which these flash devices are produced, it's infeasible to expect a perfect yield -- that is to say, every bit of every flash device fully functional -- so allowing manufacture-time errors permits more of the NAND flash parts produced to be sold. Additionally, data errors come intrinsically with the scale of the device: as the physical size of each transistor becomes smaller, the number of electrons that can be trapped in the gate also becomes smaller, and so it takes even fewer electrons leaking in or out to cause a bit to be flipped. All of these add up to the expectation of NAND flash being unreliable storage. Some NAND flash devices can have a bit error rate (BER) as high as 10-5 mid-way through their lifetime((See, for instance, http://www.snia.org/sites/default/files/SSSI_NAND_Reliability_White_Paper_0.pdf -- there are other remarkable things there, such as a 100-fold increase in bit error rate //just by executing read cycles//!)), which requires powerful error correcting schemes to bring the expected BER down to a reasonable level. ==== Basics of ECC ==== At first, the concept of improving a bit error rate without adding an extra low-error storage medium may seem somewhat far-fetched. Given an unreliable storage mechanism, how can any meaningful data be reliably extracted? The answer comes in the form of error correction codes -- ECC, for short. For the sake of clarity, I will introduce some of the basic concepts of ECC here; sadly, modern high-performance codes require a deep understanding of linear algebra (indeed, one that I do not have!), so it is infeasible -- and somewhat out of scope -- to go into great detail in this article. The underlying premise of any ECC is the concept of a //codeword// -- a chunk of data that conveys additional error correcting information, on top of the user-visible data (the "input word") that it is meant to encode. Needless to say, the codeword is always larger than the input word; ECC schemes are sometimes described as being //(m,n)// codes, where there are //m// bits in each codeword, conveying //n// bits of data. There is a bijection between valid codewords and input words. {{ :projects:hamming-distance.svg}} These codes derive their functioning from the idea of a //[[http://xlinux.nist.gov/dads/HTML/HammingDistance.html|Hamming distance]]// (named after Richard Hamming, one of the pioneers of modern information theory). The Hamming distance between two strings of bits is the number of bits that would need to be corrupted in order for one to be changed to the other. (By way of example, consider the strings 8'b01**10**0110 and 8'b01**01**0110; the Hamming distance between these two strings is 2, because two bits would need to be corrupted in order to change the first to the second.) In an error correction code, all codewords have at least a certain Hamming distance from any other valid codeword; this minimum distance specifies the properties of the code. Let's take some examples. A code with a minimum Hamming distance of 1 is no better than simply the input -- there exists a single bit flip that could result in having another valid codeword, with no way of distinguishing that there was even an issue to begin with. A code with a minimum Hamming distance of 2 begins to offer some protection: if there is a single bit flip, the result no longer lands on a valid codeword, but it is not "closer" to either, so it's not possible to correct the error. A code with a minimum Hamming distance of 3 now allows us to correct a single error -- if there is a single bit flip, the result is not any longer on a valid codeword, and it is also "closer" to one codeword than to any other, so it is possible to correct the error. However, since the minimum distance is 3, there exists at least one codeword for which two bit flips will result in silently "correcting" the error to the incorrect codeword; this scheme can also be described as a "single bit correction" system. By way of one more example, a minimum Hamming distance of 4 allows both the detection of two bits flipped, and the correction of one; this is a "Single Bit Correction, Double Bit Detection" (SECDED) scheme. (Such codes are very popular for volatile ECC memory with low bit rates.) ==== A simple ECC ==== {{ :projects:basic-parity.svg}} To make this somewhat more concrete, I'll provide two examples of codes that are simple enough to construct and verify by hand. The first code is extremely simple, and has been used more or less since the dawn of time: for all values of //n// bits, we can produce a //(n+1,n)// code that has a minimum Hamming distance of 2 between valid codewords. The code works by taking all of the bits in the input word, and XORing them ((Recall that XOR is defined as: 0 XOR 0 = 0; 0 XOR 1 = 1; 1 XOR 0 = 1; 1 XOR 1 = 0)) together; then, take the single output bit from the XOR operation, and append it to the end of the input. This is called a "parity" code; the bit at the end is sometimes referred to as the parity bit. You can see an example of this at right. In the first example, I've provided a valid codeword, with the parity check bit highlighted in blue. When I introduced a single bit error in the second example -- placing the codeword at a Hamming distance of 1 to the original -- a simple calculation will show that the parity bit no longer matches the data within, and so the code can detect that the codeword is incorrect. However, the code can detect only a single-bit error: in the third example, I have introduced only one more bit worth of error (for a total of two error bits), and the codeword once again appears to be valid. (This matches with our intuition for a minimum Hamming distance of 2; proof that 2 is in fact the minimum is left to the reader.) ==== Another simple ECC: row-column parity == {{ :projects:parity-matrix.svg}} The second code that I'll present is what I call a "row-column parity code". In this version, I will give a //(25,16)// code: codewords are 25 bits long, and they decode to 16 bits of data. The code has a minimum Hamming distance of 4, which means that it can detect all two-bit errors, and correct all one-bit errors. As you might expect from the name, this code is constructed by setting up a matrix, and operating on the rows and columns. To start, we will write the data word that we wish to encode -- for this example, 1011 0100 0110 1111 -- in four groups of four, and arrange the groups in a matrix. We will then compute nine parity operations: four for the rows, four for the columns, and one across all of the bits. (This is shown in the diagram at left.) {{ :projects:parity-matrix-bad.svg}} Although not terribly efficient, this "row-column" scheme provides a straightforward algorithm for decoding errors. At right, I show the same code, but with a single bit flipped; the bit that was flipped is highlighted in red. To decode, we recompute the parity bits on the received data portion, and compare them to the received parity bits. The bits that differ are called the //error syndrome// -- to be specific, the syndrome refers to the codeword that is produces by XORing the received parity bits with the computed parity bits. To decode, we can go case-by-case on the error syndrome: * If the syndrome is all zeroes -- that is to say, the received parity bits perfectly match the computed parity bits -- then there is nothing to do; we have received a correct codeword (or something indistinguishable from one). * If the syndrome contains just one bit set -- that is to say, only one parity bit differs -- then there is a single bit error, but it is in the parity bits. The coded data is intact, and needs no correction. * If the syndrome has exactly three bits set, and one is a row, one is a column, and one is the all-bits parity, then there is a single bit error in the coded word. The row and column whose parity bits mismatch point to the data bit that is in error; flip that bit, and then the error syndrome should be zero once again. * If the syndrome is anything else, then there is an error of (at least) two bits, and we cannot correct it. (It might not be immediately obvious why the all-bits parity bit is needed; I urge you to consider it carefully to understand what situations would cause this code to be unable to detect or correct an error without the addition of that bit.) ==== Linear codes ==== Before we put these codes aside, I wish to make one more observation about them. The codes above both share an important property: they are //linear codes//. Definitionally, a linear code is a code for which any two valid codewords can be added (XORed) together to produce another valid codeword. This has a handful of implications; for instance, if we consider the function //ECC(d)// to mean "create the parity bits associated with the data //d//, and return only those", then we can derive the identity //ECC(d1 XOR d2) = ECC(d1) XOR ECC(d2)//. (This will become very important shortly, so make sure you understand why this is the case.) ==== NAND flash implementation ==== Of course, neither of the codes above are really "industrial-strength". In order to be able to recover data from high-bit-error-rate devices like modern NAND flash, we need codes that are substantially more capable -- and with along that capability comes a corresponding increase in encoding and decoding complexity. Most flash devices these days demand the capabilities of a class of codes called [[http://en.wikipedia.org/wiki/BCH_code|BCH codes]] (although [[http://en.wikipedia.org/wiki/Low-density_parity-check_code|LDPC]] is also becoming popular); these codes have sizes on the order of //(1094,1024)// or larger, and work on those amounts of //bytes// at a time, rather than bits! Such codes have the capability of correcting many many bad elements in a codeword, not just one or two; the downside, however, is that their foundations are rooted deep in the theory of linear algebra. One thing to note is that even a BCH code with a codeword size of 1094 bytes is not capable of filling the quantum of storage in NAND flash (a "page"). On the NAND device that I worked with, the page size was as large as 8832 bytes -- nominally, 8192 bytes of user data, and 640 bytes of "out-of-band" data. So, to figure out the page format on my SD card, I started by looking at a page that had content that I knew -- in particular, a certain page in my data dump, once it was unwhitened, appeared to have bits of a directory structure in it. In this page, I saw the pattern: