AmazonFail: An Inside Look at What Happened
Amazon managers found that an employee who happened to work in France had filled out a field incorrectly and more than 50,000 items got flipped over to be flagged as "adult," the source said. (Technically, the flag for adult content was flipped from 'false' to 'true.')
Note: If they are telling the truth about what happened, this applies. And actually, it would apply if they lied, but worse. One error is one thing, but if this was a deliberate system-wide build that made the change, pretty much the same thing applies, but with less sympathy.
My expertise is not expertise, it is anecdata, but it's also ten builds and fifty emergency releases of professional anecdata, so take that as you will.
I am a professional tester because at some point, it occurred to people that things worked better when there was a level of testing that was specifically designed to mimic the experiences of the average user with a change to a program. Of course, they didn't use average users, they used former caseworkers and programmers, but the point stands.
I'm a professional program tester and do user acceptance, which means I am the last line of defense for users before we release a change to the program, major and minor. It's a web-based program that with three very idiotic ways to interface with it online for a user and about fifty for other agencies to do automatically, and I won't go into our vendor interfaces because it hurts me inside. I am one of thirty user acceptance testers for this program, because it's huge and covers a massive number of things and interfaces with federal and state level agencies outside of our own internal agencies. I test things straight from the hands of coders in emergency releases and also after they've gone through two other levels of testing in our quarterly builds.
This does ring true to my experience when something just goes stupid. And when I say stupid, I mean, someone accidentally killed off pieces of welfare policy with a misflag once, and that's not even the stupidest thing I've had to test when the program was built and is still coded modularly and the coders are in different parts of the country and sometimes at home in India when working on this. And none of them ever know what anyone else is doing.
While I have no idea what amazon's model looks like, to do a rollback on a change for us, even a minor one, it goes like this:
1.) Report
2.) Reproduction in one of our environments.
3.) Code fix and discussion and so many meetings, God. (emergency releases may not go through this.)
4.) DEV environment 1 (theoretical construct of the program, works very well, nothing like the real thing)
5.) DEV environment 2 (closer to the actual program, but not by much)(sometimes do not use both Dev 1 and Dev 2 both)
6.) SIT (sometimes skipped for emergency releases) (I have issues with their methodology.)
7.) User Acceptance (me! And some other people, somewhat close to field conditions with database as of last mass update, usually two to three months before)
8.) Prodfix (optional) (also me! And some other people, almost perfect mirror of field conditions with full database)
If it's really desperate, it goes to prodfix instead of or in addition to User Acceptance, which is the only environment we have that nearly-perfectly mirrors live field conditions and is fully updated with our field database as of five o'clock COB the day before. For me to do a basic test, they give me a (really horrifyingly short) version of events and if I get lucky, I get to see screenshots of the problem in progress.
[If I win the lottery, someone uploaded the specific patches themselves for me to look at, and I get to see what is going on pre-compiling. That has happened once. I did not take advantage of it. I kick myself sometimes.]
Usually, I get a fifth hand account that's gone through eight other people on what went wrong and what function I'm supposed to test and in what order to do it in. Depending on severity, I have four hours to four days to write the test (or several tests, or several variations of the same test for different user conditions, or different input conditions), send it to the person who owns the defect, have them check it, then I run the test in full, then fail or pass it. Or run it in full, fail or pass, then run it in prodfix, fail or pass it.
[Sometimes, I have a coder call me and we both stare in horror at our lot in life when both of us really don't know what the hell went wrong and hope to God this didn't break more things.]
The fastest I've ever seen an emergency release fix go through is three days from report to implementation, and at least once, we had a massive delay when they were too eager and crashed our database because the rollback didn't match the new information entered into the system since the problem started.
[And since this is welfare and under federal jurisdiction, the state gets fined by the feds when we cannot issue benefits correctly or have egrerious errors. Feds are really, really politely nasty about this sort of thing. And OIG, who audits us for errors, hates this program like you would not believe. To say there is motivation for speed is to understate the case.]
The program I test is huge, and terrifyingly complicated, and unevenly coded, and we can easily crash the servers for incredibly stupid small-seeming things. Amazon is about a hundred times larger. We do four major builds and four minor (just like major, just with a different name) per year, plus upwards of thirty emergency releases between builds. Our releases aren't live but overnight batched when the program goes to low-use after 8 PM, so we have some leeway if something goes dramatically bad or our testing isn't thorough enough. Which you know, that also happens. Amazon is always up and while it has the same constant database updates we do, I'm betting also has more frequent normal code updates, both automatic and human initiated.
If this is actually what happened, then the delay in fixing it makes sense, at least in my experience. Unless they release live code without testing it in an environment that is updated to current database conditions, which um, wow, see the thing where we crashed the state servers? The state is cheap and they suck and even they don't try to do even a minor release without at least my department getting to play with it first and give yea or nay because of that.
Short version: this matches my testing experience and also tells you more than you ever wanted to know about my daily life and times. YMMV for those who have a different model for code releases and updates.
And to add, again, if this is true, I am seriously feeling for the tech dept right now. Having to do unplanned system-wide fixes sucks. Someone is leaving really unkind post-it notes for the French coder. Not that I ever considered doing that or anything.
ETA: For us, there are two types of builds and fixes: mod (modification) and main (maintenance). The former is actual new things added to the code, like, I don't know, adding an interface or new policy or changing the color scheme. Maintenance is stuff that is already there that broke and needs to be fixed, like suddenly you can't make a page work. Emergency fixes in general are maintenance, something broken that needs fixing, with occasional mods, the legislature did something dramatic.
None of this means they aren't lying and it wasn't deliberate. My department failed an entire build once due to the errors in it.
Actually, the easiest way to find out if it was deliberate is to hunt down whoever did their testing and check the scripts they wrote, or conversely, if amazon does it all automated, the automated testing scripts will also tell you exactly what was being tested. If it was deliberate, there were several scripts specifically created to test this change.
Example:
If I wrote the user script and was running it in a near-field environment.
Step Four: Query for Beauty's Punishment from main page.
Expected Result: Does not display.
Actual Result: Does not display.
(add screenshot here)
Step Five: Query for Beauty's Punishment from Books.
Expected Result: Displays.
Actual Result: Displays.
(add screenshot here)
We're like the evidence trail. Generally, a tester has to know what they are supposed to be testing to test it. If this was live beta'ed earlier this year with just a few authors, it still had to, at some point, go through some kind of formal testing procedure and record the results. And there would be a test written specifically to see if X Story Marked Adult would appear if searched from the main page, and one specifically written to check that X Story Marked Adult was showing sales figures, either human-run or automated.
← Ctrl ← Alt
Ctrl → Alt →
← Ctrl ← Alt
Ctrl → Alt →