The Toybox

people for the conservation of limited amounts of indignation


Previous Entry Share Next Entry
amazon and codefixes - oh, this is something i might know something about!
children of dune - leto 1
seperis
A possible explanation, gakked from trobadora:

AmazonFail: An Inside Look at What Happened

Amazon managers found that an employee who happened to work in France had filled out a field incorrectly and more than 50,000 items got flipped over to be flagged as "adult," the source said. (Technically, the flag for adult content was flipped from 'false' to 'true.')


Note: If they are telling the truth about what happened, this applies. And actually, it would apply if they lied, but worse. One error is one thing, but if this was a deliberate system-wide build that made the change, pretty much the same thing applies, but with less sympathy.

My expertise is not expertise, it is anecdata, but it's also ten builds and fifty emergency releases of professional anecdata, so take that as you will.

I am a professional tester because at some point, it occurred to people that things worked better when there was a level of testing that was specifically designed to mimic the experiences of the average user with a change to a program. Of course, they didn't use average users, they used former caseworkers and programmers, but the point stands.



I'm a professional program tester and do user acceptance, which means I am the last line of defense for users before we release a change to the program, major and minor. It's a web-based program that with three very idiotic ways to interface with it online for a user and about fifty for other agencies to do automatically, and I won't go into our vendor interfaces because it hurts me inside. I am one of thirty user acceptance testers for this program, because it's huge and covers a massive number of things and interfaces with federal and state level agencies outside of our own internal agencies. I test things straight from the hands of coders in emergency releases and also after they've gone through two other levels of testing in our quarterly builds.

This does ring true to my experience when something just goes stupid. And when I say stupid, I mean, someone accidentally killed off pieces of welfare policy with a misflag once, and that's not even the stupidest thing I've had to test when the program was built and is still coded modularly and the coders are in different parts of the country and sometimes at home in India when working on this. And none of them ever know what anyone else is doing.

While I have no idea what amazon's model looks like, to do a rollback on a change for us, even a minor one, it goes like this:

1.) Report
2.) Reproduction in one of our environments.
3.) Code fix and discussion and so many meetings, God. (emergency releases may not go through this.)
4.) DEV environment 1 (theoretical construct of the program, works very well, nothing like the real thing)
5.) DEV environment 2 (closer to the actual program, but not by much)(sometimes do not use both Dev 1 and Dev 2 both)
6.) SIT (sometimes skipped for emergency releases) (I have issues with their methodology.)
7.) User Acceptance (me! And some other people, somewhat close to field conditions with database as of last mass update, usually two to three months before)
8.) Prodfix (optional) (also me! And some other people, almost perfect mirror of field conditions with full database)

If it's really desperate, it goes to prodfix instead of or in addition to User Acceptance, which is the only environment we have that nearly-perfectly mirrors live field conditions and is fully updated with our field database as of five o'clock COB the day before. For me to do a basic test, they give me a (really horrifyingly short) version of events and if I get lucky, I get to see screenshots of the problem in progress.

[If I win the lottery, someone uploaded the specific patches themselves for me to look at, and I get to see what is going on pre-compiling. That has happened once. I did not take advantage of it. I kick myself sometimes.]

Usually, I get a fifth hand account that's gone through eight other people on what went wrong and what function I'm supposed to test and in what order to do it in. Depending on severity, I have four hours to four days to write the test (or several tests, or several variations of the same test for different user conditions, or different input conditions), send it to the person who owns the defect, have them check it, then I run the test in full, then fail or pass it. Or run it in full, fail or pass, then run it in prodfix, fail or pass it.

[Sometimes, I have a coder call me and we both stare in horror at our lot in life when both of us really don't know what the hell went wrong and hope to God this didn't break more things.]

The fastest I've ever seen an emergency release fix go through is three days from report to implementation, and at least once, we had a massive delay when they were too eager and crashed our database because the rollback didn't match the new information entered into the system since the problem started.

[And since this is welfare and under federal jurisdiction, the state gets fined by the feds when we cannot issue benefits correctly or have egrerious errors. Feds are really, really politely nasty about this sort of thing. And OIG, who audits us for errors, hates this program like you would not believe. To say there is motivation for speed is to understate the case.]

The program I test is huge, and terrifyingly complicated, and unevenly coded, and we can easily crash the servers for incredibly stupid small-seeming things. Amazon is about a hundred times larger. We do four major builds and four minor (just like major, just with a different name) per year, plus upwards of thirty emergency releases between builds. Our releases aren't live but overnight batched when the program goes to low-use after 8 PM, so we have some leeway if something goes dramatically bad or our testing isn't thorough enough. Which you know, that also happens. Amazon is always up and while it has the same constant database updates we do, I'm betting also has more frequent normal code updates, both automatic and human initiated.

If this is actually what happened, then the delay in fixing it makes sense, at least in my experience. Unless they release live code without testing it in an environment that is updated to current database conditions, which um, wow, see the thing where we crashed the state servers? The state is cheap and they suck and even they don't try to do even a minor release without at least my department getting to play with it first and give yea or nay because of that.



Short version: this matches my testing experience and also tells you more than you ever wanted to know about my daily life and times. YMMV for those who have a different model for code releases and updates.

And to add, again, if this is true, I am seriously feeling for the tech dept right now. Having to do unplanned system-wide fixes sucks. Someone is leaving really unkind post-it notes for the French coder. Not that I ever considered doing that or anything.

ETA: For us, there are two types of builds and fixes: mod (modification) and main (maintenance). The former is actual new things added to the code, like, I don't know, adding an interface or new policy or changing the color scheme. Maintenance is stuff that is already there that broke and needs to be fixed, like suddenly you can't make a page work. Emergency fixes in general are maintenance, something broken that needs fixing, with occasional mods, the legislature did something dramatic.

None of this means they aren't lying and it wasn't deliberate. My department failed an entire build once due to the errors in it.

Actually, the easiest way to find out if it was deliberate is to hunt down whoever did their testing and check the scripts they wrote, or conversely, if amazon does it all automated, the automated testing scripts will also tell you exactly what was being tested. If it was deliberate, there were several scripts specifically created to test this change.

Example:

If I wrote the user script and was running it in a near-field environment.

Step Four: Query for Beauty's Punishment from main page.
Expected Result: Does not display.
Actual Result: Does not display.
(add screenshot here)

Step Five: Query for Beauty's Punishment from Books.
Expected Result: Displays.
Actual Result: Displays.
(add screenshot here)

We're like the evidence trail. Generally, a tester has to know what they are supposed to be testing to test it. If this was live beta'ed earlier this year with just a few authors, it still had to, at some point, go through some kind of formal testing procedure and record the results. And there would be a test written specifically to see if X Story Marked Adult would appear if searched from the main page, and one specifically written to check that X Story Marked Adult was showing sales figures, either human-run or automated.

Great post! I'm linking this.

And I will say that only if amazon is being truthful on what happened with the coding. Though even if they're lying and meant to do this, the rollback will possibly be even more difficult, because it wouldn't be just one error that caused problems, but an entire build. My department has killed an entire build for errors before, and they had to do a month delay to implement changes.

Yeah, right: next they are going to say it was a social experiment.

You do not think it is a bit TOO easy considering how the media have been brainwashing us about blaming France for everything?

Xenophobia is not much prettier than homophobia, for anyone keeping track. I could possibly have believed this story if it was not playing about our well-known negative bias towards France.

Not that I think that Franch programmers are exempt from making mistakes by the way, but I am a a database admin myself and also do design complex enterprise level software: and blaming scapegoats in a different department/company/country is a well-known tactic when a system goes awry. You wouldn't believe how often I've seen it happen.

Oh, I agree, that's why I clarified--only if what they say is absolutely true would the delay make sense in terms of identification, patches, and testing.

(no subject) (Anonymous) Expand
(no subject) (Anonymous) Expand
Heh... (Anonymous) Expand
(Deleted comment)
It does. A lot. For major builds during SIT testing, entire updates would collapse, and as I told trobadora, last August my department killed an entire build, and not even a major build, for being so error ridden it was making the program nearly impossible to use.

What troubles me more then what was flagged, is that Amazon has an adult flag that I never knew about that make items very difficult to find and I can't turn it on or off. It's decided for me, unlike Googles' "safe search" feature.

I'm wondering if the adult flag has even been active before now.

The problem with this theory is that Amazonfail only happened on the US site. The same books still had their ranks on the Canadian and the UK sites. That wouldn't be the case if this theory were true.

No, agreed, but the same delay would apply whether they are lying about it being deliberate or not. Even if they started on this first thing Sunday morning to patch it, it would still take some time to roll it back, and depending on how complex amazon is code-wise, and what else was changed at the same time, related or not, they have to remove the offending code, check the rest of it to see if it still integrates, then test it under close to field conditions.

Specifically, this doesn't defend them, just clarifies why they literally may not be able to do a full rollback very fast, even if they really, really want to.

Can I just say I think they're lying through their corporate teeth? *g* I do feel for all the Amazon people who weren't involved in the screw-up, though. This must have been a hellish day for them.

I don't particularly believe it, to be honest, but the same process would apply for a deliberate change or an accidental one. They'll have to pick apart this update and re-test all the components together without the offending code, and that can take a few days.

I agree with both your caveat and your experience matches mine.

I'm the software release manager for a very tiny arm of a Fortune 10 company and knowing both how we work and how the rest of the enterprise works, I find "it takes 3 days to change this back" credible as well.

Someone with my responsibility sat in a meeting and argued for just another day of testing, and described the process for dealing with this kind of thing, but it still comes down to someone making a call about how much is needed and how much heat the company can take.


We love you that argue for us, seriously. I've had (very careful) tantrums about rushing through emergency fixes that only do a partial fix with workaround, or we weren't given time to explore more scenarios, or the fix explanation wasn't comprehensive (or comprehensible) enough.

Someone with my responsibility sat in a meeting and argued for just another day of testing, and described the process for dealing with this kind of thing, but it still comes down to someone making a call about how much is needed and how much heat the company can take.

Yes. I've honestly wished at least the test supervisors could go to some of the meetings to explain why release is a bad idea until we can check a few more things. I've had to use a single test run for three or four separate variables and I absolutely hate having to do that. It will always come up two or three builds later when a user finds out it affected something entirely unexpected that might have been caught if we'd a.) had more information or b.) had more time.

this matches my testing experience and also tells you more than you ever wanted to know about my daily life and times.

I'm enough of a geek that I actually found that little explanation very interesting. Admittedly, the closest I get to coding is fiddling with our Access database (teeny, tiny thing, but I get the whole "I changed this little thing in X, so why the heck have the figures for A DROPPED OFF THE FACE OF THE EARTH?!" aspect and that sometimes, big things go wrong without anybody being entirely sure why or precisely how to fix it). But, still, it was interesting to read.

teeny, tiny thing, but I get the whole "I changed this little thing in X, so why the heck have the figures for A DROPPED OFF THE FACE OF THE EARTH?!" aspect and that sometimes, big things go wrong without anybody being entirely sure why or precisely how to fix it

For maintenance items, most of the time, this is how it starts. When I look up a defect (error in the program) that is in the process of being fixed, there's a log with comments as it moves from the help desk to the coders to development testing, and it usually takes a while to identify teh specific issue. And even then, often it takes a while to figure out how to fix it, especially if its integrated with eight other things, without killing the program.

And that doesn't even include the arguments between coders, policy specialists, analysts, and etc.

I don't buy it. IMO no "accident" theory explains the letters authors received about adult content, or why non-explicit material about homosexuality was unranked, but explicit heterosexual material wasn't. You might be interested in this article if you haven't seen it already. Link from penknife.
http://www.feministing.com/archives/014797.html
Amazon Rep: This was not a "glitch"

Edited at 2009-04-14 04:32 am (UTC)

Yes, agreed. It's probably not a glitch, and likely if people were affected earlier, was being beta tested live. However, what I am saying here is that the process of reverting the code is the same either way. It will take a few days to pull out the code and alter all the programming that was altered so that the code would work and then test to make sure when they take it live, it will not crash.

If they are, in fact, going to remove the code. This won't apply if htey have no intention of changing it. But if they do, again, it will still need time for the code to be removed, code to be rewritten, and code to be tested.

Amazon aside, that's a wonderful explanation of UAT. At what stage does your organization do load testing?

That's theoretically the unix/batch people across the hall who also update our environments. That's where things get complicated, since its' a combination of state employees, state contractors (at will employees of a state project), and the company we outsource the coding to.

From what I understand, we do load testing and--balance testing? There's another term for it--every week period because our servers suck and they go down a *lot*, as in daily. During the big updates, they do it with SIT and potentially (this is the part that tends to be weird), the night before a build goes live (usually Saturdays). There's also random testing when the servers go down more than two hours.

Now what environment they do it in is a mystery. We have two separate UAT environments, one specific for interface testing, and the other for general testing. Id' always assumed that prodfix, being almost-field conditions, was the one they used for that, since we only use it when specific tests need to be run on it and leave it alone otherwise. However, it doesn't get new code until we pass it in UAT.
[
Interestingly, I'll soon know more than I want to about it, because the state does not have this program in use state-wide, just in specific locations with a very, very slow, constantly delayed rollout. Right now, the userbase is comparatively tiny, and adding even a few counties will crash us fairly consistently for days. We're adding more soon, at which time our environments will collapse about once every ten minutes and many frantic emails will be sent across the hall. *G* Including from me.

*it is as if seperis speaks in foreign language* Wow. I... kind of understood parts of that.*g*

This whole thing is pretty damn amazing. I did notice one thing tonight (apropos of nothing in your post I don't think). I'd ironically just performed a ton of searching in m/m fiction just a couple of days ago -- mainly due to the release of a two friends' new books. And when I signed on today, voila! it tells me my last search, recs books accordingly, and... it's basing everything off the ONE non-m/m book I looked at. Even though I looked at about 50 m/m,lol. Now, it *could* be that that was the one I looked at last, but I really, really don't think so. So it's possibly not only doing all the things everyone else has noted, but even remembering customers wrong. LOL, it's like, "no, you did not really want to find gay erotica, that was just a figment of your imagination...let me rec you all these (totally uninteresting to me) thrillers!"

Anyway, thanks for the fascinating look into your world, and how something like this might look at a micro level.

Edited to clarify what in the world I was talking about with it using my prior search.





Edited at 2009-04-14 05:25 am (UTC)

I gather that's one of the effects of eliminating rankings -- it not only keeps books from appearing near the top of search results, but refuses to consider them when creating recommendations. So when all those m/m books you'd looked at got deranked, the non-m/m was the only one "available" to use to create recs from.

I still really want a checkbox to just turn their net nanny off. It pisses me off that they filter my searches without telling me. It's like lying to my face when I ask their store to look up a book for me. This whole method is fishy. I mean, even if, say the hypothetical "Gary screws Larry IV" was flagged as adult for valid reasons, it should come up if I search for "Gary screws Larry". Or it should be clear to me why it doesn't if they want to offer users the option to not display porn, that's fine, but I shouldn't only be noticing when they accidentally filtered out a ton of books because something went massively wrong, rather than having just a few blacklisted titles they try to hide.

I never realized they had an adult filter, because they don't tell you this (at least not anywhere prominent), and I'm suspicious that they'll now backpedal in public and "fix" this for the well known books and so on, the ones that they actually didn't want to give the leper flag, but that in the end they'll still stick with this forced "adult content filtering" policy, for books that are less well known and/or have actual sex in them, those that need the search functions and related books displays and such the most for their exposure compared to books that won awards and had movie deals. But because it'll only be fewer titles again, and Amazon "fixed" the hack'n'slash method of removal (or misflagging or whatever it was) the outrage will have died down, when they quietly mess with the rankings again and you don't even notice that they filtered some gay stripper memoir you never heard of from your suggestion list again.

I still really want a checkbox to just turn their net nanny off. It pisses me off that they filter my searches without telling me.

Yes, this.

If they reply to my mail that they were filtering for adult content, "where do I turn the filter off" will be my next mail to them.

Hah! This is lovely and detailed in the way that makes me happy in my parts.

This is very interesting. Thank you for writing it up.

I get that this problem *could* have unfolded as they outlined, save that: other countries, besides the US, were not affected as systemically. I checked the Canada site when it came up and could general search the bios, for example. Also, what about the author who e-mailed Amazon and got the response that it was a policy to filter for adult content? Which again could have been someone answering without really looking at the problem, but still.

I don't know that I'm willing to entirely trust their explanation, plausibility aside.
B

?

Log in

No account? Create an account