>I really don't get the harsh kickbacks and complaints against things like "...

Jach · on Oct 18, 2011

Misrepresentation is similar to neglect in that it's typically necessary for lawyers to argue and juries to decide. In this actual case, to repeat count's analogy, why aren't filesystem users complaining that their deleted data isn't "really deleted" (ignoring that such FS software is probably a no-warranty thing)? I've had to recover rm'd data a number of times myself. I don't think "deletion" has ever meant an expectation of "completely wipe out any traces of" in the digital world. I wouldn't put it past a lawyer to be able to convince a jury otherwise though, but it'd set a dangerous precedent.

pyre · on Oct 18, 2011

In a filesystem, the data isn't scrubbed but it's marked as space that can be overwritten. While the completely removal doesn't happen at once, it's slated for complete removal at some undetermined time in the future. In the case of Facebook, they are making a conscious decision that they do not want this data to ever disappear. I doubt very much that there are policies to remove data marked as 'deleted' after a set time period, or plans to implement something like that in the future. They keep the data around because it is still useful to them, even though it is no longer useful to the user.

0x12 · on Oct 19, 2011

And besides that, it is not the manufacturer of the filesystem that is in control, but the person that installed it, likely the same person that is deleting the files.

For web based services the rules change dramatically, because you are no longer in control of the data. Because the past has shown that companies seem to have a hard time to play nice with the data they store on behalf of their unsuspecting users there now is in some parts of the world a government entity tasked with precisely that: making sure that users right with respect to their data are respected.

If you don't like the way your filesystem deletes the data you can always cut up the platters.

dredmorbius · on Oct 19, 2011

Above and beyond that, any non-trivial web architecture is going to involve multiple tiers of data and caching.

A given text object will exist in the primary database, in its replicas or clusters, and in backups. If the outfit is at all legitimate, multiple backups representing frequent points in time, stored in multiple locations.

A binary object (say an image, video, or audio file) may exist in its originally uploaded format, several variants of different size, resolution, sampling rate, etc., and is often served through some sort of a content distribution network (CDN), which will have its own content management interface. Some of these are surprisingly primitive -- web-based forms in which a few score objects might be entered at a time, if you're lucky. Even script-driven purge methods are frequently limited as to the number of objects which can be included in a single request, and the number of outstanding requests which may be pending.

Given the large numbers of individual objects, scaling variations, redundancy, etc., deletion overhead can easily scale to tens to hundreds of millions of objects in a relatively short period of time (days to weeks). Dealing with all of this is fairly non-trivial. Especially if the site architecture didn't take these needs into consideration.

pyre · on Oct 19, 2011

  > Dealing with all of this is fairly non-trivial.

To Facebook's benefit, of course. I'm sure that Facebook would never think of using any user data flagged as 'deleted' in any sort of data mining...

Facebook also has no incentive to spend the time to figure how to do deletions because the data is valuable to them. Why would they spent time and effort to make it possible to lose this valuable data?

kiiski · on Oct 19, 2011

But if you delete something from the primary database, then I would assume it will eventually get deleted from others too. While it might never be intentionally deleted from backups, the backups will be overwritten with newer ones eventually.

(I'm talking about facebook here, not about the web in general)

dredmorbius · on Oct 20, 2011

That depends on how the databases are architected and tiered.

If they're proper slaves / replications of one another, then yes.

If, as is commonly the case especially for marketing data, periodic cuts or dumps of the data are made at various points in time, and there's no mechanism for propagating deletions throughout the chain, then no, you're not assured of deletion. This isn't likely to be the case for a site's primary database, but could very well be the case for derived datasets. I can think of instances with, say, credit bureau reports in which erroneous data must be repeatedly deleted because it keeps getting re-injected into the system.

Facebook's September, 2010 outage in which cached data were being re-injected into the system exhibited a similar problem of cache coherence. http://www.facebook.com/note.php?note_id=431441338919

0x12 · on Oct 19, 2011

And at a minimum it should not longer be trivially accessible to the company via their 'normal' procedures.

Locke1689 · on Oct 19, 2011

To what extent is this just a property of Paxos that lacking delete synchronicity can cause data unavailability or differing levels of data availability? Moreover, how would affect the Paxos read/write synchronicity? Would you have to disable read caching on the geographic layer?

I'm sorry but this seems like a huge ignorance on the part of you on how DSes are designed. These issues are important.

pyre · on Oct 19, 2011

So there is a way to synchronize data creation and data updates, but not data deletion? Really?

Locke1689 · on Oct 19, 2011

Data creation is possible and easy, especially in append only filesystems. Data updates can be done immutably by doing creates and rewriting the pointer structure, which obviously destroys cache fetch for highwater mark objects but doesn't affect cache marks for global id'd objects.

Have you actually read the Paxos papers and the rest of the literature on this?

zeteo · on Oct 19, 2011

It's not the same thing, I believe. When an OS deletes a file, it's actually deleted as far as the OS is concerned (thus they do make a best effort). When FB "deletes" something, they only make it unavailable to you, but they retain the option of using it themselves.

count · on Oct 19, 2011

That's a good point, actually. Do we know for a fact if FB keeps the data available for use in their own algorithms/processes, even if they don't allow it's display?

div · on Oct 19, 2011

What matters is that they have the option of using it. They may not do so today, but tomorrow is a brand new day.