Let’s Nuke These Scraper Sites – ORGANIZING CREATIVITY

“I say we take off and nuke the entire site from orbit. That’s the only way to be sure.”
Ellen Ripley in “Aliens”

There are a lot of good reasons to blog. You put your work out there and frequently you get interesting questions and encouraging feedback. It can bring you new ideas and motivate you to carry on. Without the feedback I received here I would not have written the second edition of “Organizing Creativity”.

But there are also two negative things that come with blogging: spam and content theft via scraper sites. I have written about spam in the previous posting, so let’s focus on content theft.

Scraper Sites

While spam is annoying, content theft, especially via scraper sites — really gets me. Often they are easy to find, as they copy whole postings verbatim. They frequently use RSS feeds most blogs offer automatically to take your content and post it 1:1 on ‘their’ blog. Why? Often to draw visitors and thus make money with advertisement. Usually the import of content distorts the formatting, and while they copy the text, they usually link to the images on your blog — costing you bandwidth and giving visitors a crappy reading experience.

Some scrapers include a source information (pointing to the URL of the original article) in a vain attempt to give the theft some legal polish. However, in my view (but I am not a lawyer) this is still theft. It’s not fair use to copy postings entirely, no matter whether you give the source information or not.

How to Find Scraper Sites That Steal Your Content

If they include the source information, Google shows them via the incoming links in the WordPress dashboard. You can also search via Google for incoming links, just search for

links:http://www.YOURWEBSITE

Otherwise simply go to one of your articles, select a sentence and google it (within quotation marks). If scraper sites have targeted your blog, you will find them this way.

Dealing With Scraper Sites

There are a few things you can do against content theft/scraper sites — and you should. If bloggers make it harder for the content to be stolen and send out DMCA notices (gasp!), they can poison the water for these scrapers.

1. Disable Hotlinking

Most scrapers copy the text but keep the links to the images on your blog, costing you bandwidth. However, you can disable hotlinking, meaning that images on your site will be shown on your site, but only on your site. If someone tries to show your images on their blog by using the file URL, it won’t work. There are some good instructions online on how to prevent hotlinking online. Note that you need an image that is not protected from hotlinking to show instead (otherwise you can cause a loop). You also need to add code to the .htaccess file, which is probably not something for newbies. With my blogs this worked very well once I added the code (by following these instructions):

RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?organizingcreativity.com [NC]
RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?organisingcreativity.com [NC]
RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?ipsych.org [NC]
RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?arkofideas.org [NC]
RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?themobilescientist.org [NC]
RewriteRule \.(jpg|jpeg|png|gif)$ http://xchange.ipsych.org/nohotlinking.png [NC,R,L]

to the .htaccess files in the directory of the WordPress installation (not the root directory). If you use a FTP program like Cyberduck, not that you have to enable “show hidden files” (usually under “View”). I think it’s really useful to do this, but take care and make backups prior to any changes you do (and as usual, anything here is without warranty).

2. Limit RSS Feeds

A scraper site usually simply imports the RSS feed of your site, which is otherwise a very useful way for readers to keep informed about your blog without having to visit it each day. Per default the whole content of your article is transmitted via the RSS feed, giving not only your readers but also the scraper sites easy access to the content. Thus, you should limit the amount of information that is transmitted as RSS feed to a summary (the first few lines). This will still give readers information that a new entry is available, but will make the content worthless for scraper sites.

You can do this in WordPress via “Settings” – “Reading”, “For each article in a feed, show” and select “Summary”.

3. DMCA

I’m no fan of the music industry, or other industries who try to protect an obsolete business model via intimidation. But we are not talking about a teenager who is sued for the millions of dollars he “would have otherwise spend on the music he copied” (yup, sure), but outright theft. And I see the need for legal action against it. Scrapping is not a mashup, there is no additional creative element. It’s just theft, plain and simple. And theft should be stopped, thus DMCAs are the way to go.

Unfortunately, there is no simple way to do this. You have to find out where the site that is stealing your content is hosted and then inform the provider that they did steal your content. It’s relatively easy with large hosting services like Blogger or WordPress, they have an online form you can fill out:

Blogger (Google): http://support.google.com/bin/request.py?contact_type=lr_dmca&product=blogger
WordPress (automattic): http://automattic.com/dmca/
Typepad: Send eMail to: copyright@typepad.com

While Blogger and WordPress delete only the specific posting but leave the blog intact, Typepad actually killed the blog itself (yeah! okay, let’s face it, there wasn’t any original content and the stealing was done automatically via a script, these ‘blogs’ have to go). Personally, I think this is the way to go. It does not take more than a few seconds to identify a blog as scraper site with no own content and these blogs should be killed. So, yeah, way to go, Typepad, let’s nuke them!

Note that this only works if the blogs are actually hosted at these companies. Just because it’s a WordPress blog (i.e., it uses the software and you see a “Powered by WordPress” at the bottom of the page), does not mean that it is also hosted at WordPress (for example, this blog uses the WordPress software, but it is not hosted at wordpress, so automattic cannot do anything). In these cases you have to find the name of the provider, e.g., via whois, and send them a DMCA notice (you can use the text from one of the forms above).

Hmm, if I ~~have~~ make some time I’ll probably write a nice HTML frame. It will show the takedown forms in the frame on the left side, and a couple of fields/buttons in the frame on the right. One button will copy my legal information (address and the like) in the forms. I would only need to copy the link to the scraper site posting and the original posting manually … this should reduce the amount of effort to fill out a DMCA notice.

Anyway, if you have a blog, have a look who steals your content — and nuke them.

(Note that this posting refers specifically to scraping sites. Someone who links to a posting or quotes part of it — that’s okay. But copying whole postings from different blogs just to make money via advertisement is not. This blog is advertisement free and I refuse to let my writing be used in combination with advertisement.)

ORGANIZING CREATIVITY

How to generate, capture, and collect ideas to realize creative projects.

1 Trackback / Pingback