Live.com's Referrer Spam Has Left Me In Despair!

illustration
Sums It Up, Really

Back Then

A few months ago I noticed odd hits coming from live.com - hits from searches for very common words such as "about", "content", or "example". My initial thoughts were that live.com is just a bad search engine (and no one should ever use it). Searches for this kind of vague terms shouldn't yield a hit among the first 1000 result pages. It really shouldn't. Well, I decided to ignore it for the time being.

Some weeks later the number of hits from live.com increased - again with the same kind of nonsense queries. After a quick log check I knew they were all coming from the same IP range (65.55.165.*), which resolves to "livebot-65-55-165-*.search.live.com".

Putting it Into Context

Since my blog isn't that popular yet it was rather easy to evaluate the access logs completely. In February I got a total of 151,302 valid referrers. Less than 0.4% is spam or invalid (e.g. nonsense referrers injected by silly personal firewalls).

Out of 436 referrer spam hits 64% are from live.com. The remaining 36% attempted to promote 34 other domains:

referrer spam by source (36% other - 64% live.com)
Figure 1: Referrer Spam by Source

Out of 285 hits from live.com only 8 were (probably) human:

user/spam ratio (2.8% users - 97.2% referrer spam)
Figure 2: User/Spam Ratio

Live.com is clearly the market leader for, well, referrer spam. That really isn't desirable if you ask me.

What Were They Thinking?

The only plausible motifs which comes to mind are:

  • The usual tactic to increase back-links (via publicly accessible statistics) - everyone knows this from other spammers.
  • Create the illusion of being relevant.

After doing some searches it seems that it actually boils down to this. That's just great, really.

As if that wouldn't be embarrassing enough already they tried to come up with awkward excuses.

Their justification for all that referrer spam is... *dramatic pause*... cloak detection. Yes, really. "Cloak detection", they say. Detecting cloaking from a single IP range with a specifically formatted query string - as if. Requests from real users use other formatting patterns (never saw one which looked like a bot's request) and they also won't come from 65.55.165.*. Cloaking your cloaking (heh) from this kind of cloak detection is about as easy as it can get.

But it Even Gets Better

Pollute HTTP logs with inappropriate terms - Another unfortunate issue is that we were using a common list of keywords for our testing that was not site specific. We have tuned this list and you should no longer see any keywords used that are not related to the content of your site.

Actual Effect: Makes it look like legitimate hits.

While we work on addressing your concerns, we would request that you do not actively block the IP addresses used by this quality check; blocking these IP addresses could prevent your site from being included in the Live Search index.

Actual Effect: Mafia-esque blackmailing makes you feel warm and fuzzy inside. Well, not necessarily.

Closing Words

Personally I think other search engines should ban *.live.com completely, because they are apparently spammers. Therefore they shall be removed from the index for all eternity. Same rules for everyone, right?

Cloak detection is a red herring. How is that supposed to work if it's so easy to detect? Of course many people won't bother because live.com/msn is completely irrelevant. Which brings me to my next point: If blocking the bot gets you removed from the index anyways, maybe one should block the msnbot (which sends you the live.com spambots) completely.

Well, you can always ignore the issue. It's only a few hundred hits each month anyways, but boy... it's so lame! However, if you do so be sure that your logs/statistics aren't publicly accessible - don't reward spammers.

Comments

I had such problems with msnbot some years ago

I had such problems with msnbot some years ago, lot of hits from msn, and I looked if there had results for my site, but even typing the name of my web site there was nothing. The MS engine was just sucking bandwidth for nothing, and never read the classical robot.txt

I reseolved quickly the problem by adding in a .htaccess

SetEnvIfNoCase User-Agent "^msnbot" bad_bot
<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>

I never seen live.com on my site I suppose this use the same buggy msnbot agent than msn ?

I do the same from myspace huggly & crappy web sites by using :

SetEnvIfNoCase Referer "profile.myspace.com" myspace_ref
SetEnvIfNoCase Referer "profile.myspace.com/index.cfm" myspace_ref
SetEnvIfNoCase Referer ".*profile.myspace.com.*" myspace_ref
SetEnvIfNoCase Referer "www.myspace.com" myspace_ref

and adding this inside the Limit too :

Deny from env=myspace_ref

myspace users generally use picture from other websites without basic references respect and contains generally lowest interest sentences that can be found on the whole Internet...

msn==live

It's the same search engine. Blocking the live.com ref spamming bots will also get you banned on msn and if you block the msn bot (it obeys robots.txt) the live.com spam bots will also stop molesting your page.

Well, that's the deal basically.

I still don't know why they think it's a good idea to spam around. It's annoying and amazingly pointless.

Too bad there isn't a "goto hell" or foad http response code. ;)

Post new comment

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options