Archive for the ‘Search Engines’ Category

Sphere Blog Search, crawling 9 pages in 16 seconds

Friday, March 23rd, 2007

I love web crawlers. They index pages and bring readers from search-engines. But some web crawlers are just annoying. Like those gathernig e-mail addresses for spammers. And Sphere Scout, which has a very odd hit-grab-and-run behavior.

Sphere Scout visited my blog, fetched robots.txt to check for permission to crawl and then and grabbed 9 pages in 16 seconds - and that was it.

64.40.115.32 - - [23/Mar/2007:02:42:05 -0400] “GET /robots.txt HTTP/1.0″ 200 24 “-” “Sphere Scout&v4.0 (beta) - scout at sphere dot com”
64.40.115.32 - - [23/Mar/2007:02:42:06 -0400] “GET / HTTP/1.0″ 200 34940 “-” “Sphere Scout&v4.0 (beta) - scout at sphere dot com”
64.40.115.32 - - [23/Mar/2007:02:42:09 -0400] “GET /2007/01/12/are-you-sure-your-backup-routines-are-sufficient/ HTTP/1.0″ 200 15576 “-” “Sphe”
64.40.115.32 - - [23/Mar/2007:02:42:12 -0400] “GET /2007/02/21/creative-seo-whos-there-google-heres-a-page-just-for-you/ HTTP/1.0″ 200 15177 “”
64.40.115.32 - - [23/Mar/2007:02:42:14 -0400] “GET /2007/03/12/youd-be-shocked-and-amazed-if-you-knew-what-theyre-searching-for/ HTTP/1.0″ 200″
64.40.115.32 - - [23/Mar/2007:02:42:17 -0400] “GET /2007/02/09/yet-another-creative-google-clone-spammed HTTP/1.0″ 200 17297 “-” “Sphere Scout”
64.40.115.32 - - [23/Mar/2007:02:42:19 -0400] “GET /2007/03/12/youd-be-shocked-and-amazed-if-you-knew-what-theyre-searching-for HTTP/1.0″ 200 “
64.40.115.32 - - [23/Mar/2007:02:42:21 -0400] “GET /2007/03/22/vigilant-a-pretty-cool-word HTTP/1.0″ 200 12409 “-” “Sphere Scout&v4.0 (beta) -”
64.40.115.32 - - [23/Mar/2007:02:42:23 -0400] “GET /2006/10/11/the-enormous-power-of-plain-text-e-mail-security/ HTTP/1.0″ 200 14138 “-” “Sphe”
64.40.115.32 - - [23/Mar/2007:02:42:25 -0400] “GET /2007/03/22/vigilant-a-pretty-cool-word/ HTTP/1.0″ 200 12409 “-” “Sphere Scout&v4.0 (beta) “

There is nothing wrong with crawling the web. Every search-engine has to. I started using Google as my #1 search engine many years ago, and I still do for two reasons:

  1. I always find exactly what I’m looking for (this may have something do to with me knowing how to use it’s more advanced functions)
  2. It’s fast. Result 1-10 of 3830000 in 0.05 seconds? It’s hard to make a static web page load that fast.

But some of their actions the latest years are at best very questionable, so it makes me happy to see that other search-engines are at least trying to give them competition. Like the blog-search-engine Sphere. But hammering a page every 2 seconds?

outrage.jpg

If every new & supposedly “next big thing” search-engine did that then it’d kill the web and that would be the end of it. That’s probably an overstatement, but still: Most web crawlers don’t rush. They download a page, wait a while, and then download another page. They usually take their time. This prevents a single bot, or a handfull of bots who happen to hit the same site, from putting noticable load on a webserver. But those running “Spere Scout” don’t get that, they want all content and they want it now.

What’s Sphere, anyway?

It’s a blog-search-engine. A pretty bad one at that.

Speed? Sphere is so slow it’s redicilous. It really is very hard to make a search-engine come close to Google’s speed, but Sphere is just… way too slow.

Results? I tried a search for “911 inside job” and it only managed to find 43 links. Technorati, another way too slow blog-only searchengine, has page by page by page with results for the term “911 inside job“. It doesn’t say how many, you have to click next and it requres referrer when using &start=200 etc, but from I bothered to check (without changing start=200 using a fake referrer field, which I briefly considered) it’s got thousands of results of that term. Google, as always, p0wnes them both with it’s incredible “about 40,850 for 911 inside job. (0.76 seconds)“.

They’ve also got a whole lot of “Tools” such as browser extentions and widgets who they encurage bloggers to install on their sites. I read their “sphere it, tools and tips” page and after carefull consideration for about 0.9 seconds found that their most advanced browser extention is a searchplugin which does the job Google’s related: queries do, and their widgets - who show “post-related search results” looked like a more annoying version of Google Adsense - only without payment.

I found that the “Social bookmarks” widget I use - which I plan on rewriting, btw - has Sphere in it (it has like 60 sites you can choose between) - so I’m going to check if having the button has any effect on their crawling behaviour the next few weeks. Will it visit more frequently, perhaps? If it does then I may actually remove the button and warn other people about having it since a page pr. 2 seconds is just totally unacceptiable crawler behaviour.

In bullet summary:

  • Sphere really should consider increasing their bots crawl-delay from 2 seconds, and
  • Their blogsearchengine is redicilous, it’s slow, it finds nothing and it wouldn’t even pass as decient back in 1998.

Sphere: Related Content

You’d be shocked and amazed if you knew what they’re searching for..

Monday, March 12th, 2007

I run one of the many YaCy P2P search portals out there. YaCy is a distributed P2P search-engine, if you run a node then you can search using the global index of all the nodes. Most people run their own node on their own desktop’s and don’t make it publicly available, I run a public search service which allows anyone to use the YaCy network.

YaCy has a nice “feature” called Search Statistics. It gives you a nice list of the latest search keywords - and the hosts used to search. This makes it very easy to follow the same user’s searches for many searches in a row. It doesn’t use cookies, which makes tracking over time impossible, but that is something most search engines do.

Regardless. Only seeing even one or three searches in a row at YacySearch actually gives quite a lot of information about the person doing the search. And it also may tell you way to much, some of the strings some people search for are just.. sick. Or very strange.

I would actually prefer to turn this search-logging “feature” off and not be able to view it at all, because those few times I look at the “What are people searching for today?“-list I almost always get.. kind of upset at just how .. how do I put it.. sick? some people are. But it does give some interesting information, too, like if there has been some story in the mainstream press about some celeberty then suddenly everybody’s searching for that celeb’s name..

Anyway. Here’s a word of advice for you all about searching on the Internet:

1) Clear your cookies every time you close your browser (Firefox, and others, can be configured to do this automatically.

2) Use scrapers like Scroogle to search Yahoo (and Google).

3) Preferrably, use a anonymity system like Tor to browse the Internet.

4) Spread your searches between different search-engines. If MSN knows your last 100 searches then they probably know a whole lot about you. You’re better off doing 1 search at MSN, one search at Google, one search at Yahoo, and so on. This means that none of them get a complete history of your searches, and it’s way simpler to see what you’re up to when you’ve got 10 search-requests in a row or something like that…

5) Some browsers can give you “suggusted keywords” when you type in the search-box. Turn this off. It reports everything you type in the box back to a search-engine, even if you don’t actually search for anything. Worst case: You accidentially mispaste your computer password into the box, now it’s broadcasted accross the Internet to a search-engine…

Happy searching. 


Sphere: Related Content

Creative SEO: Who’s there? Google? Here’s a page, just for you!

Wednesday, February 21st, 2007

It’s been.. uhm.. “rumored” that some sites who require you to pay and login to read their content threat web-crowlers differently and allow them to crawl “restricted” content. Which is nice, since all you have to do to access such sites without paying is to say you’re Google.

After pretending to be Google a few days I’ve noticed something. Many websites seem to give a different page depending on who visits. For example, this is the front page at www.bluecoat.com:

bluecoat1.jpg

Doesn’t look very fancy, does it? That is because they serve Google (and anyone/thing who pretends to be Google) a completely different page.

Their website actually looks like this - in most browsers:

bluecoat2.jpg

This is what’s called doing black-hat “search engine optimization”.

Except for one little detail. The problem with all kinds of “dirty trick” black-hat SEO is that it doesn’t work.

And it specially doesn’t work with Google. Se, here’s a little dirty secret about GoogleBot: It sometimes lies about who’s there! It will fetch the / using the normal User-Agent, wait a while, and re-crawl the root page / using a (outdated beta-version of a Linux-only) web browser string.

I don’t actually know what Google (or more correctly, their bot..) thinks of websites who give them a different page. But I do not think their bot likes that kind of SEO. And as mentioned, it’s not like you’re fooling anyone by trying to give search-engines a different page, most of them now check at least 1 page on your site using a “fake” (as in not their own) User-Agent string.

But I actually like getting a simpler “SEO” page. It’s much simpler to find what you’re looking for using a “Web 0.1″ plain text link-list - in most cases…

Just one more little detail regarding SEO: It does not work. Forget about the SE part. Just optimize your sites for human visitors. If they like it then real people who like your site will link to your site and pages on your site, and that’s the only kind of SEO which actuall works. Period.


Sphere: Related Content

Yet another creative Google-clone spammed

Friday, February 9th, 2007

Too many people on the Internet view a free blog service as somewhere they can spam huge amounts of totally worthless advertisements. This Livelyblog blog is a good example of this, it is basically advertisements and random cut-and-pasted content:

Spam-blog

People sign up and create a blog like this every week, so this is nothing strange. What is strange si that the advertisements in the spam-blog pictured above is spamming links to a “search-engine” service named “cooooogle.com”. This “search engine” looks exactly like Google, and it’s results “results” are limited to a handfull of websites with little or no content and huge amounts of advertisements.

Cooooogle

It is vey interesting that the “results” from this “search-engine” all give results who link to pages with advertisements from Google’s Adsense advertisement program.

This is kind of .. a strange scam.

Someone has created a clone of Google’s web-page, which looks exactly like Google, and is spamming the links to it everywhere.. to make money using Google’s own Adsense advertisement program.

Perhapts it will work for a while. Perhaps now. How can Google accept that someone is using a “fake” version of their search-engine and spams link to it to make money from Google’s own advertisement service? Perhaps they have no idea that Cooooogle exists. Who knows. Regardless, I do think that their scam is very.. bold.


Sphere: Related Content
xiando.livelyblog.comLogin