Search Tool Data Analysis

by Daniel Nicholls (dpnickdpnick in BIT330, Fall 2008)

Questions and queries

Web search engines

The National Hockey League is one of the premiere sports leagues in America. While it may not match the NFL or NBA in size or popularity, it still has a very large fan base and creates billions of dollars in revenue. After reaching a low point during the 2004-5 lockout, it has grown in strength every year since. I know that Detroit was one of the first few teams in the league, but how many teams were there when the league first started?

The query I used for this search was: “NHL number teams beginning”.

Blog search engines

This year’s election is a very close race between Barack Obama and John McCain. Recently both have chosen their Vice Presidential picks. While Obama chose a well-known senator in Joe Biden, McCain picked a relatively unknown official in Sarah Palin. Not many people, including myself, had ever heard of Sarah Palin. I have heard that her current position is the governor of Alaska, and I was curious how many years has she served at this position?

The query I used for this search was: “Sarah Palin governor Alaska”

Data that I collected

Search engine overlap data

Web search Live Google Yahoo Web
Live 15 5 10
Google 65 25
Yahoo Web 45
All 5
Blog search Technorati Google Blog Bloglines
Technorati 15 0 5
Google Blog 40 5
Bloglines 45
All 0

Search engine ranking overlap data

This table provides a measure of how much of Google's responses are reproduced by Yahoo.
GY Yahoo
Google 5 10 20
5 3 3 4
10 3 3 4
20 3 3 5
This table provides a measure of how much of Yahoo's responses are reproduced by Google.
YG Google
Yahoo 5 10 20
5 3 3 3
10 3 3 3
20 4 4 5
This table provides a measure of how much of Blogline's responses are reproduced by Google Blog Search.
BG Google
Bloglines 5 10 20
5 0 0 0
10 0 0 0
20 1 1 1
This table provides a measure of how much of Google Blog Search's responses are reproduced by Bloglines.
GB Bloglines
GBlog 5 10 20
5 0 0 1
10 0 0 1
20 0 0 1

Results

Web search

This table provides an average of the precision of Web Search for all students in BIT330 Fall 2008.

Web search Live Google Yahoo Web
Live 43 18 20
Google 54 21
Yahoo Web 52
All 10

This table shows the average of all students in BIT 330. To interpret this table, you must remember that these numbers are percentages. When a number is in the column and row of the same site, this shows the precision of the site. For example, where the Google row and Google column meet we see 54%. This illustrates that 54% of the query searches in Google, on average, brought back accurate results. When a row labeled with one site intersects a row of another site, this is showing the percentage that the two sites overlap. For instance, where Live and Yahoo Web met we see 20%. This means that 20% of the relevant searches in Windows Live were also in Yahoo Web, on average. Finally, the 10% under All illustrates that 10% of 20 searches (or 2 searches total) were found in all three Web searches, on average.

This table provides a measure of the average of all BIT330 students of how much of Google's responses are reproduced by Yahoo.
GY Yahoo
Google 5 10 20
5 1.06 1.29 1.65
10 1.35 2.00 2.47
20 1.63 2.65 3.71
This table provides a measure of the average of all BIT330 students of how much of Yahoo's responses are reproduced by Google.
YG Google
Yahoo 5 10 20
5 1.06 1.47 1.88
10 1.18 1.94 2.65
20 1.65 2.47 3.76

Both tables above illustrate the average of all BIT330 students. If we look under Google row 5 and Yahoo column 5, we see the number 1.06. This means that on average, there are 1.06 results that are in the top 5 of Google that are also in the top 5 results of Yahoo. If we look under Google row 20 and Yahoo column 5, we see that number 1.63. This means that on average, there are 1.63 responses in the top 5 results of Yahoo that are also in the top 20 results of Google. We should be able to match up parallel data between the two tables, however, some of the numbers are different. For example, the top 20 of both Google and Yahoo is 3.71 in the first table and 3.76 in the second. This portrays an error in some of the student's reporting.

Blog search

This table provides an average of the precision of Blog Search for all students in BIT330 Fall 2008.

Blog search Technorati Google Blog Bloglines
Technorati 33 4 9
Google Blog 53 7
Bloglines 44
All 1

This table shows the average of all students in BIT 330. To interpret this table, you must remember that these numbers are percentages. When a number is in the column and row of the same site, this shows the precision of the site. For example, where the Bloglines row and Bloglines column meet we see 44%. This illustrates that 44% of the query searches in Bloglines brought back accurate results, on average. When a row labeled with one site intersects a row of another site, this is showing the percentage of results that the two sites overlap. For instance, where Technorati and Google Blog met we see 4%. This means that 4% of the relevant searches in Technorati were also in Google Blog, on average. Finally, the 1% under All illustrates that 1% of 20 searches were found in all three Web searches, on average.

This table provides a measure of the average of all BIT330 students of how much of Blogline's responses are reproduced by Google Blog Search.
BG Google
Bloglines 5 10 20
5 5 7 9
10 6 9 15
20 10 14 19
This table provides a measure of all BIT330 students of how much of Google Blog Search's responses are reproduced by Bloglines.
GB Bloglines
GBlog 5 10 20
5 5 7 12
10 6 8 13
20 8 14 18

Both tables above illustrate the total responses of all BIT330 students. If we look under Bloglines row 5 and Google column 5, we see the number 5. This means that in total of all the students, there were 5 results in the top 5 of Bloglines that were also in the top 5 results of Google. If we look under Bloglines row 20 and Google column 5, we see that number 10. This means that in total of all students, there were 10 responses in the top 20 results of Bloglines that were also in the top 5 results of Google. We should be able to match up parallel data between the two tables, however, some of the numbers are different. For example, the top 20 of both Google and Bloglines is 19 in the first table and 18 in the second. This portrays an error in some of the student's reporting.

Discussion

Web search

Looking at the two sets of data, we see that while all three web sites are relatively accurate in finding data (approximately 50% of the time), they normally do it through different channels. In the top 20 results of both Google and Yahoo, there are only 3.71 results that are the same. Given that Google is accurate 54% of the time and Yahoo 52%, this means that most of the time the two searches bring back good results but from different web sites. This is a very good finding because if a user is having trouble finding information through one search site, they can always try another. This is because they are all relatively effective (Google being the most effective at 54% and Windows Live the least at 43%), and on average bring back different results. This is my biggest finding through this experiment because it tells me that the different search engines actually have very different results, on average. I was very surprised to see how little the sites overlapped. One question that might be beneficial to research is: what kind of websites did these engines overlap on? For example, if all of the overlaps came from large, well-established websites, then that would imply that these engines have very different techniques. If this were the case, it might be helpful to investigate this further to understand why they bring back such different results.

Blog search

The results from the Blog search illustrate similar findings as the Web search, but to an even further extent. All of the engines showed relatively effective results, on average. Google Blog was certainly the most efficient, as it found helpful results 53% of the time. And while Technorati only found good results 33% of the time, this is still pretty effective. It is amazing to see how little the blog search engines found in common. Throughout a class of 17 students, Bloglines and Google Blog brought back a total of 19 similar findings. To a person searching for information, I would recommend trying a variety of different blog searches when researching a topic. They all bring back relatively accurate and unique responses. Eventually, one may find one of the blog searches to be their favorite. One thing that I learned from this exercise was that even though these searches are effective, they can provide misleading results, and one must be careful when analyzing their search. It's also important to realize that this experiment only covers three of the possible hundreds of different blog searches. There is a wealth of research capabilities out there!

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License