whatwewant.www

people don't want to search it. they just want to get it.

金曜日, 2月 13, 2004

Who is this guy dude?

Who is the first author krub? ;)

Ratanachai Sombatsrisomboon, Yutaka Matsuo, and Mitsuru Ishizuka (2003). Aquisition of Hypernyms and Hyponyms from the WWW, in Proceedings of 2nd Int'l Workshop on Active Mining (AM2003), pp.7-13, Maebashi, Japan (in conjunction with Int'l Sympo. on Methodologies for Intelligent Systems), October, 2003.

What does it about?

Posted by: bact' / 6:42 午前

木曜日, 2月 05, 2004

Rush Hour Intro to IR

by Mirella Lapata
(slides for COM3110 Text Processing class, Department of Computer Science, University of Sheffield)

breifly explains Google search, IR, issues in IR, indexing, inverted file, boolean model, vector space model, TF/IDF, term weighting, evaluation, precision, recall, and F-measure.

introduction | term manipulation & evaluation

dude: Web Information Retrieval, cool tutorial by google's research director, Monika Henzinger

Posted by: bact' / 4:51 午前

火曜日, 2月 03, 2004

Managing Gigabytes (Book)

Sometimes it's more than just 'search'. We may want it 'faster', and many times we want it 'smaller'.
(And for the case of database/index size, smaller one is probably the faster one -- less things to looking for.)

Managing Gigabytes: Compressing and Indexing Documents and Images by Ian H. Witten, Alistair Moffat, and Timothy C. Bell. (read reviews)

From the authors of the book, MG, an open-source indexing and retrieval system for text, images, and textual images.

Posted by: bact' / 10:25 午後

Google File System

How to search things from a collection is one problem.
How to keep things (in a collection) for a searching is another problem.

And the latter one could be a really big problem, if you have to keep "3,307,998,701 web pages" like Google does.

Google File System: Technical paper, by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. This is a technical paper that explains Google's custom scalable cluster filesystem for storing their gigantic database of the entire Web across thousands of low-cost PCs.

From Google Weblog.

Posted by: bact' / 10:16 午後

Hypertext

The idea of Hypertext first recognized in As we may think, an article by Vannevar Bush in The Atlantic Monthly, July 1945.
Now WWW is the largest hypertext system created in 1991 by Tim Berners-Lee, (the first web page, web server, browser).

Timeline of Hypertext technology

Posted by: burlight / 6:15 午後

日曜日, 2月 01, 2004

Vector Space Model and TF-IDF

Boolean model -terms in a document is equally weighted as 1 (exist) or 0 (not exist), and documents that satisfy a input query are returned without ranking.

Vector Space Model -view each document in a database as a vector in a vector space where number of dimensions is a number of terms in all documents in a database. The length of vector in each dimension is determined by weighting algorithm (TF-IDF is most used for this). Input query also viewed as a vector in that space, and documents near the query vector are returned and ranked (by distance; closer higher rank) as a result.

TF-IDF -weighting algorithm widely used in IR. Stands for term frequency - inverse documents frequency. The idea is terms which appears frequently in one document, but less-frequently in other documents (in database or corpus) are considered as important terms in that document (high TF-IDF weight).

Posted by: burlight / 6:29 午後

Information Need

About our slogan "people don't want to search it. they just want to get it".
There are a number of studies concerning "Information need and Information seeking".
This seems to be a good tutorial.

One of the most famous is Taylor's 4 levels of Information need.

Taken from page 10 of the slides.

Q1- Visceral need
Actual, but unexpressed need for information
Feeling of unease, doubt, uncertainty
Vague sense of dissatisfaction
Hard to express in words

or in short, .... "(sometimes) people don't know what they want. (but still) they just want to get it!"

bact': This book [R. Belew. Finding Out About: A Cognitive Perspective on Search Engines and the WWW. Cambridge University Press, Cambridge, 2000.] investigates and try to describes IR from the cognitive perspective (what human/user think, percept, behave, ..).