Wednesday, November 5, 2008

Unit 10 Readings

"Web Searching Engines: Part 1"



  • Search engines crawl and index around 400 terabytes of data
  • A full crawl would saturate a 10-Gbps network link for more than 10 days
  • Simplest crawling algorithm uses a queue of URLS and initializes the queue with one or more "seed" URLs
  • But, this simple crawling method could only fetch 86,400 pages per day ( would take 634 years to crawl 20 billion pages) - crawler parallelism is one solution (but it still is not sufficient to achieve the necessary crawling rate and it could bombard web servers and cause them to become overloaded)
  • To increase crawling's success - a priority queue replaces the simple queue
  • Ranking depends heavily upon link information

"Web Searching Engines: Part 2"

  • Search engines use an inverted file to rapidly identify indexing terms
  • An inverted file is a concatenation of the postings lists for each distinct term
  • Indexers create inverted files in two phases - scanning, and inversion
  • For high quality rankings - indexers store additional info in the postings
  • Search engines can reduce demands on disk space and memory by using compression algorithms
  • PageRank - assigns different weights to links depending on the source's page rank
  • Most search engines use a combination of queries such as link popularity, spam score, how many clicks...ect
  • To speed crawling up: skipping (such as words like and, the); early termination; assigning documents numbers; and caching.

"Current Developments and Future Trends for the OAI Protocol for Metadata Harvesting"

  • Initial release in 2001 and initially developed as a means to federate access to diverse e-print archives through metadata harvesting and aggregation
  • Mission is to "develop and promote interoperability standards that aim to facilitate the efficient dissemination of content."
  • Others have begun to use the protocol to aggregate metadata relative to their needs
  • Provides users with browsing and searching capabilities as well as accessibility to machine processing
  • OAI is divided into: data providers, repositories, and service providers
  • EXs: Sheet Music Consortium and National Science Digital Library (NSDL has the broadest vision of the OAI service)
  • Issues: Completeness, Searchability and Browsable, and Amendable Machine Processing
  • Ongoing Challenges: variations and problems with data provider implementation and on the metadata; the third - lack of communication among service and data providers
  • (interesting connection to the Dublin Core - controlled vocabularies will be more important as providers try to cope with all the data)

"The Deep Web"

  • Traditional search engines do not probe beneath the surface - the deep web is hidden
  • Deep Web sources store their content in searchable databases that only produce results dynamically in a response to a direct request - but this is a laborious way to search
  • BrightPlanet's search technology automates the process of making dozen of direct queries using multiple-thread technology
  • Only search technology so far that is capable of identifying, retrieving, qualifying, classifying, and organizing both "deep" and "surface" content - described as "directed-query engine"
  • The deep web is about 500 times larger than the surface web and 97.4% of deep websites are publicly available without restriction; but they receive about half of the traffic as a surface website
  • Deep websites tend to return about 10% more documents than surface websites (sw) and nearly triple the quality of sw
  • Quality = both the quality of the search and ability to cover the subject requested
  • Most important finding: large amount of meaningful content not discoverable with conventional search technology and a lack of awareness of this content
  • If deep websites were easily searchable users could make better judgements about the information

Muddiest Point: Why hasn't the deep web come into light sooner? Why haven't IS and other tech savy students been using it before? Are their problems with BrightPlanets search engine? The last article does not seem to address any problems of deep search engines.

3 comments:

RAlessandria said...

Hello Jenelle, I also wondered why the deep web hadn't come to light sooner. Not to say that it was undetected, but why so little is known about something so large is hard to understand.

Monica said...

That article on the deep web is from 2001 and is outdated and is an advertisement. Another of our classmates posted a link to a better article on the deep web. Here's their blog:
http://idontunderstandthepresent.blogspot.com/
I was just looking at BrightPlanet's website and it's a fee for service site that does 'deep searching'. I don't think the deep web is as deep as it was and search engines like Google are able to search 'deep' sites moreso now.

Kristina Grube Lacroix said...

I found it amazing that the deep web is so much larger than the surface web, and not only is it larger, it often holds more and better search results. The only problem is that most popular search engines do not search the deep-web at all, which greatly limits what people can find on the web.