Destination Japan: Internationalization of the Lycos Web crawler - PowerPoint PPT Presentation

destination japan internationalization of the lycos search engine n.
Skip this Video
Loading SlideShow in 5 Seconds..
Destination Japan: Internationalization of the Lycos Web crawler PowerPoint Presentation
Destination Japan: Internationalization of the Lycos Web crawler

play fullscreen
1 / 35
Download Presentation
yuri-dean
Views
Download Presentation

Destination Japan: Internationalization of the Lycos Web crawler

Presentation Transcript

  1. Destination Japan:Internationalization of the Lycos Search Engine Presented by: Jeff Vander Clute of Lycos, Inc. & Tina Lieu of Basis Technology Corp.

  2. Lycos is... A new generation Web company - 4 top 20 Web properties in Network - Lycos, Tripod, Angelfire, HotBot A “hub” Search Engine & Navigation - Patented search & directory technology Community & Communication E-commerce, Content Aggregation, Etc.

  3. The Search Technology Created by CMU professor (Fuzzy Mauldin) & students in 1994/95. 1. “Intelligent” spidering methods (now patented), but not internationalized. Spiders crawl the web retrieving documents for indexing. 2. Back-end database of webpages, or catalog, plus relevancy algorithms for ordering search results.

  4. First Stop: Europe • Lycos search technology initially for ASCII only. In-house work to make data paths 8-bit clean, to accommodate European languages. • Otherwise relatively straightforward. Components such as ad servers, Web servers, etc., require little if any changes. • Euro service came online in May 1997.

  5. What’s Unicode? Where’s Japan? • The more interesting problem. • Business reasons to introduce Japanese search. • But not a lot of international(ization) experience within Lycos at the time. • We needed assistance and chose Basis Technology.

  6. Goals • Quickdeployment of Japanese search • 1995 to 1997, Japanese Internet more than doubling each year • Marketing need to launch in Japan ASAP • Economical and efficient solution • Produce reusable internationalized code • Poise Lycos for even quicker deployment into other languages • Get "more bang for the buck"

  7. Two Main Functions of aSearch Engine • Building a catalogCompiling an indexed catalog of webpages from the Internet • Performing a queryDelivering a list of webpages matching certain keywords and parameters input by the user

  8. Japanese Issues for Catalog • Double-Byte: Japanese characters are double-byte. • Multiple encodings: Japanese webpages use 3 encodings: Shift-JIS, EUC-JP, and ISO-2022-JP. • Options: Multiple vs. Single Catalog • Three catalogs: one in Shift-JIS, one in EUC-JP, one in ISO-2022-JP (an awkward and complicated solution to implement)OR • One catalog: all catalog data either in one Japanese encoding or in Unicode

  9. Single Catalog Options • A) Convert all data to one Japanese encoding • ISO-2022-JP, Shift-JIS, or EUC-JP • B) Convert all data to Unicode: • The quick andeconomicalchoice, Unicode is . . . • A superset of all scripts and character set encodings used on the Web, therefore reusable for other languages • More easily implemented into existing code originally written for processing single-byte ASCII

  10. The Unicode Plan • Use Unicode in catalog & internal processing • Because all electronic text on the Web maps cleanly into Unicode • Required elements: • Character encoding conversions Unicode webpage encodings (webpage encodings: Shift-JIS, ISO-2022-JP, and EUC-JP) • Encoding auto-detection • Japanese word breaking

  11. Encoding Conversion • Purpose: Convert data between encodings used on the Web and Unicode (which is still not used universally on the Web) • From 寿司 in Shift-JIS you want 寿司 in Unicode • Functionality provided by Basis Technology's Rosette embedded in Lycos code as source • Rosette is a cross-platform C++ library for Unicode; http://www.basistech.com/products/ • Complete set of mapping tables between Unicode and major legacy encodings • Conversions performed quickly and economically with minimal impact on performance

  12. Why Encoding Auto-Detection? • In order to convert text to another encoding, you have to know where you’re starting from. Or you could get . . . • Ex. Text in EUC-JP when viewed as other encodings. EUC-JP: 寿司 コンピュータ 花見 Shift-JIS: シハ ・ウ・ヤ・蝪シ・ソ イヨクォ ASCII:

  13. Encoding Auto-Detection • Purpose: to correctly identify encoding of webpage or query in order to convert properly from one encoding to another. • Functionality provided by Basis Technology's Rosette • Auto-detection on Japanese text in Shift-JIS, EUC-JP, or ISO-2022-JP encodings • Enhanced tiebreaker functionality to auto-detect very short strings (queries)

  14. Japanese Word Breaking • Purpose: To return indexable units (words) for creating an index, or for breaking the query into words to look up in the index. • Problem: Japanese words are not delimited by spaces • Solution: Basis Technology's Japanese Morphological Analyzer (http://www.basistech.com/products/) • Dictionary-based Japanese word breaking • Elimination of stop words (ex. “a”,”the”, etc.) • Looks for longest word match

  15. Selecting Unicode Representation (1) UCS2 characteristics • Depending on the task, either the UCS2 or UTF8 representation of Unicode was used in different parts of the Lycos search • Characteristics of UCS2 • Each coded character element is fixed width, 16 bits • Data paths must all accommodate 16 bits • Text in UCS2 is easy to manipulate and analyze (from a programming viewpoint)

  16. Selecting Unicode Representation (2) UTF8 characteristics • Characteristics of UTF8 • Each coded character is composed of one to six octets (one octet = 8 bits) • Data paths need only be "8-bit clean" • None of the octets in a multi-byte character are null (i.e., has the value of zero) • Text in UTF8 is difficult to manipulate or analyze. • "8-bit clean" = computer code which treats all 8 bits of a byte as significant. True of any computer code that processes European languages properly, but not necessarily true of code that processes only ASCII which only uses 7 bits per character.

  17. UCS2, UTF8, ASCII, etc.16-bit UCS2 can’t fit :(8-bit clean data pipe As UTF8 As UCS2 ASCII (7 bits) Latin character (8-bits) (w/diacritical) Japanese character (double-byte) (in Shift-JIS, EUC-JP etc.)

  18. Unicode in the Lycos System • UCS2: Japanese Morphological Analyzer from Basis Technology • Using UCS2 is the quick and economical way to process huge volumes of Japanese text. • UTF8: Lycos Catalog • Economy of disk space: ASCII is smaller in UTF8On the Web: ASCII 79%, double-byte Asian less than 5%, European encodings and others 16% • Ease of integration with existing code(a.k.a. transmissibility) • Based on the number of Web hosts on the Internet by country (total number of hosts for English-speaking domains as a percentage of the total number of hosts worldwide). Source: Survey by Network Wizards, http://www.nw.com

  19. Project Complete: Lycos Japan (1) • Quick:Prototype of Japanese search is produced in two months.Lycos Japan: http://www.lycos.co.jp • Beta version of Japanese search debuts July 1998; enters competitive Japanese search engine race in 4th place* • Upon formal launch grabs 2nd place in October 1998**According to Search Desk, http://www.searchdesk.com

  20. Project Complete: Lycos Japan (2) • E-conomical:Today, Lycos has spider, catalog and query software, which may easily be set to make catalogs in different languages by swapping in and out localized pieces: • Settings for target domains • Encoding detection and conversion calls • Language-specific word breaker (if needed)

  21. Q&A

  22. Q&A Questions? tina@basistech.com www.basistech.com jvanderclute@lycos.com www.lycos.com

  23. Q&A Questions? tina@basistech.com www.basistech.com jvanderclute@lycos.com www.lycos.com Thank you!