Destination Japan: Internationalization of the Lycos Web crawler.


71 views
Uploaded on:
Category: Medical / Health
Description
Destination Japan: Internationalization of the Lycos Web search tool. Exhibited by: Jeff Vander Clute of Lycos, Inc. and Tina Lieu of Premise Innovation Corp. Lycos is. Another era Web organization -4 main 20 Web properties in System -Lycos, Tripod, Angelfire, HotBot A "center"
Transcripts
Slide 1

Destination Japan: Internationalization of the Lycos Search Engine Presented by: Jeff Vander Clute of Lycos, Inc. & Tina Lieu of Basis Technology Corp.

Slide 2

Lycos is... Another era Web organization -4 main 20 Web properties in Network -Lycos, Tripod, Angelfire, HotBot A “hub” Search Engine & Navigation -Patented hunt & catalog innovation Community & Communication E-business, Content Aggregation, Etc.

Slide 3

The Search Technology Created by CMU educator (Fuzzy Mauldin) & understudies in 1994/95. 1. “Intelligent” spidering systems (now licensed), however not internationalized. Creepy crawlies slither the web recovering archives for indexing. 2. Back-end database of site pages, or index, in addition to pertinence calculations for requesting query items.

Slide 7

First Stop: Europe Lycos hunt innovation at first down ASCII just. In-house work to make information ways 8-bit clean, to suit European dialects. Generally moderately direct. Parts, for example, advertisement servers, Web servers, and so forth., oblige little if any progressions. Euro administration came online in May 1997.

Slide 8

What’s Unicode? Where’s Japan? The all the more fascinating issue. Business motivations to present Japanese look. In any case, not a great deal of international(ization) experience inside Lycos at the time. We required help and picked Basis Technology.

Slide 9

Goals Quick arrangement of Japanese pursuit 1995 to 1997, Japanese Internet dramatically multiplying every year Marketing need to dispatch in Japan ASAP Economical and proficient arrangement Produce reusable internationalized code Poise Lycos for considerably speedier sending into different dialects Get "more blast for the buck"

Slide 10

Two Main Functions of a Search Engine Building a list Compiling a recorded index of site pages from the Internet Performing a question Delivering a rundown of site pages coordinating certain catchphrases and parameters info by the client

Slide 15

Japanese Issues for Catalog Double-Byte: Japanese characters are twofold byte. Different encodings: Japanese pages utilize 3 encodings: Shift-JIS, EUC-JP, and ISO-2022-JP. Alternatives: Multiple versus Single Catalog Three lists: one in Shift-JIS, one in EUC-JP, one in ISO-2022-JP (an unbalanced and confounded answer for actualize) OR One index: all list information either in one Japanese encoding or in Unicode

Slide 16

Single Catalog Options A) Convert all information to one Japanese encoding ISO-2022-JP, Shift-JIS, or EUC-JP B) Convert all information to Unicode: The brisk and efficient decision, Unicode is . . . A superset of all scripts and character set encodings utilized on the Web, in this way reusable for different dialects More effectively executed into existing code initially composed for handling single-byte ASCII

Slide 17

The Unicode Plan Use Unicode in inventory & interior preparing Because all electronic content on the Web maps neatly into Unicode Required components: Character encoding transformations Unicode website page encodings (page encodings: Shift-JIS, ISO-2022-JP, and EUC-JP) Encoding auto-identification Japanese word breaking

Slide 18

Encoding Conversion Purpose: Convert information between encodings utilized on the Web and Unicode (which is still not utilized generally on the Web) From 寿司 in Shift-JIS you need 寿司 in Unicode Functionality gave by Basis Technology\'s Rosette implanted in Lycos code as source Rosette is a cross-stage C++ library for Unicode; http://www.basistech.com/items/Complete arrangement of mapping tables in the middle of Unicode and real legacy encodings Conversions performed rapidly and financially with insignificant effect on execution

Slide 19

Why Encoding Auto-Detection? So as to change over content to another encoding, you need to know where you’re beginning from. On the other hand you could get . . . Ex. Content in EUC-JP when seen as different encodings. EUC-JP: 寿司 コンピュータ 花見 Shift-JIS: シハ ・ウ・ヤ・蝪シ・ソ イヨクォ ASCII:

Slide 20

Encoding Auto-Detection Purpose: to accurately distinguish encoding of site page or question with a specific end goal to change over legitimately starting with one encoding then onto the next. Usefulness gave by Basis Technology\'s Rosette Auto-identification on Japanese content in Shift-JIS, EUC-JP, or ISO-2022-JP encodings Enhanced sudden death round usefulness to auto-identify short strings (questions)

Slide 21

Japanese Word Breaking Purpose: To return indexable units (words) for making a record, or for breaking the inquiry into words to turn upward in the file. Issue: Japanese words are not delimited by spaces Solution: Basis Technology\'s Japanese Morphological Analyzer (http://www.basistech.com/items/) Dictionary-based Japanese word breaking Elimination of stop words (ex. “a”,”the”, and so on.) Looks for longest word match

Slide 22

Selecting Unicode Representation (1) UCS2 qualities Depending on the errand, either the UCS2 or UTF8 representation of Unicode was utilized as a part of distinctive parts of the Lycos seek Characteristics of UCS2 Each coded character component is altered width, 16 bits Data ways should all oblige 16 bits Text in UCS2 is anything but difficult to control and investigate (from a programming perspective)

Slide 23

Selecting Unicode Representation (2) UTF8 attributes Characteristics of UTF8 Each coded character is made out of one to six octets (one octet = 8 bits) Data ways require just be "8-bit clean" None of the octets in a multi-byte character are invalid (i.e., has the estimation of zero) Text in UTF8 is hard to control or break down. "8-bit clean" = PC code which treats each of the 8 bits of a byte as critical. Valid for any PC code that procedures European dialects legitimately, yet not so much valid for code that procedures just ASCII which just uses 7 bits for every character.

Slide 24

UCS2, UTF8, ASCII, and so forth 16-bit UCS2 can’t fit :( 8-bit clean information funnel As UTF8 As UCS2 ASCII (7 bits) Latin character (8-bits) (w/diacritical) Japanese character (twofold byte) (in Shift-JIS, EUC-JP and so on.)

Slide 25

Unicode in the Lycos System UCS2: Japanese Morphological Analyzer from Basis Technology Using UCS2 is the speedy and sparing approach to handle gigantic volumes of Japanese content. UTF8: Lycos Catalog Economy of plate space: ASCII is littler in UTF8 On the Web: ASCII 79%, twofold byte Asian under 5%, European encodings and others 16% Ease of mix with existing code (a.k.a. transmissibility) Based on the quantity of Web hosts on the Internet by nation (aggregate number of hosts for English-talking areas as an all out\'s rate number of hosts around the world). Source: Survey by Network Wizards, http://www.nw.com

Slide 30

Project Complete: Lycos Japan (1) Quick: Prototype of Japanese pursuit is delivered in two months. Lycos Japan: http://www.lycos.co.jp Beta adaptation of Japanese inquiry makes a big appearance July 1998; enters aggressive Japanese internet searcher race in fourth place* Upon formal dispatch snatches second place in October 1998* *According to Search Desk, http://www.searchdesk.com

Slide 32

Project Complete: Lycos Japan (2) E-conomical: Today, Lycos has creepy crawly, list and question programming, which might effortlessly be set to make inventories in diverse dialects by swapping in and out limited pieces: Settings for target spaces Encoding location and transformation calls Language-particular word breaker (if necessary)

Slide 33

Q&A

Slide 34

Q&A Questions? tina@basistech.com www.basistech.com jvanderclute@lycos.com www.lycos.com

Slide 35

Q&A Questi

Recommended
View more...