Automatic Creation - PDF Document

Download Presentation

Automatic Creation

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. H. P. Luhn The Automatic Creation of Literature Abstracts* Abstract: Excerpts of abstracts have been created entirely by automatic means. In the exploratory research described, the com- plete text of an article in machine-readable form is scanned by an IBM 704 data-processing machine and analyzed in accordance with a standard program. Statistical information derived from word frequency and distribution is used by the machine to compute a relative measure of significance, first for individual words and then for sentences. Sentences scoring highest in significance are extracted and printed out to become the “auto-abstract.” technical papers and magazine articles that serve the purposes of conventional Introduction their efficiency depends on availability of literary infor- mation in machine-readable form. It is evident that the transcription of existing printed text into this form would have to be done manually at this time. In the future, however, print-reading devices should be developed for this task. For material not yet printed, tape-punching devices attached to typewriters and type- setting machines could readily produce machine-readable records as by-products. This paper describes some exploratory research on automatic methods of obtaining abstracts. outlined here begins with the document in machine-read- able form and proceeds by means sampling process comparable to the scanning a human reader would do. However, instead dom, as a reader normally does when scanning, the new mechanical method selects those among all the sentences of an article that are the most representative of pertinent information. These key sentences are then enumerated to serve as clues for judging the character of the article. Thus, citations of the author’s own statements constitute the “auto-abstract.’’ The programs for creating auto-abstracts must be based on properties of writing ascertained by analysis of specific types of literature. Because the use of abstracts is an established practice in science and technology, it seemed desirable to develop the method first for papers and articles in this area. A primary objective of the de- velopment was to arrive at a system that could take full advantage of the capabilities of a modern electronic data- processing system such as the IBM 704 or 705, while at the same time keeping the scheme as simple as possible. The purpose of abstracts in technical literature is to facil- itate quick and accurate identification of the topic of published papers. The objective is to save a prospective reader time and effort in finding useful information in a given article or report. The preparation of abstracts is an intellectual effort, requiring general familiarity with out the salient points of an author’s argument calls for skill and experience. Consequently a considerable amount of qualified manpower that could be used to advantage in other ways must be diverted to the task of facilitating access to information. This widespread problem is being aggravated by the ever-increasing output of technical literature. But another problem - -is that of achieving consistence abstracts. The abstracter’s product is almost always influenced by his background, attitude, and disposition. The abstract- er’s own opinions or immediate interests may sometimes bias his interpretation of the author’s ideas. The quality of an abstract of a given article may therefore vary widely among abstracters, and if the same person were to abstract an article again at some other time, he might come up with a different product. The application of machine methods to literature searching is currently receiving a great deal of attention and now indicates that both human effort and bias may be eliminated from the abstracting process. Although rapid progress is being made in the development of sys- tems using modern electronic data-processing sufficiently the subject. To bring The system of a programmed perhaps equally acute and objectivity in of sampling at ran- devices, 159 New York, March 24, 1958. *Presented at IRE National Convention, IBM JOURNAL APRIL 1958

  2. I plicity can be found in the nature of technical writing. Within a technical discussion, there is a very small proba- bility that a given word is used to reflect more than one notion. The probability is also small that an author will use different words to reflect the same notion. Even if the author makes a reasonable effort to select synonyms for stylistic reasons, he soon runs out of legitimate alterna- tives and falls into repetition if the notion being ex- pressed was potentially significant in the first place. A word list compiled in accordance with the method outlined will generally take the form of the diagram in Fig. 1. The presence in the region of highest frequency of many of the words previously described as too common to have the type of significance being sought would con- stitute “noise” in the system. This noise can be materially reduced by an elimination technique in which text words are compared with a stored common-word list. A sim- pler way might be to determine a high-frequency cutoff through statistical methods to establish “confidence limits.” If the line C in the figure represents this cutoff, only words to its right would be considered suitable for indicating significance. Since degree of frequency has been proposed as a criterion, a lower boundary, line D, would also be established to bracket the portion of the spectrum that would contain the most useful range of words. Establishing optimum locations for both lines would be a matter of experience with appropriately large samples of published articles. It should even be possible to adjust these locations to alter the characteristics of the output. The curve for the degree of discrimination, or “resolv- ing power,” of the bracketed words in the figure might look something like the dotted line, E. It is apparent that words that cannot be put in the category of common words may sometimes fall to the left of line C. If the program has been properly formulated, the location of these words on the diagram would indicate their loss of discriminatory power. The word “cell” in an article on biology may be an example of this. It may be anticipated that the cutoff line, once established, may be stable over many different degrees of specialization within a field, or even over many different fields. Moreover, the resolving power would increase with the need for finer resolution. The case of a common word falling in the region to the right of line C can be tolerated because of its lesser degree of interference. this case an even more fundamental justification for sim- Measuring significance To determine which sentences of an article may serve as the auto-abstract, a measure is required by which the information content of all the sentences can be com- pared and graded. Since the suitability of each sentence is relative, a value can be assigned to each in accordance with the quality criterion of significance. The “significance” factor of a sentence is derived from an analysis of its words. It is here proposed that the fre- quency of word occurrence in an article furnishes a useful measurement of word significance. It is further proposed that the relative position within a sentence of words having given values of significance furnishes a use- ful measurement for determining the significance of sen- tences. The significance factor of a sentence will therefore be based on a combination of these two measurements. It should be emphasized that this system is based on the capabilities of machines, not of human beings. There- fore, regrettable as it might appear, the aspects of writing and of meaning cannot serve as ele- ments of such machine systems. To a machine, words can be only so many physical things. It can find out whether or not certain such things are similar and how many of them there are. The machine can and can perform arithmetic counted. It can do all of this by means of suitable pro- gram instructions. The human intellect need be relied upon only to prepare the program. best intellectual remember such findings on those which can be Establishing a set of significant words The justification of measuring word significance by use- frequency is based on the fact that repeats certain words as he advances or varies his argu- ments and as he elaborates on an aspect of a subject. This means of emphasis is taken as an indicator of signifi- cance. The more often certain words are found in each other’s company within a sentence, the more significance may be attributed to each of these words. Though certain other words must be present to serve the important func- tion of tying these words together, the type of significance sought here does not reside in such words. If such com- mon words can be segregated substantially by non-intel- lectual methods, they could then consideration. This rather unsophisticated argument on “significance” avoids such linguistic implications as grammar and syn- tax. In general, the method does not even propose to differentiate between word forms. Thus the differ, differentiate, different, differently, difference and differential could ordinarily be considered identical no- tions and regarded as the same word. No attention is paid to the logical and semantic relationships the author has established. In other words, an inventory is taken and a word list compiled in descending order of frequency. Procedures as simple as these, of course, are rewarding from the standpoint of economy. The more complex the method, the more operations must the machine perform and therefore the more costly will be the process. But in a writer normally be excluded from Establishing relative significance of sentences As pointed out earlier, the method to be developed here is a probabilistic one based on the physical properties of written texts. No consideration is to be given to the mean- ing of words or the arguments expressed by word com- binations. Instead it is here argued that, topic, the closer certain words are associated, the more specifically an aspect of the subject is being treated. Therefore, wherever the greatest number of frequently occurring different words are found in greatest physical proximity to each other, the probability is very high that variants whatever the I60 1958 IBM JOURNAL APRIL

  3. the information being conveyed is most representative of the article. The significance of degree of proximity is based on the characteristics of spoken and written language ideas most closely associated intellectually are found to be implemented by words most closely associated physi- cally. The divisions of written text paragraphs, chapters, et cetera, is another physical mani- festation of the graduating degree of association of ideas. These aspects have been discussed in detail in an earlier paper by the writer.“ From these considerations a “significance factor” can be derived which reflects the number of occurrences of significant words within a sentence and the linear distance between them due to the intervention of non-significant words. All sentences may be ranked in order significance according to this factor, and one or several of the highest ranking sentences may then be selected to serve as the auto-abstract. It must be kept in mind that, when a statistical proce- dure is applied to produce such rankings, the criterion is the relationship of the significant words to each other rather than their distribution over a whole sentence. therefore appears proper to consider only those portions of sentences which are bracketed by significant words and to set a limit for the distance at which any two sig- nificant words shall be considered as being significantly related. A significant word beyond that limit would then be disregarded from consideration in a given bracket, although it might form a bracket, or cluster, in conjunc- tion with other words in the sentence. An analysis of many documents has indicated that a useful limit is four or five non-significant words between significant words. If with this separation two or more clusters result, the highest one of the several significance factors is taken as the measure for that sentence. A scheme for computing the significance factor is given by way of example in Fig. 2. It consists of ascer- taining the extent of a cluster of words by bracketing, counting the number of significant words contained in the cluster, and dividing the square of this number by the in that It into sentences, of their *H. P. Luhn, “A Statistical Approach Searching of Literary Information,” to Mechanized Encoding IBM Journal of Research and and De- wdopmenf, 1, No. 4, 309-317 (October 1957). Figure I Word-frequency diagram. Abscissa represents individual words arranged in order of frequency. - c D \ \ 161 N O R D S iBM .JOURNAL * APRIL 1958

  4. in this paper. Exhibit 1 shows four selected sentences of a 2,326-word article from The Scientific American. A table of word frequency is also given. Exhibit 2 shows the highest ranking sentence of a 783-word article from the Science Section of The New York Times. reproduction of this article is given. - - 1 Sentence A complete - - Machine procedures Significant Words * - * * “ * The abstracts described in this paper were prepared by first punching the documents on cards. Punctuation marks in the printed text not available on the standard key punch were replaced by other key-punch characters. The cards thus produced constitute the machine-readable form of the document. The abstracting process was initiated by transcribing the card record onto magnetic tape by means of an auxil- iary card-to-tape unit. The resulting tape was introduced into an IBM 704 data-processing machine, which programmed to read the taped text to separate it into its individual words, to note the position of each word in the document, the sentence and paragraph in appeared, and to note the punctuation following it. Concurrently, common words such as pro- nouns, prepositions, and articles were deleted from the list by a table-lookup routine. lowed by a sorting program which arranged the remain- ing words in alphabetic order. The next step of the machine operation was a consoli- dation of words which are spelled in the same way at their beginning, such as similar and similarity. This pro- cedure was a simple statistical analysis routine consisting of a letter-by-letter comparison words in the alphabetized list. From the point letters failed to coincide, a combined count was taken of the non-similar subsequent letters of both words. When this count was six or below, the words were assumed to be similar notions; above six, different notions. Al- though this method of word consolidation is not infallible, errors up to 5% did not seem to affect the final results of the abstracting process. The machine then counted the occurrence of similar words derived in this way. Words of a stipulated low frequency were then deleted from the list and locations of the remaining words were sorted into order. These words thereby attained “significant” words. The significance factor for each sentence was deter- mined by a computing routine in accordance with the formula previously mentioned. All sentences which scored above a predetermined cutoff value were written on an output tape along with their respective values. The basis for this cutoff value depends on the amount of detailed information needed for a given type of abstract. Results were then printed out from this tape. - - 1 ( - - 1 2 3 4 5 6 7 All Words - Portion of sentence bracketed by and including significant words not more than four non-significant words apart. If eligible, the whole sentence is cited. - was which it preceding and Figure 2 Computation of significance factor. The square of the number of bracketed signif- icant words (4) divided by the total number of bracketed words (7) = 2.3. This operation was fol- total number of words within this cluster. based on this formula, as performed on about 50 articles ranging from 300 to 4,500 words each, have been en- couraging enough for further evaluation by a psychologi- cal experiment involving 100 people. This experiment will determine on an objective basis the effectiveness of the abstracts generated. The resolving power of significant words derived under the method described depends on the total number words comprising an article and will decrease as the total number of words increases. In order to effect, the abstracting process may be performed on sub- divisions of the article, and the highest ranking sentences of each of these divisions may then be selected and com- bined to constitute the auto-abstract. In many cases the author provides such divisions as part of the organization of his paper, and they may therefore serve for the ex- tended process. Where such absent they can be made arbitrarily in accordance with some criteria established by experience. These divisions would be arranged in such a way that they overlap each other, for lack of any simple means of mechanically de- tecting the exact point of the author’s transition to a new subject subdivision. A more detailed account of these and other computing methods, as well as details on programming electronic data-processing machines for this procedure, will be given in subsequent papers. By way of example, two auto-abstracts are The results of pairs of succeeding where of overcome this the status of deliberate divisions are Extended applications Although a standard abstract has thus far been assumed in order to simplify the explanation of the machine pro- cess, extracts or condensations of literature are used for diverse purposes and may vary in length and orientation. included 162 IBM JOURNAL - APRIL 1958

  5. Exhibit 1 Source: The Scientific American, Vol. 196, No. 2, 86-94, February, 1957 Title: Messengers of the Nervous System Author: Amodeo S. Marrazzi Editor's Sub-heading: The internal communication of the body is mediated by chemicals as well as by nerve impulses. Study of their interaction has developed important leads to the understanding and therapy of mental illness. Auto-Abstract* It seems reasonable to credit the single-celled organisms also with a system of chemical communication by di'flusion of stimulating substances through the cell, and these correspond to the chemical messengers (e.g., hormones) that stimuli from cell to cell in the more complex organisms. (7.0)P carry Finally, in the vertebrate animals there the nervous and chemical communication systems gland is subject to control both by nerve impulses and by chemicals brought to the gland by the blood. (6.4) are special glands (e.g., the adrenals) for producing chemical messengers, are intertwined: for instance, and release of adrenalin by the adrenal The experiments clearly demonstrated that acetylcholine (and related substances) and adrenalin (and its relatives) exert opposing actions which maintain a balanced regulation of the transmission of nerve impulses. (6.3) It is reasonable to suppose that the tranquilizing drugs counteract the inhibitory effect serotonin or some related inhibitor in the human nervous system. (7.3) of excessive adrenalin or *Sentences selected by means of statistical analysis as hat iog n degree of significance of 6 and over. tSignificarlce factor is given at the end of each sentence. Significant words in descending order of frequency (common words omitted). 46 40 28 22 19 I8 18 16 16 15 15 13 13 13 substances 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 disturbance related control diagram fibers gland mechanisms mediators organism produce regulate serotonin accumulate balance block disorders end excitation health human outgoing reaching recording release supply tranquilizing 12 12 12 12 12 IO IO 8 8 7 7 7 7 6 6 5 5 5 5 5 5 5 5 5 5 nerve chemical system communication adrenalin cell synapse impulses inhibition brain transmission acetylcholine experiment body effects electrical mental messengers signals stimulation action ganglion animal blood drugs normal k t . . . . . . . . 2326 Total word occurrences in the document: Different words in document: Total of different words . Less different common words . Different non-common words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 741 170 571 . _. . Ratio of all word occurrences to different non-common words . . . . . . . . -4:l Non-common words having a frequency of occurrence of 5 and over: Total occurrences . . . Different words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478 39 163 IBM JOURNAL APRIL 1958

  6. Exhibit 2 Source: The New York Times, September 8, 1957, page E l l Chemistry Is Employed in a Search for New Methods to Conquer Mental Illness Title: Author: Robert K. Plumb ease processes themselves. then will the metabolic era mature “Only to fruition man’s bring ~- and hoped for salvation from long “__ ” rav- the At the psychologist’s meeting here, a technique for tracing clec- trical activitv in soecific oortions ”~ I I to finance resrea& on the chiatrists to predict that the admin- last week in an announcement ; in illness is isfration of ACTH a.nd cortisone Washington. / I from ~ o n e v mental i n I heir I opposite and even mutually work they may the Technical of the center excl‘- letters describing of Physiologic dis- have in progress-to 3ances;’ they said. Information Unit 1 64 IBM JOURNAL APRIL 1958

  7. Exhibit 2 Auto-Abstract Two major recent developments have called the attention mental diseases: It has been found that extremely minute quantities of chemicals can induce hallucinations and bizarre psychic disturbances in normal people, and mood-altering drugs (tranquilizers, for instance) have made long-institution- alized people amenable to therapy. (4.0) of chemists, physiologists, physicists and other scientists to This poses new possibilities for studying brain chemistry changes in health and sickness and their alleviation, the Cali- fornia researchers emphasized. (5.4) The new studies those who must care for mental patients. (5.4) of brain chemistry have provided practical therapeutic results and tremendous encouragement to A condensation of a document to a given fraction of the original could be readily accomplished with the system outlined by adjusting the cutoff value of sentence signifi- cance. On the other hand, a fixed number of sentences might be required irrespective of document length. Here it would be a simple matter to print out exactly that num- ber of the highest ranking sentences which fulfilled the requirement. In many instances condensations of documents are made emphasizing the relationship of the information in the document to a special interest or field of investiga- tion. In such cases sentences could be weighted by assign- ing a premium value to a predetermined class of words. These two features of the auto-abstract, variable length and emphasis, might at times be usefully combined. In the case of a long, comprehensive condensed versions could be prepared, each of a length suitable to the requirements of its recipient and biased to his particular sphere of interest. Along these same lines, a specificity ranking technique might prove feasible. If none of the sentences in an article attained a certain significance factor, it would be pos- sible to reject the article as too generalized for the pur- pose at hand. In certain cases an abstract might be amplified by following it with an enumeration of specifics, such as names of persons, places, organizations, products, mate- rials, processes, et cetera. Such selected by the machine either because they are capital- ized or by means of lookup in a stored special dictionary. Auto-abstracting could also be used to alleviate the translation burden. To avoid total translation auto-abstracts of appropriate length could be produced in the original language and only the abstracts translated for subsequent analysis. Finally, the process of deriving key words for encoding documents for mechanical information retrieval could be simplified by auto-abstracting techniques. Conclusions The results so far obtained for technical articles have indicated the feasibility of automatically selecting sen- tences that will indicate the general subject matter, very much as do conventional abstracts. abstracts might lack in sophistication they will more than compensate for by their uniformity cause of the absence of the variations of human capabili- ties and orientation, auto-abstracts have a high degree of reliability, consistency, and stability, as they are the prod- uct of a statistical analysis of the author’s own words. In many cases the abstract obtained is the type generally referred to as the “indicative” abstract. Once auto-abstracts are generally available, their users will learn how to interpret them and how to detect their implications. They will realize, for instance, that certain words contained in the sample sentences stand for notions which must have been elaborated upon somewhere in the article. If this were not so for a substantial portion of the words in the selected sentences, these sentences could not have attained their status based on word frequency. There is, of course, the chance that an author’s style of writing deviates from the average to an extent that might cause the method to select sentences of inferior signifi- cance. Since the title of the paper is always given in con- junction with the auto-abstract, there is a high probability that it will favorably supplement the abstract. However, there will always be a residue of inadequate results, and it appears to be entirely feasible to establish criteria by which a machine may recognize such exceptions and ear- mark them for human attention. If machines can perform satisfactorily within the range outlined in this paper, a substantial saving in human effort will have been realized. The auto-abstract is perhaps the first example of a machine- generated equivalent of a completely intellectual task in the field of literature evaluation. Received 2,1957 What such auto- of derivation. Be- paper, several specific words could be initially, and worthwhile December 165 IBM JOURNAL APRIL 1958