What is a thesaurus used for? The meaning of the word thesaurus

N. V. Lukashevich

[email protected]

B. V. Dobrov

Research Computing Center of Moscow State University. M.V. Lomonosov;

ANO Center for Information Research

[email protected]

Keywords: thesaurus, information retrieval, automatic text processing,

The vast majority of technologies working with large collections of texts are based on statistical and probabilistic methods. This is due to the fact that lexical resources that could be used to process text collections using linguistic methods must have a volume of tens of thousands of dictionary entries and have a number of important properties that must be specifically monitored when developing the resource. In the report, we examine the basic principles of developing lexical resources for automatic processing of large text collections using the example of the Russian language thesaurus for computer text processing RuTez, created in 1997, which is currently a hierarchical network of more than 42 thousand concepts. We describe the current state of the thesaurus based on a comparison of its lexical composition and the text corpus of the University Information System RUSSIA (www.cir.ru) - 400 thousand documents. Examples of thesaurus use in various automatic word processing applications are discussed.

  1. Introduction

Currently, millions of documents have become available in electronic form, thousands of information systems and electronic libraries have been created. At the same time, information systems that use lexical and terminological resources for searching are calculated in fractions of a percent. This is due to the serious challenges of creating such linguistic resources for automatic processing of modern collections of electronic documents.

First, these collections are usually very large; the resource must include descriptions of thousands of words and terms. Secondly, collections are a set of documents of different structures with various syntactic structures, which makes it difficult to automatically process text sentences. In addition, important information is often distributed between different sentences of the text.

All this acutely raises the question of what a linguistic resource should be, which, on the one hand, would be useful for automatic processing and searching in electronic collections, on the other hand, could be created in a foreseeable time and maintained with relatively little effort.

In this article we will look at the basic principles of developing lexical resources for automatic processing of large text collections. These principles will be examined using the example of the Russian language thesaurus created by the ANO Center for Information Research since 1997 for computer text processing RuTez. RuTez is currently a hierarchical network of more than 42 thousand concepts, which includes more than 95 thousand Russian words, expressions, and terms. We will describe the current state of the thesaurus based on a comparison of its lexical composition and the vocabulary of the text corpus of the University Information System RUSSIA, supported by the Research Computing Center of Moscow State University. M.V. Lomonosov and ANO TSII. UIS RUSSIA (www.cir.ru) contains 400 thousand documents on socio-political topics (about 3 GB of texts, 200 million words). The article will also discuss examples of using thesaurus in various automatic word processing applications.

  1. Principles for developing a linguistic resource

for information retrieval tasks

To ensure effective automatic processing of electronic documents (automatic indexing, categorization, comparison of documents), it is necessary to build a basis for their comparison - a list of what was mentioned in the document. For such an index to be more effective than a word-by-word index, it is necessary to overcome the lexical diversity of the text: synonyms, polysemy, parts of speech, stylistics, and reduce it to an invariant - a concept that becomes the basis for comparing different texts. Thus, concepts should become the basis of a linguistic resource, and linguistic expressions: words, terms - become only text inputs that initialize the corresponding concept.

In order to be able to compare different but similar concepts, relationships must be established between them. Traditionally, linguistic resources for automatic processing of texts in natural language used certain sets of semantic relations, such as part, source, reason and so on. However, when working with large and heterogeneous text collections, we must understand that with the current state of word processing technology, a computer system will not be able to reliably detect these relationships in the text in order to perform the procedures that we have associated with these or other relationships. Therefore, the relations between concepts must first of all describe certain invariant properties that do not depend or weakly depend on the topic of the specific text in which the concept is mentioned.

The main function of this relationship is to answer the following question:

if it is known that the text is dedicated to discussing C1, and C2 is related

attitudeRwith C1, can we say that the topic of the text(*)

related to C2?

When creating a linguistic resource for automatic processing, it is important to determine which properties of the concepts C1 and C2 allow us to establish correct (*) relationships between them.

So, for example, no matter what texts are written about birches, we can always say that these lyrics are about trees. But despite the popularity and frequent discussion of the relationship tree as part forests, very few texts about trees are texts about forests. Note that the problem is not related to the name of the relationship. So clearing is part of the forest, and texts about clearings are texts about forests.

The invariance of relations relative to the spectrum of possible topics of texts in a subject area is largely determined by deeper properties than those reflected by the names of relations, namely its quantifier and existential properties. Thus, the quantifier properties of relations describe whether all examples of a concept have a given relation, whether this relation persists throughout the entire life cycle of the example. Problem with using relation treeforest It is precisely due to the fact that not every specific tree is located in the forest, but the clearing cannot be outside the forest.

An example of a description of the existential properties of relations - does it follow from the existence of the concept C1 the existence of the concept C2 (for example, the existence of the concept GARAGE requires the existence of a concept AUTOMOBILE) or the existence of examples C1 depends on the existence of examples C2 (so specific FLOOD inseparable from a specific example RIVERS). The discussion in the text of the dependent concept C2, especially dependent on the example, suggests that the text is also related to the main concept C1.

Let's consider the relationship between concepts FOREST and TREE in details. In fact, part of the concept FOREST is TREE IN THE FOREST, while there are FREE-STANDING TREE,TREE IN THE GARDEN etc. In any case, it is necessary to break the relationship of subordination of the concept TREE concept FOREST.

On the other side, FOREST is a species COLLECTIONS OF TREES, does not exist without trees (as well as GARDEN). Thus, the concept FOREST must be in relation to the concept TREE. Starting with an analysis of the needs of specific application problems, we came to the conclusion that it is important to describe the deep properties of relations that were previously very little reflected in linguistic resources, but which are of paramount importance for the tasks of automatic processing of large text collections, and, possibly, for many other tasks.

Now we model the description of quantifier and existential properties of concepts with a set of traditional thesaurus relations ABOVE-BELOW (66% of all relations), PART-WHOLE (30% of relations), ASSOCIATION (4%), in combination with a certain set of additional modifiers (20% of relations are marked ). Note that the PART-WHOLE and ASSOCIATION relationships are interpreted taking into account the rule (*). In total, about 160 thousand direct connections between concepts are described, which, taking into account the transitivity of relationships, gives a total number of different connections of more than 1350 thousand connections, that is, on average, each concept is connected with 30 others.

  1. RuTez Thesaurus: general structure

The RuTez thesaurus is a hierarchical network of concepts corresponding to the meanings of individual words, text expressions or synonymous series. Thus, the main elements of a thesaurus are concepts, linguistic expressions, relationships between linguistic expressions and concepts, and relationships between concepts.

The thesaurus combines into a single system both linguistic knowledge - descriptions of lexemes, idioms and their connections, traditionally related to lexical, semantic knowledge, and knowledge about terms and relationships within subject areas, traditionally related to the field of activity of terminologists, described in information retrieval thesauri . As such subject sub-areas, the thesaurus describes such subject areas as economics, legislation, finance, international relations, which are so important for everyday human life that they have significant lexical representation in traditional explanatory dictionaries. In them, lexical and terminological are strongly interconnected and strongly interact with each other.

Linguistic expressions are individual lexemes (nouns, adjectives and verbs), nominal and verbal groups. Thus, the thesaurus does not currently include adverbs and function words as linguistic expressions. Multiword groups may include terms, idioms, lexical functions ( influence e).

For each linguistic expression the following is described:

Its polysemy is a connection with one or more concepts, which means that a given linguistic expression can serve as a textual expression of this concept. Attributing a linguistic expression to different concepts is also an implicit indication of its polysemy;

Its morphological composition (part of speech, number, case);

Writing features (for example, with a capital letter), etc.

Each thesaurus concept has a unique name, a list of linguistic expressions with which this concept can be expressed in the text, and a list of relationships with other concepts.

One of its unambiguous text expressions is usually chosen as a unique name for a concept. But the name of a concept can also be formed by a pair of its ambiguous text expressions - synonyms, written separated by commas and unambiguously defining it (for example, the concept THICK). An ambiguous text expression of the name of a concept can also be provided with a mark or a shortened fragment of interpretation, for example, concept CROWD (GROUP OF PEOPLE).

  1. Example dictionary entry

We chose as an example the dictionary entry for the concept FOREST, corresponding to one of the meanings of the word forest. This dictionary entry is interesting because it includes different types of knowledge, traditionally classified as lexical (semantic) knowledge and encyclopedic knowledge (knowledge about the subject area, terminology).

Synonyms for the concept FOREST(total 13):

forest(M), forest zone, forest environment,

forest, forest quarter, forest landscape,

forest area, woodland, wooded area,

forest area, little forest,

array of forests.

Below concepts with synonyms:

JUNGLE(jungle);

FOREST PARK(city ​​garden, green area,

green area, forest park,

forest management, forest park

belt, park(M), park area);

FORESTRY;

LEAVED FOREST(soft-leaved forest, hard-leaved

forest);

GROVE(oak grove);

CONIFEROUS FOREST (coniferous forest, dark coniferous forest)

Concepts-parts with synonyms:

WINDBREAK(windfall, windfall);

CUTTING(cutting area);

FOREST CULTURE(forest species, forestry

culture);

FOREST LAND (forest lands; lands covered

forest; forest lands, forest territory;

forested land, forested

area);

FOREST PLANTATIONS(forest plantations, forest plantations,

afforestation);

EDGE OF THE FOREST(edge, edge);

UNDERFLOWER(undergrowth);

PROSEKA;

DRY WOOD(deadwood).

Here the symbols (M) reflect a note about the ambiguity of the text input.

Concept FOREST It also has other relationships, the so-called dependency relationships (in the modern version they are called ASC 2 - asymmetric association): FOREST FIRE(forest fire, fire in the forest; FOREST USE (forest use, use of forest fund areas); FORESTRY; FOREST SCIENCE (forest science). As already noted in paragraph 2, the concept of FOREST depends on the concept of TREE, which in the thesaurus is denoted by the relation ASC 1.

Total concept FOREST is connected directly with 28 other concepts, taking into account the transitivity of relations - with 235 concepts (in total more than 650 text inputs).

  1. Assessment of the current state

Russian language thesaurus RuTez

5.1. Lexical composition

Currently, the thesaurus network includes more than 95 thousand linguistic expressions, of which 61 thousand are single-word.

This volume of work forced us to decide what words and linguistic expressions needed to be included in the Thesaurus descriptions. The natural desire was to see how the most frequent words in the Russian language were represented in the thesaurus. For this purpose, the text collection of the University Information System RUSSIA (400 thousand documents) was used. The collection contains official documents from various bodies of the Russian Federation (55 thousand documents since 1992), as well as press materials since 1999 (newspapers Izvestia, Nezavisimaya Gazeta, Komsomolskaya Pravda, Argumenty i Fakty, Expert magazine and others), materials from scientific journals (“Bulletin of Moscow University”, “Sociological Journal”). A comparison was made between the list of lemmas included in the Thesaurus and the list of the most frequent 100,000 lemmas in the text collection (frequency more than 25).

Polexeme marking of the list showed that among these hundred thousand lemmas, 35 thousand are described in RuTez, only about 7 thousand lexemes deserve inclusion in the Thesaurus, the rest are lemmatic variants of various proper names. Therefore, replenishment has ceased to be a priority task and is carried out gradually, starting with the most frequent words. It is assumed that as soon as this list is mostly exhausted, another comparison will be made with the text array of the information system, new lexemes with a frequency of more than 25 will be selected. Next, the viewing threshold is supposed to be lowered. The presence of a large number of text examples in the text collection allows you to quickly respond to “lexical innovations” (for example, installation,blockbuster, beau monde, thriller) and include them in the appropriate places in the Thesaurus hierarchical system.

Constant work with a current text collection provides unique opportunities for checking the significance and quality of lexical descriptions proposed in dictionaries. For example, an unusually high frequency of use of the word Mother See(more than 400 times). Checking the array showed that the word is indeed often used as a synonym for the word Moscow, while explanatory dictionaries often mark this word as obsolete. Another example of a frequently used word (more than 300 times) marked as obsolete in dictionaries is the word blissful.

5.2 Description of word meanings

Comparison with the text collection shows that many of the frequency words in the array are well represented in the Thesaurus in at least one of their (usually basic) meanings. Finding out to what extent the spectrum of meanings of polysemantic words in the Russian language is represented in the Thesaurus is our primary task at the present time.

As is known, often different dictionary sources give a different set of meanings for polysemous words, highlight shades of meaning, and the same type of polysemy can be described differently for different words even in the same dictionary. Therefore, the task of consistently and representatively describing the meanings of lexemes is an important task for the creators of any vocabulary resource.

However, if the resource is intended for automatic processing, then the task of balanced description of values ​​becomes much more important. Excessive value inflation can result in the computer system's inability to select the desired value, which in turn results in a significant reduction in the performance of the automatic word processing system. So, one of the disadvantages of the WordNet resource as a resource for automatic word processing is the excessive number of meanings described for some words (in WordNet 1.6: 53 meanings for run, 47 for play and so on.). These meanings are difficult to distinguish even for humans when semantically annotating texts. It is clear that the computer system also cannot cope with choosing the appropriate value. Therefore, different authors propose different ways to combine values ​​to improve processing quality.

At the same time, the opposite factor operates: if the meanings really differ in their set of dictionary connections (in our case, thesaurus connections) - they cannot be glued into one unit (one concept) - this will also lead to a deterioration in the quality of automatic processing.

Let's take an example of the words school And church, each of which can be considered as an organization and as a building.

Each school organization has a building (most often one). All parts of the school building (classrooms, blackboards) are related to school how to an organization. There are no specific types of school buildings. Therefore the description schools As buildings, it is inappropriate to separate them into a separate concept. However, the description of such a collective concept SCHOOL as an organization and as a building must have a specially designed relationship with the concept BUILDING. When describing such relationships in the Thesaurus, a mark on the relationship is used - the modifier “A” (“aspect”; during automatic analysis, “confirmation” by other concepts is required to take this relationship into account).

SCHOOL

HIGHER EDUCATIONAL INSTITUTION

ABOVE A PUBLIC BUILDING

Corresponding meanings of the word church not that close. Churches As an organization, it can have a large number of church buildings in different places, and also has many other buildings. Church-building is closely related to religion and confession, but can change affiliation church organizations. Church-organization And church-building have different subspecies. That's why CHURCH (ORGANIZATION) And CHURCH (BUILDING) are presented in RuTez as different concepts.

The significant divergence in thesaurus connections correlates in an interesting way with the ability of the denotations corresponding to the meanings to exist separately from each other. Thus, a church-building does not cease to exist and even be called a church even when its use changes, unlike a school-building.

The process of verifying the representation of values ​​in the Thesaurus is constantly underway, starting with the most frequent lemmas. For each frequency lexeme, it is checked how its meanings are described in explanatory dictionaries, what meanings are used in the collection and how they are presented in the Thesaurus. As a result, a list of 10,000 lexemes has now been formed, the ambiguity of which still requires either additional analysis or additional description. The list was obtained based on 30 thousand of the most frequent lemmas.

It should be noted that in the Thesaurus the problem of polysemy is partially removed due to the fact that thesaurus connections can be described between different meanings of a word, and therefore the highest concept in the hierarchy can be selected by default. It was definitely discussed in the text. For example, the word photo has three meanings: photography as a field of activity, photography as a photographic image, photography as a photo studio:

PHOTOGRAPHY(photographing, photo business, ..., photo )

PART PHOTOGRAPHIC IMAGE

(photo, photograph, photo )

PART PHOTO STUDIO (photo ).

Thus, if it was not possible to figure out what meaning the word was used photo, the default is to assume that a photo was taken (of a process, result, or location), which is sufficient for many automatic text processing applications.

  1. Application of the RuTez thesaurus

for automatic text processing

Since 1995, the socio-political terminology RuTez (socio-political thesaurus) has been actively and successfully used for various applications of automatic text processing, such as automatic conceptual indexing, automatic rubrication using several rubricators, automatic annotation of texts, including English-language ones. Socio-political thesaurus (27 thousand concepts, 62 thousand text entries) is a basic search tool in the UIS RUSSIA search system (www.cir.ru).

All vocabulary of the RuTez thesaurus is used in procedures for automatically categorizing texts using complex hierarchical rubricators. In the existing technology, each category is described as a Boolean expression of terms, after which the original formula is expanded along the thesaurus hierarchy. The resulting Boolean expression may already include hundreds and thousands of conjuncts and disjuncts.

Let us give, as an example, a fragment of a description using thesaurus concepts (and linguistic expressions after expanding the formula) of the “Image of a Woman” rubric of the SOFIST 2 rubricator, used by VTsIOM to classify public opinion poll questionnaires:

(WOMAN[N]

|| GIRL[N]

|| RELATIVE [L] (grandmother, granddaughter, cousin,

daughter, sister-in-law, mother, stepmother, daughter-in-law, stepdaughter, ...))

(CHARACTER TRAIT[L] (thrifty, heartless, forgetful,

frivolous, mocking, intolerant, sociable, ...)

|| IMAGE [E] (presentation, appearance, appearance,

appearance, appearance, image, appearance)

|| PLEASANT [L] (..., interesting, beautiful, cute,

attractive, cute, attractive, ...)

|| UNPLEASANT[L] (unsympathetic, rude, nasty, ...)

|| APPRECIATE[L] (to revere, adore, adore,

worship, adore, ...)

|| PREFER[N]

The symbol “E” denotes full expansion along the thesaurus hierarchy, the symbol “L” - according to species relations (“BELOW”), the symbol “N” - do not expand.

Research is being carried out to develop a combined technology for automatic text categorization, combining thesaurus knowledge and machine learning procedures.

The issues of using a thesaurus to expand a query formulated in natural language are being explored (currently, only the socio-political part of the thesaurus is used to expand a terminological query in the information retrieval system of the UIS RUSSIA), and searching for answers to questions in large text collections.

7. Conclusion

The paper presents the basic principles of developing linguistic resources for automatic processing of large text collections. The created linguistic resource - Thesaurus of the Russian language RuTez - is intended for use in such automatic text processing applications as conceptual indexing of documents, automatic rubrication according to complex hierarchical rubricators, automatic expansion of natural language queries.

This work is partially supported by the Russian Humanitarian Foundation grant No. 00-04-00272a.

Literature

  1. Lukashevich N.V., Saliy A.D., Representation of knowledge in the system of automatic text processing //NTI, Ser.2. 1997. No. 3. P. 1‑6.
  2. Zhuravlev S.V., Yudina T.N., Information system RUSSIA //NTI, Ser.2. 1995. No. 3. P. 18‑20.
  3. Winston M., Chaffin R., Herman D., A Taxonomy of Part-Whole Relations // Cognitive Science. 1987. No. 11. P. 417‑444.
  4. Priss U.E., The Formalization of WordNet by Methods of Relational Concept Analysis // WordNet. An Electronic Lexical Database/Ed. by C. Fellbaum. Cambridge, Massachusetts, London, England.: The MIT Press 1998. P. 179‑196.
  5. Guarino N., Welty C., A Formal Ontology of Properties // Proceedings of the ECAI-00 Workshop on Applications of Ontologies and Problem Solving Methods. Berlin: 2000. P. 121-128. (http://citeseer.nj.nec.com/guarino00formal.html).

Some Ontological Principles for Designing Upper Level Lexical Resources // First Int. Conf. on Language Resources and Evaluation. 1998.

  1. Lukashevich N.V., Dobrov B.V., Modifiers of conceptual relations in thesaurus for automatic indexing // NTI, Ser.2. 2000, No. 4, pp. 21‑28.
  2. Large explanatory dictionary of the Russian language / Ed. S.A. Kuznetsova. St. Petersburg: Norint, 1998.
  3. Ozhegov S.I., Shvedova N.Yu., Explanatory Dictionary of the Russian Language - 3rd edition. M.: Az, 1996.
  4. Apresyan Yu.D., Selected works, volume I. Lexical semantics: 2nd ed. M.: School “Languages ​​of Russian Culture”, Ed. Firm "Oriental Literature" RAS, 1995.
  5. G. Miller, R. Beckwith, C. Fellbaum, D. Gross and K. Miller, Five papers on WordNet, CSL Report 43. Cognitive Science Laboratory, Princeton University, 1990.
  6. Chugur, J. Gonzalo and F. Verdjeo, Sense distinctions in NLP applications // Proceedings of “OntoLex-2000”: Ontologies and Lexical Knowledge Bases. Sofia: OntoTextLab. 2000.
  7. Loukachevitch N., Dobrov B., Thesaurus-Based Structural Thematic Summary in Multilingual Information Systems // Machine Translation Review. 2000. No. 11. P. 10‑20. (http://www.bcs.org.uk/siggroup/nalatran/mtreview/mtr-11/mtr-11-8.htm).

Thesaurus of Russian language for natural language processing

of large text collections

Natalia V. Loukachevitch, Boris V. Dobrov

Keywords: thesaurus, natural language processing, informational retrieval

In our presentation we consider the main principles of developing lexical resources for automatic processing of large text collections and describe the structure of Thesaurus of Russian Language, which is developed since 1997 specially as a tool for automatic text processing. Now the Thesaurus is a hierarchical net of 42 thousand concepts. We describe the current stage of the Thesaurus developing in comparison with 100,000 the most frequent lemmas of the text collection of University Information System RUSSIA (www.cir.ru), including 400 thousand documents. Also we consider the use of the Thesaurus in different applications of automatic text processing.

, antonyms, paronyms, hyponyms, hypernyms, etc.) between lexical units. Thesauri are one of the most effective tools for describing individual subject areas.

In the past the term thesaurus Mostly dictionaries were designated, representing the vocabulary of the language with maximum completeness with examples of its use in texts.

Also term thesaurus used in information theory to denote the totality of all information possessed by the subject.

In psychology, an individual's thesaurus is characterized by the perception and understanding of information. Communication theory also considers the general thesaurus of a complex system through which its elements interact.

Story

One of the first thesauri is called the “Dictionary of Synonyms” by Philo of Byblos. A more precise correspondence to the term is Amara-kosha, written in Sanskrit in poetic form in the 6th century. The first modern English thesaurus was created by Peter Mark Roger in 1805. It was published in 1852 and has been used without reprint since then.

In the 1970s, thesauri began to be actively used for information retrieval tasks. In such thesauri, words are mapped to descriptors through which semantic connections are established.

Thesauruses

see also

Write a review about the article "Thesaurus"

Notes

Excerpt characterizing the Thesaurus

- What a dandy you are today! – Nesvitsky said, looking at his new mantle and saddle pad.
Denisov smiled, took out a handkerchief from his bag, which smelled of perfume, and stuck it in Nesvitsky’s nose.
- I can’t, I’m going to work! I got out, brushed my teeth and put on perfume.
The dignified figure of Nesvitsky, accompanied by a Cossack, and the determination of Denisov, waving his saber and shouting desperately, had such an effect that they squeezed onto the other side of the bridge and stopped the infantry. Nesvitsky found a colonel at the exit, to whom he needed to convey the order, and, having fulfilled his instructions, went back.
Having cleared the road, Denisov stopped at the entrance to the bridge. Casually holding back the stallion rushing towards his own and kicking, he looked at the squadron moving towards him.
Transparent sounds of hooves were heard along the boards of the bridge, as if several horses were galloping, and the squadron, with officers in front, four in a row, stretched out along the bridge and began to emerge on the other side.
The stopped infantry soldiers, crowding in the trampled mud near the bridge, looked at the clean, dapper hussars marching harmoniously past them with that special unfriendly feeling of alienation and ridicule with which various branches of the army are usually encountered.
- Smart guys! If only it were on Podnovinskoe!
- What good are they? They only drive for show! - said another.
- Infantry, don't dust! - the hussar joked, under which the horse, playing, splashed mud at the infantryman.
“If I had driven you through two marches with your backpack, the laces would have been worn out,” the infantryman said, wiping the dirt from his face with his sleeve; - otherwise it’s not a person, but a bird sitting!
“If only I could put you on a horse, Zikin, if you were agile,” the corporal joked about the thin soldier, bent over from the weight of his backpack.
“Take the club between your legs, and you’ll have a horse,” responded the hussar.

The rest of the infantry hurried across the bridge, forming a funnel at the entrance. Finally, all the carts passed, the crush became less, and the last battalion entered the bridge. Only the hussars of Denisov's squadron remained on the other side of the bridge against the enemy. The enemy, visible in the distance from the opposite mountain, from below, from the bridge, was not yet visible, since from the hollow along which the river flowed, the horizon ended at the opposite elevation no more than half a mile away. Ahead there was a desert, along which here and there groups of our traveling Cossacks were moving. Suddenly, on the opposite hill of the road, troops in blue hoods and artillery appeared. These were the French. The Cossack patrol trotted away downhill. All the officers and men of Denisov’s squadron, although they tried to talk about outsiders and look around, did not stop thinking only about what was there on the mountain, and constantly peered at the spots on the horizon, which they recognized as enemy troops. The weather cleared again in the afternoon, the sun set brightly over the Danube and the dark mountains surrounding it. It was quiet, and from that mountain the sounds of horns and screams of the enemy could occasionally be heard. There was no one between the squadron and the enemies, except for small patrols. An empty space, three hundred fathoms, separated them from him. The enemy stopped shooting, and the more clearly one felt that strict, menacing, impregnable and elusive line that separates the two enemy troops.
“One step beyond this line, reminiscent of the line separating the living from the dead, and - the unknown of suffering and death. And what's there? who's there? there, beyond this field, and the tree, and the roof illuminated by the sun? Nobody knows, and I want to know; and it’s scary to cross this line, and you want to cross it; and you know that sooner or later you will have to cross it and find out what is there on the other side of the line, just as it is inevitable to find out what is there on the other side of death. And he himself is strong, healthy, cheerful and irritated, and surrounded by such healthy and irritably animated people.” So, even if he doesn’t think, every person who is in sight of the enemy feels it, and this feeling gives a special shine and joyful sharpness of impressions to everything that happens in these minutes.
The smoke of a shot appeared on the enemy’s hill, and the cannonball, whistling, flew over the heads of the hussar squadron. The officers standing together went to their places. The hussars carefully began to straighten out their horses. Everything in the squadron fell silent. Everyone looked ahead at the enemy and at the squadron commander, waiting for a command. Another, third cannonball flew by. It is obvious that they were shooting at the hussars; but the cannonball, whistling evenly quickly, flew over the heads of the hussars and struck somewhere behind. The hussars did not look back, but at every sound of a flying cannonball, as if on command, the entire squadron with its monotonously varied faces, holding back its breath while the cannonball flew, rose in its stirrups and fell again. The soldiers, without turning their heads, glanced sideways at each other, curiously looking for the impression of their comrade. On every face, from Denisov to the bugler, one common feature of struggle, irritation and excitement appeared near the lips and chin. The sergeant frowned, looking around at the soldiers, as if threatening punishment. Junker Mironov bent down with each pass of the cannonball. Rostov, standing on the left flank on his leg-touched but visible Grachik, had the happy look of a student summoned before a large audience for an exam in which he was confident that he would excel. He looked clearly and brightly at everyone, as if asking them to pay attention to how calmly he stood under the cannonballs. But in his face, too, the same feature of something new and stern, against his will, appeared near his mouth.
-Who is bowing there? Yunkeg "Mig"ons! Hexog, look at me! - Denisov shouted, unable to stand still and spinning on his horse in front of the squadron.
The snub-nosed and black-haired face of Vaska Denisov and his entire small, beaten figure with his sinewy (with short fingers covered with hair) hand, in which he held the hilt of a drawn saber, was exactly the same as always, especially in the evening, after drinking two bottles. He was only more red than usual and, raising his shaggy head up, like birds when they drink, mercilessly pressing spurs into the sides of the good Bedouin with his small feet, he, as if falling backwards, galloped to the other flank of the squadron and shouted in a hoarse voice to be examined pistols. He drove up to Kirsten. The headquarters captain, on a wide and sedate mare, rode at a pace towards Denisov. The staff captain, with his long mustache, was serious, as always, only his eyes sparkled more than usual.
- What? - he told Denisov, - it won’t come to a fight. You'll see, we'll go back.
“Who knows what they’re doing,” Denisov grumbled. “Ah! G” skeleton! - he shouted to the cadet, noticing his cheerful face. - Well, I waited.
And he smiled approvingly, apparently rejoicing at the cadet.
Rostov felt completely happy. At this time the chief appeared on the bridge. Denisov galloped towards him.
- Your Excellency! Let me attack! I will kill them.
“What kind of attacks are there,” said the chief in a bored voice, wincing as if from a bothersome fly. - And why are you standing here? You see, the flankers are retreating. Lead the squadron back.
The squadron crossed the bridge and escaped the gunfire without losing a single man. Following him, the second squadron, which was in the chain, crossed over, and the last Cossacks cleared that side.
Two squadrons of Pavlograd residents, having crossed the bridge, one after the other, went back to the mountain. Regimental commander Karl Bogdanovich Schubert drove up to Denisov's squadron and rode at a pace not far from Rostov, not paying any attention to him, despite the fact that after the previous clash over Telyanin, they now saw each other for the first time. Rostov, feeling himself at the front in the power of a man before whom he now considered himself guilty, did not take his eyes off the athletic back, blond nape and red neck of the regimental commander. It seemed to Rostov that Bogdanich was only pretending to be inattentive, and that his whole goal now was to test the cadet’s courage, and he straightened up and looked around cheerfully; then it seemed to him that Bogdanich was deliberately riding close to show Rostov his courage. Then he thought that his enemy would now deliberately send a squadron on a desperate attack to punish him, Rostov. It was thought that after the attack he would come up to him and generously extend the hand of reconciliation to him, the wounded man.

3.1. Thesaurus concept

Thesaurus (from the Greek θήσαϋροξ - treasure, stock) or ideographic dictionary (from the Greek idea - concept, representation, idea and grapho - write, describe) - in modern linguistics: 1) a special type of dictionary of general or special vocabulary, which contains semantic relationships between lexical units; 2) a dictionary for searching for a word based on its semantic connection with other words; 3) a certain way of organizing (arranging) words in the dictionary; 4) a way of organizing the lexical composition, which allows you to economically “model the world.”

In the first, original meaning - repository, treasure, the term thesaurus was used by L.V. Shcherba in the article “Experience of general lexicography” (third opposition: thesaurus - an ordinary (explanatory or translation) dictionary). The scientist writes: “When they say thesaurus, today we most often mean “Thesaurus linguae latinae”, an enterprise of five German academies, begun back in 1900 and until now brought with omissions only to the letter M. Characteristic feature This type of dictionary consists in the fact that they contain absolutely all words that appear in a given language at least once, and that under each word absolutely all quotations from texts available in a given language are given. The basis of the above opposition - thesaurus - an ordinary (explanatory or translation) dictionary - is the opposition of “linguistic material” and “linguistic system” - concepts that I tried to substantiate in my article “On the threefold aspect of linguistic phenomena and on experiment in linguistics.”

The second meaning of this term is associated with the widely known dictionary-thesaurus “Thesaurus of English Words and Expressions” by P.M. Roger (Roget's Thesaurus of English Words and Phrases, 1852) and its continuation, the dictionary of O.V. Baranov.

In this interpretation, the term thesaurus denotes a certain way of organizing and arranging the lexical composition in the dictionary (see the third meaning of the term).

The fourth meaning of the term thesaurus is associated with the universal recognition of this method of organizing the lexical composition, which allows one to economically “model the world.” From this point of view, a thesaurus dictionary is “a systematic ordering of the vocabulary of any scientific or technical field, and in the most general form - general literary vocabulary, and moreover, the entire vocabulary of a given language.”

According to Yu.N. Karaulova, a general language thesaurus, fixing in the structure and relationships of its headings, sections, zones, areas the wide possibilities of non-verbal connection of ideas, ensures an account of human values.

A.N. Baranov and D.O. Dobrovolsky in the preface “From the editors” to his “Dictionary-thesaurus of modern Russian idioms” gives the thesaurus the following definition - a special type of dictionary that differs from others (in particular, explanatory, bilingual, etc.) in the way of organizing linguistic material. In a thesaurus, language units are not presented in alphabetical order, as in a regular dictionary, but are grouped based on their meaning.

L.P. Krysin calls the thesaurus (ideographic dictionary) a special kind of explanatory dictionary, a dictionary “on the contrary.” “If in an explanatory dictionary, the scientist writes, the “entry” to a dictionary entry is a word, and the content of the dictionary entry is the interpretation of the meaning of this word, then in an ideographic dictionary the “entry” is the meaning, the idea (hence the name of this type of dictionary - ideographic), and the content of a dictionary entry is a list of words expressing a given meaning. And if an explanatory dictionary is an indispensable tool for understanding a text, then an ideographic dictionary can be used in generating a text: very often a person wants to express a certain thought, but cannot find the words suitable for this; an ideographic dictionary facilitates these searches. There are two main types of thesauri:

linguistic thesaurus - a dictionary containing a list of natural language words selected as a result of meaningful analysis of texts and systematized in accordance with the accepted classification system;

statistical thesaurus - an information retrieval dictionary containing a list of words selected as a result of statistical analysis of texts on a specific topic and grouped into dictionary entries based on the frequency of co-occurrence of these words in the same texts.

Information retrieval thesauri (IRT) facilitate the search for information during its automatic processing. IPT maximally reveals the semantic relationships between lexical units. As stated in GOST on IPT, “a monolingual information retrieval thesaurus is a controlled and changing dictionary of lexical units, based on the vocabulary of one natural language, displaying semantic relationships between lexical units and intended for processing and retrieving information.”

The basic unit of IPT is descriptor terms. The alphabetical, lexical-semantic part of the IPT is a set of descriptor articles.

Descriptive dictionaries are intended to fully describe the vocabulary of a certain field and record all uses therein; they record all available relevant cases. A typical example of a descriptive dictionary is “Explanatory Dictionary of the Living Great Russian Language” by V.I. Dahl (the first edition in four volumes was published in 1863-1866). The goal of its creator was not to standardize the language, but to fully describe the entire diversity of Great Russian speech - including its dialect forms of vernacular.

Each descriptor dictionary entry begins with a descriptor, in which synonyms of this descriptor, as well as other lexical units associated with the main descriptor by genus-species or associative relations, are given below within the GOST article.

Thus, thesauri, especially in electronic format, are one of the effective tools for describing individual subject areas.

A thesaurus is rarely found in its pure form. In real thesauri, the original idea is simplified or extraneous, but potentially necessary, information is added to the user. The most famous today are “Russian Semantic Dictionary” by Yu.N. Karaulova, “Dictionary of identical names” N.Yu. Shvedova, “Thematic Dictionary of the Russian Language” by L.G. Smekhova and others.

Summary. Thesaurus term L.V. Shcherba used it in relation to a dictionary, which recorded, if possible, all the contexts in which a given word occurs. A characteristic feature of thesauruses is that they list all the words that appear in a given language at least once, and under each word all quotes from texts available in that language are given. The content of a thesaurus dictionary is language material, and a regular dictionary is language material and a language system (terms by L.V. Shcherba).

This characteristic is complemented by cross-connections of various kinds - often paradigmatic (synonymous or antonymic), which indicate commonality or opposition of meanings. In addition, various kinds of associations. connections (i.e. syntagm connections).

Thus, the task of a thesaurus (ideographic dictionary) is to give an idea of ​​the semantic organization of a certain cross-section of linguistic material, showing the main semantic fields, their internal structure and external connections. A thesaurus is a clear demonstration of the systemic nature of a language, allowing one to see many types of relationships connecting individual linguistic units and groups of units.

3.2. The history of representing conceptual knowledge about the world in the form of a thesaurus

The need to arrange words according to similarity, contiguity, and analogy of their meanings has been felt throughout the observable history of human thought.

To trace the origins of the idea of ​​representing conceptual knowledge about the world in the form of a thesaurus, we will be helped by turning to the history of compiling thesauri (ideographic dictionaries).

Thus, at the dawn of civilization, when people could express their thoughts in writing only with the help of ideograms and symbols, the only possible dictionary was probably one in which words were arranged into thematic groups. It was simply difficult for a lexicographer at that time to find another criterion for classifying words other than the relationships that exist in reality itself.

Unfortunately, we have no evidence of whether the peoples who used ideographic writing actually had such dictionaries. Among the most ancient attempts at ideographic classification known to us is the Attikai Lexeis of the Greek grammarian, director of the Library of Alexandria, Aristophanes of Byzantium (died 180 BC).

In the II century. n. e. the major work “Onomasticon” appears, compiled on material from the Greek language by the lexicographer and sophist Julius Pollux (real name Polydeuces), a native of the Egyptian city of Naucratis. Yu. Pollux wrote several works, but only “Onomasticon” has reached us (Pollux Yu. Onomasticon. M., 1956).


Onomasticon consists of 10 books. Books are essentially separate treatises and contain the most important words related to a particular topic. Thus, the first book talks about gods and kings; in the second - about people, their lives and physiological structure; in the third - about kinship and civil relations, etc. The words included in the dictionary are accompanied by brief interpretations. In modern times, the dictionary was first published in 1502 in Venice.

Between the 2nd and 3rd centuries. n. e. The wonderful Sanskrit dictionary “Amarakosha” (Amarakosha. Paris, 1839) was published. Its author is the ancient Indian poet, grammarian and lexicographer Amara Sina, who was called “one of the nine pearls that adorn the throne of Vikramaditya.” Amarakosha translated into Russian means the treasury of Amara. The dictionary contains 10 thousand words. To better remember the interpretation of the meanings of words, dictionary entries are constructed in the form of poems. All dictionary material is divided into 3 books. Each book includes several chapters, and the chapter in turn, if necessary, is divided into a number of sections. The first book is dedicated to the sky, the gods and everything that is directly related to them. The second book contains words related to the earth, settlements, plants, animals and humans (first, man is considered as a living being, and then as a social being; the entire caste structure of the author’s contemporary society appears before our eyes; priests, as God’s trustees, are at the very top , and below are military men and kings, even lower are landowners, and at the very bottom are artisans, jugglers, servants, etc.). The third book is strictly linguistic, as is clear from the titles of its six chapters.

The dictionary became known to European scientists only at the end of the 18th century, when its first part was published in Rome in 1798. It was published in full with translation into English in 1808 by the English Sanskrit scholar G.T. Colebrooke (N.T. Colebrooke). In 1839, its French translation appeared, made by A.L. Delonchamps (A.L. Deslongchamps). Further development of the idea of ​​semantic classification of vocabulary is associated with the problem of the so-called world language.

Summary. This, in the most general terms, is the first stage in the development of the tradition of ideographic classification of vocabulary. This stage can be called the prehistory of ideographic dictionaries. Now it is advisable to turn to the modern classification of thesaurus dictionaries.

It is easy to see how different the described works are from alphabetical dictionaries. If in alphabetic dictionaries the presentation of words is regulated by such a conventional and highly neutral instrument as the alphabet, then when constructing an ideographic dictionary, the worldview of the lexicographer himself becomes decisive.

3.3. Principles of classification of dictionaries-thesauruses

As has already been shown above, the problem of compiling a classification of thesauri is not new and for several decades has attracted the attention of a number of domestic and foreign linguists (C. Marello, V.V. Morkovkin, L.P. Stupin, V.V. Dubichinsky, etc. ). The result of research in this area was the creation of alternative classifications of these lexicographic works. One of the latest classifications is based on the following criteria: a) the type of semantic connections between vocabulary units; 2) volume of the vocabulary; 3) generalization of the vocabulary; 4) development of the meaning of lexemes; 5) grammatical and stylistic qualification of lexemes; 6) demonstration of the functioning of lexemes; 7) number of languages ​​represented; 8) the type of semiotic means used to semantize lexemes. This classification is based on the previously created classifications by O.M. Karpova and I. Burkhanov (Burchanov I. On the Ideographic Description of Stylistically and Pragmatically Relevant Aspects of Lexical Meanings. London, 1996); terminology used in classification is introduced into the lexicographic apparatus


V.V. Morkovkin, Yu.N. Karaulov, K. Marello. The classification criteria were formulated by O.M. Karpova. At the same time, C. Marello distinguishes three types of thesauri:

cumulative, which are groupings of words without defining their meanings;

definitive, interpreting each lexical unit of a group of words;

bi- and multilingual thesauri for travelers (Marello C. TheThesaurus//W.D.D. 1990. V. 2. P. 1083).

Cumulative thesauruses not only provide the opportunity to find a more understandable, accurate, stylistically correct word in the situation of being in a certain semantic field, but also become the basis for the formation of thematic computer data banks.

Definitive thesauri can include, along with definitions of meaning, etymological information and quotations from literary works, which shows the direct encyclopedic orientation of this type of thesaurus. In addition, dictionaries of this type introduce the user to the necessary system of concepts, explain the essence, similarities and differences of concepts, their paradigmatic and syntagmatic connections, and sometimes provide information about the pronunciation, grammatical, word-formation and other possibilities of lexical units denoting these concepts.

Bilingual and multilingual thesauri for travelers are usually created according to thematic sections: numbers, food, transport, hotels, etc. with translation equivalents of two or more languages.

To display the types of existing thesaurus dictionaries as completely as possible, a multi-level classification is created. Firstly, according to the type of semantic connections between vocabulary units, thesauri are divided into three large classes:

1. Associative thesaurus (terminology by Yu.N. Karaulov

2. Analogous thesaurus (terminology by V.V. Morkovkin

3. Ideographic (ideological) thesaurus (terminology by L.V. Shcherba, V.V. Morkovkin. The above three types of thesauri reflect the following types of semantic connections of lexemes, respectively:

1. Semantic-syntactic connections, on the basis of which
words are combined into groups or pairs, predetermined in their occurrence and existence by double connections: semantic and syntactic. Semantic connections between words are established mainly between verbs and adjectives that perform a predicative function in a sentence, and nouns, for example:

a) between an action and the organ (instrument) with which it is performed: to grab - a hand, to see - an eye, to swim - a boat, etc.;

b) between action verbs that require one subject and a subject: bark - a dog, neigh - a horse, etc.; c) between verbs and a certain grammatical addition, which the former require: chop - wood, eat - food, etc.

Hence, an associative thesaurus is a dictionary-thesaurus that organizes lexical units based on the semantic and syntactic connections that exist between them and arranges groups in accordance with the graphic form of center words.

2. Lexico-semantic connections. Grouping with this type of connection occurs according to the main feature for words - lexical meaning. In this case, lexico-grammatical connections are also taken into account, in the form of which individual meanings of words are realized.

Thus, an analogical thesaurus is a lexicographic reference book, the main unit of macrostructure of which is the lexical-semantic group; the groups are systematized in alphabetical order of semantic dominants.

3. Subject or thematic connections, where the combination of words into one group occurs due to the similarity or commonality of functions of the objects and processes denoted by the words: objects
household items, body parts, types of clothing, buildings, etc.

Thus, an ideographic thesaurus is a lexicographic work that represents lexical units as part of subject (thematic) groups and organizes them into a hierarchical structure designed to represent conceptualized knowledge about the world.

Within the framework of the same criterion, we further subdivide the types. Thus, the ideographic thesaurus is represented by the following 4 types:


Actually an ideographic thesaurus.

Thematic dictionary.

Systematic dictionary.

Thematic-systematic dictionary


The ideographic thesaurus itself is a special type of ideographic dictionary, the macrostructure of which is organized in accordance with an a priori synoptic map superimposed on the lexical composition of the language. Unlike other types of ideographic dictionary, the ideographic thesaurus itself is characterized by a logical and strictly ordered classification structure created on the basis of scientific taxonomy, even if general vocabulary is subject to lexicographic description (New Webster "Thesaurus. Landoll, 1991).

A thematic dictionary is a special type of ideographic thesaurus, the main unit of macrostructure of which is a thematic group, including lexemes, united on the basis of the classification of their denotations (referents) and considered from the point of view of compliance with a specific topic.

A systematic dictionary is a special type of ideographic thesaurus whose classification structure is intended to represent the actual semantic relationships that exist between lexical units of a language. At its core, the classification structure represents the lexico-grammatical classification of the vocabulary, in other words, its paradigmatic structure, described from the point of view of subordination and composition.

A thematic-systematic dictionary is a special type of ideographic dictionary, which is a combination of a thematic and systematic dictionary.

Summary. The considered classification of linguistic thesauri includes the following types of dictionaries: analogical thesaurus (terminology by V.V. Morkovkin); ideographic (ideological) thesaurus (terminology by L.V. Shcherba and V.V. Morkovkin); assoc. thesaurus (terminology by Yu.N. Karaulov). Next will be presented pop. thesauri and their features are revealed.

3.4. Popular thesauri and their features

The most famous of the available dictionaries-thesauruses, to which this term itself owes its existence, was created on the material of the English language; this is a constantly reprinted thesaurus by P.M. Roger Roget's Thesaurus of English Words and Phrases (1852).

It is important to note that the author of the Thesaurus of English Words and Expressions made full use of the experience available by that time. “The principle that guided me when classifying words,” writes P.M. Roger, is the same one used in classifying individuals in various fields of natural history. Therefore, the sections I have highlighted correspond to the natural families of botany and zoology, and the series of words are cemented by the same relationships that unite the natural series of plants and animals."

P.M. Roger believed that a convincing classification of words according to their meanings is impossible until the objects of reality called these words are properly studied and organized. Therefore, he begins his work by dividing the conceptual field of the English language into four large classes: abstract relations, space, matter and spirit (mind, will, feelings). These classes are further divided into a number of genera, which in turn are divided into a certain number of species.

Among the shortcomings of the ideographic dictionary of P.M. Scientists attribute the following to Roger: 1) a not entirely convincing nomenclature of the main conceptual classes; 2) abstract logic prevails over natural connections of words; 3) relative inconvenience of use (this deficiency has been largely corrected in subsequent editions).

In modern Russian lexicography there are several dictionaries that should be classified as dictionaries-thesauruses (ideographic dictionaries). This, for example, was created under the leadership of Yu.N. Karaulova “Russian semantic dictionary”, “Russian semantic dictionary” edited by N.Yu. Shvedova, “Thematic Dictionary of the Russian Language” by L.G. Sayakhova, D.M. Khasanova and V.V. Morkovkina, “Dictionary of lexical-semantic groups of Russian verbs”, ed. E.V. Kuznetsova, “Ideographic Dictionary of the Russian Language” O.S. Baranova, “The Conceptosphere of the Inner World of Man in the Russian Language” by V.I. Ubiyko, a comprehensive educational dictionary “Lexical basis of the Russian language” under the guidance of V.V. Morkovkina.

Let's get to know some of them.

Dictionary-thesaurus of modern Russian idioms” edited by A.N. Baranova and D.O. Dobrovolsky includes four main parts: 1) synopsis; 2) legend; 3) the main body of the Dictionary-Thesaurus; 4) pointers. The purpose of the Synopsis is to give a general idea of ​​the structure of the Main Body of the Thesaurus. It lists all taxa with subtaxa and corresponding paradigmatic references. The main body of the Thesaurus Dictionary is a collection of dictionary entries, grouped into groups (taxa) and subgroups (subtaxa) in accordance with the meaning of the idioms described in them. Each article contains an idiom and examples of its use in modern Russian. Synopsis, Legend, Indexes are service parts of the above-mentioned Dictionary-thesaurus, providing the user with the opportunity to work quickly and efficiently. The legend is used in cases where examples of the use of idioms are not needed, because it reproduces all information except examples. In fact, this is the vocabulary of the Dictionary. The units of the vocabulary are lemmas. The lemma in this case represents the idiom in its original (dictionary) form and includes, if possible, all its significant variants. For example, the idiom stand still is part of the lemma mark time, stand still, skid in place.

The dictionary contains two pointers. At the end of the book there is an article “Theoretical Concept of the Dictionary-Thesaurus of Modern Russian Ideomatics”, which analyzes in detail the scientific features of this project.

“Russian Semantic Dictionary”, created under the leadership of Yu.N. Karaulova includes 10 thousand Russian words, which are divided into 1600 conceptual groups. The identification of groups is based on repeated elements of word interpretation in explanatory dictionaries: for example, “action”, “property”, “tool”, etc.

“Russian semantic dictionary”, created under the leadership of academician N.Yu. Shvedova, is based on slightly different principles characteristic of the compilation of both ideographic and explanatory dictionaries. Firstly, all the words of the language are divided here into four classes: 1) indicating units (pronouns), 2) naming (notional words), 3) actual connectors (conjunctions, prepositions, linking verbs), 4) classifying (modal words, particles, interjections). Secondly, within each class, all words are distributed according to parts of speech. Thirdly, within each part of speech, sets and subsets are identified based on thematic proximity or, conversely, opposition of word meanings.

DUDEN is a book with pictures (drawings) on the left side (according to different software) with numbered parts (down to the smallest). On the right side, this numbered list is accompanied by titles (even in two languages). For example, railway equipment, stations, and tracks are drawn on a whole page. On the right are the names of arrows, semaphores, crutches, etc.

“Thematic Dictionary of the Russian Language” L.G. Sayakhova, D.M. Khasanova and V.V. Morkovkina contains 25 thousand lexical units, grouped into three large classes: “Man”, “Society”, “Nature”, which branch stepwise into smaller subclasses. For example, in the class “Human” there are subclasses “Human body and organism”, “Human life”, “Appearance, appearance of a person”, “Emotional appearance of a person”, etc. Each of the subclasses in turn is divided into even more specific ones: “ Emotional world of a person" - "Mental properties of a person" - "Temperament", "Character" - "General character traits", etc. The meaning and use of words belonging to each class are illustrated by the most common phrases. For example, the word “laughter”, which is in the subgroup “expression of feelings, emotions” of the “Man” class, is accompanied by an indication of such combinations with this word as cheerful laughter, joyful laughter, child’s laughter, burst into laughter, etc.

Summary. One of the effective tools for describing individual subject areas, especially in electronic format, are thesauri.

The term thesaurus has long been widely used in linguistics to designate a special type of dictionary, to one degree or another reflecting the “picture of the world”, “linguistic model of the world” (according to Yu.N. Karaulov). The thesaurus as a “treasury” has grown in its semantic scope and received a new meaning. They began to call it a dictionary that not only absorbs all the lexical riches of a language, but organizes them in a certain logical-systemic manner. In a thesaurus dictionary, words are combined into groups, and this unification occurs on the basis of the ability of a particular word to convey a certain concept.

The thesaurus dictionary has always been considered in linguistics as a kind of universal system that ensures the storage of collective (for a particular society) knowledge about the world in verbal form. Unlike other dictionaries, in a thesaurus-dictionary this knowledge is stored in a structured form that reflects our ideas about the “structure of the world.”

The most famous and popular thesauri at present are the English Roger's Thesaurus, O.V. Ideographic Dictionary of the Russian Language. Baranova, Russian semantic dictionary Yu.N. Karaulova, Russian semantic dictionary of academician N.Yu. Shvedova, DUDEN, Thematic Dictionary of the Russian Language L.G. Sayakhova, D.M. Khasanova and V.V. Morkovkina.

Conceptual system of a subject area The basis of any subject area is the system of concepts of this area. Definition of a concept: A concept is a thought that reflects in a generalized form objects and phenomena of reality by fixing their properties and relationships; the latter (properties and relationships) appear in the concept as general and specific features, correlated with classes of objects and phenomena (Linguistic Dictionary)


Concepts and terms To express the concept of a subject area in texts, words or phrases called terms are used. The set of terms of a subject area form its terminological system. The relationship of a specific term with other terms of the term system of the subject area is specified by means of a definition


Definitions of the term? A word (or combination of words) that is an exact designation of a specific concept of any special field of science, technology, art, social life, etc. || A special word or expression used to designate something. in one environment or another, profession (Big Explanatory Dictionary of the Russian Language)


Terms - exact names of concepts Usually, each concept in the field corresponds to at least one unambiguously understood term, the meaning of which is this concept. - terms, in the sense of the traditional theory of terminology Properties of terms - exact names of concepts - the term must relate directly to the concept, it must express the concept clearly; - the meaning of the term must be precise and must not overlap in meaning with other terms; - the meaning of the term should not depend on the context. Terms that accurately name a concept are the subject of research by the theory of terminology, terminologists


Text terms In real texts of the subject area, to refer to a concept, in addition to basic terms, many different language expressions can be used, which we call text terms: - syntactic and word-formation options: recipient of budget funds - budget recipient; - lexical options – direct write-off, undisputed write-off; - polysemantic expressions, depending on the context, which serve as a reference to different concepts of the field, for example, the word currency in different contexts can mean national currency or foreign currency.














Descriptors with marks Litter - part of the name of the descriptor cranes (lifting equipment) vs cranes (birds) shells (structures) – comparison of different thesauruses Preferences for phrases: –Phonograph records vs. records (phonograph) Marks and plural: Wood (material) Woods (forested areas)






Including descriptors based on multi-word expressions Splitting a term increases ambiguity: plant food The meaning of the expression depends on the word order: information science - scientific information One of the component words is outside the scope of the thesaurus or is too general: first aid The relations of the descriptor do not follow from its structure: –Artificial kidneys, refugee status, traffic lights




Associative relations Field of activity - actor - Mathematics - mathematician Discipline - object of study - Neurology - nervous system Action - agent or tool - Hunting - hunter Action - result of action - Weaving - fabric Action - goal - Bookbinding - book Cause-effect - Death – funeral Value – unit of measurement – ​​Current strength – ampere Action – counterparty – Allergen – antiallergic drug, etc.


Information retrieval thesauri: stages of development First stage: indexers describe the main topic of the text using arbitrary words and phrases Terms obtained from many texts are brought together Among terms that are similar in meaning, the most representative is selected Some of the remaining ones become conditional synonyms, the rest are deleted Specific terms are usually not included


Information retrieval thesauri: the art of development Descriptors are terms that are needed to express the main topic of the document Synonyms are included only the most necessary (for example, starting with a different letter) so as not to complicate the work of the indexer Related terms should be reduced to one term to avoid subjectivity indexing Hierarchy levels, inclusion of specific terms is limited


Information retrieval thesaurus: the art of development - 2 In complex cases, descriptors are supplied with marks and comments –LIV: bombardment – ​​bombing – Polysemantic terms: one meaning in the thesaurus (capital), do not fit in the thesaurus, marks!!! Traditional information retrieval thesaurus is an artificial language built on the basis of real terms




Traditional IPT: application in automatic processing Lack of knowledge about the real language of the software Lack of knowledge about the real language of the software Legislative Indexing Vocabulary: Legislative Indexing Vocabulary: – in the text TROOPS – in the thesaurus MILITARY FORCES – in the text CAPITAL – capital, in the thesaurus only capital Suggested: each descriptor supplement with lists of words and terms It is proposed: each descriptor is supplemented with lists of words and terms But: polysemy or relating to different descriptors. But: polysemy or relating to different descriptors. Disambiguation resolution Disambiguation resolution


Traditional IPT: automatic query expansion Problem with associations Suggested: enter weights enter weights enter names of relations: object, property, etc. enter the names of relationships: object, property, etc. CONCLUSION: you need to learn how to build linguistic resources specifically for automatic processing of text collections


Thesaurus EUROVOC – multilingual thesaurus of the European Community Thesaurus in 9 languages ​​Russian version of EUROVOC – +5 thousand concepts reflecting Russian specifics Multilingual thesaurus – Descriptor – names in different languages ​​– Ascriptors – for some languages


Automatic indexing according to the EUROVOC thesaurus, based on rules (Hlava, Heinebach, 1996) Example rule: IF (near "Technology" AND with "Development") USE Community program USE development aid ENDIF 40 thousand rules. Testing: 20 most frequent descriptors in the text, generated automatically - 42% completeness, compared to manual rubrication


Automatic indexing based on establishing correspondence weights between words and descriptors (Steinberger et al., 2000) Stage 1 - establishing correspondence between text words and assigned descriptors based on statistical measures (chi-square or log-likelihood) FISHERY MANAGEMENT descriptor - the following words ( in descending order of weight): fishery, fish, stock, fishing, conservation, management, vessel, etc. Stage 2 indexing itself - summing the logarithms of weights or as a scalar product of vectors


A combination of free queries and queries based on an information retrieval thesaurus. A manually indexed collection – establishing correlations. A user asks a query in natural language. The query is expanded by the thesaurus descriptors that are most strongly correlated with the query (Petras 2004; Petras 2005). For example, at the request Insolvent Companies, a list of descriptors liquidity, indebtness, enterprise, firm. can be obtained, and the query can be expanded. The accuracy in the experiment increased by 13%.



One of the new basic concepts that emerged as a result of the development of machine methods for processing information, in particular, when translating from one language to another, searching for scientific and technical information and creating an information model of an enterprise in automated control systems, was the concept of an information system thesaurus. The term “thesaurus” implies a body of knowledge about the external world - this is the so-called thesaurus of the world T. All concepts of the external world, expressed using natural language, constitute a thesaurus, from which private thesauri can be distinguished by hierarchical division taking into account the subordination of individual concepts or by isolating parts general thesaurus of the world. The thesaurus in information retrieval systems plays an important role in finding the desired document using keywords. Therefore, building a thesaurus is a complex and responsible task. But this task can also be automated.

Classification in its most general definition is the partitioning and ordering of sets. It is called the distribution of objects into classes based on a common feature inherent in these phenomena or objects and distinguishing them from objects and phenomena that make up other classes. If necessary, each class can be divided into subclasses. A rubricator is a special type of classification. Therefore, they are created on the basis of general provisions:
 scientific basis for constructing the classification;
 reflection of the current level of development of science;
 the presence of a system of links and referrals, as well as a reference and reference apparatus (CCA).

However, the rubricator is a pragmatic classification created on the basis of information flows and the needs of specialists. This is its difference from a priori classifications, such as UDC and IPC.

The main functions of classifications and, in particular, the rubricator are the following:
 thematic differentiation of information subsystems;
 formation of information arrays based on any characteristics;
 systematization of information materials and publications;
 current and retrospective search;
 indexing of documents and queries;
 connection with other classification schemes;
- normative functions.

They are built by dividing concepts - objects of classification on the basis of established connections between the characteristics of these objects in accordance with certain logical principles. The characteristic by which the classification is made is called the basis for dividing the classification. Classifications widely use methods of deduction and induction to fix groups, classes and identify connections between them. This is typical for hierarchical classifications. The depth of classification (the number of hierarchy levels) may vary depending on the purpose. One of the widely used rubricators is the State Rubricator of Scientific and Technical Information (GRNTI).

The GRNTI rubricator is designed in such a way that it can be used together with other classifications such as UDC and IPC. The Universal Decimal Classification (UDC) has existed for more than 70 years, but still has no equal in its breadth of distribution and is used in many countries around the world. UDC covers the entire universe of knowledge and is successfully used for systematization and subsequent search for a wide variety of sources of information.

In addition to the UDC, the library and bibliographic classification (LBC) is widely used in practice. BBK is built on the principles of logical subordination and represents an application-type classification.
In the Russian Federation, to classify inventions and systematize domestic collections of invention descriptions, the international patent classification is used - a rather complex multi-aspect classification built on a functional-industry principle. The same technical concepts can be found in IPC or special classes (by industry) or in functional classes (by principle of operation). The sectoral principle of distribution of concepts involves the classification of objects depending on their application in a particular historically established branch of equipment and technology.

Comparative characteristics of the rubricators of SRNTI, UDC, BBK and IPC are given in Table 1.

Table 1
Characteristics of the rubricator of SRNTI, UDC, BBK and IPC

Name

Structure

The principle of placement of divisions

Partition construction scheme

Hierarchical

Industry

From general to specific

Hierarchical

Thematic

Hierarchical

Functional-sectoral

From general to specific

LBC for scientific libraries

Hierarchical

Industry

From general to specific, by species


Thus, we can highlight the main distinctive features of rubricators and classifiers:
- they are characterized by an applied nature and industry orientation;
 these are open systems that depend on the development of science and technology, the needs and requests of specialists;
 inorganic systems, since objects arise and develop in the environment and enter them from it. Elements are capable of existing independently outside the system. This trait is closely related to the second trait;
- the minimum element is the concept associated with the environment. A concept represents a system of definitions;
 connections arise between concepts both “vertically” (genus-type, whole-part) and “horizontally” (type-type, part-part), which indicates the hierarchy of systems.

Consequently, the structure and principles of organization of classifications and rubricators make it possible to automate the process of constructing subject area thesauri using the deduction method. The algorithm for constructing a thesaurus using the deduction method is shown in Fig. 1.

The basis for the formation of a thesaurus is a search image of a document, a task or an application for information search, filled out by the operator. Therefore, the first step is to research and analyze the application. At the first stage, the operator indicates the topic or problem of interest, possible keywords and their synonyms. As a result, we get a superficial understanding of the subject area.

Rice. 1. Algorithm for constructing a thesaurus using the deduction method

In addition, a thesaurus of CS keywords is formed using the deduction method, which requires:
 KS array, which is specified by the user himself, designated in Figure 1 as MP;
 KS array extracted from the search task, respectively MZ.

However, for a more complete and in-depth understanding of the subject area, we use existing rubricators and classification schemes (GRNTI, UDC, BBK, IPC). In order to maximize coverage of the subject area, it is necessary to review all available ones. The array of rubricators represents MR. The deduction search algorithm consists of two steps:
1. Finding generic concepts (Fig. 2);
2. Finding specific terms within generic concepts (Fig. 3).


Rice. 2. Processing of the generic concept

We load the first rubricator from the array and organize a cycle of checking the presence of CS entered by the user in the rubricators. Each KS is searched in the rubricator and compared with a generic concept or “nest”, and then the condition is checked to see if there is a link to specific terms. If such a link is available, then the KS is compared with the specific terms. If no link is found, move on to the next generic concept. When the keywords of the CS entered by the operator are viewed, we move on to the array of CS extracted from the task. The verification procedure is similar - we look for KS corresponding to generic concepts, and then their links to specific terms.


Rice. 3. Processing of specific terms

Note that within each generic concept it is important to review all available specific terms in order to obtain the maximum understanding of the problem area. The result of these actions is the formation of an array of KS keywords, which is a complete thesaurus corresponding to the task of searching for information or the search image of a document.

Based on a complete set of search images of documents (let’s denote them), it is possible to create industry thesauri and a unified library classifier. Obviously, the complete set of  itself represents a simple thesaurus.

However, using the selection criterion
, (1)
we can build industry thesauri. In this case, the set of all industry thesauruses forms a complete thesaurus
, (2)
sections of which can be hierarchically structured in accordance with the requirements of GOST according to the main classifiers (GRNTI, UDC, BBK, MPK) or according to an internal unified classifier.

Automation of the process of constructing a thesaurus and classification makes it possible to make the work of an operator working with distributed information resources as easy as possible.

In addition to constructing a thesaurus, based on a search image of a document, the proposed approach can be used for automatic document abstraction and text clustering.

Document abstracting is one of the tasks aimed at providing expert specialists with reliable information necessary for making management decisions about the value of documents obtained from the Internet. Abstracting is the process of transforming documentary information, culminating in the preparation of an abstract, and an abstract is a semantically adequate presentation of the main content of the primary document, characterized by economical symbolic design, constancy of linguistic and structural characteristics and intended to perform a variety of information and communication functions in the system of scientific communication. The document abstracting algorithm is presented in Fig. 4.


Rice. 4. Document abstraction algorithm

In general, the algorithm includes the following main stages.
1. Sentences are extracted from a document downloaded from the Internet and located in the data warehouse by highlighting punctuation marks and stored in an array.
2. Each sentence is divided into words by selecting separators, and we save them into an array, and the array is different for each sentence.
3. For each sentence, for each word of this sentence, we count the number of words in other sentences (before and after). The sum of repetitions for each word (before and after) will be the weight of this sentence.
4. A given number of sentences with a maximum weighting coefficient is selected for the abstract in the order of appearance in the text.

The proposed model for constructing a thesaurus and thematic catalogs of an information system represents a theoretical basis for automating semantic search and allows an expert not only to carry out search work, but also in an automated mode, abstract documents obtained as a result of searching in distributed information systems on the Internet.

Literature:
1. Barushkova R.I. Classification schemes of scientific and technical information. Textbook allowance. - M., 1981. - 80 p.
2. Barushkova R.I. Rubricator as a classification scheme of scientific and technical information. Toolkit. - M., 1980. - 38 p.
3. Trusov A.V., Babarykin E.P. Estimation of the boundaries of the domain of a thematic information request in distributed information systems. Materials of the All-Russian (with international participation) conference “Information, innovation, investment”, November 24-25, 2004, Perm / Perm CSTI. - Perm, 2004. - P.76-79.
4. Yatsko V.A. Logical-linguistic problems of analysis and summarizing of scientific text. - Abakan: Khakass State Publishing House. University, 1996. - 128 p.

Latest materials in the section:

Furious construction squad.
Furious construction squad. "Teams are people. The best people" Student construction teams of the USSR

As the people called the movement VSSO (All-Union Student Construction Teams) VSSO is an abbreviation meaning All-Union Student...

What was the name of Yuri Gagarin's spaceship: alternative versions Ships in honor of ships
What was the name of Yuri Gagarin's spaceship: alternative versions Ships in honor of ships

Residents all over the world learned the name of the man who opened space to people. From sensational newspaper headlines, read in rapid succession by enthusiastic...

An essay about:
Essay on the topic: “Biology is my favorite subject”

Part 1: Read the material 1 Try to have a positive attitude towards biology. Of course, this is a difficult subject, but it is very interesting...