Lesson outline on the topic: Lesson MDK "Machine translation systems for texts and computer dictionaries"

Content

Computer dictionaries and machine translation systems for texts

Computer dictionaries.

Dictionaries are necessary for translating texts from one language to another. There are thousands of dictionaries for translation between hundreds of languages (English-Russian, German-French, etc.), and each of them can contain tens of thousands of words. In paper form, the dictionary is a thick book of hundreds of pages, in which searching for the right word is a rather long and labor-intensive process.

Computer dictionaries can contain translations into different languages of hundreds of thousands of words and phrases, and also provide the user with additional capabilities. Firstly, computer dictionaries can be multilingual - giving the user the opportunity to choose languages and direction of translation (for example, English-Russian, Spanish-Russian, etc.).

Secondly, computer dictionaries can, in addition to the main dictionary of commonly used words, contain dozens of specialized dictionaries in areas of knowledge (technology, medicine, computer science, etc.).

Thirdly, computer dictionaries provide a quick search for dictionary entries: “quick typing”, when in the process of typing a word a list of similar words appears; access to frequently used words via bookmarks; the ability to enter phrases, etc.

Fourthly, computer dictionaries can be multimedia, that is, they can provide the user with the opportunity to listen to words performed by native speakers.

Computer translation systems.

The process of globalization of the world leads to the need for frequent exchange of documents between people and organizations located in different countries of the world and speaking different languages.

Under these conditions, the use of traditional “manual” translation technology slows down the development of interethnic contacts. Translating multi-page documentation manually requires a long time and high wages for translators. Translation of a letter received by e-mail or a Web page viewed in a browser must be done urgently, and there is no time to invite a translator.

Computer-assisted translation systems can solve these problems. On the one hand, they are capable of translating multi-page documents at high speed (one page per second), and on the other hand, translating Web pages on the fly, in real time.

Computer translation systems translate texts based on formal “knowledge”: language syntax (rules for constructing sentences), word formation rules and the use of dictionaries. A translator program first analyzes text in one language and then constructs this text in another language.

Modern computer translation systems make it possible to translate technical documentation, business correspondence and other specialized texts with sufficient quality. However, they are not applicable for translating works of art, since they are not able to adequately translate metaphors, allegories and other elements of human artistic creativity.

Control questions

1. What advantages do computer dictionaries have over traditional paper dictionaries? 2. In what cases is it advisable to use computer translation systems?

Optical document recognition systems

Optical character recognition systems.

Optical character recognition systems are used to create electronic libraries and archives by converting books and documents into digital computer format.

First, using a scanner, you need to obtain an image of a page of text in graphic format. Next, to obtain a document in a text format, it is necessary to perform text recognition, i.e., convert the elements of a graphic image into a sequence of text characters.

Optical character recognition systems first determine the structure of the text on the page and divide it into separate areas: columns, tables, images, etc. Next, the selected text fragments of the graphic image of the page are divided into images of individual characters.

For scanned documents of typographic quality (reasonably large font, no poorly printed characters or corrections), character recognition is carried out by comparing them with bitmap patterns.

The raster image of each character is sequentially superimposed on the raster character templates stored in the memory of the optical recognition system. The result of recognition is the character whose pattern it most closely matches the image (Fig. 3.16).

Rice. 3.16. The recognized character “B” is superimposed on the raster character patterns (A, B, C, etc.)

When recognizing documents with low print quality (typewritten text, fax, etc.), the vector character recognition method is used. In the recognized symbol image, geometric primitives (segments, circles, etc.) are identified and compared with vector symbol templates. As a result, the symbol is selected for which the totality of all geometric primitives and their location most closely matches the recognized symbol (Fig. 3.17).

Rice. 3.17. The recognized character “B” is superimposed on the vector character patterns (A, B, C, etc.)

Optical character recognition systems are “self-learning” (for each specific document they create an appropriate set of character templates), and therefore the speed and quality of recognition of a multi-page document is gradually increasing.

With the advent of Apple's first Newton pocket computer in 1990, handwriting recognition systems began to be created. Such systems convert text written on the screen of a pocket computer with a special pen into a text-based computer document.

Optical shape recognition systems.

When a large number of people fill out documents (for example, when a school graduate takes the Unified State Exam (USE)), forms with empty fields are used. Data is entered into the fields in block letters by hand. This data is then recognized using optical shape recognition systems and entered into computer databases.

The difficulty is that you need to recognize handwritten characters, which vary quite a bit from person to person. In addition, such systems must be able to determine which field the recognized text belongs to.

Control questions

1. What are the differences in text recognition technology when using raster and vector methods?

Practical computer workshop work recommended for completion during the study of the chapter

Computer workshop

No. 8. Coding of text information.

No. 9. Create business cards based on a template.

No. 10. Setting document page parameters, inserting headers and footers and page numbers.

No. 11. Inserting formulas into a document.

No. 12. Formatting characters and paragraphs.

#13: Creating and formatting lists.

#14: Inserting a table of contents into a document containing headings.

No. 15. Inserting a table into a document, formatting it and filling it with data.

No. 16. Creating a hypertext document.

No. 17. Translation of text using a computer dictionary.

No. 18. Scanning and recognizing a paper text document.

Internet resources

With the advent of Google and Yandex search engines, the Internet has become an invaluable encyclopedia and a translator's best friend. These systems really open up very wide opportunities for the translator. In a few seconds you can check in what context a particular combination occurs that causes translation difficulties. Multilingual tools (most notably Wikipedia) allow you to switch between articles written on the same topic, but in different languages. As a result, the translator gets the opportunity to compare the terminology of the source and target languages in a particular narrow subject area. For example, if a translator is not a native English speaker and is not sure which wording is best to use (for example, heavy rain or strong rain), then by specifying both options in the Google search bar, you can compare the frequency of each of them by the number of pages found (8 .8 million for heavy rain and 41,900 for heavy rain). In order for Google to search for a phrase rather than individual words, they must be placed in quotation marks. Of course, you need to be very careful when using the Internet as the main source of information for making decisions. The very fact that there were more than forty thousand sites on the Internet containing the non-idiomatic expression strong rain in English indicates the unreliability of this source. Fortunately, there are still more literate website authors than illiterate ones, and the above method has been successfully used by the author of this article in his translation activities for several years. With the spread of the Internet, more and more resources appear that can be accessed via the network. Online versions of many electronic dictionaries are available, including those with the ability to be replenished by users (for example, www.multitran.ru). The Internet has connected translators living in different countries of the world. Every day more and more forums, blogs and resources for translators appear, where the latter can exchange experiences and help each other. The most famous among such sites in Russia is “City of Translators”, and among Western sites – Proz.com, KudoZ, Translators' Café and Translators' Base. Some time ago, mailing lists (such as Lantra-L and Trad-Prt) were also popular among translators. Using communication programs (ICQ, Skype and others), translation companies can hire native-speaking translators from different countries. Thanks to the Internet, the translation market is becoming truly unified and global. International translator associations are emerging (for example, the American Translators Association, or ATA for short, founded in 1959 and comprising 10,500 translators from 70 countries). In 2005, the ATA conducted a “census of translators”, from which a number of interesting conclusions can be drawn. Thus, 67.1% of translators in the association are women and only 32.9% are men. 63.6% of translators lived in the United States, and 36.4% lived outside of this country. Only 74.6% of translators have higher education. The role of the ATA is that it certifies translators. The advent of computer technology, translation storage, electronic dictionaries, speech and text recognition programs has made it possible to speed up and facilitate the work of translators. As a result, the turnover of the translation industry has increased significantly. However, not all translators use translation storage.

Translation quality[ | ]

This section may contain original research.

Add links to sources, otherwise it may be deleted. More information may be on the talk page. (October 27, 2012)

The quality of translation depends on the subject and style of the source text, as well as the grammatical, syntactic and lexical relatedness of the languages between which the translation is made. Machine translation of literary texts is almost always of unsatisfactory quality. Nevertheless, for technical documents, if you have specialized machine dictionaries and some adjustment of the system to the peculiarities of a particular type of text, it is possible to obtain a translation of acceptable quality, which only needs minor editorial adjustments. [ source not specified 3042 days

] The more formal the style of the source document, the higher the quality of the translation you can expect. The best results when using machine translation can be achieved for texts written in technical (various descriptions and manuals) and official business style.

The use of machine translation without customization for the topic (or with deliberately incorrect settings) is the subject of numerous jokes on the Internet. Of the oldest and most popular examples of such jokes, the most famous is the text of the translation of documentation for the mouse driver, known as “Mouse Hurts”, stated as “translation of computer documentation by the Poliglossum machine translation system based on medical, commercial and legal dictionaries” [comm. 1]. Among the short ones is the phrase “Our cat gave birth to three kittens - two whites and one black”, which the online translator “PROMT” (version 7.0, 2007) turned into “Our cat gave birth to three kittens - two white and one African American.” [ 6] If an “African-American” could still be made “black” by writing “black kitten,” then a “cat” could not change gender: for example, female cat was translated as “female cat.”

Most often, such jokes are related to the fact that the program does not recognize the context of the phrase and translates terms literally, moreover, without distinguishing proper names from ordinary words. The same translator PROMT turned “Leo Tolstoy” into “Lion Thick” (“fat lion”), “bra-ket notation” into “Katie’s bra note”, “Lie algebra” into “algebra of Lies”, “eccentricity vector” - into “vector of originality”, “Shawnee Smith” into “Shawnee Smith Indian”, etc. Google translator, on the contrary, the word “rice” was often mistaken for the surname of the US Secretary of State.