Please use this identifier to cite or link to this item: http://hdl.handle.net/10995/103128
Title: Language attribution of an unmarked text corpus
Authors: Tarasov, D.
Issue Date: 2020
Publisher: World Scientific and Engineering Academy and Society
Citation: Tarasov D. Language attribution of an unmarked text corpus / D. Tarasov. — DOI 10.37394/23203.2020.15.75 // WSEAS Transactions on Systems and Control. — 2020. — Vol. 15. — P. 754-759.
Abstract: Unmarked text corps will increasingly appear with the growth of information on the web. Automated analysis of Big Data in search engines, scientific and commercial applications requires detailed information about the object under study. In the case of text bodies, information on the language of the documents is extremely important. Working with the scanned texts the situation is even more complicated. In this paper, the idea of using the fractal-inspired irregularity to attribute the language of the text is being further developed. A methodology for the attribution is proposed and an experiment based on 10 European languages is conducted. The proposed approach has shown its effectiveness and promise. A selection of approximately 4000 characters (1 page of text) allows you to uniquely attribute the language of the text. © 2020, World Scientific and Engineering Academy and Society. All rights reserved.
Keywords: BIG DATA
FRACTAL
IRREGULARITY
LANGUAGE
URI: http://hdl.handle.net/10995/103128
Access: info:eu-repo/semantics/openAccess
SCOPUS ID: 85099953727
PURE ID: 20889886
e5a0334b-66ba-49c0-8a1e-1d903bc266fe
ISSN: 19918763
DOI: 10.37394/23203.2020.15.75
Appears in Collections:Научные публикации, проиндексированные в SCOPUS и WoS CC

Files in This Item:
File Description SizeFormat 
2-s2.0-85099953727.pdf1,88 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.