作者: Carson S. Cumbee
DOI:
关键词:
摘要: A method of identifying the script a line text by first assigning weight to each n-gram in group documents known scripts, where is sequence numbers representing k-mean cluster centroids which character segments scripts most closely match. identified, made up pixels. The identified cropped so that only percentage pixels remain. vertically and horizontally rescaled into gray-scale vertical are replaced with number k-means centroid it matches. n-grams represents scored against weights text. highest score compared scores scripts. determined be document highest.