作者: Shreeranjani Srirangamsridharan , Raghu Kiran Ganti , Mudhakar Srivatsa , Yeon-Sup Lim
DOI:
关键词:
摘要: Embodiments of the invention include method, systems and computer program products for document vectorization. Aspects include receiving, by a processor, a plurality of documents each having a plurality of word. The processor utilizing a vector embeddings engine generates a vector to represent each of the plurality of words in the plurality of documents. An image representation for each document in the plurality of documents is created and a word probability for each of the plurality of words in the plurality of documents is generated. A position for each word probability is determined in the image based on the vector associated with each word and a compression operation on the images is performed to produce a compact representation for the plurality of documents.