作者: Adam Schenker , Abraham Kandel , Horst Bunke
DOI:
关键词:
摘要: In this dissertation we introduce several novel techniques for performing data mining on web documents which utilize graph representations of document content. Graphs are more robust than typical vector as they can model structural information that is usually lost when converting the original content to a representation. For example, capture such location, order and proximity term occurrence, discarded under standard representation models. Many machine learning methods rely distance computations, centroid calculations, other numerical techniques. Thus many these have not been applied represented by graphs since no suitable graph-theoretical concepts were previously available. We Graph Hierarchy Construction Algorithm (GHCA), performs topic-oriented hierarchical clustering search results modeled using graphs. The system created around new algorithm its prior version compared with similar systems gauge usefulness. An important advantage approach over conventional better organized easily browsed users. Next present extensions classical algorithms, k-means k-Nearest Neighbors classification algorithm, allows use fundamental items instead vectors. We perform experiments comparing performance graph-based traditional vector-based three collections. Our experimental show an improvement approaches both documents. propose allow computation similarity in polynomial time; determination NP-Complete problem. fact, there some cases where execution time graph-oriented was faster approaches.