Web Page Similarity based on Size and Frequency of Tokens

作者: Eun-Joo Lee , Woo-Sung Jung

DOI: 10.9716/KITS.2012.11.4.263

关键词: Token frequencySimilarity (network science)Web pageWeb applicationComputer scienceData mining

摘要: Abstract 논문투고일:2012년 07월 27일 논문수정완료일:2012년 09월 14일 논문게재확정일:2012년 10월 03일* 이 논문은 2012학년도 충북대학교 학술연구지원사업의 연구비 지원, 그리고 경북대학교 학술연구비에 의하여 연구되었음.** IT대학 컴퓨터학부*** 전자정보대학 컴퓨터공학과, 교신저자It is becoming hard to maintain web applications because of hig h complexity and duplication pages. However, most research about code clone focusing on hunks, an d their target limited a specific language. Thus, we propose GSIM, language-independent statistical approach detect similar pages based scarcity frequency customized tokens. The tokens, which can be obtained from pa ges splitted by set given separators, are defined as atomic elements for calculating similarity between two . In this paper, the domain definition algorithms collecting making matrics, given. We also conducted experiments open source codes evaluation, with our GSIM tool. results show applicability proposed method effects parameters such threshold, toughne ss, length quality performance.

参考文章(19)
Giuseppe Scanniello, Andrea De Lucia, Genoveffa Tortora, Identifying Clones in Dynamic Web Sites Using Similarity Thresholds. international conference on enterprise information systems. pp. 391- 396 ,(2004)
Giuseppe Scanniello, Rita Francese, Andrea De Lucia, Genoveffa Tortora, Identifying cloned navigational patterns in web applications Journal of Web Engineering. ,vol. 5, pp. 150- 174 ,(2006)
Yoshiki Higo, Toshihiro Kamiya, Shinji Kusumoto, Katsuro Inoue, K Words, ARIES: Refactoring support environment based on code clone analysis. iasted conference on software engineering and applications. pp. 222- 229 ,(2004)
Fabio Calefato, Filippo Lanubile, Teresa Mallardo, Function clone detection in web applications: a semiautomated approach Journal of Web Engineering. ,vol. 3, pp. 3- 21 ,(2004)
Suvda Myagmar, Shan Lu, Zhenmin Li, Yuanyuan Zhou, CP-Miner: a tool for finding copy-paste and related bugs in operating system code operating systems design and implementation. pp. 20- 20 ,(2004)
G.A. Di Lucca, M. Di Penta, A.R. Fasolino, An approach to identify duplicated web pages computer software and applications conference. pp. 481- 486 ,(2002) , 10.1109/CMPSAC.2002.1045051
V. I. Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals Soviet physics. Doklady. ,vol. 10, pp. 707- 710 ,(1966)
A. De Lucia, R. Francese, G. Scanniello, G. Tortora, Understanding Cloned Patterns in Web Applications 13th International Workshop on Program Comprehension (IWPC'05). pp. 333- 336 ,(2005) , 10.1109/WPC.2005.42
Miryung Kim, Vibha Sazawal, David Notkin, Gail Murphy, An empirical study of code clone genealogies foundations of software engineering. ,vol. 30, pp. 187- 196 ,(2005) , 10.1145/1081706.1081737
F. Ricca, P. Tonella, Using clustering to support the migration from static to dynamic web pages workshop on program comprehension. pp. 207- 216 ,(2003) , 10.1109/WPC.2003.1199204