An Algorithm for Differential File Comparison

作者: M. D. McIlroy , J. W. Hunt

DOI:

关键词: Imaginary lineLongest increasing subsequenceBinary search algorithmFile comparisonComputer scienceAlgorithmHash functionAppendLine (text file)Longest common subsequence problem

摘要: The program diff reports differences between two files, expressed as a minimal list of line changes to bring either file into agreement with the other. Diff has been engineered make efficient use time and space on typical inputs that arise in vetting version-to-version computer-maintained or computer-generated documents. Time usage are observed vary about sum lengths real data, although they known product worst case. central algorithm solves ‘longest common subsequence problem’ find lines do not change files. Practical efficiency is gained by attending only certain critical ‘candidate’ matches breaking which would shorten longest some pair initial segments Various techniques hashing, presorting equivalence classes, merging binary search, dynamic storage allocation used obtain good performance. [This document was scanned from Bell Laboratories Computing Science Technical Report #41, dated July 1976. Te xt converted OCR hand-corrected (last changed June, 2012). Figures were reconstructed. Some errors may remain, especially tables equations. Please report them doug@cs.dartmouth.edu.] creates what one have be it second vice versa. It based ideas several sources[1,2,7,8]. As an example its work, consider listed horizontally for brevity: b c d e f g w x y z easy see first can made following prescription, imaginary 0 understood at beginning each: append after 0: w, 3 through 4, were: into: z, delete 6 7, g. Going other way, this way: 1, was: 4 6, d, 7: Delete, operations available . indicates 1-letter

参考文章(6)
Daniel S. Hirschberg, The longest common subsequence problem. Princeton University. ,(1975)
James W. Hunt, Thomas G. Szymanski, A fast algorithm for computing longest common subsequences Communications of the ACM. ,vol. 20, pp. 350- 353 ,(1977) , 10.1145/359581.359603
D. Sankoff, Matching Sequences under Deletion/Insertion Constraints Proceedings of the National Academy of Sciences of the United States of America. ,vol. 69, pp. 4- 6 ,(1972) , 10.1073/PNAS.69.1.4
Michael L. Fredman, On computing the length of longest increasing subsequences Discrete Mathematics. ,vol. 11, pp. 29- 35 ,(1975) , 10.1016/0012-365X(75)90103-X
Saul B. Needleman, Christian D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins Journal of Molecular Biology. ,vol. 48, pp. 443- 453 ,(1970) , 10.1016/0022-2836(70)90057-4
D. S. Hirschberg, A linear space algorithm for computing maximal common subsequences Communications of The ACM. ,vol. 18, pp. 341- 343 ,(1975) , 10.1145/360825.360861