摘要: In laboratories the majority of large-scale DNA sequencing is done following theshotgun strategy, which to sequence large amount relatively short fragments randomly and then heuristically find a shortest common superstring [26]. We study mathematical frameworks, under plausible assumptions, suitable for massive automated analyzing algorithms. model problem as learning string from its drawn substrings. Under certain restrictions, this may be viewed in Valiant's distribution-free case we give an efficient algorithm quantitative bound on how many examples suffice. One major obstacle our approach turns out quite well-known open question approximate set strings, raised by number authors last 10 years [9], [29], [30]. firstprovably good approximates lengthn lengthO(n logn). The works equally well even presence negative examples, i.e., when merging some strings prohibited.