摘要: While traditional document analysis has focused on printed media, an increasingly large portion of the documents today are generated in digital form from start. Such “documents born digital” range plain text such as emails to more sophisticated forms PDF and Web documents. On one hand, existence encoding eliminates need for scanning, image processing, character recognition most situations (a notable exception being prevalent use embedded images documents, elaborated upon section “Analysis Text Images”). other many higher-level processing tasks remain due fact that design purpose almost existing systems (i.e., HTML, PDF) is display or printing human consumption, not machine-level information exchange extraction. As such, significant amount still required automatic extraction, indexing, content repurposing challenges exist this process. This chapter describes detail key technologies digital, with a focus processing.