|
DjVu (pronounced déjà vu) is a computer file format designed primarily to store scanned images, especially those containing text and line drawings. It uses technologies such as image layer separation of text and background/images, progressive loading, arithmetic coding, and lossy compression for bitonal (monochrome) images. This allows for high quality, readable images to be stored in a minimum of space, so that they can be made available on the web. DjVu has been promoted as an alternative to PDF, as it gives smaller files than PDF for most scanned documents. The DjVu developers report that color magazine pages compress to 40–70KB, black and white technical papers compress to 15–40KB, and ancient manuscripts compress to around 100KB; all of these are significantly better than the typical 500KB required for a satisfactory JPEG image. Like PDF, DjVu can contain an OCRed text layer, making it easy to perform cut and paste and text search operations.
HistoryThe DjVu technology was originally developed by Yann Le Cun, Léon Bottou, Patrick Haffner, and Paul G. Howard at AT&T Laboratories in 1996. DjVu is a free file format. The file format specification is published as well as source code for the reference library. The ownership rights to the commercial development of the encoding software have been transferred to different companies over the years, including AT&T and LizardTech. The original authors maintain a GPLed implementation named "DjVuLibre". DjVu divides a single image into many different images, then compresses them separately. To create a djvu file, the initial image is first separated into three images: a background image, a foreground image, and a mask image. The background and foreground images are typically lower-resolution color images (e.g., 100dpi); the mask image is a high-resolution bilevel image (e.g., 300dpi) and is typically where the text is stored. The background and foreground images are then compressed using a wavelet-based compression algorithm named IW44. The mask image is compressed using a method called JB2 (similar to JBIG2). The JB2 encoding method identifies nearly-identical shapes on the page, such as multiple occurrences of a particular character in a given font, style, and size. It compresses the bitmap of each unique shape separately, and then encodes the locations where each shape appears on the page. Thus, instead of compressing a letter "e" in a given font multiple times, it compresses the letter "e" once (as a compressed bit image) and then records every place on the page it occurs. In 2002 the djvu file format was chosen by the Internet Archive as the format in which its Million Book Project provides scanned public domain books online (along with TIFF and PDF). DjVu format will be used by the One Laptop per Child project in order easily to supply existing paper books in an eBook format. The advantage of DjVu is that it is highly compressed and it does not require any font support. [1] Comparison with PDFThe primary difference between DjVu and PDF is that DjVu is a raster format, whereas PDF is primarily a vector format. This difference between the two formats has several consequences:
All this suggests that in the long run vector graphics will become the format of choice for the production of text documents by typesetting. On the other hand, for scanned media the following two options exist:
Roughly, the printed media content can be said to be a mix of text and graphics. To store various scanned media types both PDF and DjVu formats employ various codecs. The simplest (and the least efficient) way of storing scanned media is to treat both graphics and text as graphics. Historically, this was the first way how scanned media was stored in PDF: for color and gray images the JPEG codec was used, while for bitonal (black-and-white) images one of the fax codecs was used, most notably CCITT3 & CCITT4. As a result, a typical PDF file size was several hundred kilobytes per page. It was around this time when DjVu was proposed. This new file format essentially combined two new codecs with a very simple file structure:
It is interesting to note that while Adobe Reader 5.0 was able to render JBIG2-encoded images, the encoder only appeared in Adobe Acrobat 6.0. This, along with other factors, lead to the establishment of DjVu as the format of choice for storing scanned documents. At present, both PDF and DjVu have similar arsenal for representing highly compressed images. Moreover, the codecs used are essentially the same. The difference for the end user thus comes from the differences between encoders. If one compares the JBIG2 encoders in Adobe Acrobat (in lossy mode) and the on-line service at [2], the general conclusion is that djvu file will be smaller, while the PDF file will have higher quality (will be more accurate). Both formats define features that do not address the representation of the document appearance but aim at creating a document delivery platform. Both djvu files and PDF files can be enriched with text, table of contents, hyperlinks and metadata. PDF goes further by allowing sounds, interactive forms, and JavaScript programs. DjVu defines a protocol to transfer document pages on demand over the Internet. DjVu does not specify a way to certify the authenticity of a document or to define Digital Rights Management policies. Relative advantagesWith PDF documents one can zoom in on vector-based content to an arbitrary depth or print them at an arbitrarily high resolution without introducing quality loss or jaggedness inherent to raster formats. But if a PDF is simply used as a container for non-vector images (such as scans), those images will not gain anything. Also, a vector format can always be converted to a raster format, usually with irrevocable data loss, but the other direction is very difficult. PDF is most useful when the original source is an electronic document such as a Microsoft Word doc or TeX file. Such documents benefit most from the vector graphics technology that underlies PDF. djvu files can be marginally smaller but only deliver a high resolution image, possibly enriched with the associated text. DjVu is very good for image files, and has been optimized especially for scanned text and images. However, PDF could be better if the scanned raster images can be transformed into high quality vector graphics, for instance by applying optical character recognition to the scanned image, identifying the fonts, and carefully proofreading the resulting file. This procedure often costs too much time. Suitable fonts might not be available, or one may want to preserve the original document more exactly, including signatures, marginal comments, paper texture, or other markings. In such cases, DjVu is the better choice. Other compression methodsAt present, the most advanced method for compressing scanned bitonal documents seems to be Cartesian Perceptual Compression. Its size/quality ratios are unmatched by both DjVu and PDF. However, this compression format enjoys limited popularity since it's a closed file format/codec, which is protected by a US patent. External linksWikimedia Commons has media related to:
|
|||||||||||||||||||||||||||||
This article is from Wikipedia. All text is available under the terms of the GNU Free Documentation License.
Mercedes Car
This site monitored by SitePinger.net