Researching informaiton on the formatting, imposition and pagination for content for my 'What Is Design for Print?' ISSUU- based print handbook publication. All information, combined with my own knowledge from this design module shall be utilised, and written in my own words for the final design outcome. All developments and design of the publication can be found on my Design Practice blog in the forthcoming weeks.
http://en.wikipedia.org/wiki/Imposition
Imposition
Imposition is one of the fundamental steps in the prepress
printing process. It consists in the arrangement of the printed
product’s pages on the printer’s sheet, in order to obtain faster
printing, simplified binding and less waste of paper.
Correct imposition minimizes printing time by maximizing the number
of pages per impression, reducing cost of press time and materials. To
achieve this, the printed sheet must be filled as fully as possible.
Imposition is affected by five different parameters:
- Format of the product: The size of the finished page determines how many pages can be printed on a single sheet.
- Number of pages of the printed product: The compositor must determine how many sheets are to be printed to create a finished book.
- Stitching/binding method: The compositor must understand how the sheets are placed to form the signatures that compose the finished book.
- Paper fiber direction: Many papers have a "grain," reflecting the alignment of the paper fibers. That these fibers must run lengthwise along the fold influences the alignment, hence the position, of the pages on the printed sheet.
- Finishing and binding
To understand how the pages are related to each other, an imposition
dummy may be used. This is made by folding several sheets of paper in
the way the press will print and fold the product. A little copy is then
created, and this can help paginate the product.
In the example above, a 16-page book is prepared for printing. There
are eight pages on the front of the sheet, and the corresponding eight
pages on the back. After printing, the paper is folded in half
vertically (page two falls against page three). Then it is folded again
horizontally (page four meets page five). A third fold completes this
process (page nine meets page eight). The example below shows the final
result prior to binding and trimming.
Contents |
Imposition has been a requirement since the earliest days of printing. When pages were set using movable type, pages were assembled in a metal frame called a chase, and locked into place using wedges called quoins.
By the late twentieth century, most typesetting was onto photographic film. These sheets were combined manually on a light table, in a process called stripping. Skilled workers would spend many hours stripping
pieces of film together in the correct sequence and orientation. The
term stripping was also used for other changes to a prepared page, such
as a spelling correction, or a "stop press" story in a newspaper.
Digital techniques rendered stripping less necessary, but what has
forced increasing numbers to abandon it completely is the introduction
of "platesetters", which put pages directly onto printing plates; these
plates cannot be adjusted with a sharp knife. In addition, an extremely
high precision would be needed for stripping of colour work, as each ink
colour is on a separate piece of film.
Digital techniques
Manual imposition processes tend to cause bottlenecks of the whole
printing production. The advent of digital imposition has not only
helped a lot in making sure layout and sheet arrangement are correct
with higher register precision, but it improves a lot on reducing the
usual imposition errors (e.g., slight movements of register due to
parallax). An entire book may be imposed and many complex
functions applied in an instant. Binding options may be changed on the
fly and impositions produced to multiple output devices at once, often
with no user intervention. In turn, digital techniques help to reduce
material costs, time and resolves production bottlenecks. There are
several different approaches to digital imposition.
- Imposition in the design application. Software packages that can be used to design single pages can often be used to design entire printed sheets, sometimes by a process as simple as copy/paste onto a larger sheet. This is still in use, especially for low volumes of work, but a popular alternative is an imposition function built in, or added in, to the design tool. This typically takes a document prepared as single pages and creates a new document with full-sheet layouts. This larger layout is then printed to film or a plate.
- Post-design imposition. A post-design application might take a PostScript or PDF file in single pages and produce a new PostScript or PDF file with imposed sheet layouts for printing. A variation of this is to take a large number of single-page source files as input. This is especially suitable for a magazine or newspaper, where pages may be worked on by different groups simultaneously.
- Print driver imposition. Some printer drivers enable the source application's single-page printed output to be sent to the printer as full sheets. This is not often found in professional production, but is popular for such things as booklet printing on office laser printers. A variation of this offers the ability to print layouts as an option in the application.
- Output device imposition. This is sometimes called "in-RIP imposition". This allows regular pages to be printed by any suitable means, and the output device handles imposition. While this offers the advantage of enabling specific tuning of the imposition for an output device, the cost is that there is no preview until the output is produced. This may mean a costly printing plate that takes some time to produce, or even (with a digital press) errors in finished copies: expensive mistakes are possible.
Where an imposition layout is viewed on screen, it may be referred to as a printer's spread. This is used to contrast with reader's spread,
which shows a finished printed piece on screen as it will appear to the
reader, rather than the printer; specifically, in a reader's spread for
a typical book, pairs of facing pages are shown side-by-side (that is,
pages 2 and 3 together).
Imposition proof
The imposition proof is the last test that is performed before beginning the print run.
This test is performed to verify, through the formation of a
prototype, that the imposition was successful. Typical checks are that
the pages are on the correct spot and the crossover bleeds work. It
cannot be used as a check proof for images or colors or layout because
it is printed on a large, low-resolution inkjet printer.
Since the inkjet printer can print on only one side of the paper, the
full proof (the front and rear sides) is printed on two separate
sheets. They are first cut along the crossover bleeds, checking to see
if they are in the correct position. The two sheets are then attached
together to form a single sheet printed on both sides, and then this
sheet is folded to form a prototype of the signature.
This proof is still called blue copy, digital blue copy to prototype, or blues plotter.
Impose a document for booklet printing
The
Print Booklet feature lets you create printer spreads for
professional printing. For example, if you’re editing an 8-page
booklet, the pages appear in sequential order in the layout window.
However, in printer spreads, page 2 is positioned next to page 7,
so that when the two pages are printed on the same sheet, folded, and
collated, the pages end up in the appropriate order.
The process of creating printer spreads
from layout spreads is called imposition. While imposing
pages, you can change settings to adjust spacing between pages,
margins, bleed, and creep. The layout of your InDesign document
is not affected, because the imposition is all handled in the print
stream. No pages are shuffled or rotated in the document.
Pages appear in sequence in the layout window, but are printed
in a different order so that they appear correct when folded and
bound.
Pagination
Pagination is the process of dividing information (content) into discrete pages, either electronic pages or printed pages. Today the latter are usually simply instances of the former that have been outputted to a printing device, such as a desktop printer or a modern printing press. For example, printed books and magazines are created first as electronic files (for example, PDF or QXD files) and then printed. Pagination encompasses rules and algorithms for deciding where page breaks
will fall, which depends on semantic or cultural senses of which
content belongs on the same page with related content and thus should
not fall to another (e.g., widows and orphans). Pagination is sometimes a part of page layout,
and other times is merely a process of arbitrary fragmentation. The
difference is in the degree of intelligence that is required to produce
an output that the users deem acceptable or desirable. Before the rise
of information technology
(IT), pagination was a manual process, and print output was its sole
purpose. Every instance of a pagination decision was made by a human.
Today, most instances are made by machines, although humans often
override particular decisions. As years go by, software developers
continually refine the programs to increase the quality of the
machine-made decisions (make them "smarter") so that the need for manual
overrides becomes ever rarer.
In reference to books made in the pre-IT era, in a strict sense of
the word, pagination can mean the consecutive numbering to indicate the
proper order of the pages, which was rarely found in documents
pre-dating 1500, and only became common practice circa 1550, when it replaced foliation, which numbered only the front sides of folios.
Contents |
Pagination in word processing, desktop publishing, digital typesetting
Word processing, desktop publishing, and digital typesetting
are technologies built on the idea of print as the intended final
output medium, although nowadays it is understood that plenty of the
content produced through these pathways will be viewed onscreen by most
users rather than being printed on paper.
All of these software tools are capable of flowing the content
through algorithms to decide the pagination. For example, they all
include automated word wrapping (to make line-ending decisions), machine-readable
paragraphing (to make paragraph-ending decisions), and automated
pagination (to make page-breaking decisions). All of those automated
capabilities can be manually overridden by the human user, via manual
line breaks (that is, forced soft returns), hard returns, and manual page breaks.
Pagination in Web content (HTML, ASP, PHP, and others)
On the Internet, pagination is used for such things as displaying a limited number of results on search engine results pages, or showing a limited number of posts when viewing a forum thread. Pagination is used in some form in almost every web application to divide returned data and display it on multiple pages. Pagination also includes the logic of preparing and displaying the links to the various pages.
Pagination can be handled client-side or server-side.
Server-side pagination is more common. Client-side pagination can be
used when there are very few records to be accessed, in which case all
records can be returned, and the client can use JavaScript to view the separate pages. By using AJAX,
hybrid server/client-side pagination can be used, in which Javascript
is used to request the subsequent page which is loaded and inserted into
the Document Object Model via AJAX.
Server-side pagination is appropriate for large data sets providing faster initial page load, accessibility for those not running Javascript, and complex view business logic
Correctly implementing pagination can be difficult.
There are many different usability questions such as should "previous"
and "next" links be included, how many links to pages should be
displayed, and should there be a link to the first and last pages. Also ability to define the number of records displayed in a single page is useful.
Separation of presentation and content and its effect on how we classify presentation media
Today, all content, no matter which output medium is planned,
predicted, or not predicted, can be produced with technologies that
allow downstream transformations into any presentation desired, although
such best-practice preparation is still far from universal. This
usually involves a markup language (such as XML, HTML, or SGML) that tags the content semantically and machine-readably, which allows downstream technologies (such as XSLT, XSL, or CSS) to output them into whatever presentation is desired. This concept is known as the separation of presentation and content. In this paradigm, which is now the conventional one in most commercial publishing,
it is no longer possible to make a hierarchical distinction between
pagination in the print medium and pagination in the electronic medium,
because print is merely an instance of presentation of the same
underlying content.
A file format is a particular way that information is encoded for storage in a computer file.
Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information
to 0s and 1s and vice-versa. There are different kinds of formats for
different kinds of information. Within any format type, e.g., word processor documents, there will typically be several different formats. Sometimes these formats compete with each other.
File formats can be divided into proprietary and open formats.
Generality
Some file formats are designed for very particular sorts of data: PNG files, for example, store bitmapped images using lossless data compression. Other file formats, however, are designed for storage of several different types of data: the Ogg format can act as a container for many different types of multimedia, including any combination of audio and/or video, with or without text (such as subtitles), and metadata. A text file can contain any stream of characters, encoded for example as ASCII or Unicode, including possible control characters. Some file formats, such as HTML, Scalable Vector Graphics and the source code of computer software, are also text files with defined syntaxes that allow them to be used for specific purposes.
Specifications
Many file formats, including some of the most well-known file formats, have a published specification document (often with a reference implementation) that describes exactly how the data is to be encoded, and which can be used to determine whether or not a particular program
treats a particular file format correctly. There are, however, two
reasons why this is not always the case. First, some file format
developers view their specification documents as trade secrets,
and therefore do not release them to the public. Second, some file
format developers never spend time writing a separate specification
document; rather, the format is defined only implicitly, through the
program(s) that manipulate data in the format.
Using file formats without a publicly available specification can be costly. Learning how the format works will require either reverse engineering
it from a reference implementation or acquiring the specification
document for a fee from the format developers. This second approach is
possible only when there is a specification document, and typically requires the signing of a non-disclosure agreement.
Both strategies require significant time, money, or both. Therefore, as
a general rule, file formats with publicly available specifications are
supported by a large number of programs, while non-public formats are
supported by only a few programs.
Patent law, rather than copyright,
is more often used to protect a file format. Although patents for file
formats are not directly permitted under US law, some formats require
the encoding of data with patented algorithms.
For example, using compression with the GIF file format requires the
use of a patented algorithm, and although initially the patent owner did
not enforce it, they later began collecting fees for use of the
algorithm. This has resulted in a significant decrease in the use of GIFs, and is partly responsible for the development of the alternative PNG format. However, the patent expired in the US in mid-2003, and worldwide in mid-2004.
Algorithms are usually held not to be patentable under current European
law, which also includes a provision that members "shall ensure that,
wherever the use of a patented technique is needed for a significant
purpose such as ensuring conversion of the conventions used in two
different computer systems or networks so as to allow communication and
exchange of data content between them, such use is not considered to be a
patent infringement", which would apparently allow implementation of a
patented file system where necessary to allow two different computers to
interoperate.
Identifying the type of a file
A method is required to determine the format of a particular file within the filesystem—an example of metadata. Different operating systems have traditionally taken different approaches to this problem, with each approach having its own advantages and disadvantages.
Of course, most modern operating systems, and individual
applications, need to use all of these approaches to process various
files, at least to be able to read 'foreign' file formats, if not work
with them completely.
Filename extension
One popular method in use by several operating systems, including Windows, Mac OS X, CP/M, DOS, VMS, and VM/CMS,
is to determine the format of a file based on the section of its name
following the final period. This portion of the filename is known as the
filename extension. For example, HTML documents are identified by names that end with .htm (or .html), and GIF images by .gif. In the original FAT filesystem, filenames were limited to an eight-character identifier and a three-character extension, which is known as 8.3 filename.
Many formats thus still use three-character extensions, even though
modern operating systems and application programs no longer have this
limitation. Since there is no standard list of extensions, more than one
format can use the same extension, which can confuse the operating
system and consequently users.
One artifact of this approach is that the system can easily be
tricked into treating a file as a different format simply by renaming
it—an HTML file can, for instance, be easily treated as plain text by
renaming it from filename.html to filename.txt.
Although this strategy was useful to expert users who could easily
understand and manipulate this information, it was frequently confusing
to less technical users, who might accidentally make a file unusable (or
'lose' it) by renaming it incorrectly.
This led more recent operating system shells, such as Windows 95 and Mac OS X,
to hide the extension when displaying lists of recognized files. This
separates the user from the complete filename, preventing the accidental
changing of a file type, while allowing expert users to still retain
the original functionality through enabling the displaying of file
extensions.
A downside of hiding the extension is that it then becomes possible
to have what appear to be two or more identical filenames in the same
folder. This is especially true when image files are needed in more than
one format for different applications. For example, a company logo may
be needed both in .tif format (for publishing) and .gif format (for web sites). With the extensions visible, these would appear as the unique filenames "CompanyLogo.tif" and "CompanyLogo.gif". With the extensions hidden, these would both appear to have the identical filename "CompanyLogo", making it more difficult to determine which to select for a particular application.
A further downside is that hiding such information can become a security risk.
This is because on a filename extensions reliant system all usable
files will have such an extension (for example all JPEG images will have
".jpg" or ".jpeg" at the end of their name), so seeing file extensions
would be a common occurrence and users may depend on them when looking
for a file's format. By having file extensions hidden a malicious user
can create an executable program with an innocent name such as "Holiday photo.jpg.exe". In this case the ".exe" will be hidden and a user will see this file as "Holiday photo.jpg",
which appears to be a JPEG image, unable to harm the machine save for
bugs in the application used to view it. However, the operating system
will still see the ".exe" extension and thus will run the
program, which is then able to cause harm and presents a security issue.
To further trick users, it is possible to store an icon inside the
program, as done on Microsoft Windows, in which case the operating
system's icon assignment can be overridden with an icon commonly used to
represent JPEG images, making such a program look like and appear to be
called an image, until it is opened that is. This issue requires users
with extensions hidden to be vigilant, and never open files which seem
to have a known extension displayed despite the hidden option being
enabled (since it must therefore have 2 extensions, the real one being
unknown until hiding is disabled). This presents a practical problem for
Windows systems where extension hiding is turned on by default.
Internal metadata
A second way to identify a file format is to store information
regarding the format inside the file itself. Usually, such information
is written in one (or more) binary string(s), tagged or raw texts placed
in fixed, specific locations within the file. Since the easiest place
to locate them is at the beginning of it, such area is usually called a file header when it is greater than a few bytes, or a magic number if it is just a few bytes long.
File header
First of all, the meta-data contained in a file header
are not necessarily stored only at the beginning of it, but might be
present in other areas too, often including the end of the file; that
depends on the file format or the type of data it contains.
Character-based (text) files have character-based human-readable
headers, whereas binary formats usually feature binary headers, although
that is not a rule: a human-readable file header may require more
bytes, but is easily discernable with simple text or hexadecimal
editors. File headers may not only contain the information required by
algorithms to identify the file format alone, but also real metadata
about the file and its contents. For example most image file formats store information about image size, resolution, colour space/format and optionally other authoring information like who, when and where it was made, what camera model and shooting parameters was it taken with (if any, cfr. Exif),
and so on. Such metadata may be used by a program reading or
interpreting the file both during the loading process and after that,
but can also be used by the operating system to quickly capture
information about the file itself without loading it all into memory.
The downsides of file header as a file-format identification method
are at least two. First, at least a few (initial) blocks of the file
need to be read in order to gain such information; those could be fragmented
in different locations of the same storage medium, thus requiring more
seek and I/O time, which is particularly bad for the identification of
large quantities of files altogether (like a GUI browsing inside a folder with thousands or more files and discerning file icons or thumbnails
for all of them to visualize). Second, if the header is binary
hard-coded (i.e. the header itself is subject to a non-trivial
interpretation in order to be recognized), especially for metadata
content protection's sake, there is some risk that file format is
misinterpreted at first sight, or even badly written at the source,
often resulting in corrupt metadata (which, in extremely pathological
cases, might even render the file unreadable anymore).
A more logically sophisticated example of file header is that used in wrapper (or container) file formats.
Magic number
One way to incorporate such metadata, often associated with Unix
and its derivatives, is just to store a "magic number" inside the file
itself. Originally, this term was used for a specific set of 2-byte
identifiers at the beginning of a file, but since any undecoded binary
sequence can be regarded as a number, any feature of a file format which
uniquely distinguishes it can be used for identification. GIF images, for instance, always begin with the ASCII representation of either GIF87a or GIF89a,
depending upon the standard to which they adhere. Many file types, most
especially plain-text files, are harder to spot by this method. HTML
files, for example, might begin with the string <html> (which is not case sensitive), or an appropriate document type definition that starts with <!DOCTYPE, or, for XHTML, the XML identifier, which begins with <?xml. The files can also begin with HTML comments, random text, or several empty lines, but still be usable HTML.
The magic number approach offers better guarantees that the format
will be identified correctly, and can often determine more precise
information about the file. Since reasonably reliable "magic number"
tests can be fairly complex, and each file must effectively be tested
against every possibility in the magic database, this approach is
relatively inefficient, especially for displaying large lists of files
(in contrast, filename and metadata-based methods need check only one
piece of data, and match it against a sorted index). Also, data must be
read from the file itself, increasing latency as opposed to metadata
stored in the directory. Where filetypes don't lend themselves to
recognition in this way, the system must fall back to metadata. It is,
however, the best way for a program to check if a file it has been told
to process is of the correct format: while the file's name or metadata
may be altered independently of its content, failing a well-designed
magic number test is a pretty sure sign that the file is either corrupt
or of the wrong type. On the other hand a valid magic number does not
guarantee that the file is not corrupt or of a wrong type.
So-called shebang lines in script files are a special case of magic numbers. Here, the magic number is human-readable text that identifies a specific command interpreter and options to be passed to the command interpreter.
Another operating system using magic numbers is AmigaOS, where magic numbers were called "Magic Cookies" and were adopted as a standard system to recognize executables in Hunk
executable file format and also to let single programs, tools and
utilities deal automatically with their saved data files, or any other
kind of file types when saving and loading data. This system was then
enhanced with the Amiga standard Datatype recognition system. Another method was the FourCC method, originating in OSType on Macintosh, later adapted by Interchange File Format (IFF) and derivatives.
External metadata
A final way of storing the format of a file is to explicitly store
information about the format in the file system, rather than within the
file itself.
This approach keeps the metadata separate from both the main data and the name, but is also less portable
than either file extensions or "magic numbers", since the format has to
be converted from filesystem to filesystem. While this is also true to
an extent with filename extensions — for instance, for compatibility
with MS-DOS's
three character limit — most forms of storage have a roughly equivalent
definition of a file's data and name, but may have varying or no
representation of further metadata.
Note that zip files or archive files
solve the problem of handling metadata. A utility program collects
multiple files together along with metadata about each file and the
folders/directories they came from all within one new file (e.g. a zip
file with extension .zip). The new file is also compressed and possibly
encrypted, but now is transmissible as a single file across operating
systems by FTP systems or attached to email. At the destination, it must
be unzipped by a compatible utility to be useful, but the problems of
transmission are solved this way.
Mac OS type-codes
The Mac OS' Hierarchical File System stores codes for creator and type as part of the directory entry for each file. These codes are referred to as OSTypes, and for instance a HyperCard "stack" file has a creator of WILD (from Hypercard's previous name, "WildCard") and a type of STAK.
The type code specifies the format of the file, while the creator code
specifies the default program to open it with when double-clicked by the
user. For example, the user could have several text files all with the
type code of TEXT, but which each open in a different program, due to having differing creator codes. RISC OS uses a similar system, consisting of a 12-bit number which can be looked up in a table of descriptions — e.g. the hexadecimal number FF5 is "aliased" to PoScript, representing a PostScript file.
Mac OS X Uniform Type Identifiers (UTIs)
A Uniform Type Identifier (UTI) is a method used in Mac OS X for uniquely identifying "typed" classes of entity, such as file formats. It was developed by Apple as a replacement for OSType (type & creator codes).
The UTI is a Core Foundation string, which uses a reverse-DNS string. Some common and standard types use a domain called public (e.g. public.png for a Portable Network Graphics image), while other domains can be used for third-party types (e.g. com.adobe.pdf for Portable Document Format). UTIs can be defined within a hierarchical structure, known as a conformance hierarchy. Thus, public.png conforms to a supertype of public.image, which itself conforms to a supertype of public.data. A UTI can exist in multiple hierarchies, which provides great flexibility.
In addition to file formats, UTIs can also be used for other entities which can exist in OS X, including:
- Pasteboard data
- Folders (directories)
- Translatable types (as handled by the Translation Manager)
- Bundles
- Frameworks
- Streaming data
- Aliases and symlinks
OS/2 Extended Attributes
The HPFS, FAT12 and FAT16
(but not FAT32) filesystems allow the storage of "extended attributes"
with files. These comprise an arbitrary set of triplets with a name, a
coded type for the value and a value, where the names are unique and
values can be up to 64 KB long. There are standardized meanings for
certain types and names (under OS/2). One such is that the ".TYPE"
extended attribute is used to determine the file type. Its value
comprises a list of one or more file types associated with the file,
each of which is a string, such as "Plain Text" or "HTML document". Thus
a file may have several types.
The NTFS filesystem also allows to store OS/2 extended attributes, as one of file forks,
but this feature is merely present to support the OS/2 subsystem (not
present in XP), so the Win32 subsystem treats this information as an
opaque block of data and does not use it. Instead, it relies on other
file forks to store meta-information in Win32-specific formats. OS/2
extended attributes can still be read and written by Win32 programs, but
the data must be entirely parsed by applications.
POSIX extended attributes
On Unix and Unix-like systems, the ext2, ext3, ReiserFS version 3, XFS, JFS, FFS, and HFS+
filesystems allow the storage of extended attributes with files. These
include an arbitrary list of "name=value" strings, where the names are
unique and a value can be accessed through its related name.
PRONOM Unique Identifiers (PUIDs)
The PRONOM Persistent Unique Identifier (PUID) is an extensible scheme of persistent, unique and unambiguous identifiers for file formats, which has been developed by The National Archives of the UK as part of its PRONOM technical registry service. PUIDs can be expressed as Uniform Resource Identifiers using the info:pronom/ namespace. Although not yet widely used outside of UK government and some digital preservation programmes, the PUID scheme does provide greater granularity than most alternative schemes.
MIME types
MIME types are widely used in many Internet-related
applications, and increasingly elsewhere, although their usage for
on-disc type information is rare. These consist of a standardised system
of identifiers (managed by IANA) consisting of a type and a sub-type, separated by a slash — for instance, text/html or image/gif. These were originally intended as a way of identifying what type of file was attached to an e-mail, independent of the source and target operating systems. MIME types identify files on BeOS, AmigaOS 4.0 and MorphOS,
as well as store unique application signatures for application
launching. In AmigaOS and MorphOS the Mime type system works in parallel
with Amiga specific Datatype system.
There are problems with the MIME types though; several organisations
and people have created their own MIME types without registering them
properly with IANA, which makes the use of this standard awkward in some
cases.
File format identifiers (FFIDs)
File format identifiers is another, not widely used way to identify
file formats according to their origin and their file category. It was
created for the Description Explorer suite of software. It is composed
of several digits of the form NNNNNNNNN-XX-YYYYYYY. The first
part indicates the organisation origin/maintainer (this number
represents a value in a company/standards organisation database), the 2
following digits categorize the type of file in hexadecimal. The final
part is composed of the usual file extension of the file or the
international standard number of the file, padded left with zeros. For
example, the PNG file specification has the FFID of 000000001-31-0015948 where 31 indicates an image file, 0015948 is the standard number and 000000001 indicates the ISO Organisation.
File content based format identification
Another but least popular way to identify the file format is to look
at the file contents for distinguishable patterns among file types. As
we know, the file contents are sequence of bytes and a byte has 256
unique patterns (0~255). Thus, counting the occurrence of byte patterns
that is often referred as byte frequency distribution gives
distinguishable patterns to identify file types. There are many content
based file type identification schemes that use byte frequency
distribution to build the representative models for file type and use
any statistical and data mining techniques to identify file types
File structure
There are several types of ways to structure data in a file. The most usual ones are described below.
Unstructured formats (raw memory dumps)
Earlier file formats used raw data formats that consisted of directly
dumping the memory images of one or more structures into the file.
This has several drawbacks. Unless the memory images also have
reserved spaces for future extensions, extending and improving this type
of structured file is very difficult. It also creates files that might
be specific to one platform or programming language (for example a
structure containing a Pascal string is not recognized as such in C). On the other hand, developing tools for reading and writing these types of files is very simple.
The limitations of the unstructured formats led to the development of
other types of file formats that could be easily extended and be
backward compatible at the same time.
Chunk-based formats
Electronic Arts and Commodore-Amiga pioneered this file format in 1985, with their IFF (Interchange File Format)
file format. In this kind of file structure, each piece of data is
embedded in a container that contains a signature identifying the data,
as well the length of the data (for binary encoded files). This type of
container is called a "chunk". The signature is usually called a chunk id, chunk identifier, or tag identifier.
With this type of file structure, tools that do not know certain
chunk identifiers simply skip those that they do not understand.
This concept has been taken again and again by RIFF (Microsoft-IBM equivalent of IFF), PNG, JPEG storage, DER (Distinguished Encoding Rules) encoded streams and files (which were originally described in CCITT X.409:1984 and therefore predate IFF), and Structured Data Exchange Format (SDXF). Even XML
can be considered a kind of chunk-based format, since each data element
is surrounded by tags which are akin to chunk identifiers.
Directory-based formats
This is another extensible format, that closely resembles a file system (OLE
Documents are actual filesystems), where the file is composed of
'directory entries' that contain the location of the data within the
file itself as well as its signatures (and in certain cases its type).
Good examples of these types of file structures are disk images, OLE documents and TIFF images.
No comments:
Post a Comment