CHAPTER FOUR
Files
AS WE MOVE FURTHER ALONG, as we dive deeper into exploring and gaining an understanding of the science behind cyber forensics, our goal is to provide useable materials to the reader, materials that will sustain the evolutionary creep of technology, materials that will not become dated or obsolete before they are published.
In order to accomplish our goal, it is necessary to explore general or broad concepts (although perhaps complex) while refraining from addressing specific software, programs, or even generalized forensic tools, which can quickly become dated and obsolete over time.
As we examine further the building blocks of cyber forensics, special attention has been made to focus on tools that will not quickly become dated, expire, or no longer be vendor supported. We spend more time, for example, discussing a tool such as a HEX editor, versus discussing the Windows NT operating system. A HEX editor has been around for as long as well—since HEX itself—whereas Windows NT is quickly becoming less and less relevant.
Read on as we continue with our exploration of the science behind cyber forensics, focusing here on files, file signatures, and their role and relevancy in cyber forensic investigations.
In Chapter 3 the following topics were addressed:
1. Discussed HEX and the steps involved with converting this binary representation to ASCII.
2. Covered the actual conversion process in an effort to better understand the HEX character representation.
3. Went into some detail discussing the nuts and bolts of HEX.
4. Referred to HEX editors and their function, but stopped short of probing deeply into the functionality and usefulness of a HEX editor when viewing files (or pieces of files). This discussion is saved for a later chapter.
HEX, as was discussed, is useful when attempting to view a file that is partially deleted. This begs the questions:
1. Why would a partially deleted file have difficulties being opened or viewed normally?
2. What parts of a file does a HEX editor allow us to see, which otherwise would not be visible?
FILES, FILE STRUCTURES, AND FILE FORMATS
To answer the questions posed above, we need to further investigate the basics of a file, file structures, and file formats. A partially deleted file in many cases may be missing part of its formatting data, the data that identifies the file.
It is this formatting information that identifies the file to its parent or native software. If a file does not contain this formatting information, the software or operating system (OS) will most likely not be able to access or execute the file. It is this formatting information that uniquely identifies a file.
There are hundreds of different formats for data (databases, word processing, spreadsheets, images, video, etc.). There are also formats for executable programs on different platforms (Windows, Mac, Linux, Unix, etc.). Each format defines how the sequence of bits and bytes are laid out, with the ASCII based text file being one of the simplest formats for humans to decipher.
Some file formats are designed to store very particular sorts of data: the JPEG format, for example, is designed only to store static photographic images. Other file formats, however, are designed for storage of several different types of data: the GIF format supports storage of both still images and simple animations, and the QuickTime format can act as a container for many different types of multimedia.
A text file is simply one that stores any text, in a format such as ASCII or UTF-8, with few if any control characters. Some file formats, such as HTML, or the source code of some particular programming language, are in fact also text files, but adhere to more specific rules which allow them to be used for specific purposes.1
There are a wide variety of digital file types in our ever expanding electronic universe. These various file types contain specific formatting information which allows for file access, storage, or “manipulation.” This “manipulation” may occur via the operating system itself, or it may occur via a “parent” program installed on the operating system.
Parent program, meaning the program and possibly proprietary software, is used to create, execute, or otherwise access the file. In most cases a file will contain data, its file signature, from which its parent software (or the operating system) will be able to identify and handle its operation.
This file signature information is contained in what is sometimes referred to as a file header. The data contained within a file header is not seen by the casual user, yet is very important for the file to function as designed. It is this data contained within the file header that is used to identify the format of the file.
File headers may also contain data regarding the integrity of the file as well as information about itself and its contents. This data is often referred to as metadata.
There is no one specific file format structure that fits all file types. File formats will vary as does file content. The contents of an image, as well as its format, for example, will be different from the contents and format of a word processing document.
A summary of some more common file formats along with their Windows file extensions can be found in Appendix 4A.
Within the Windows Operating System environment, file formats are easily identified by file extensions.
The Windows Operating System uses file extensions to “bind” an application to a specific file type. For example, Windows will bind Adobe Reader software to the .PDF file extension, or MS Word to the .DOC (or .DOCX) file extension.
File extensions are specific to the Windows Operating System and without an extension the Windows Operating System would not know how to open, process, or handle a file. The Windows Operating System looks at the extension when binding a file to an application.
Question: What would occur if the file extension of an executable (.EXE) file was changed to that of an Adobe file extension (.PDF)?
Answer: Windows would look at the file extension and see that it’s a .PDF; it would therefore hand that file over to Adobe to open. Adobe would attempt to launch or open the file and report an error since the file, regardless of its name, is not actually an Adobe file.
Windows stores this application binding information in a section of the Operating System (OS) called the registry.
Each file type contains a corresponding file extension; this correlation stored within the registry tells the OS what type of program is needed to access a certain file type. This is Window’s way of organizing the many different types of files to their corresponding software.
When the OS identifies an extension, say .CSV (Comma Separated Values), the OS looks to the registry and finds which application is bound to this extension. It most cases, MS Excel is bound to CSVs, so Windows will hand that file over to Excel to open and process. A file extension and/or its corresponding registry information can be manipulated by a savvy user.
For example, suppose a change was made to the registry so that the .CSV file extension was associated to and therefore opened with an image viewer such as Windows Picture Viewer. If a user were to click on an actual CSV file, Windows would hand that file over to Windows Viewer instead of the logical application (e.g., Excel); the image viewer would then attempt to open the file. If the file was an actual CSV file, a “no preview available” message or an error would be displayed, as the Windows Viewer application would not be able to process a file with the .CSV file format extension.
Say the file was an image, which had been renamed with a .CSV extension. Windows would hand that CSV file over to the image viewing software and the image would be displayed. A file with an incorrect file extension would open as long as the Windows Registry had that “incorrect” file extension associated with the correct software. Remember, changing or renaming a file’s extension does not change the content of the file; it only changes the way in which Windows OS handles the file (i.e., which application the file is sent to).
So why is the way the OS handles the interpretation of a file’s extension important to the cyber forensic investigator?
What if a cyber forensic investigator receives a forensic image of a suspected child molester’s hard drive, and searches the drive’s contents for image files (e.g., JPGs). Let’s say that the investigator is unable to find the existence of files with image extensions such as .JPG. Is this case closed? Is the suspected child molester innocent and free to go?
Hardly; there could be plenty of images on this hard drive that have just been renamed. The fact that Windows uses file extensions creates a means by which a user can hide information by renaming file extensions.
CHANGING A FILE’S EXTENSION TO EVADE DETECTION
The process to change a file’s extension to evade detection is quite simple, as shown in the following steps.
Step 1
Create a legitimate looking folder into which you wish to place your files (see Figure 4.1).
Step 2
Open the misnamed but legitimate looking folder called My Doxuments (see Figure 4.2).
Step 3
Open the Tools tab and select Folder Options (see Figure 4.3).
Step 4
Open the View tab (see Figure 4.4).
Step 5
Uncheck “Hide extensions for known file types” (see Figure 4.5).
Step 6
File extension type is revealed (see Figure 4.6).
Step 7
Right-click on the file name to Rename the file, including providing any valid file extension type.
The file type is changed based upon the extension provided (see Figure 4.7).
Step 8
Click Hide extensions for known file types, to hide the new file extensions (see Figure 4.8).
Where there were once 10 JPEG image files there are now only six (see Figure 4.9). Scanning simply for image files will result in missing the four files with modified extensions!
Advice to the potential criminal: It may be wise to rename the file names from “racy pic” to something more inconspicuous! Also, using or renaming a less well-known folder buried further down in the directory tree may be advantageous.
Remember Windows looks at a file’s extension first, and hands that file over to the appropriate application to open. A Microsoft Word application attempting to open a .JPEG or .TIF file would attempt to launch or open the file and report an error since the file, regardless of its name, is not actually a Microsoft Word file.
In our intellectual property theft case, Ronelle Sawyer is investigating whether Jose McCarthy has potentially engaged in the unlawful distribution of his organization’s intellectual property to a competitor, Janice Witcome, Managing Director of the XYZ Company.
Ronelle is faced with examining millions of pieces of potential evidential data residing on Jose’s hard drive, such as any occurrence of the character string “X,” “Y,” and “Z.” To add to the complexity of Ronelle’s task, these files could have easily been renamed and moved to locations buried deep within the logical folder structure of the computer.
Figure 4.10 displays the contents of a folder (7.0.6000.374) buried within the Windows folder structure, a location which would normally contain system files such as .DLLs. The folder’s name is [C:WINDOWSsystem32SoftwareDistributionSetupServiceStartupwups2.dll7.0.6000.374]
Windows Folder 7.0.6000.374 contains two files: file #1, wups2.dll, and file #2, systemm32.dll.
Remember, there can be hundreds if not thousands of folders and even more files, all of which may seem inconsequential as they are scattered and stored throughout an individual’s hard drive.
So these file types (i.e., .DLL) seem normal enough, right?
Let’s look at the files with a HEX editor. There are many HEX editors available, most of which are free to download. Google is your friend; just be sure you are downloading a HEX editor and not a Trojan.
Figure 4.11 shows File #1 “wups2.dll” viewed in a HEX editor.
Figure 4.12 shows File # 2, “systemm32.dll” viewed in a HEX editor.
Contained within the file format information is the file signature, sometimes referred to as the “Magic Number.”
Magic numbers are referred to as magic because the purpose and significance of their values are not apparent without some additional knowledge. The term magic number is also used in programming to refer to a constant that is employed for some specific purpose but whose presence or value is inexplicable without additional information.2
A file signature is the binary that identifies a particular file: the data that will aid in the identification of the file to its native or parent software.
For common file formats, the file signatures conveniently represent the names of the file types. For example, image files conforming to the widely used GIF87a format in HEX equals 0x474946383761; when converted into ASCII it equates to GIF87a. ASCII is the de facto standard used by computers and communications equipment for character encoding (i.e., associating alphabetic and other characters with numbers).
Likewise, the file signature for image files having the subsequently introduced GIF89a format is 0x3474946383961. For both types of GIF (Graphic Interchange Format) files, the file signature occupies the first six bytes of the file. They are then followed by additional general information (i.e., metadata) about the file.
Similarly, a commonly used file signature for JPEG (Joint Photographic Experts Group) image files is 0x4A464946, which is the ASCII equivalent of JFIF (JPEG File Interchange Format). However, JPEG file signatures are not the first bytes in the file; rather, they begin with the seventh byte. Additional examples include 0x34D546864 for MIDI (Musical Instrument Digital Interface) files and 0x425a6831415925 for bzip2 compressed files.
Remember, “0x - zerox” refers to HEX(adecimal) notational value. Therefore, the value 0x474946383761 is not and should not be interpreted as a decimal value, but rather as a hexadecimal representation of a decimal equivalent.
Notice in the HEX editor of file “systemm32.dll” (Figure 4.12) we see a file signature of “d0 cf 11 e0.” This is the known file signature for MS Word. In fact, Microsoft picked the binary code to identify its files with some forethought, as the HEX representation of the binary (which is d0 cf 11 e0) almost spells out (if you look closely and use your imagination) the word “docfile” (d0c f11e 0). Perhaps it’s an example of tech humor or clever, albeit maybe bored, application designers?
Curious? Why would a file with a .dll file extension contain a “docfile” file signature? If we scroll down through the HEX editor some more, we will also see the actual text contained within the file (Figure 4.13).
Notice the HEX value 58 59 5a and its ASCII equivalent, “XYZ” contained within the ASCII Character Panel. HEX editors, as part of their “tool set,” will automatically convert HEX to ASCII, so the rigorous HEX to ASCII conversion process we performed in the previous discussion is not necessary here.
The binary values representing the text “XYZ” are contained within the file “systemm32.dll.” When we rename “systemm32.dll” to “systemm32.doc” and double click the file we will see that it is not a system file (.dll file) after all but a Word document (.doc).
This example shows us the importance of HEX when viewing files or attempting to view files. Since we already know the file signature for MS Word documents, “d0 cf 11 e0,” we can now search the entirety of Jose McCarthy’s drive for those specific HEX characters, revealing the existence of an MS Word document.
Notice we couldn’t search the drive for the ASCII equivalent. The ASCII equivalent of the binary represented by the HEX is “.....”. The “.......” will sometimes be displayed when there is no ASCII equivalent to binary code, such as with file signatures. See Appendix 4C for a further review and discussion of file signatures.
Remember, the ASCII equivalent of HEX d0 cf 11 e0, is not the file extension “.doc.”
There may not always be an ASCII equivalent to a file type; this is one of the reasons to use HEX, or the importance of HEX. There may not always be an ASCII equivalent of, say, a file header (as in this case), ergo HEX.
Remember, ASCII has limitations and was expanded with Unicode (the Unicode equivalent of the HEX D0 CF 11 E0 is ÐÏ.ࡱ.á (seen and pronounced as DIATA). (See Figure 4.14.) However, this isn’t something easily searchable either as the characters are not all text based.
Open a .doc file (pre Office 2007) in a HEX editor and you will notice dots “................” for a lot of varying HEX characters. The file signature for this HEX value may not have an ASCII equivalent as well as some of the code and other file header data.
The first eight (8) bytes are contained in the fixed compound document file identifier, so the file identifier would be all eight (8) bytes: D0 CF 11 E0 A1 B1 1A E1.
However, when searching through HEX the first four (4) bytes would certainly suffice, but it is best to be most accurate, so the full file signature for HEX D0 CF 11 E0 (MS Word .doc file) is D0 CF 11 E0 A1 B1 1A E1 (Figure 4.15).
For a more detailed breakdown of the compound file header, please refer to Appendix 4D.
We see that even though a file is renamed we can still view the contents. If we were to search for the binary representation of “XYZ” across the entire drive we would find this value regardless of its modified file extension or file signature. As was discussed previously, many times in the course of normal day-to-day operations and file processing, a deleted file and its associated metadata will be partially overwritten, perhaps missing the entire file signature, or other important formatting information and even some text. However, if the binary values representing a piece of evidence (as in our case, “XYZ”) remain within the remnants, then the file can be found.
It is important to note that a forensic examiner cannot always depend on having an intact file or a file with the authorized (correct) file extension (i.e., file type), available in its native format, on which to perform an analysis.
Office 2007 has drastically altered the MS Word file format. The previous example is true for Office 2003 and earlier versions. Microsoft Office 2007 documents are now stored in what is referred to as the Office Open XML File Format.
It is essentially a ZIP file of various XML documents describing the entire document.
The point of the previous example is not to discuss the inner workings of an MS Word file format, but to show how file formats and signatures work in general. All file signatures are different and will continue to evolve. The purpose here is not to cover all file signatures, but to provide the reader with a very practical example of the relevance of a file’s signature in locating and identifying potentially incriminating information as part of a cyber forensic investigation.
There are file signature databases and tables available on the Internet. Most forensic tools are able to identify file signatures and header information, and will verify file types in this manner. Forensic tools will convert binary to ASCII, verify file signatures, and search for binary strings (or keywords, such as “XYZ”) without much effort on the part of the forensic examiner, and now you know HOW the software accomplishes this. See Appendix 4B for an example of a file signature database.
COMPLEX FILES: COMPOUND, COMPRESSED, AND ENCRYPTED FILES
Before ending this section there are other more complex files worth discussing: compound, compressed, and encrypted files. The full complexities of these files are not covered here, as there are books written about each. We do, however, explain some basics and their importance in forensics.
A compound file is a file format that consists of numerous files. The compound file itself is little more than a container for those files. The structure within a compound file is similar to that of a real file system consisting of a hierarchy of storage with one parent directory.
There is a root directory folder, children contained within, and files (data streams) contained therein. Compound files are sometimes associated with Microsoft’s Compound File Binary Format (CFBF) file.
All allocations of space within a Compound File are done in “chunks” or units called sectors. The size of a sector is definable at creation time of a Compound File, and those sectors are usually 512 bytes in size. A virtual stream is made up of a sequence of sectors.
At its simplest, the Compound File Binary Format is a container, with little restriction on what can be stored within it.
However, in forensics, the term compound files is sometimes used more loosely, representing any file that may contain a directory structure. Again, our goal is not to cover a specific file type or software, but concepts generally.
As with other files, the file header of a compound file will contain a file signature, identifying the file; it will also contain information required to interpret the rest of the file such as the file’s size and storage location.
It is this metadata that allows the software to reconstruct the file into the appropriate file format that will display the file’s specific information (i.e., size, creation date, change date, etc.). The file therefore needs to be “reconstructed” by its parent software in order for the data to be legible or otherwise accessible.
To further explain, we typically think of data storage as linear. For example, consider the information in the following data stream, “XYZ Corp.” The data is displayed in a linear contiguous pattern, X before Y and Y before Z. If that data was displayed in a nonlinear pattern we would see perhaps, “oZ pYCrX.”
If that same data now were not contiguous, other data from that same compound file may also be intertwined (e.g., ...?>>o....Z^qL p....77Ymn....C@qwerbsbdX......,,,,). The original data stream “XYZ Corp” is not as easily discernable now. Even searching for the HEX equivalent wouldn’t help us uncover the data in this example.
We would need an instruction set to reconstruct this data.
Files have become more complex and need to contain a lot of information. Many files contain Object Linking and Embedding (OLE) technology, in which one file may contain many files.
OLE (object linking and embedding) allows users to integrate data from different applications. Object linking allows users to share a single source of data for a particular object. The document contains the name of the file containing the data, along with a picture of the data. When the source is updated, all the documents using the data are updated as well.
With object embedding, one application (referred to as the “source”) provides data or an image that will be contained in the document of another application (referred to as the “destination”). The destination application contains the data or graphic image, but does not understand it or have the ability to edit it. It simply displays, prints, and/or plays the embedded item. To edit or update the embedded object, it must be opened in the source application that created it. This occurs automatically when you double-click the item or choose the appropriate edit command while the object is highlighted.
While embedding doesn’t allow users to have a single source of data, it does make it easier to integrate applications. An embedded object contains the actual data for the object, the name of the application that created it, and a picture of the data.3
For instance, an MS Word document may contain a JPG image; a file within a file. Compound files allow for incremental access, allowing for individual components to be accessed without the need of the entire file. This can save time and resources by not having to load an entire file, only the piece or pieces desired.
As we continue on with our discussion, compressed files are essentially compound files (and sometimes referred to as such in the forensic community) that are compressed. They work in similar fashion; however, also contained within the compound file are compression instructions.
A common file extension associated with compressed files is .ZIP. This file format has gone mainstream and is supported by many software utilities other than its parent software, PKZIP. The .ZIP file format was publically released, making it an open format which is used by other programs including Microsoft’s Open Office XML format. The ZIP file extension name is often used to describe any archival file format. There are other ZIP file formats including WINZIP, 7-Zip, GZip, and RZip.
The file format of a compressed file (or .ZIP file) changes depending upon its compression algorithm. Algorithms are the mathematical operations or instructions for completing a task, in this case compressing data. It is a method of encoding data using fewer bits than used in the original encoding. Algorithms are complex to say the least and books have been written regarding this topic.
To exemplify the difference between a regular file and a more complex file (compound or compressed) it would be best to examine a similar file in both formats.
Let’s examine a letter from Jose McCarthy, seized as part of the Ronelle Sawyer investigation. The letter examined was an MS Word file format (i.e., .doc), as was made clear when viewed via the HEX editor, shown here again.
We can easily see the doc file signature, d0 cf 11 e0, displayed in the HEX editor in Figure 4.16.
By knowing the file signature for an MS Word document, we can easily identify and/or search for the text contained within this .doc file, and in doing so, find references to the “XYZ” company, as shown in Figure 4.17.
What happens, however, when an application is upgraded? How might this effect the application’s file signature? To see the result of a change in software application file formatting, let us view the same document file from Jose McCarthy with a HEX editor, when Microsoft Office 2007 rather than Office 2003 is used to generate the document.
We see in Figure 4.18 that the file signature has changed. If you search for a file signature matching 50 4B 03 04 you will notice it corresponds to a .ZIP file signature (ASCII panel shows PK . . . . format). With the release of Office ‘07, Microsoft Word documents now use the same file format signature as a .ZIP file.
What is the importance to a cyber forensic investigation and what does this mean? For starters, it means that the file is a compound file consisting of other files. If we were to view the entirety of the file with our HEX editor we would not uncover any legible ASCII characters (see ASCII panel in Figure 4.17).
Why? The file structure and assembly instructions are contained within the file; thus, the file would need to be mounted by its native software in order for the contents to be viewed. As can be seen in Figure 4.19, the ASCII representation is not identifiable.
Viewing and, more importantly, searching the contents of these “complex” files are possible once they are mounted. Forensic tools incorporate the software to mount these so that searching is possible. If these complex files are not mounted then no search results will be obtained.
Encrypted files are also complex but differ in that an encryption key is required to decrypt an encrypted file.
Encryption uses an algorithm (cipher) to alter or transform the data in an attempt to prevent reconstruction by those without the instruction set, a.k.a. Encryption Key. Decryption refers to the reverse process of making the data readable or otherwise accessible.
Encryption is a method by which the confidentiality of data can be protected. For the most part, an encrypted file cannot be decrypted without the encryption key (aka password). The encryption process uses an algorithm or cipher to mathematically transform the plaintext along with the encryption key (password), thereby encoding it in such a manner that it is illegible or indecipherable.
With the correct decryption key (password) the data is then run through its associated cipher text (algorithm) and converted back to clear text, which is, by default, decrypted. Remember, this entire process occurs in binary, as 0s or 1s.
It is the cipher that actually changes the file; the password is just a set of data which are used to “mathematically mix” and set the process in motion, turning the plaintext data into an unreadable end product.
The structure of ciphers depends upon the cipher’s type. Types of ciphers vary but generally they can be categorized by the following:
- Block or stream. Block ciphers generally work on fixed length bits of data called blocks. The cipher may take a 256–bit block of plaintext data and encrypt it, which results in a 256-bit block of encrypted data. In a stream cipher, the plaintext bits are encrypted one at a time along with the encryption key.
- Symmetric or asymmetric. In symmetric encryption, the same encryption key or password is used for both encryption and decryption, whereas with asymmetric encryption different keys are used. Symmetric key encryption is intuitive in that the same password is used to encrypt or decrypt the data. Asymmetric key encryption, or public-key cryptography, uses two different encryption keys, a public and a private key. Data is encrypted using a person’s public key, one in which everyone may have access to or even be distributed. However, data can only be decrypted using the person’s private key, one which is kept secret by the individual.
There are various encryption methods available, such as the Advanced Encryption Standard (AES), which is currently the standard adopted by the United States government and one of the most popular encryption methods available and in use today.
There are other encryption algorithms (or formats) available and many books have been written regarding each. It is not within the scope of this text to cover the various standards of encryption.
A similar attribute shared by all these “complex” file types discussed is that they contain some level or form of instruction needed to reconstruct the file. If that information is overwritten or otherwise missing, the ability to retrieve the data contained within the file will be severely compromised.
If the instructional data needed to reconstruct a compound file is missing, overwritten, destroyed, compromised, etc., the file may not be recoverable, even though the data containing the evidence (e.g., XYZ Corp) may still be contained within the file itself.
However, with that said, it may be possible to reconstruct a complex file which has been partially overwritten. Forensic analysts are creative, cutting edge, innovative, and very intelligent; they have developed solutions for some of the most complex problems. However, recovering the data with normal “point and click” methods may not always be possible.
It is important to understand that not all binary values are convertible into readable ASCII. ASCII is a code, based on the ordering of the English alphabet¸ and not all data contained within a computer is necessarily text (ASCII) based. There are many programs or software applications which are written in programming code which is not ASCII based.
This programming code is not meant to be viewed in ASCII, it is meant to perform a function. Recall from our earlier discussions, a computer’s functions are all based on math, not the English (nor French, Chinese, Slavic, Greek, or Arabic) language; code therefore needs to be based on mathematical principles not grammatical ones.
A file’s type or format is based upon its file signature, not a Microsoft Windows extension. The file header, including the file signature is best viewed in HEX as there is no legible or identifiable corresponding ASCII representation. As we discussed, file signature/headers are the pieces of a file which identify the file to its “parent” software, not to the user.
Thus, when we view the HEX editor and see HEX values appearing as “...............” in the ASCII Character Panel, this could mean that there may not be an ASCII representation for those HEX values. The HEX values, when they do exist, are unique and therefore searchable.
It is very easy (and potentially dangerous) to become dependent on the forensic tools and forget the nuts and bolts of the technological process, and forget or even be unaware HOW the answer is obtained.
Reliance on any “tool” without having a solid understanding of how the tool works could spell personal and professional disaster for the cyber forensic investigator.
This is akin to successfully providing the correct answers to all the questions on a mathematics exam, and still receiving a failing grade because you failed to show your work.
If asked to explain how an answer was obtained or on what data analysis a conclusion is reached, if one were to reply, “I used tool ‘ABC’ and it provided the answer,” and if you are unable to explain how the tool obtained the answer or how you could validate and substantiate that the answer you provided was correct, the validity and reliance of your answer could be called in to question and held suspect.
Use a tool, but be certain you know how the tool works and how to replicate the results if you had to do so, without the tool.
1. File format, retrieved January 2010, www.answers.com/topic/file-format.
2. Bellevue Linux Users Group (BLUG), Magic Number Definition,” The Linux Information Project, August 21, 2006, retrieved January 2010, www.linfo.org/magic_number.html.
3. “Common Questions: Object Linking and Embedding, Data Exchange,” Microsoft Support, retrieved November 2011, http://support.microsoft.com/kb/122263, © 2007 Microsoft Corporation. All rights reserved. Used with permission from Microsoft.
APPENDIX 4A: COMMON FILE EXTENSIONSa
Common file extensions that are good to know, organized by file format.
Text Files | |
.doc | Microsoft Word Document |
.docx | Microsoft Word Open XML Document |
.log | Log File |
.msg | Outlook Mail Message |
.pages | Pages Document |
.rtf | Rich Text Format File |
.txt | Plain Text File |
.wpd | WordPerfect Document |
.wps | Microsoft Works Word Processor Document |
Data Files | |
.123 | Lotus 1-2-3 Spreadsheet |
.accdb | Access 2007 Database File |
.csv | Comma Separated Values File |
.dat | Data File |
.db | Database File |
.dll | Dynamic Link Library |
.mdb | Microsoft Access Database |
.pps | PowerPoint Slide Show |
.ppt | PowerPoint Presentation |
.pptx | Microsoft PowerPoint Open XML Document |
.sdb | OpenOffice.org Base Database File |
.sdf | Standard Data File |
.sql | Structured Query Language Data |
.vcfv | Card File |
.wks | Microsoft Works Spreadsheet |
.xls | Microsoft Excel Spreadsheet |
.xlsx | Microsoft Excel Open XML Document |
.xml | XML File |
Image Files | |
.pct | Picture File |
Raster Image Files | |
.bmp | Bitmap Image File |
.gif | Graphical Interchange Format File |
.jpg | JPEG Image File |
.png | Portable Network Graphic |
.psd | Photoshop Document |
.psp | Paint Shop Pro Image File |
.thm | Thumbnail Image File |
.tif | Tagged Image File |
Vector Image Files | |
.ai | Adobe Illustrator File |
.drw | Drawing File |
.dxf | Drawing Exchange Format File |
.eps | Encapsulated PostScript File |
.ps | PostScript File |
.svg | Scalable Vector Graphics File |
3D Image Files | |
.3dm | Rhino 3D Model |
.dwg | AutoCAD Drawing Database File |
.pln | ArchiCAD Project File |
Page Layout Files | |
.indd | Adobe InDesign File |
Portable Document Format File | |
.qxd | QuarkXPress Document |
.qxp | QuarkXPress Project File |
Audio Files | |
.aac | Advanced Audio Coding File |
.aif | Audio Interchange File Format |
.iff | Interchange File Format |
.m3u | Media Playlist File |
.mid | MIDI File |
.midi | MIDI File |
.mp3 | MP3 Audio File |
.mpa | MPEG-2 Audio File |
.ra | Real Audio File |
.wav | WAVE Audio File |
.wma | Windows Media Audio File |
Video Files | |
.3g2 | 3GPP2 Multimedia File |
.3gp | 3GPP Multimedia File |
.asf | Advanced Systems Format File |
.asx | Microsoft ASF Redirector File |
.avi | Audio Video Interleave File |
.flv | Flash Video File |
.mov | Apple QuickTime Movie |
.mp4 | MPEG-4 Video File |
.mpg | MPEG Video File |
.rm | Real Media File |
.swf | Flash Movie |
.vob | DVD Video Object File |
.wmv | Windows Media Video File |
Web Files | |
.asp | Active Server Page |
.css | Cascading Style Sheet |
.htm | Hypertext Markup Language File |
.html | Hypertext Markup Language File |
.js | JavaScript File |
.jsp | Java Server Page |
.php | Hypertext Preprocessor File |
.rss | Rich Site Summary |
.xhtml | Extensible Hypertext Markup Language File |
Font Files | |
.fnt | Windows Font File |
.fon | Generic Font File |
.otf | OpenType Font |
.ttf | TrueType Font |
Plugin Files | |
.8bi | Photoshop Plug-in |
.plugin | Mac OSX Plug-in |
.xll | Excel Add-In File |
System Files | |
.cab | Windows Cabinet File |
.cpl | Windows Control Panel |
.cur | Windows Cursor |
.dmp | Windows Memory Dump |
.drv | Device Driver |
.key | Security Key |
.lnk | File Shortcut |
.sys | Windows System File |
Settings Files | |
.cfg | Configuration File |
.ini | Windows Initialization File |
.prf | Outlook Profile File |
Executable Files | |
.app | Mac OS X Application |
.bat | DOS Batch File |
.cgi | Common Gateway Interface Script |
.com | DOS Command File |
.exe | Windows Executable File |
.pif | Program Information File |
.vb | VBScript File |
.ws | Windows Script |
Compressed Files | |
.7z | 7-Zip Compressed File |
.deb | Debian Software Package |
.gz | Gnu Zipped File |
.pkg | Mac OS X Installer Package |
.rar | WinRAR Compressed Archive |
.sit | Stuffit Archive |
.sitx | Stuffit X Archive |
.zip | Zip File |
.zipx | Extended Zip File |
Encoded Files | |
.bin | Macbinary II Encoded File |
.hqx | BinHex 4.0 Encoded File |
.mim | Multi-Purpose Internet Mail Message |
.uue | Uuencoded File |
Developer Files | |
.c | C/C++ Source Code File |
.cpp | C++ Source Code File |
.java | Java Source Code File |
.pl | Perl Script |
Backup Files | |
.bak | Backup File |
.bup | Backup File |
.gho | Norton Ghost Backup File |
.ori | Original File |
.tmp | Temporary File |
Disk Files | |
.dmg | Mac OS X Disk Image |
.iso | Disc Image File |
.toast | Toast Disc Image |
.vcd | Virtual CD |
Game Files | |
.gam | Saved Game File |
.nes | Nintendo (NES) ROM File |
.rom | N64 Game ROM File |
.sav | Saved Game |
Misc Files | |
.msi | Windows Installer Package |
.part | Partially Downloaded File |
.torrent | BitTorrent File |
.yps | Yahoo! Messenger Data File |
APPENDIX 4B: FILE SIGNATURE DATABASE
APPENDIX 4C: MAGIC NUMBER DEFINITIONb
A magic number is a number embedded at or near the beginning of a file that indicates its file format (i.e., the type of file it is). It is also sometimes referred to as a file signature.
Magic numbers are generally not visible to users. However, they can easily be seen with the use of a HEX editor, which is a specialized program that shows and allows modification of every byte in a file.
For common file formats, the numbers conveniently represent the names of the file types. Thus, for example, the magic number for image files conforming to the widely used GIF87a format in hexadecimal (i.e., base 16) terms is 0x474946383761, which when converted into ASCII is GIF87a. ASCII is the de facto standard used by computers and communications equipment for character encoding (i.e., associating alphabetic and other characters with numbers).
Likewise, the magic number for image files having the subsequently introduced GIF89a format is 0x474946383961. For both types of GIF (Graphic Interchange Format) files, the magic number occupies the first six bytes of the file. They are then followed by additional general information (i.e., metadata) about the file.
Similarly, a commonly used magic number for JPEG (Joint Photographic Experts Group) image files is 0x4A464946, which is the ASCII equivalent of JFIF (JPEG File Interchange Format). However, JPEG magic numbers are not the first bytes in the file; rather, they begin with the seventh byte. Additional examples include 0x4D546864 for MIDI (Musical Instrument Digital Interface) files and 0x425a6831415925 for bzip2 compressed files.
Magic numbers are not always the ASCII equivalent of the name of the file format, or even something similar. For example, in some types of files they represent the name or initials of the developer of that file format. Also, in at least one type of file the magic number represents the birthday of that format’s developer.
Various programs make use of magic numbers to determine the file type. Among them is the command line (i.e., all-text mode) program named file, whose sole purpose is determining the file type.
Although they can be useful, magic numbers are not always sufficient to determine the file type. The main reason is that some file types do not have magic numbers, most notably plain text files, which include HTML (hypertext markup language), XHTML (extensible HTML), and XML (extensible markup language) files as well as source code.
Fortunately, there are also other means that can be used by programs to determine file types. One is by looking at a file’s character set (e.g., ASCII) to see if it is a plain text file. If it is determined that a file is a plain text file, then it is often possible to further categorize it on the basis of the start of the text, such as <html> for HTML files and #! (the so-called shebang) for script (i.e., short program) files.
Another way to determine file type is through the use of filename extensions (e.g., .exe, .html, and .jpg), which are required on the various Microsoft operating systems but only to a small extent on Linux and other Unix-like operating systems. However, this approach has the disadvantage that it is relatively easy for a user to accidentally change or remove the extensions, in which case it becomes difficult to determine the file type and use the file.
Still another way that is possible in the case of some commonly used filesystems is through the use of file type information that is embedded in each file’s metadata. In Unix-like operating systems, such metadata is contained in inodes, which are data structures (i.e., efficient ways of storing information) that store all the information about files except their names and their actual data.
Magic numbers are referred to as magic because the purpose and significance of their values are not apparent without some additional knowledge. The term magic number is also used in programming to refer to a constant that is employed for some specific purpose but whose presence or value is inexplicable without additional information.
APPENDIX 4D: COMPOUND DOCUMENT HEADERc
The first 512 bytes of the file may look like Table 4D.1.
00000000H | D0 CF 11 E0 A1 B1 1A E1 00 00 00 00 00 00 00 00 |
00000010H | 00 00 00 00 00 00 00 00 3B 00 03 00 FE FF 09 00 |
00000020H | 06 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 |
00000030H | 0A 00 00 00 00 00 00 00 00 10 00 00 02 00 00 00 |
00000040H | 01 00 00 00 FE FF FF FF 00 00 00 00 00 00 00 00 |
00000050H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000060H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000070H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000080H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000090H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000000A0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000000B0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000000C0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000000D0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000000E0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000000F0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000100H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000110H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000120H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000130H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000140H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000150H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000160H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000170H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000180H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000190H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000001A0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000001B0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000001C0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000001D0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000001E0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000001F0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
Examining the details of this Compound Document Header discloses the following: eight (8) bytes containing the fixed compound document file identifier (Table 4D.2).
00000000H | D0 CF 11 E0 A1 B1 1A E1 00 00 00 00 00 00 00 00 |
Sixteen (16) bytes containing a unique identifier, followed by four (4) bytes containing a revision number and a version number (Table 4D.3).
00000000H | D0 CF 11 E0 A1 B1 1A E1 00 00 00 00 00 00 00 00 |
00000010H | 00 00 00 00 00 00 00 00 3B 00 03 00 FE FF 09 00 |
Two (2) bytes containing the byte order identifier. It should always consist of the byte sequence FEH FFH (Table 4D.4).
00000010H | 00 00 00 00 00 00 00 00 3B 00 03 00 FE FF 09 00 |
Two (2) bytes containing the size of sectors, two (2) bytes containing the size of short-sectors. The sector size is 512 bytes, and the short-sector size is 64 bytes here (Table 4D.5).
00000010H | 00 00 00 00 00 00 00 00 3B 00 03 00 FE FF 09 00 |
00000020H | 06 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 |
Ten (10) bytes without valid data can be ignored (Table 4D.6).
00000020H | 06 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 |
Four (4) bytes containing the number of sectors used by the sector allocation table (SAT). The SAT uses only one sector here (Table 4D.7).
00000020H | 06 00 00 00 00 00 00 00 00 00 00 00 01 00 00 00 |
Four (4) bytes containing the SecID of the first sector used by the directory. The directory starts at sector 10 here (Table 4D.8).
00000030H | 0A 00 00 00 00 00 00 00 00 10 00 00 02 00 00 00 |
Four (4) bytes without valid data can be ignored (Table 4D.9).
00000030H | 0A 00 00 00 00 00 00 00 00 10 00 00 02 00 00 00 |
Four (4) bytes containing the minimum size of standard streams. This size is 00001000H = 4096 bytes here (Table 4D.10).
00000030H | 0A 00 00 00 00 00 00 00 00 10 00 00 02 00 00 00 |
Four (4) bytes containing the SecID of the first sector of the short-sector allocation table (Table 4D.11).
00000030H | 0A 00 00 00 00 00 00 00 00 10 00 00 02 00 00 00 |
Four (4) bytes containing the number of sectors used by the SSAT. In this example, the SSAT starts at sector 2 and uses one sector (Table 4D.12).
00000040H | 01 00 00 00 FE FF FF FF 00 00 00 00 00 00 00 00 |
Four (4) bytes containing the SecID of the first sector of the master sector allocation table, followed by four (4) bytes containing the number of sectors used by the MSAT. The SecID here is −2, which states that there is no extended MSAT in this file (Table 4D.13).
00000040H | 01 00 00 00 FE FF FF FF 00 00 00 00 00 00 00 00 |
436 bytes containing the first 109 SecIDs of the MSAT. Only the first SecID is valid, because the SAT uses only one sector (see earlier).
Therefore, all remaining SecIDs are set to the special Free SecID with the value −1.
The only sector used by the SAT is sector 0 (Table 4D.14).
00000040H | 01 00 00 00 FE FF FF FF 00 00 00 00 00 00 00 00 |
00000050H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000060H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000070H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000080H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000090H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000000A0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000000B0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000000C0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000000D0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000000E0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000000F0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000100H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000110H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000120H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000130H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000140H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000150H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000160H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000170H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000180H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
00000190H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000001A0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000001B0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000001C0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000001D0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000001E0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
000001F0H | FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF |
aThe information in this Appendix came from www.fileinfo.com/common.php.
bBellevue Linux Users Group (BLUG), “Magic Number Definition,” The Linux Information Project, August 21, 2006, retrieved January 2010, www.linfo.org/magic_number.html.
cD. Rentz, D. “Documentation of the Microsoft Compound Document File Format,” OpenOffice.org Source Project, August 7, 2007, retrieved February 2010, http://sc.openoffice.org/compdocfileformat.pdf.