This article is based on Tika in Action, to be published on Summer, 2011. It is being reproduced here by permission from Manning Publications. Manning publishes MEAP (Manning Early Access Program,) eBooks and pBooks. MEAPs are sold exclusively through Manning.com. All pBook purchases include free PDF, mobi and epub. When mobile formats become available all customers will be contacted and upgraded. Visit Manning.com for more information. [ Use promotional code 'java40beat' and get 40% discount on eBooks and pBooks ]
The types of content within files vary vastly. We’ve picked two example file format types to examine in this article: the Hierarchical Data Format (HDF), a common file format used to capture scientific information, and Really Simple Syndication (RSS), the most commonly used format to spread news and rapidly changing information.
For example, consider a scenario in which a science instrument flown on a NASA satellite takes data, which is then downlinked via an overpass of one of any number of existing ground stations, as shown in figure 1. The downlinked data is then transferred via dedicated networks to a science data processing center for data transformation and, ultimately, for dissemination to the public.
In this scenario, the raw data arriving at the science data processing center represents engineering and housekeeping information including raw voltages from the instrument and rudimentary location information (for example, an orbit number). This information is represented as a series of files, each file corresponding to one channel of data from the instrument (three channels in total) and one set of three files per each orbit of the satellite around the Earth.
Data within each channel file is stored initially in a binary data format; for the purposes of this example we’ll assume the widely used Hierarchical Data Format (HDF), version 5 (HDF-5) format. HDF is a binary data format, providing an external user facing API for writing to and reading from HDF-5 files. The HDF-5 API allows users to write data using a small canonical set of data constructs, specifically those shown in table 1 below.
All of the data and metadata from our postulated scenario is represented in a set of three HDF-5 files (corresponding to each channel of the instrument), for each of the orbits that the satellite makes around the Earth. That means that, if the instrument is measuring a set of scientific variables, such as air temperature, wind speed, CO2, or any number of additional interesting other variables, then that information is represented in the HDF-5 files as sets of named scalars, vectors, and matrices in the HDF-5 file.
Of course, that’s just one example of how content is stored inside of a file. Consider another scenario, such as a Really Simple Syndication (RSS) feed file that lists the latest news stories provided by CNN.com, an example of which is provided below in figure 2.
RSS files are based around a simplistic but powerful data model. Each RSS file is an XML file adhering to a prescribed XML schema defining the RSS vocabulary. That vocabulary consists of two main data structures. First, each RSS file typically contains a Channel, which aggregates a set of associated RSS items, each of which typically points to some news story of interest. Every RSS Channel has a set of metadata associated with it, such as a URL and description (for example, http://www.cnn.com/sports/ for the URL and “Latest news stories about sports within the last hour” as the description), as does each RSS item tag.
In the CNN example, CNN publishes set of RSS files, each containing an RSS Channel, one for each CNN news category (such as Top Stories, World, the United States, or any of the other left-hand column categories in figure 2). Each RSS Channel has a corresponding set of latest news stories and links that users can subscribe to via any number of different free RSS readers, including most modern web browsers.
Understanding the types of content is the first step towards automatically extracting information from it. We’ll go into the details of that in the next section, describing how Tika codifies the process of extracting content.
The mechanism by which a file is stored on media may transmit useful information worthy of Tika’s extraction. These mechanisms include the logical representation of files via storage, either through links (like file symbolic links) as well as the notion that files can simply be sets of independent physical files linked together somehow.
Files can be physically stored on a single disk or via the network. Sometimes files may be physically distributed, such as in the case of networked filesystems like Google File System (GFS) or Hadoop Distributed File System (HDFS) but centrally represented via collection of network data blocks or some other higher order structure. We’ll discuss how Tika’s use of the InputStream abstraction hides some of this complexity and uniqueness.
Individual files may be stored on disk as part of a larger whole of logically or physically linked files via some mechanism such as a common collection label, or a unique directory to collect the files together. Tika doesn’t care because it has the ability to exploit information from either case. Madness you say? Read on!
Let’s postulate a simple example of software deployment to illustrate how logical representation of files and directories may convey otherwise-hidden meaning that we’ll want to bring out in the open using Tika. Take for example the software deployment scenario in figure 3.
In our postulated scenario, software is extracted from a configuration management system, let’s say Apache Subversion, and then run through a deployment process that installs the latest and greatest version of the software into the /deploy directory, giving the installed software a unique version number. A symbolic link, titled current is also updated to point to the most recent installed version of the software as a result of this process.
If we expand our focus beyond the logical links between files and consider how those files are actually represented on disk, we arrive at a number of interesting information sources ripe for extraction. For example, considering nowadays that more and more filesystems are moving beyond simple local disks to federations and farms of storage devices, we are faced with an interesting challenge. How do we deal with the extraction of information from a file if we only have available to us a small unit of that file? Even worse, what do we do if that small unit available to us is not a power unit like the file header?
The reality is we need a technology and software approach that can abstract away the mechanism by which the file is actually stored. If the storage mechanism and physical file representation were abstracted away, then the extraction of text and agglomeration of metadata derived from a file could easily be fed into Tika’s traditional extraction processes that we’ve covered so far.
This is precisely the reason that Tika leverages the InputStream as the core data passing interface to its Parser implementations via the parse(…) method. InputStreams obfuscate the underlying storage and protocol used to physically represent file contents (or sets of files). In fact, whether it’s a Google File System URL pointer to a file that’s distributed as “blocks” over the network or a URL pointer to a file that’s locally on disk, Tika still deals with the information as an InputStream via a call to URL.openStream. And, URLs aren’t the only means of getting InputStreams—InputStreams can be generated from Files, byte arrays, and all sorts of objects making it the right choice for Tika’s abstraction for the file physical storage interface.
We started out by showing you how file content organization can affect performance and memory properties and influence how Tika parses out information and metadata. In the case of RSS, its content organization (based on XML) allows for easy streaming and random-access, whereas in the case of HDF5, the entire file had to be read into memory, precluding streaming but supporting random access.
The last important aspect of files is the physical location of a file (or set of associated files) on disk. In many cases, individual files are part of some larger conglomerate, for example, in the case of directories and split files generated by archive/compression utilities.