The Standards of Metadata

This article is based on Tika in Action, to be published on Summer, 2011. It is being reproduced here by permission from Manning Publications. Manning publishes MEAP (Manning Early Access Program,) eBooks and pBooks. MEAPs are sold exclusively through Manning.com. All pBook purchases include free PDF, mobi and epub. When mobile formats become available all customers will be contacted and upgraded. Visit Manning.com for more information. [ Use promotional code ‘java40beat’ and get 40% discount on eBooks and pBooks ]

also read:

The Standards of Metadata

Introduction

While PDF file properties, and HTML page properties are useful for making decisions such as, “Do I care to read this research paper?” or “Is this web page the one I was looking for?”, the property names themselves don’t tell you everything that you need to know in order to make use of them. For example, is “PDFVersion” an integer, or an alphanumeric? This would be useful to know because it would allow you to compare different “PDFVersion” attributes. What about “Author”? Is it multivalued, meaning that a paper can have multiple authors, or is it only single-valued?

To answer these questions, we usually turn to metadata standards or metadata models. Standards describe all sorts of information about metadata such as cardinality (of fields), relationships between fields, valid values, and ranges, and field definitions, to name a few. Some representative properties of metadata standards are given in table 1.

The International Standards Organization (ISO) has published a reference standard for the description of metadata elements as part of metadata models, numbered ISO-11179. ISO-11179 prescribes a generally accepted mechanism for defining metadata models. There are tons of metadata models out there, and they can be loosely classified as either general models or content-specific models, as depicted in figure 1.

Dublin Core is a general metadata model consisting of fewer than 20 attributes (Creator, Publisher, Format) that are said to describe any electronic resource. On the other side of the coin are Content-specific models, which are unique to a particular file type and only contain metadata elements and descriptions that are relevant to the content type. Examples of these types of models are the Federal Geographic Data Committee (FGDC), a model for describing spatial data files, and Adobe XMP, a metadata standard for media files (like images and videos).

Tika supports both general and content-specific metadata standards. You can get a list of standard metadata models supported by your version of Tika via the option provided by the Tika command –list-met-models line interface.

java -jar tika-app-0.8-SNAPSHOT.jar --list-met-models

Or, you can print the same list programmatically by calling the TikaCLI from a Java program.

The version of Tika at the time of writing this book (0.8-SNAPSHOT) supports the following metadata models, as output from the DescribeMetadata tool wherein the metadata model name is shown without a left indent and the model’s associated metadata attributes are shown indented one space from the left and capitalized.

ClimateForecast
ACKNOWLEDGEMENT
COMMAND_LINE
COMMENT
CONTACT
CONVENTIONS
EXPERIMENT_ID
HISTORY
INSTITUTION
MODEL_NAME_ENGLISH
PROGRAM_ID
PROJECT_ID
REALIZATION
REFERENCES
SOURCE
TABLE_ID
CreativeCommons
LICENSE_LOCATION
LICENSE_URL
WORK_TYPE
DublinCore
CONTRIBUTOR
COVERAGE
CREATOR
DATE
DESCRIPTION
FORMAT
IDENTIFIER
LANGUAGE
MODIFIED
PUBLISHER
RELATION
RIGHTS
SOURCE
SUBJECT
TITLE
TYPE
Geographic
ALTITUDE
LATITUDE
LONGITUDE
HttpHeaders
CONTENT_DISPOSITION
CONTENT_ENCODING
CONTENT_LANGUAGE
CONTENT_LENGTH
CONTENT_LOCATION
CONTENT_MD5
CONTENT_TYPE
LAST_MODIFIED
LOCATION
MSOffice
APPLICATION_NAME
APPLICATION_VERSION
AUTHOR
CATEGORY
CHARACTER_COUNT
CHARACTER_COUNT_WITH_SPACES
COMMENTS
COMPANY
CONTENT_STATUS
CREATION_DATE
EDIT_TIME
KEYWORDS
LAST_AUTHOR
LAST_PRINTED
LAST_SAVED
LINE_COUNT
MANAGER
NOTES
PAGE_COUNT
PARAGRAPH_COUNT
PRESENTATION_FORMAT
REVISION_NUMBER
SECURITY
SLIDE_COUNT
TEMPLATE
TOTAL_TIME
VERSION
WORD_COUNT
Message
MESSAGE_BCC
MESSAGE_CC
MESSAGE_FROM
MESSAGE_RECIPIENT_ADDRESS
MESSAGE_TO
TIFF
BITS_PER_SAMPLE
EQUIPMENT_MAKE
EQUIPMENT_MODEL
EXPOSURE_TIME
FLASH_FIRED
FOCAL_LENGTH
F_NUMBER
IMAGE_LENGTH
IMAGE_WIDTH
ISO_SPEED_RATINGS
ORIENTATION
ORIGINAL_DATE
RESOLUTION_HORIZONTAL
RESOLUTION_UNIT
RESOLUTION_VERTICAL
SAMPLES_PER_PIXEL
SOFTWARE

Now that you know what metadata models that Tika supports and that there is a difference between the models (in other words, not all are created equally!), let’s more precisely explore the variations between generic and content-specific metadata models. We’ll use Tika to help us out.

General standards

Most electronic files available via the Internet have a common set of metadata properties, the conglomerate of which are part of what we call general metadata models or general standards for metadata. General models describe electronic resources at a high level as in, “Who authored the content?”, “What format(s) is the content represented in?”, and the like.

To illustrate, let’s take a look at some of the properties of the Dublin Core metadata model attributes, as output by Tika, which supports Dublin Core. Recall the command we showed you above.

java -jar tika-app-0.8-SNAPSHOT.jar --list-met-models

By exploring the output of the above command a bit, and by using a simple grep command, we can augment the –list-met-models output to isolate only the Dublin Core part.

java -jar tika-app-0.8-SNAPSHOT.jar --list-met-models | grep -A16 DublinCore

Which produces the output:

DublinCore
CONTRIBUTOR
COVERAGE
CREATOR
DATE
DESCRIPTION
FORMAT
IDENTIFIER
LANGUAGE
MODIFIED
PUBLISHER
RELATION
RIGHTS
SOURCE
SUBJECT
TITLE
TYPE

Looking at some of these attributes, it’s pretty clear that most or all of them are highly representative of all electronic documents. Think back to table 1. What would the valid values be for something like the FORMAT attribute? Most of the time the metadata field is filled with a valid MIME media type. What would the cardinality be for something like the DATE attribute? Often, electronic documents have a single creation date, but perhaps many last modified dates, so the cardinality is one or more values.

Let’s now focus in on content-specific metadata models.

Content-specific metadata standards

Generic metadata standards and models are great because they address two fundamentally important facets of capturing and using metadata:

  • Filling in at least some value per field—Content-specific metadata standards provide at least some value for each field (for example, FORMAT, TITLE in the case of Dublin Core, and their corresponding values captured for, for example, a PDF file, like application/pdf and mypdffile.pdf, respectively) because they’re generic.
  • Easily comparable—Mainly due to having some default value, the actual attributes themselves are so general that they are more likely to mean the same thing. (It’s pretty clear what a TITLE is referring to for a document.)

On the other hand, content-specific metadata standards and models are less likely to fulfill either of the above properties. First, they aren’t guaranteed to fill in any values of any of their particular fields. Take MS Office files and their field, COMPANY, derived from the same grep trickery we showed you above:

java -jar target/tika-app-0.8-SNAPSHOT.jar --list-met-models | grep -A28 MSOffice
MSOffice
APPLICATION_NAME
APPLICATION_VERSION
AUTHOR
CATEGORY
CHARACTER_COUNT
CHARACTER_COUNT_WITH_SPACES
COMMENTS
COMPANY
CONTENT_STATUS
CREATION_DATE
EDIT_TIME
KEYWORDS
LAST_AUTHOR
LAST_PRINTED
LAST_SAVED
LINE_COUNT
MANAGER
NOTES
PAGE_COUNT
PARAGRAPH_COUNT
PRESENTATION_FORMAT
REVISION_NUMBER
SECURITY
SLIDE_COUNT
TEMPLATE
TOTAL_TIME
VERSION
WORD_COUNT

COMPANY is only filled out in an MS Word file’s metadata attributes when the Company name has been entered by the user or owner of the MS Office suite installed on the computer that created the MS Word file. So, in short, if you didn’t fill out the Company field when registering your MS Office, and you begin sharing MS Word files with your other software colleagues, and they want to use Tika to find out what company you work for, they are out of luck. (For privacy-minded people, though, this is a good thing!)

As for being easily comparable, this is another area where content-specific metadata models do not particularly provide a silver bullet. The field in the metadata LAST_MODIFIED HttpHeaders model does not correspond directly to the MODIFIED field in the DublinCore model nor does it correspond to LAST_SAVED field from the MSOffice metadata model. So, content-specific metadata model attributes are not easily comparable across metadata models.

Most document formats have an associated content or file-specific metadata model associated with them (even in the presence of a general model, like Dublin Core). There is the eXtensible Metadata Platform (XMP), pioneered by Adobe for media file formats (images, videos, and so on); there’s a whole slew of MS Office metadata formats, there’s metadata models for JPEG files, metadata for climate related science files in the Climate Forecast metadata model, and there’s corresponding other metadata formats (like FITS) for science files in the astrophysics community. The good news is this: Tika already supports a slew of existing content-specific metadata models and even if it doesn’t, it’s extensible and allows you to add in your own metadata models and attributes/specifications that you can leverage in your own content-specific applications.

We’ll tell you a little bit about metadata quality and how it influences all sorts of things like comparing, understanding, and validating metadata.

Metadata quality

The biggest thing we’ve glossed over while informing you about the wonders of metadata until now is, “So, how does that metadata get populated?” This is a great question. There are plenty of ways. The application program that generates a particular file (for example, MS Office generates Word Documents, PowerPoint files, and so one, or Adobe Photoshop generates PDF files, and so on) can be responsible for annotating a file with metadata.

An alternative is that a user may explicitly fill out metadata about the file on their own when authoring it. Many Software Project Management tools (like MS Project or Fastrack) prompt a user to fill out basic metadata fields (Title, Duration, Start Project Date, End Project Date, and so on) when authoring the file.

Sometimes, downstream software programs author metadata about files. A classic example of this is when a web server returns metadata about the file content it is delivering back to a user request. The web server was not the originator of the file; however, it has the ability to tell a requesting user things like file size, content-type (or MIME type), and other useful properties. This is depicted for the example in figure 2.

During that process, Word annotates the file with basic MsOffice metadata, including AUTHOR, and PAGE_COUNT. After some file is created, later the content creator may publish her file on the Apache HTTPD web server, where it will be available for downstream users to acquire. When a downstream user requests the file from Apache HTTPD, the web server will annotate the file with HttpHeaders metadata, including CONTENT_TYPE and other metadata.

With all of these actors in the system, it’s no wonder that metadata quality, or the examination and assesment of captured metadata for file types, is a big concern. In any one of the steps in figure 2, the metadata for the file could be changed or simply not populated, affecting some downstream user of the file, or some software that must make sense of it later. What’s more, even if the metadata is populated, it’s often difficult to compare metadata captured in different files, even if the metadata captured in fact does represent the same terminology. This is often due to each metadata model’s using its own terms, potentially its own units for those terms, and, ultimately, its own definitions for those terms as well.

Metadata quality is of prime importance, especially in the case of correlating metadata for files of different types and, most often, different metadata models. For a writer of software that must deal with thousands of different file types and metadata models every day, it’s no easy challenge to tackle metadata correlation.

Here comes Tika to save the day again!

Unifying heterogeneous standards

Lucky for us, Tika’s metadata layer is designed with exactly the aforementioned metadata quality challenges in mind. Tika provides a Property class that implements the Adobe XMP standard for capturing metadata attributes. XMP defines a property (called PropertyType in Tika) as some form of metadata captured about an annotated document. XMP also defines property values that are captured for each property of metadata. In Tika we call XMP property values ValueTypes. Let’s take a quick look at a snippet of the Tika Property class.

Listing 1 Tika’s Property class and its support for XMP-like metadata.

public final class Property {
		public static enum PropertyType {
			SIMPLE, STRUCTURE, BAG, SEQ, ALT
		}
		public static enum ValueType {
			BOOLEAN, OPEN_CHOICE, CLOSED_CHOICE, DATE, INTEGER, LOCALE,
			MIME_TYPE, PROPER_NAME, RATIONAL, REAL, TEXT, URI, URL, XPATH
		}
		// ...
	}

The PropertyType and ValueType Java enums allow Tika to define a metadata attribute’s cardinality (for example, is it a SIMPLE value or a sequence of them called SEQ for shorthand), controlled vocabulary (for example, a CLOSED_CHOICE or simple OPEN_CHOICE), and units (for example, a REAL or an INTEGER). Using Tika and its Property class, you are able to decide whether or not LAST_MODIFIED in the HttpHeaders metadata model is roughly equivalent in terms of units, controlled vocabulary and cardinality to that of LAST_SAVED in the MsOffice metadata model.

These capabilities are useful in comparing, validating them, and understanding metadata properties (recall from table 1 that these are important things to capture for each metadata element) and in dealing with heterogeneous metadata models and formats. Tika’s goal is to allow you to curate high-quality metadata in your software application.

Summary

We’ve helped you familiarize yourself with metadata models, Tika’s support for the different properties of metadata models, and most of the important challenges behind dealing with metadata models.

Comments

comments

About Krishna Srinivasan

He is Founder and Chief Editor of JavaBeat. He has more than 8+ years of experience on developing Web applications. He writes about Spring, DOJO, JSF, Hibernate and many other emerging technologies in this blog.

Speak Your Mind

*