This article is based on ManifoldCF in Action, to be published on Oct 2011. It is being reproduced here by permission from Manning Publications. Manning publishes MEAP (Manning Early Access Program,) eBooks and pBooks. MEAPs are sold exclusively through Manning.com. All pBook purchases include free PDF, mobi and epub. When mobile formats become available all customers will be contacted and upgraded. Visit Manning.com for more information. [ Use promotional code 'java40beat' and get 40% discount on eBooks and pBooks ]
When you click on the List Output Connections link on the navigation menu, you enter the area of the UI that manages output connection definitions. We will need an output connection definition of some kind in order to demonstrate web crawling. But we can certainly make do with an output connection that uses the null output connector. Nevertheless, let’s visit some UI pages, so we can discuss them in greater depth.
Output connection definition list fields
The presented list of output connection definitions provides a little general information about each connection definition. While incomplete, it helps you to keep track of exactly which connection definition is which. Figure 1 shows the UI page in question. The fields are described in table 1.
Standard output connection definition tabs
The three tabs that make up the standard output connection definition tab set are the Name tab, the Type tab, and the Throttling tab. These tabs set the basic parameters of every output connection definition. It’s also no accident that these tabs are where information displayed in the output connection definition list comes from.
The Name tab
Refer to figure 2 for a screen shot of the Name tab. This tab is used solely for setting a connection definition’s name and description.
As mentioned before, the name given must be unique among output connection definitions, while the description can be anything, although you will benefit enormously by choosing a description that is meaningful and helpful. The name field can contain any Unicode character but is limited to 32 characters in total. The description field also may consist of any Unicode characters, but can be up to 255 characters in length.
The Type tab
The output connection definition Type tab can be seen in figure 3. Use this tab to select the output connection type (which maps to the connector class by way of a database table).
The Throttling tab
Figure 4 shows what the Throttling tab looks like.
For an output connection definition, all you can do with this tab is to set the Max Connections (per JVM) connection parameter. This parameter limits the number of active, connected instances of the underlying connector class that have the same exact configuration. For example, you may have two Solr indexes and a connection definition for each one. Since the connection definition information differs between the two, ManifoldCF will keep two distinct pools of connected Solr connection instances and apply the corresponding limit to the number of connection instances managed by each pool.
So, what should you set this parameter to? In part, the answer depends on the semantics of the target system. If each target system connection has a cost, or there is a limit of some kind in the target system for the number of connections to it, then you will want to control the number of instances based on that constraint. If there is no need for limits of any kind, you may safely set this parameter as high as the number of working threads (30, by default). Setting the parameter higher than that does no harm but has little benefit, since there is a limit to the number of connection instances ManifoldCF can use at any given time in any case.
Table 2 describes the target system constraints for the output connectors currently part of ManifoldCF.
Output connection status
Any time you finish editing an output connection definition or if you click the View link from the output connection definition list, you will see a connection definition information and status summary screen, similar to figure 5.
This screen has two important purposes: first, letting you see what you actually have set up for your connection definition parameters, and, second, showing you the status of a connection created with your connection definition. The connection status usually represents the results of an actual interaction with the target system. It is designed to validate the connection definition parameters you’ve described. However, it can also be very helpful in diagnosing transient connection problems that might arise (for example, expired credentials or a crashed Solr instance). This makes the connection definition information and status summary screen perhaps the best invention since the Chinese food buffet, so please make it a point to visit this screen as part of your diagnostic bag of tricks.
Setting up connection definitions is a fundamental function of the ManifoldCF crawler UI. Each kind of connection definition has its own link in the navigation area. We discussed output connection definitions.