Setting up the Semantic Editor

Required modules

The Semantic Editor makes use of several modules and libraries. Please install them in the given order, so that some preconfiguration can be done automatically. (See here for manual configuration.)

Wysiwyg module and TinyMCE editor

First, you need to install the 3rd party Drupal module Wysiwyg (get here). The Wysiwyg module allows for neat integration of several graphical text editors in Drupal. After enabling the Wysiwyg module, you will need to install TinyMCE. Installation instructions can be found here. In order to check whether TinyMCE is installed correctly, go to its admin page on Site Configuration -> Wysiwyg profiles. TinyMCE should appear with version information.

Note: Hiding the TinyMCE path bar does not work with Wysiwyg module (6.x-2.4) and TinyMCE 3.5. In WissKI editor, the path bar should generally be hidden, though, to avoid rendering issues. Please use the 3.4 branch of TinyMCE instead.

File 128

Textproc module

The Textproc module (on Github) is responsible for automatic text analysis and ontological modeling of text in WissKI.

Semantic Editor module

The Semantic Editor module (on Github) enables the graphical text editor to make semantic annotations.

For showing infoboxes about annotated instances, the Semantic Editor module makes use of the qTip library (version 1). You can download it here. The Javascript library file jquery.qtip.js should be placed under tooltip_plugin/lib/ in the module directory.

Search Entities module (optional)

The Search Entities module (on Github) is an add-on for the Semantic Editor. It provides a search dialog for quick search of entities that can be annotated. The module requires a newer version of jQuery than the one shipped with Drupal 6. Therefore, you also need the jQuery Update module. You can get it here.

Configuration of the Textproc Module

The module has two functions: Managing how text is modeled according to the ontology and providing methods to automatically analyse text.

Both functions can be configured via the four tabs on the admin pages under Site Configuration -> WissKI module settings -> Textproc.

In order to get started, you only need to specify the modeling of text as described in the next subsection.

Text and ontology

File 130

Like in Wikis, WissKI allows describing objects by free text. This text is not only displayed together with the instance, the text itself is also modeled as an instance that is linked on ontology level with the object instance. That is, you can distinguish between the object and a description of the very same object. Consequently you can express, for example, that the description mentions some other objects without implying that the object itself somehow refers to them.

The system needs to know about how to model texts and links to objects. The modelling information is provided by defining a special pathbuilder group with paths. See this tutorial on how to create paths and groups. A sample group using the ECRM (versions 101001 & 120111) can be downloaded here or here. You may import it into your path definitions. The group and paths can be set in the Document Settings tab through the four select boxes:

  • Group for documents This field should point to the special group. It defines of what ontology concept the text instance will be modeled. If you use the sample path definitions, you should be able to select a group called "Document".
  • Path for subject of a document This field defines how the text is linked to the instance it describes (the one with which the text is displayed). If you use the sample path definitions, set it to "Topic".
  • Path for referred objects This field defines how the text is linked to instances it mentions, ie. that are annotated in the text. In the sample paths, it's "refers to".
  • Path for creator of a document This field is experimental. It should normally be set to "<none>".

Furthermore, you can set the default language of the texts in your system. Currently, you cannot set the language for individual texts, i.e. the default language setting applies to all texts in the system. Currently, only German and English are supported. However, most pre-installed analysis methods are configured to work sufficiently well with any (European) language. You may just try out other languages.

Automatic text analysis

File 135

The automatic text analysis can be configured on the other three tabs. The default tab List shows a list of all active components for analysing text. For each component, its user-defined name and its type (i.e. functionality) are given. Each component can be configured by clicking on the Edit link.

During the installation process of the Textproc module, the "Default vocabulary detection" component will already be configured and activated. It detects mentions of vocabulary entries, both local and imported, and marks them according to the defined pathbuilder groups.

Adding and editing an analysis method

File 129

The user may add other analysis components for a better automatic detection of entities. The Add tab shows a list of available component types together with a short description of what they can detect. By clicking on a link, the user may create and configure a new component.

Vocabulary detection

IB Image

This is always pre-installed as the default vocabulary detection. It scans the text for occurences of vocabulary entries. This method looks up entries in all vocabularies defined in the Vocabulary Control module that have been indexed. (Vocabularies from the local store are always indexed.) If a text snippet matches a vocabualry entry, the snippet will be added an annotation that links it with the entity the vocabulary entry represents. E.g. if you have defined a vocabualry of persons with their names, and you run the automatic text analysis by pressing the Send button in the editor, the method will look for occurences of the persons' names. If it finds one, it will create an annotation of group "Person" (displayed with appropriate icon and color) that links to the specified person.

As it is pre-installed, you usually need not add that component by yourself. On the configuration page, you can tune the parameters of the analysis algorithm. This is usually not necessary as the default values should work in most cases. The fields are:

  • Place coordinates check These settings are only interesting, if you have instances representing geo-referenced places and the vocabularies in question provide coordinate information via the latitude/longitude fields. Such instances may be reranked according to their geographic distance to some reference points. Reference points may be either statically set (in the "prefered coords" textarea) or calculated from other approved annotations in the text.
    • Place classes Here you can specify, instances of which pathbuilder groups should be treated as places. WissKI expects a whitespace-separated list of group IDs.
    • Use coordinates of approved annotations Toggle this if you want approved place annotations to act as reference points. If enabled, already annotated places will tend to attract new place annotations, i.e. places in the neighborhood of existing places will be prefered.
    • Preferred coords Here you can specify the static reference points, line per line, where each line contains <latitude> <longitude>.
    • Latitude factor, Longitude factor Both fields define a factor for the latitudinal/longitudinal difference that determines how much impact has the Latitude/longitude on the reranking. For example, if you only want to rerank places diverging from a certain latitude, you may set Longitude factor to 0. Negative values favour places in the neighborhood of reference points, positive values favour remote places.
  • The next 5 textfields give the weights and factors for exact and partial hits. The default values should work sufficiently well in most scenarios.
    • Rank offset exact Rank of a one-word hit, like "John", "Germany", "Chair", etc...
    • Rank offset contains Rank of a complete multi-word hit, like "John Smith", "New York", etc...
    • Rank offset length contains Additional rank for each word in a complete multi-word hit
    • Rank offset guess Rank of a partial multi-word hit, like only "John" for "John Smith", only "York" for "New York", etc...
    • Rank offset length guess Additional rank for each word in a partial multi-word hit
  • Has lemma, Has pos These fields are only of interest, if you use a preprocessor that provides lemmata and/or part-of-speech (POS) tags. For each group, you may specify positive or negative weights that rerank a hit, if a the word(s) has/have lemmata or a certain POS. For example, "Bath" may be both an ordinary word or a settlement. In order to suppress misdetection as a place, you can define, that a token "Bath" tagged as normal noun (opposed to proper noun) should be downranked. "Has lemma" accepts one factor for each field, "Has pos" accepts one weight for a POS per line; syntax being <POS> <factor>.

Person Name Detection

File 132

This component detects person names that are not in a vocabulary, i.e. it tries to identify persons that are unknown to the system until now. As such, it will always create a annotation for a new instance of the person group. If the vocabulary detection detects an existing person, the existing person will be linked instead of a new instance. On the configuration page, there are the following settings:

  • Group The group with which the annotations will be associated, ie. the group that represents persons in your system. This setting is mandatory and needs to be set on each WissKI information space.
  • Database table name The component uses a table of name parts, like givennames, to detect names. The default database table is filled with name parts extracted from Wikipedia. You usually need not change it.
  • Rankings You can tell the component how you expect names to be build up in your texts: E.g., in most of Europe names are given in the form <givenname(s)> <surname> or <surname> <comma> <givenname(s)>. You may alter that scheme here or give the patterns alternate weights. Again, at least for most European languages the default patterns should be sufficient.

Date & Time Detection

File 131

This component detects and annotates date formats like "21 August 1920". Annotations are then converted to new instances of a group that represents Dates/Time-Spans. On the configuration page you can select the group that represents Dates/Time-Spans.

Detection by Regular Expressions

File 133

This component can be used if you want to annotate occurences of a specific textual pattern as entities. Examples are identifiers that follow certain rules like inventory numbers, or names with specific suffixes like -sky or -vich. In the image example a pattern for a museum's inventory numbers is given: One of the letters A, G or X followed by a hyphen followed by four to six digits. This will link all occurences of that pattern to new instances of museum objects. If the vocabulary detection component finds this pattern as museum object in a vocabulary, the existing instance will be linked instead of creating a new instance.


File 136

In the Preprocessor tab you can define for each supported language an external program to preprocess (ie. POS-tag and lemmatize) the text. Currently only German and English are supported. Using a preprocessor is not mandatory altough it yields better analysis results.

WissKI expects following input and output formats: Input must be line-based, one token/word per line. Output is also line-based, of the form <lemma>\t<tag>. The <tag> is optional.
An example for a preprocessing program with such input/output format is the TreeTagger (<>).

Configuring the Semantic Editor module

The Semantic Editor module extends the TinyMCE visual editor by providing buttons and functionality for annotating object references in text and the links between them.

Referenced objects are always linked to one of the pathbuilder groups defined in the information space. Therefore, before starting to annotate text, you need to define the groups of the objets that you want to annotate.

The Semantic Editor module can be configured via the Admin tab Site Configuration -> WissKI module settings -> Editor. The admin page can be divided into two parts which are described below.

Rendering of annotations

File 137

In the first frame, you can configure the rendering of annotations in the text. For each defined group you may define a color for the annotated text and an icon that preceeds the annotation. If you have installed the 3rd party Drupal modules Colorpicker or JQuery Colorpicker, there will appear a nice color picker dialog for selecting a color. Otherwise you will have to enter the RGB color hexcode like in HTML and CSS. Both color and icon are optional, nonetheless you should provide at least one of the both to make annotations visible and outstanding in your texts.

For a basic installation, you should define a color and/or an icon for each group for which you want to make annotations. Note that, of course, you may define more groups than you actually want to use for annotation. For grous that you don't use in the editor you don't need to specify a color and icon.

Advanced configuration: The appearance of annotations in the text can be altered by editing the CSS files template_default.css, template_group.css, and template_group_no_icon.css.

Editor settings

File 138

The second frame groups various settings concerning how and which annotations can be created and displayed as well as encoding options. For a basic installation, only the third and fourth option is important set appropriately:

  • Place groups You may select one or more groups that represent places IB Image and that at least for some instances provide coordinate information for geo-referencing (see the resp. fields of the Vocabulary Control module). When hovering over annotations belonging to that groups, the infobox will show up with a map showing the place according to the coordinate information.
  • Input format This is the Drupal Input format that the semenatic editor uses. If the pre-configuration during the module installation ran correctly, you should not need to alter this.
  • Show triples before store When saving a text a table of all triples extracted from the text's annotations will be displayed together with the possibility to not store the extracted triples. Normally, this is useful for debugging purposes and may be disabled on production sites.
  • Groups of which instances may be created Like form-based input, the editor let's you distinguish between closed and open groups, .i.e groups that have a fixed/defined set of instances (e.g. gender, departments, ...) and groups of which the user may create new instances. In this field you should select the groups that are considered open. The editor will always provide the possibility to create a new instance of these groups by displaying a menu item on the side menu, marked with an asterisk (*). In order to correctly create triples for such groups, a vocabulary for that group backed on the local data must be defined.
  • Encode entities also in RDFa WissKI uses an HTML class-based approach to store the annotations in the text. Enabling this option will also include RDFa attributes in the text for refering to the mentioned objects/entities. This allows RDFa parsers and harvesters to read the annotations inluded in the text.