General settings¶

Well, let’s have a look at the configuration file more closely. The head of the file is:

dir: ../data/
feature_dir: ../config/feature_definitions/
outdir: ../analysis/
outfile: features_solo_level.csv

The very first line tells melfeature that it should look in the directory ../data for any input data. As the leading two dots indicate, the path is relative. melfeature actually resides in the bin subfolder of your installation directory, so the two dots .. mean “one directory up”, i.e., the ROOT directory, where data is a subfolder. Beware, if melfeature was moved to another location on the file system, this specific configuration (and by the way melfeature itself) would not work properly anymore. The next line specifies the directory, where the Feature Definitions Files are located. As you might have guessed, those can be found int he config/melfeature subdirectory of your installation. The fourth line indicates the actual name of the result file to be used, which is here ROOT/analysis/features_solo_level.csv.

YAML files

A short note on the YAML syntax might be appropiate here. Each of the lines above starts with an identifier (e.g., dir), followed by a colon, followed by a blank, followed by a value. This means, that each line represents a YAML block, with has a name and a value. (Such lists of (name, value) pairs are called named lists, associative arrays, or dictionaries in computer lingo). Each value can be itself a block or a list, but here they are simple single values. (Lists will be explained later). All blocks start at the very beginning of the line, which means all these blocks are on the same hierarchical level. Nesting blocks and lists is done by whitespace at the beginning of a line, which must be blanks. Tabulators are not allowed, so be careful not to use the TAB-key to indent your YAML blocks. Some editors can be configured to substitute a tabulators with blanks, so you should try to do this to avoid cryptic error messages from melfeature and the other commandline tools.

Well, so far so good. The next two lines in the configuration file are:

shortnames: True
convention: en
wide_format: False
split_ids: True
NA_str: NA

The first tells melfeature to use shortnames for feature labels, instead of long names, which are the default. Each feature is contained in a feature definition file (FDF), where it is identified by a unique label. Long names would be a concatenation of the name of the FDF and the feature label, joined by a dot. Short names consists of just a feature label.

The value of the shortnames field is True with a capital T, which is important. Try changing it to true, save the configuration file and start the demo again, and see what has changed. True and False are so-called “boolean constants” as they are defined in Python. This makes sense, since the MeloSpySuite programs are written in Python, so Python syntax will pop up here and there quite unavoidable. The value of the convention-field is not a Python constant but the string “german”. This value could “de” or “Deutsch” to get the same effect, namely, that another set of conventions for various separators is used in the result file. The main reason is that in German the comma is used as decimal point, whereas in many other places it is the dot . This means that commas cannot be used as field separators (between columns), and not for elements of vector-type features etc. (It’s an unnerving mess, really.) If this block would be missing, the default convention – which is English by the way – will be used. This whole option is convenient for users who happen to have a German version of Excel or Calc, because result files will not open correctly in Excel if a dot . instead of a comma , is be used for floating point numbers. If you happen to be a user with a German Excel/Calc version, try to change the convention value to “English” and run the demo again. Watch what happens and wonder.

The wide_format field relates to the output format of the feature result file. Prior to version 1.4. of the MeloSpySuite (and v1.2 of the MeloSpyGUI), the output CSV files were in wide format, i.e., features that are not scalars (i.e., vectors and matrices) were packed into a string to occupy exactly one field. This had some drawbacks for further analysing the result files with statistics software, i.e., special input routinge were required. Thus, we changed the default output format to so-called long format, where each entry of vector features are expanded into single rows, where scalar values are repeated. This allows to directly read the results file. Only matrix-valued features (currently only self-similarity matices) are not treated this way. Hence, if you want to use matrix features, you have to set wide_format to True. Since, False is now the new default, this line here is redundant and could be missing.

The split_ids is also new since version 1.4. of the MeloSpySuite (and v1.2 of the MeloSpyGUI). True is the new default, so this line is strictly also redundant. It controls the handling of the IDs for segmentations. Formerly, the IDs where absorbed into the id field in the result file. With the new default, two new columns are generated in the result file, one called seg_type and one seg_id. If, for instance, segmentation is set to phrases, the seg_type field will be constantly set to phrases whereas the seg_id field will contain the phrase numbers. A similar logic applies to other segmentatation. Since you can mix several segmenations in one result file, the seg_type column is necessary. The reason behind this change was also further processing of the results.

The NA_string is useful for post-processing. If melfeature is not able to calculate a feature, which might happen due to several good or bad reasons, it inserts this value. Default is NA, which is the symbol R uses.

Next part: Input selection.