General settings¶
Well, let’s have a look at the configuration file more closely. The head of the file is:
dir: ../data/
feature_dir: ../config/feature_definitions/
outdir: ../analysis/
outfile: features_solo_level.csv
The very first line tells melfeature that it should look in the directory ../data
for any input data. As the leading two dots indicate, the path is relative. melfeature actually resides in the
bin
subfolder of your installation directory, so the two dots ..
mean “one directory up”, i.e., the ROOT directory, where data
is a subfolder. Beware, if melfeature was moved to
another location on the file system, this specific configuration (and by the way melfeature itself) would not work properly anymore. The next line specifies the directory, where the
Feature Definitions Files are located. As you might have guessed, those can be found int he config/melfeature
subdirectory of your installation. The fourth line indicates the actual name of the result file to be used, which is here ROOT/analysis/features_solo_level.csv
.
YAML files
A short note on the YAML syntax might be appropiate here. Each of the lines above starts with an identifier (e.g., dir
), followed by a colon, followed by a blank, followed by a value.
This means, that each line represents a YAML block, with has a name and a value. (Such lists of (name, value) pairs are called named lists, associative arrays, or dictionaries in computer
lingo). Each value can be itself a block or a list, but here they are simple single values. (Lists will be explained later). All blocks start at the very beginning
of the line, which means all these blocks are on the same hierarchical level. Nesting blocks and lists is done by whitespace at the beginning of a line, which must be blanks. Tabulators
are not allowed, so be careful not to use the TAB
-key to indent your YAML blocks. Some editors can be configured to substitute a tabulators with blanks, so you should try to do this
to avoid cryptic error messages from melfeature and the other commandline tools.
Well, so far so good. The next two lines in the configuration file are:
shortnames: True
convention: en
wide_format: False
split_ids: True
NA_str: NA
The first tells melfeature to use shortnames for feature labels, instead of long names, which are the default. Each feature is contained in a feature definition file (FDF), where it is identified by a unique label. Long names would be a concatenation of the name of the FDF and the feature label, joined by a dot. Short names consists of just a feature label.
The value of the shortnames
field is True
with a capital T
, which is important. Try changing it to true
, save the configuration file and start the demo again, and see what
has changed. True
and False
are so-called “boolean constants” as they are defined in Python. This makes sense, since the MeloSpySuite programs are written in Python, so Python syntax will
pop up here and there quite unavoidable. The value of the convention
-field is not a Python constant but the string “german”. This value could “de” or “Deutsch” to get the same effect,
namely, that another set of conventions for various separators is used in the result file. The main reason is that in German the comma is used as decimal point, whereas in many other places
it is the dot .
This means that commas cannot be used as field separators (between columns), and not for elements of vector-type features etc. (It’s an unnerving mess, really.)
If this block would be missing, the default convention – which is English
by the way – will be used. This whole option is convenient for users who happen to have a German version of
Excel or Calc, because result files will not open correctly in Excel if a dot .
instead of a comma ,
is be used for floating point numbers.
If you happen to be a user with a German Excel/Calc version, try to change the convention value to “English” and run the demo again. Watch what happens and wonder.
The wide_format
field relates to the output format of the feature result file. Prior to version 1.4. of the MeloSpySuite (and v1.2 of the MeloSpyGUI), the output CSV files were in wide format, i.e., features
that are not scalars (i.e., vectors and matrices) were packed into a string to occupy exactly one field. This had some drawbacks for further analysing the result files with statistics software, i.e.,
special input routinge were required. Thus, we changed the default output format to so-called long format, where each entry of vector features are expanded into single rows, where scalar values are repeated.
This allows to directly read the results file. Only matrix-valued features (currently only self-similarity matices) are not treated this way. Hence, if you want to use matrix features, you have to set
wide_format
to True
. Since, False
is now the new default, this line here is redundant and could be missing.
The split_ids
is also new since version 1.4. of the MeloSpySuite (and v1.2 of the MeloSpyGUI). True
is the new default, so this line is strictly also redundant. It controls the handling of the IDs for
segmentations. Formerly, the IDs where absorbed into the id
field in the result file. With the new default, two new columns are generated in the result file, one called seg_type
and one
seg_id
. If, for instance, segmentation is set to phrases, the seg_type
field will be constantly set to phrases
whereas the seg_id
field will contain the phrase numbers. A similar
logic applies to other segmentatation. Since you can mix several segmenations in one result file, the seg_type
column is necessary. The reason behind this change was also further processing
of the results.
The NA_string
is useful for post-processing. If melfeature is not able to calculate a feature, which might happen due to several good or bad reasons, it inserts this value. Default is NA
, which is the
symbol R uses.
Next part: Input selection.