\chapter{Evaluation}
\label{ch:Evaluation}

In this chapter the prototype is evaluated in terms of its functionality and its properties. The evaluation is an offline evaluation with synthetic data. All possible valid configurations are generated for one use case, i.e. all possible valid configurations for the forestry use case. Moreover, groups with explicit preferences and a configuration state (which, e.g. would be the currently existing forest) are generated, too.

\section{Metric}
\label{sec:Evaluation:Metrics}

A metric is required to carry out the validation. The proposed metric is the metric of satisfaction. This metric was created because pertinent literature does not provide metrics usable for this thesis. Satisfaction is quantified in this thesis by a threshold metric. A user's preference is used to calculate a rating for each possible solution. Each configuration solution gets an individual score determined by the user's preferences. The score is calculated using the average of a user's preference for each characteristic that is part of the configuration. The result allows a configuration to be compared to all other configurations and ranked according to the percentage of configurations that it beats for a specific user. The threshold metric consists of two parameters. First the threshold center $tc$ and second the satisfaction distance $sd$. The threshold for a satisfied person is at $tc + sd$ and for a dissatisfied person is at $tc - sd$. If a recommendation lies in between these two thresholds the person is classified to neither be satisfied nor be dissatisfied with the solution. For this thesis $sd=5\%$ will be used. This choice is based on the assumption that people switch from satisfied to dissatisfied rather quickly \todo{find a source psychology}. Therefore, the parameter considered in this thesis is $tc$. An example is the choice of $tc = 60\%$. This results in a person satisfied with a recommendation if it is better than at least $65\%$ of all possible finished configurations. In contrast, a person is dissatisfied if the recommendation is not better than $55\%$ of all possible finished configurations. A recommendation that is better than at least $55\%$ and not better than $65\%$ of all possible solutions is considered neutral by the individual.

Different $tc$ values allow to model different situations. A situation with a low willingness to compromise is modelled by a high $tc$. A contrary situation with a group that has a high willingness to compromise is modelled by a low $tc$.

A satisfaction and dissatisfaction classification allows groups to be measured by the amount of people that are satisfied and dissatisfied. Moreover, changes in satisfaction and dissatisfaction for different parameters can be compared. A reasonable $tc$ value has to be found for groups otherwise any derived metrics will not show any meaningful results.

\section{Evaluation Objective}
\label{sec:Evaluation:Questions}

This section poses three questions that will be answered during the evaluation. The questions' aim is to guide through this chapter. They set the guidelines for this evaluation and define its focuses. The questions answered during the evaluation are:

\begin{itemize}
    \item Main question: How does the satisfaction with a group decision, guided by the recommender, differ from the decision of a single decision maker, the dictator, who does not take the other group members' opinions into account?
    \item How many group members on average are satisfied with the group decision?
    \item How does the amount of stored finished configurations relate to satisfaction with a recommendation?
\end{itemize}

The main question is addressed to understand the behaviour the recommender and whether it gives benefits to groups. The second question is aimed at providing information regarding the data and what satisfaction looks like in group decisions and which factors influence it. Last, a technical question is posed. This question is relevant because it shows technical aspects of the recommender. This is important since other work for using the recommender in other possibly larger use cases depend on performance figures in relation to number of stored configurations.

\section{Use Case}
\label{sec:Evaluation:UseCase}

To evaluate the recommender, a use case is needed. In this thesis, a forestry use case is evaluated. This is a use case with four stakeholders. \autoref{fig:Concept:ForestExample} presents the attributes and characteristics of this use case but an extension is needed to fully show the whole use case. Namely the rules of non-valid configurations are missing. Therefore, the constraints for this use case are listed in \emph{not with} form in \autoref{tab:Evaluation:UseCase}. 

\begin{table}
    \tiny
    \begin{center}
        \setlength\tabcolsep{3pt}
        \begin{tabularx}{\columnwidth}{cl|C|C|C|C|C|C|C|C|C|C|C|C|C|C|C|C|C|C|C|C|C|}
            & & \multicolumn{3}{c|}{\textit{indigenous}} & \multicolumn{3}{c|}{\textit{resilient}} & \multicolumn{3}{c|}{\textit{usable}} & \multicolumn{3}{c|}{\textit{effort}} & \multicolumn{3}{c|}{\textit{quantity}} & \multicolumn{3}{c|}{\textit{price}} & \multicolumn{3}{c|}{\textit{accessibility}} \\
            & & \rotatebox[origin=c]{90}{low} & \rotatebox[origin=c]{90}{moderate} & \rotatebox[origin=c]{90}{high} & \rotatebox[origin=c]{90}{low} & \rotatebox[origin=c]{90}{moderate} & \rotatebox[origin=c]{90}{high} & \rotatebox[origin=c]{90}{low} & \rotatebox[origin=c]{90}{moderate} & \rotatebox[origin=c]{90}{high} & \rotatebox[origin=c]{90}{manual} & \rotatebox[origin=c]{90}{harvester} & \rotatebox[origin=c]{90}{\ autonomous} & \rotatebox[origin=c]{90}{low} & \rotatebox[origin=c]{90}{moderate} & \rotatebox[origin=c]{90}{high} & \rotatebox[origin=c]{90}{low} & \rotatebox[origin=c]{90}{moderate} & \rotatebox[origin=c]{90}{high} & \rotatebox[origin=c]{90}{low} & \rotatebox[origin=c]{90}{moderate} & \rotatebox[origin=c]{90}{high} \\

            \hline
            \multirow{3}{*}{\textit{indigenous}}    & low       & - & - & - &   &   &   &   &   &   &   &   &   &   &   &   &   &   &   &   &   &   \\ \cline{2-23}
                                                    & moderate  & - & - & - &   &   & n &   &   & n &   &   &   &   &   &   &   &   &   &   &   &   \\ \cline{2-23}
                                                    & high      & - & - & - &   &   & n &   & n & n &   &   &   &   &   & n & n &   &   &   &   &   \\ \hline

            \multirow{3}{*}{\textit{resilient}}     & low       &   &   &   & - & - & - &   &   &   &   &   &   &   &   &   &   &   &   &   &   &   \\ \cline{2-23}
                                                    & moderate  &   &   &   & - & - & - &   &   & n &   &   &   &   &   &   &   &   &   &   &   &   \\ \cline{2-23}
                                                    & high      &   & n & n & - & - & - &   & n & n &   &   &   &   &   & n & n & n &   &   &   &   \\ \hline

            \multirow{3}{*}{\textit{usable}}        & low       &   &   &   &   &   &   & - & - & - &   &   &   &   &   & n & n & n &   &   &   &   \\ \cline{2-23}
                                                    & moderate  &   &   & n &   &   & n & - & - & - &   &   &   &   &   &   &   &   &   &   &   &   \\ \cline{2-23}
                                                    & high      &   & n & n &   & n & n & - & - & - &   &   &   &   &   &   &   &   &   &   &   & n \\ \hline

            \multirow{3}{*}{\textit{effort}}        & low       &   &   &   &   &   &   &   &   &   & - & - & - &   &   & n & n & n &   &   &   &   \\ \cline{2-23}
                                                    & moderate  &   &   &   &   &   &   &   &   &   & - & - & - &   &   &   &   &   &   &   & n & n \\ \cline{2-23}
                                                    & high      &   &   &   &   &   &   &   &   &   & - & - & - &   &   &   &   &   &   &   & n & n \\ \hline

            \multirow{3}{*}{\textit{quantity}}      & low       &   &   &   &   &   &   &   &   &   &   &   &   & - & - & - & n & n &   &   &   &   \\ \cline{2-23}
                                                    & moderate  &   &   &   &   &   &   &   &   &   &   &   &   & - & - & - & n & n &   &   &   &   \\ \cline{2-23}
                                                    & high      &   &   & n &   &   & n & n &   &   & n &   &   & - & - & - &   &   &   &   & n & n  \\ \hline

            \multirow{3}{*}{\textit{price}}         & low       &   &   & n &   &   & n & n &   &   & n &   &   & n & n &   & - & - & - &   &   &   \\ \cline{2-23}
                                                    & moderate  &   &   &   &   &   & n & n &   &   & n &   &   & n & n &   & - & - & - &   &   &   \\ \cline{2-23}
                                                    & high      &   &   &   &   &   &   &   &   &   &   &   &   &   &   &   & - & - & - &   &   &   \\ \hline

            \multirow{3}{*}{\textit{accessibility}} & low       &   &   &   &   &   &   &   &   &   &   &   &   &   &   &   &   &   &   & - & - & - \\ \cline{2-23}
                                                    & moderate  &   &   &   &   &   &   &   &   &   &   & n & n &   &   & n &   &   &   & - & - & - \\ \cline{2-23}
                                                    & high      &   &   &   &   &   &   &   &   & n &   & n & n &   &   & n &   &   &   & - & - & - \\ \hline
        
        \end{tabularx}
        \caption[Forestry Use Case: Constraints]{Constrains in \emph{not with} form for the forestry use case.}
        \label{tab:Evaluation:UseCase}
    \end{center}
\end{table}

The stakeholders in this use case are: a forest owner, an athlete, an environmentalist, and a consumer. The owner sees the forest as an investment, they are interested in a high long-term profit. On the other hand consumers are interested in reasonable wood price as they use wood for furniture and also for their fireplaces. In contrast, the environmentalist is interested in a healthy forest that is not impacted negatively by human activity. Last is the athlete who is interested in good accessibility of the forest and that there is some plant and animal life.
Every group consists of four people which is why they need to try and find a compromise. Diverging preferences make this difficult. All stakeholders have an interest in getting their will but also all parties need others to accept the decision. It is not in the interest of a stakeholder to fully have their preferences met while ending up with protests that arise from the deep dissatisfaction of other group members.

\section{Data Generation}
\label{sec:Evaluation:GeneratingGroups}

This section describes the data generation process as seen in \autoref{fig:Evaluation:GeneratingDataProcess} that generates data based on the use case int \autoref{sec:Evaluation:UseCase}. Group profiles are used to generate groups of four with different group member types. The exact group composition depends on the group type. For every parameter and group type $1000$ groups are generated and converted to preferences. This number was chosen as it is the highest number that allows computing times to work overnight on the hardware that is available. Also this number is large enough to reduce strong variability between runs. For each group unfinished configurations are generated and its preferences are paired up with the generated unfinished configurations. These pairs later on are used for the evaluation. 

\begin{figure}
    \centering
    \includegraphics[width=1\textwidth]{./figures/60_evaluation/bpmn_evaluation_input_data_generation.pdf}
    \caption[Data Generation Process]{Data generation process for the evaluation}
    \label{fig:Evaluation:GeneratingDataProcess}
\end{figure}

\subsection{Unfinished Configurations Generation}

Unfinished configurations are generated using all finished configurations and taking a subset of the contained characteristics. This way all generated configurations will be valid and lead to valid solutions. For the results that are presented in this chapter around $\frac{1}{7} \approx 15\%$ of characteristics is maintained. This value is chosen to allow the existing configuration to take effect but not to skew the results due to the penalty function severely limiting possible options.

\subsection{Preference Generation}

For the forestry use case, the idea is that there are multiple types of user profiles. Each group profile is represented by a neutral, negative or positive attitude towards a characteristic. During data generation the attitude is converted to a preference using a normal distribution. Each preference lies in the interval $[0,1]$. Zero can be seen as worst possible option and one as best possible option. \autoref{fig:Evaluation:PreferenceDistribution} shows how the user profile can be converted to preferences. The actual group member profiles are shown in \autoref{tab:Evaluation:GroupMemberMappings}.

\pgfplotsset{height=5cm,width=\textwidth,compat=1.8}
\pgfmathdeclarefunction{gauss}{2}{%
  \pgfmathparse{1/(#2*sqrt(2*pi))*exp(-((x-#1)^2)/(2*#2^2))}%
}
\begin{figure}
    \begin{tikzpicture}
        \begin{axis}[
            every axis plot post/.append style={
                mark=none, domain=0:1, samples=50, smooth
            },
            axis x line*=bottom,
            xmin=0,
            xmax=1,
            ymin=0.1,
            xticklabel style={
                /pgf/number format/precision=3,
            },
            xtick={0,0.25, 0.5, 0.75,1},
            hide y axis]
          \addplot [draw=black, style={dashdotdotted}][very thick] {gauss(0.25,0.1)} node[text=black][above,pos=0.5] {negative};
          \addplot [draw=black, style={solid}][very thick] {gauss(0.5,0.05)} node[text=black][above,pos=0.48] {neutral};
          \addplot [draw=black, style={dotted}][very thick] {gauss(0.75,0.1)} node[text=black][above,pos=0.5] {positive};
        \end{axis}
        \end{tikzpicture}
 \caption[Preference Distribution]{Distribution of preferences for a user type.}
\label{fig:Evaluation:PreferenceDistribution}
\end{figure}


\begin{table}
    \begin{center}
        \begin{tabular}{l|c|c|c|c}
            characteristic                              & athlete           & forest owner      & environmentalist  & consumer          \\
            \hline
            $(\textit{indigenous}, \text{low})$         & \textbf{negative} & \textit{positive} & \textbf{negative} & neutral           \\
            $(\textit{indigenous}, \text{moderate})$    & \textit{positive} & neutral           & \textbf{negative} & neutral           \\
            $(\textit{indigenous}, \text{high})$        & \textit{positive} & \textbf{negative} & \textit{positive} & \textbf{negative} \\
            \hline
            $(\textit{resilient}, \text{low})$          & neutral           & \textit{positive} & neutral           & neutral           \\
            $(\textit{resilient}, \text{moderate})$     & \textit{positive} & neutral           & neutral           & neutral           \\
            $(\textit{resilient}, \text{high})$         & \textit{positive} & \textbf{negative} & \textbf{negative} & \textbf{negative} \\
            \hline
            $(\textit{usable}, \text{low})$             & neutral           & neutral           & neutral           & \textbf{negative} \\
            $(\textit{usable}, \text{moderate})$        & neutral           & neutral           & \textbf{negative} & neutral           \\
            $(\textit{usable}, \text{high})$            & \textbf{negative} & \textit{positive} & \textbf{negative} & \textit{positive} \\
            \hline
            $(\textit{effort}, \text{manual})$          & \textbf{negative} & neutral           & \textit{positive} & \textbf{negative} \\
            $(\textit{effort}, \text{harvester})$       & \textbf{negative} & \textit{positive} & \textbf{negative} & neutral           \\
            $(\textit{effort}, \text{autonomous})$      & \textbf{negative} & \textit{positive} & \textbf{negative} & neutral           \\
            \hline
            $(\textit{quantity}, \text{low})$           & \textit{positive} & \textbf{negative} & \textit{positive} & \textbf{negative} \\
            $(\textit{quantity}, \text{moderate})$      & neutral           & \textit{positive} & neutral           & \textbf{negative} \\
            $(\textit{quantity}, \text{high})$          & \textbf{negative} & \textit{positive} & \textbf{negative} & \textit{positive} \\
            \hline
            $(\textit{price}, \text{low})$              & neutral           & neutral           & neutral           & \textit{positive} \\
            $(\textit{price}, \text{moderate})$         & neutral           & \textit{positive} & neutral           & neutral           \\
            $(\textit{price}, \text{high})$             & neutral           & \textit{positive} & neutral           & \textbf{negative} \\
            \hline
            $(\textit{accessibility}, \text{low})$      & \textbf{negative} & \textit{positive} & \textit{positive} & neutral           \\
            $(\textit{accessibility}, \text{moderate})$ & neutral           & neutral           & neutral           & neutral           \\
            $(\textit{accessibility}, \text{high})$     & \textit{positive} & \textbf{negative} & \textbf{negative} & neutral           \\
            \hline
        \end{tabular}
        \caption[Forestry Use Case: Group Member Profiles]{The attitudes of each group member profile.}
        \label{tab:Evaluation:GroupMemberMappings}
    \end{center}
\end{table}

These user profiles can be used to generate rather homogenous groups but also to create groups that have interests that are more conflicting. The following group types, with four members each, are generated:

\begin{itemize}
    \item random groups (preferences are uniformly random)
    \item heterogeneous groups (people adhere to one preference profile like forest owner, athlete, consumer, environmentalist)
    \item homogeneous groups (only one preference profile for all group members which in this evaluation is the forest owner)
\end{itemize}

The natural group type for the use case is a heterogeneous group but to widen the evaluation and to see how the recommender performs with different types of groups the two other group types are evaluated too. As a result, more general statements about the recommender's performance can be made.

\subsection{The Effect of Stored Finished Configurations}

Another important component of the evaluation is the influence of stored finished configurations. When evaluating a subset of stored finished configurations it is important to avoid outliers. This is the reason why a process inspired by \emph{cross validation} \todo{referenz hinzufügen} is used. The configuration database is randomly ordered and sliced into sub-databases of the needed size. As an example, if the evaluated stored data size is 20, a configuration database containing 100 configurations is split into five sub-databases of size 20. Now the evaluation is carried out for each of the sub-databases and finally the average is determined. This avoids the random picking of a subset which either performs much better than most other possible combinations of databases or which performs much worse. This way the data is more aligned to the \emph{expected value}. \todo{referenz}


\section{Hypotheses}
\label{sec:Evaluation:Hypotheses}

This section gives an overview on the hypotheses tested during data analysis. Each hypothesis is followed by an explanation as to why the hypothesis is presented. In later sections the truthfulness of the hypothesis is examined. This allows to verify if expectations about the behaviour of the recommender are true or false.

\begin{hypothesis}
    \begin{itshape}
        \label{hyp:Evaluation:MaximumMinimum} Improvements for group recommendations are highest when the number of people satisfied with the dictator's decision is slightly lower than two and the highest reduction in dissatisfied group members can be seen at around two group members dissatisfied respectively.
    \end{itshape} \medskip \\*    
    This is based on the assumption that in a real situation a group of four with less than two satisfied members (with a dictator's decision) on average, has enough room for improvement so that potentially three group members can be satisfied after the use of the recommender. This means that at least one more person is satisfied with the compromise. In some groups it my then be potentially possible to shift the last person from dissatisfaction towards a neutral attitude. A higher base satisfaction is assumed to reduce the possibility to make an additional group member satisfied.
\end{hypothesis}


\begin{hypothesis}
    \begin{itshape}
        \label{hyp:Evaluation:HigherTcLessSatisfied} A higher $tc$ value results in less satisfied people and more dissatisfied people with regard to the dictator's decision.
    \end{itshape} \medskip \\*
    A higher $tc$ value causes a person to be dissatisfied with a higher amount of configurations. It also causes a person to be satisfied with less configurations. Therefore, recommending a random configuration causes the chance of making an individual satisfied to sink while increasing the chance of that person to be dissatisfied. Already the change in probability leads to the assumption that this result should be seen with non-random recommendations too.
\end{hypothesis}

\begin{hypothesis}
    \begin{itshape}
        \label{hyp:Evaluation:OnlyOneSatisfied} There exists a $tc$ value which causes only one person to be classified as satisfied with the dictator's decision and no one is classified as satisfied with the group recommender's decision.
    \end{itshape} \medskip \\*
    A $tc$ value that reaches a high enough level eventually should make only the dictator herself satisfied with the dictator's decision. The bound for satisfaction is so high that any group recommendation will cause the dictator to also be dissatisfied or at least neutral with the group decision. This can be understood as a complete unwillingness of a group to compromise. All group members are only satisfied with their own decision. Having two group members with identical interest, which is expected to be rare, results in this effect not being present even in a situation like that. 
\end{hypothesis}

\begin{hypothesis}
    \begin{itshape}
        \label{hyp:Evaluation:HomogenousMoreSatisfied} Homogeneous groups have more members satisfied with the recommender's decision but also with the dictator's decision compared to heterogeneous groups.
    \end{itshape} \medskip \\*
    As the interest in homogenous groups is more aligned it is to be expected that the overall satisfaction levels for more homogenous groups is higher. If the base level is already higher it is likely that even just a slight increase shifts recommendations for homogenous groups to satisfaction levels not achievable by heterogeneous groups.
\end{hypothesis}

\begin{hypothesis}
    \begin{itshape}
        \label{hyp:Evaluation:HeterogenousBiggerSatisfactionIncrease} More heterogeneous groups see a bigger increase in satisfaction than less heterogeneous groups when switching from a decision of a dictator to a decision made by the recommender.
    \end{itshape} \medskip \\*
    The assumption is made that in more heterogeneous groups the satisfaction with the dictator's decision is less. Therefore, there is a higher possible increase in satisfaction. A homogenous group that already satisfies all group members with the dictator's decision cannot see an increase in satisfaction, therefore, the assumption is made that with a higher number of people dissatisfied or neutral with the dictator's decision, more people will be be lifted into satisfaction and the increase in satisfaction will be bigger. However a group that has divergent interests actually might not be able to reach high levels of satisfaction.
\end{hypothesis}

\begin{hypothesis}
    \begin{itshape}
        \label{hyp:Evaluation:StoreSizeBetterResults} A higher amount of stored finished configurations results in a higher number of satisfied and a lower number of dissatisfied group members when the recommender is used to make the group decision.
    \end{itshape} \medskip \\*
    This hypothesis is based on the fact that the possibility to chose a bigger pool of configurations increases the chances of arriving at a good recommendation. This of course requires the assumption that aggregation strategies that pick recommendations pick configurations that also fare better in the chosen satisfaction metric. If this is not the case this hypothesis is not sustainable.
\end{hypothesis}

\begin{hypothesis}
    \begin{itshape}
        \label{hyp:Evaluation:AggregationStrategies} The multiplication and best average aggregation strategies perform better than the least misery aggregation strategy.
    \end{itshape} \medskip \\
    Best average and multiplication are strategies that perform best in some of the by \citeauthor{Masthoff2015} \cite[~ 755f]{Masthoff2015} listed online experiments. Therefore, it is reasonable to assume that they perform well here too. Least misery was listed in some studies as performing worst. Accordingly, it is expected to fare less good than other group aggregation strategies.
\end{hypothesis}

\section{Results}
\label{sec:Evaluation:Findings}

\subsection{Threshold Center Selection}
This section aims at finding a $tc$ parameter for the analysis. This is required to reduce the amount of data that has to be looked at and to get valuable results. For this purpose all parameters except $tc$ will be fixed. The preference aggregation strategy looked at is multiplication because this strategy shows good results across the board when briefly looking at the generated data. The configuration database is used with all possible solutions (which is 148 in total). This results in a bigger visible effect in terms of satisfaction and dissatisfaction change as the recommender has access to all possible configurations and also provides more solid and predictable results. \autoref{fig:Evaluation:tcChange} shows the satisfaction change based on choice of $tc$. Of note is that the maxima of satisfaction change precedes the minima of dissatisfaction change for all group types. Maxima and minima occur at different tc values depending on the group type. Heterogeneous groups peek earliest while homogenous groups only show a peek towards the maximum $tc$ value. Changes in dissatisfaction are minimal even with $tc$ close to its maximum value for homogeneous groups. \autoref{fig:Evaluation:tcCount} shows the amount of group members satisfied and dissatisfied with the dictator's decision. The number of satisfied people decreases with an increasing $tc$ and its downward movement accelerates. The dissatisfaction curve shows a similar trend in reverse. Here the number of dissatisfied group members increases with an increase in $tc$. The curve accelerates its growth analogous to the acceleration of the satisfaction curve. The behaviour of heterogeneous groups and random groups is similar but the curve for heterogeneous groups shows less satisfaction and more dissatisfaction for a given tc. Also both curves have a negative satisfaction change when $tc$ reaches a certain height. Homogeneous groups only have satisfied group members for most $tc$ values but they decrease rapidly for values greater than $85$. Dissatisfied group members are at zero for the whole value range of $tc$ except a very slight upward tick at the end that is barely noticeable.

\begin{figure}
    \centering
    \includegraphics[width=1\textwidth]{./figures/60_evaluation/tc_change__multi__db-size-148.pdf}
    \caption[Satisfaction and Dissatisfaction: Change based on $tc$]{The average satisfaction and dissatisfaction change based on $tc$ with a database size of 148 and multiplication as aggregation strategy.}
    \label{fig:Evaluation:tcChange}
\end{figure}

\autoref{hyp:Evaluation:MaximumMinimum} states that the highest satisfaction change is expected at places where the overall satisfaction with the dictator's decision is close to two. However, the data shows a slightly different result. This hypothesis does not hold true. When looking at the data we see peeks in satisfaction change when values are equal to $2.81, 2.51$ and $3$ (heterogeneous, random, homogenous). Therefore, the expectation does not hold up. Most likely this happens because at lower satisfaction numbers with the dictator's decision the threshold for satisfaction is set too high which causes the group compromise to classify less group members as satisfied. Moreover, valleys for dissatisfaction change are also not at the expected value of \textit{two}. They are instead at $1.19, 1.49, 0.04$ (heterogeneous, random, homogenous). Here the valleys are lower than expected. However, the data from homogenous groups seems to be cut off. Therefore, a judgement for homogenous groups is difficult and with slightly less heterogeneous groups this graph should show bigger effects.

\begin{figure}
    \centering
    \includegraphics[width=1\textwidth]{./figures/60_evaluation/tc_dictator__multi__db-size-148.pdf}
    \caption[Satisfaction and Dissatisfaction: Average based on $tc$]{The average satisfaction and dissatisfaction with the dictator's decision based on $tc$.}
    \label{fig:Evaluation:tcCount}
\end{figure}

The predicted trend that a higher $tc$ results in a lower satisfaction and a higher dissatisfaction with the dictator's decision, as predicted by \autoref{hyp:Evaluation:HigherTcLessSatisfied}, can be clearly seen in \autoref{fig:Evaluation:tcCount} and has already been described in this section. This means for the evaluation that the behaviour of the recommender is predictable and suggests that the used metrics are modelling behaviour expected in reality.

\autoref{hyp:Evaluation:OnlyOneSatisfied} predicts that the satisfaction with the individual decision eventually reaches one and that no one is satisfied with the group recommender's decision. This means the satisfaction change should decrease to minus one. \autoref{fig:Evaluation:tcCount} shows a downward trend that comes close to one for heterogeneous and random groups. Therefore, the trend suggests that the hypothesis holds true with regard to heterogeneous and random groups but the drop for homogenous groups just reaches below $2.8$ suggesting that the hypothesis does not hold for homogenous groups. Also, satisfaction change in heterogeneous groups decreases close to minus one while this value is neither fully reached by random groups nor by homogenous groups. The hypothesis therefore holds true only for heterogeneous groups. A likely cause why it does not seem to hold true for random or homogenous groups is that the highest tc value still includes multiple configurations and a recommended configuration keeps some group members satisfied for some of the time. For random groups it may also be possible that a group member other than the dictator could be satisfied with the group decision.

During a group decision it is better to make one less person dissatisfied than to make one more person satisfied. Therefore, this thesis uses $tc$ values that are closer to the minima of dissatisfaction change than to the maxima of satisfaction change. The minima for heterogeneous groups is at $tc = 70\%$, therefore, this is the chosen value for evaluation of the remaining hypotheses. This is needed because otherwise analysis would be infeasible due to a too large parameter space. For random groups the minima of dissatisfaction change can be found at $tc = 85\%$ which is the value used for all following analyses of random groups. For homogenous group dissatisfaction change is decreasing until the highest possible value of $tc$ is reached. Because of that $tc = 94\%$ is used for analysis.

\subsection{Recommender Performance Analysis}

\begin{figure}[p]
    \centering
    \includegraphics[width=1\textwidth]{./figures/60_evaluation/heterogeneous_combined__amount-1000__tc-70}
    \caption[Satisfaction and Dissatisfaction: Heterogeneous Groups]{The satisfaction and dissatisfaction using the group recommender for heterogeneous groups with $tc = 70$.}
    \label{fig:Evaluation:HeteroSatisfaction}
\end{figure}

\begin{figure}[p]
    \centering
    \includegraphics[width=1\textwidth]{./figures/60_evaluation/random_combined__amount-1000__tc-85}
    \caption[Satisfaction and Dissatisfaction: Random Groups]{The satisfaction and dissatisfaction using the group recommender for random groups with $tc = 85$.}
    \label{fig:Evaluation:RandomSatisfaction}
\end{figure}

\begin{figure}[p]
    \centering
    \includegraphics[width=1\textwidth]{./figures/60_evaluation/homogeneous_combined__amount-1000__tc-94}
    \caption[Satisfaction and Dissatisfaction: Homogeneous Groups]{The satisfaction and dissatisfaction using the group recommender for homogeneous groups with $tc = 94$.}
    \label{fig:Evaluation:HomoSatisfaction}
\end{figure}

This subsection holds fixed parameters of $tc$. It describes the satisfaction change and the total amount of satisfied people with the recommenders decision dependent on the amount of stored configurations.

\autoref{fig:Evaluation:HeteroSatisfaction} shows the relationship between the satisfaction and dissatisfaction and the number of stored configurations. The left y-axis shows the change in satisfaction compared to a decision made by a dictator. The right axis shows the average number of group members being satisfied. The left figure shows numbers for satisfaction and the right for dissatisfaction. On the left higher numbers are better and on the right lower ones (with regards to change). There are three graphs each. One for multiplication, one for least misery and one for best average. The graphs for satisfaction are similar to a logarithmic curve. The increase in change of satisfaction decelerates with a higher number of stored configurations. The change in satisfaction is always above zero and a satisfaction increase of more than three quarters of the maximum can already be seen at around 25 stored configurations. Moreover, the curve for multiplication is greater than all other curves for all parameters. Least misery reaches the lowest amount of change across all values. The minimum number of satisfaction change is $0$ for least misery, and $0.1$ for best average and multiplications. The highest number is around $0.3$ for least misery, $0.4$ for best average and $0.5$ for multiplication
When looking at dissatisfaction change the graphs are all in the negative number range. Multiplication reaches the lowest number and best average the highest. The gap between all three functions is less than that of satisfaction increase. And overall the curves are flatter meaning the change with 25 stored configurations already reaches close to five sixth of the minimum value. The highest number of satisfaction change is $-0.4$ for all strategies meanwhile the lowest number is around $-0.57$ for least misery, $-0.53$ for best average and $-0.63$ for multiplication.

The figures for homogenous (\autoref{fig:Evaluation:HomoSatisfaction}) and random groups (\autoref{fig:Evaluation:RandomSatisfaction}) have a similar shape but their values and slope vary. The satisfaction change for homogenous groups is mostly negative, starting at $-2$, and only reaches a positive level for more than $100$ stored configurations with a value of $0.04$. Multiplication and best average have higher values than least misery here, too. Moreover the dissatisfaction change is always positive with a value range of $[0,1]$, except it slightly falls below zero after more than $75$ configurations are stored.
Random groups as seen in \autoref{fig:Evaluation:RandomSatisfaction} mostly have a positive change in satisfaction. Values range here from $-0.55$ to $0.27$ for least misery, from $-0.27$ and $-0.28$ to $0.74$ for best average and multiplication. The change is higher than the change for heterogeneous groups. Dissatisfaction also changes similarly to heterogeneous groups. Here the values for random groups reach a lower level. They range from $0$ to $-0.59$ for least misery. Multiplication and best average both have as minimum value around $-0.21$ and behave similarly. The range goes down to $-0.84$ for best average and $-0.86$ for multiplication.

\autoref{fig:Evaluation:HeteroSatisfaction} also shows the average number of group members satisfied and dissatisfied with the recommender's decision. Satisfaction with the recommender's decision starts at $2.4$ and quickly reaches $2.65$ for least misery and $2.8$ for best average and multiplication. The highest value for multiplication is at $2.89$. Dissatisfaction also quickly plateaus. Here values for different recommenders are closer together. They start at $0.74$ (least misery) to $0.78$ (best average) and go as low as $0.62$ for least misery, $0.66$ for best average and $0.56$ for multiplication.

As shown in \autoref{fig:Evaluation:HomoSatisfaction} when looking at the total numbers the value range for homogenous groups is much larger but the overall shape stays the same. Here satisfaction numbers go from $0.55$ to $2.95$. Least misery performs visibly worse than multiplication and best average reaching only $2.7$. Dissatisfaction values range from $1.21$ to $0.01$ and the values are not really visibly distinguishable besides that in the range $[25,50]$ least misery seems to have the highest number of dissatisfied group members.

Random groups have less overall satisfaction with $tc = 85\%$ as seen in \autoref{fig:Evaluation:RandomSatisfaction} when looking at the total numbers. Satisfaction numbers start from $1.33$ (least misery), $1.61$ (best average) and $1.6$ (multiplication) and go up to $2.15$ for least misery and $2.62$ for best average and multiplication. The dissatisfaction numbers start at $1.5$ for least misery and $1.27$ for best average and multiplication and level of at $0.9$ (least misery), $0.65$ (best average) and $0.63$ (multiplication). Visibly there is a big difference between least misery and the other two aggregation functions.

\subsection{Discussion}

After description of the data the remaining hypotheses are discussed.
\autoref{hyp:Evaluation:HomogenousMoreSatisfied} states that homogenous groups have more satisfied member's with regards to the dictator's and the group recommender's decision. \autoref{fig:Evaluation:tcCount} shows that this holds true for the dictator's decision as for every instance satisfaction in homogeneous groups is higher than that of other groups. However, \autoref{fig:Evaluation:HeteroSatisfaction}, \autoref{fig:Evaluation:HomoSatisfaction} and \autoref{fig:Evaluation:RandomSatisfaction} show that for satisfaction with the recommender's decision this does not hold when looking at $tc$ values where the recommender performs best for each segment. In those places the homogenous group only reaches the highest amount of satisfaction when the recommender has access to all stored configurations. With a decreasing number of stored configurations both random groups and heterogeneous groups achieve a higher satisfaction. This likely happens because of the similarity between group members. A recommender with imperfect knowledge and a, in size reduced, configuration database gives results that are not good enough and cannot compete with the dictator who always finds the perfect individual match that group members of homogeneous groups are satisfied with. It is important to note, when the same $tc$ values are used homogenous groups have a higher amount of satisfied people across the board.

\autoref{hyp:Evaluation:HeterogenousBiggerSatisfactionIncrease} states that the increase in satisfaction should be bigger for more heterogeneous groups. However, \autoref{fig:Evaluation:HeteroSatisfaction}, \autoref{fig:Evaluation:HomoSatisfaction} and \autoref{fig:Evaluation:RandomSatisfaction} show this to be not true. The recommendations for heterogeneous groups indeed cause a larger change in satisfaction compared to homogeneous groups but random groups cause a positive change of higher magnitude. Also the decrease in dissatisfaction is higher among random groups. This possibly happens due to random groups having interest that are more aligned and their preferences among group members therefore they do not diverge as much, thereby resulting in compromises for the group that can satisfy more individual members. Also the group preferences are still far apart enough to cause enough dissatisfaction and neutrality with the dictator's decision.

The data shows that having a larger configuration database causes the amount of satisfied group members to be greater than recommendation's using a smaller database. With dissatisfaction the same is seen in inverse. A larger configuration database causes the number of dissatisfied group members to drop compared to a small database. However, in some runs there have been instances of least misery that have seen a slight drop. This can be seen in \autoref{fig:Evaluation:HeteroSatisfaction} when comparing $74$ and $148$ as number of stored configurations. Why this happens is not entirely clear but a cause of that might be that least misery just takes into account the worst performing group member of the group. Therefore, it is possible that there is a second slightly worse solution, when comparing least misery scores, which actually has a slight advantage in terms of dissatisfaction. Having this second best configuration can cause it to be in the second database partition therefore resulting in less dissatisfaction on average. \autoref{hyp:Evaluation:StoreSizeBetterResults} therefore is supported by the data but it does not fully hold up when looking at least misery.

\autoref{hyp:Evaluation:AggregationStrategies} states least misery performs worse than multiplication. For a change in satisfaction this can be seen across the board, however, for dissatisfaction change this is not true everywhere. \autoref{fig:Evaluation:HeteroSatisfaction} shows that least misery performs better than best average in terms of dissatisfaction reduction. This behaviour possibly occurs because an average metric yields the same results for heavily polarised decisions and decisions that everyone feels neutral about. Least misery on the other hand takes only the group member least satisfied with the decision into account therefore this metric performs better. However, in other cases it performs visibly worse. Also of note is multiplication performs best across the board. This supports the findings by \citeauthor{Masthoff2015} \cite[~p. 755f]{Masthoff2015} and also shows that the satisfaction model does show some similar results to online evaluations.

To go back to in \autoref{sec:Evaluation:Questions} posed evaluation questions this section has shown that for random and heterogeneous groups the recommender performs better than a dictator. The average satisfaction depends on the chosen parameters but for the chosen value range average satisfaction with the recommender decision lies above two and can reach close to three satisfied group members for a high number of stored configurations and for some group types. The amount of stored finished configurations plays an important role in the recommender's performance but with a fraction of stored configurations the recommender still yields good results. This shows that the recommender provides useful decision support for helping in group decisions. It provides a solid basis for groups and can help their group decision. Most decisions the recommender does improve group satisfaction which shows that the recommender is able to be used to improve group decisions.