bachelor_thesis/30_Thesis/sections/60_evaluation.tex

\chapter{Evaluation}
\label{ch:Evaluation}

In this chapter the prototype is evaluated in terms of its functionality and its properties. The evaluation is an offline evaluation with synthetic data. All possible valid configurations are generated for one use case i.e. all possible valid configurations for the forest use case. Moreover, groups with explicit preferences and a configuration state (which would be for example the currently existing forest) are generated, too.

\section{Metric}
\label{sec:Evaluation:Metrics}

For the evaluation a metric to evaluate by is needed. The proposed metric for usage is that of satisfaction. This metric has been newly created because existing literature did not provide metrics usable for this thesis. Satisfaction is quantified in this thesis by a threshold metric. A user's preference is used to calculate a rating for each possible solution. Each configuration solution gets an individual score determined by the user's preferences. The score is calculated using the average of a user's preference for each characteristic that is part of the configuration. The result allows that a configuration can be compared to all other configurations and ranked according to the percentage of configurations that it beats for a specific user. The threshold metric consists of two parameters. First the threshold center $tc$ and second the satisfaction distance $sd$. The threshold for a person being satisfied is at $tc + sd$ and of a person being dissatisfied is at $tc - sd$. If a recommendation lies in between these two thresholds the person is classified to neither by satisfied nor be unsatisfied with the solution. For this thesis  $sd=5\%$ will be used. This choice is guided by the assumption that people switch from satisfied to unsatisfied rather quickly \todo{find a source psychology}. Therefore the parameter considered in this thesis is the $tc$. An example is the choice of $tc = 60\%$. This results in a person being satisfied with a recommendation if it is better than at least $65\%$ of all possible finished configurations. Moreover, a person is dissatisfied if the recommendation is not better than $55\%$ of possible finished configurations. A recommendation that is better than at least $55\%$ and not better than $65\%$ of possible solutions is considered neutral by the individual.

Different $tc$ values allow to model different situations. A situation where there is a low willingness to compromise is modelled by a high $tc$. A contrary situation where a group has a high willingness to compromise is modelled by a low $tc$.

A satisfaction and dissatisfaction classification allows groups to be measured by the amount of people that are satisfied and dissatisfied. Moreover, changes in satisfaction and dissatisfaction for different parameters can be compared. A reasonable $tc$ value has to be found for groups otherwise any derived metrics will not show any meaningful results.

\section{Evaluation Objective}
\label{sec:Evaluation:Questions}

An introduction to which questions lead the evaluation

\todo[inline]{fülle dieses Kapitel noch etwas mit Leben: Kurze Einleitung, dass die Evaluation von folgenden Fragen geleitet wird und im Anschluss an die Fragen jeweils noch eine Erläuterung, warum diese Fragen relevant sind / auf was das Beantworten dieser Fragen abzielt. zB Main Question -> herausfinden,ob der Recommender für die Gruppe tatsächlich von Vorteil ist
Und, dass zB die Frage bzgl Anzahl auf die technischen Eigenschaften des Recommenders abzielt}

\begin{itemize}
    \item Main question: How does the satisfaction with a group decision, guided by the recommender, differ from the decision of a single decision maker, the dictator, who does not take the other group member's opinions into account?
    \item How many group members are satisfied with the group decision on average?
    \item How does the amount of stored finished configurations relate to satisfaction with a recommendation?
\end{itemize}

\section{Use Case}
\label{sec:Evaluation:UseCase}

To evaluate the recommender, a use case is needed. In this thesis, a forestry use case is evaluated. This is a use case with four stakeholders. \autoref{fig:Concept:ForestExample} presents the attributes and characteristics of this use case but an extension is needed to fully show the whole use case. Namely rules of non valid configurations. The constraints for this use case are listed in \emph{not with} form in \autoref{tab:Evaluation:UseCase}.

\begin{table}
    \begin{center}
        \begin{tabular}{r|l}
            \textbf{characteristic} & \textbf{not with (either of the listed) characteristics} \\
            \hline
            $(\textit{indigenous}, \text{moderate})$   & $(\textit{resilient}, \text{high})$ \\
            \hline
            $(\textit{indigenous}, \text{high})$   & $(\textit{resilient}, \text{high}), (\textit{usable}, \text{moderate}), (\textit{usable}, \text{high}),$ \\
            & $(\textit{quantity}, \text{high}), (\textit{price}, \text{low})$ \\
            \hline
            $(\textit{resilient}, \text{moderate})$   & $(\textit{usable}, \text{high})$ \\
            \hline
            $(\textit{resilient}, \text{high})$   & $(\textit{usable}, \text{high}), (\textit{usable}, \text{moderate}), (\textit{quantity}, \text{high}),$ \\
            & $(\textit{price}, \text{moderate}), (\textit{price}, \text{low})$ \\
            \hline
            $(\textit{usable}, \text{low})$ & $(\textit{quantity}, \text{high}), (\textit{price}, \text{moderate}), (\textit{price}, \text{low})$\\
            \hline
            $(\textit{usable}, \text{high})$ & $(\textit{accessibility}, \text{high})$\\
            \hline
            $(\textit{effort}, \text{manual})$ & $(\textit{quantity}, \text{high}), (\textit{price}, \text{low}), (\textit{price}, \text{moderate})$\\
            \hline
            $(\textit{effort}, \text{harvester})$ & $(\textit{accessibility}, \text{high}), (\textit{accessibility}, \text{moderate})$\\
            \hline
            $(\textit{effort}, \text{autonomous})$ & $(\textit{accessibility}, \text{high}), (\textit{accessibility}, \text{moderate})$\\
            \hline
            $(\textit{quantity}, \text{low})$ & $(\textit{price}, \text{low}), (\textit{price}, \text{moderate})$\\
            \hline
            $(\textit{quantity}, \text{moderate})$ & $(\textit{price}, \text{low}), (\textit{price}, \text{moderate})$\\
            \hline
            $(\textit{quantity}, \text{high})$ & $(\textit{accessibility}, \text{high}), (\textit{accessibility}, \text{moderate})$\\
            \hline
        \end{tabular}
        \caption{Constrains in \emph{not with} form for the forest use case.}
        \label{tab:Evaluation:UseCase}
    \end{center}
    \todo[inline]{matrix anstelle von tabelle}
\end{table}

The stakeholders in this use case are: a forest owner, an athlete, an environmentalist, and a consumer. The owner sees the forest as an investment, he is interested in a high long term profit. On the other hand the consumer is interested in reasonable wood price as she uses wood for furniture and also for her fireplace. In contrast, the environmentalist is interested in a healthy forest that is not impacted negatively by human activity. Last is the athlete who is interested in good accessibility of the forest and that there is some plant and animal life.

\todo[inline]{Kapitel nochmal abschließen mit: hier liegen als sich widersprechende Präferenzen vor. Und: was sollen die Stakeholder jetzt entscheiden? in welcher Situation befinden sie sich? Wie setzt sich eine Gruppe zusammen? Aus 4 Personen von je einem Typ?}

\section{Data Generation}
\label{sec:Evaluation:GeneratingGroups}

\todo[inline]{Dieses Kapitel ist für mich noch nicht konsistent. Auf der Abbildung fehlen Elemente (Präferenzen \& Gruppen generieren), im Text ist das Paaren von Präferenzen(?) und Konfigurationen nicht beschrieben. Und: was wird da wirklich gepaart: präferenzen oder Gruppen?}

The whole process explained in \todo[inline]{hier einen besseren Übergang schaffen: um den use case zu evaluieren, wurden basierend auf den vorherigen Informationen Daten generiert. Die Visualisierung...} this section is visualized in \autoref{fig:Evaluation:GeneratingDataProcess}.

\begin{figure}
    \centering
    \includegraphics[width=1\textwidth]{./figures/60_evaluation/bpmn_evaluation_input_data_generation.pdf}
    \caption{Data generation process for the evaluation}
    \label{fig:Evaluation:GeneratingDataProcess}
\end{figure}

\subsection{Unfinished Configurations Generation}

Unfinished configurations are generated using all finished configurations and taking a subset of the contained characteristics. This way all generated configurations will be valid and lead to valid solutions. For the results that are presented in this chapter around $\frac{1}{7} \approx 15\%$ \todo{why this number} of characteristics is kept.

\subsection{Preference Generation}

For the forest use case, the idea is that there are multiple types of user profiles. Each group profile is represented by a neutral, negative or positive attitude towards a characteristic. During data generation the attitude is converted to a preference \todo{hier evlt nochmal nennen, dass du Präferenzen zwischen 0 und 1 verwendest, steht aktuell nur in der Grafik} using a normal distribution. \autoref{fig:Evaluation:DataGeneration} shows how the user profile can be converted to preferences. The actual group member profiles are shown in \autoref{tab:Evaluation:GroupMemberMappings}.

\pgfplotsset{height=5cm,width=\textwidth,compat=1.8}
\pgfmathdeclarefunction{gauss}{2}{%
  \pgfmathparse{1/(#2*sqrt(2*pi))*exp(-((x-#1)^2)/(2*#2^2))}%
}
\begin{figure}
    \begin{tikzpicture}
        \begin{axis}[
            every axis plot post/.append style={
                mark=none, domain=0:1, samples=50, smooth
            },
            axis x line*=bottom,
            xmin=0,
            xmax=1,
            ymin=0.1,
            xticklabel style={
                /pgf/number format/precision=3,
            },
            xtick={0,0.25, 0.5, 0.75,1},
            hide y axis]
          \addplot [draw=black, style={dashdotdotted}][very thick] {gauss(0.25,0.1)} node[text=black][above,pos=0.5] {negative};
          \addplot [draw=black, style={solid}][very thick] {gauss(0.5,0.05)} node[text=black][above,pos=0.48] {neutral};
          \addplot [draw=black, style={dotted}][very thick] {gauss(0.75,0.1)} node[text=black][above,pos=0.5] {positive};
        \end{axis}
        \end{tikzpicture}
 \caption{Distribution of preferences for a user type.}
\label{fig:Evaluation:DataGeneration}
\end{figure}

These user profiles can be used to generate rather homogenous groups but also to create groups that have interests that are more conflicting. The following group types are generated: \todo{wie genau sehen diese Gruppen aus? Aus wievielen Personen bestehen sie?}

\begin{itemize}
    \item random groups (preferences are uniformly random)
    \item heterogeneous groups (people adhere to one preference profile like forest owner, athlete, consumer, environmentalist)
    \item homogeneous groups (only one preference profile for all group members which in this evaluation is the forest owner)
\end{itemize}
\todo[inline]{warum diese unterscheidungen}

\begin{table}
    \begin{center}
        \begin{tabular}{l|c|c|c|c}
            characteristic                              & athlete           & forest owner      & environmentalist  & consumer          \\
            \hline
            $(\textit{indigenous}, \text{low})$         & \textbf{negative} & \textit{positive} & \textbf{negative} & neutral           \\
            $(\textit{indigenous}, \text{moderate})$    & \textit{positive} & neutral           & \textbf{negative} & neutral           \\
            $(\textit{indigenous}, \text{high})$        & \textit{positive} & \textbf{negative} & \textit{positive} & \textbf{negative} \\
            \hline
            $(\textit{resilient}, \text{low})$          & neutral           & \textit{positive} & neutral           & neutral           \\
            $(\textit{resilient}, \text{moderate})$     & \textit{positive} & neutral           & neutral           & neutral           \\
            $(\textit{resilient}, \text{high})$         & \textit{positive} & \textbf{negative} & \textbf{negative} & \textbf{negative} \\
            \hline
            $(\textit{usable}, \text{low})$             & neutral           & neutral           & neutral           & \textbf{negative} \\
            $(\textit{usable}, \text{moderate})$        & neutral           & neutral           & \textbf{negative} & neutral           \\
            $(\textit{usable}, \text{high})$            & \textbf{negative} & \textit{positive} & \textbf{negative} & \textit{positive} \\
            \hline
            $(\textit{effort}, \text{manual})$          & \textbf{negative} & neutral           & \textit{positive} & \textbf{negative} \\
            $(\textit{effort}, \text{harvester})$       & \textbf{negative} & \textit{positive} & \textbf{negative} & neutral           \\
            $(\textit{effort}, \text{autonomous})$      & \textbf{negative} & \textit{positive} & \textbf{negative} & neutral           \\
            \hline
            $(\textit{quantity}, \text{low})$           & \textit{positive} & \textbf{negative} & \textit{positive} & \textbf{negative} \\
            $(\textit{quantity}, \text{moderate})$      & neutral           & \textit{positive} & neutral           & \textbf{negative} \\
            $(\textit{quantity}, \text{high})$          & \textbf{negative} & \textit{positive} & \textbf{negative} & \textit{positive} \\
            \hline
            $(\textit{price}, \text{low})$              & neutral           & neutral           & neutral           & \textit{positive} \\
            $(\textit{price}, \text{moderate})$         & neutral           & \textit{positive} & neutral           & neutral           \\
            $(\textit{price}, \text{high})$             & neutral           & \textit{positive} & neutral           & \textbf{negative} \\
            \hline
            $(\textit{accessibility}, \text{low})$      & \textbf{negative} & \textit{positive} & \textit{positive} & neutral           \\
            $(\textit{accessibility}, \text{moderate})$ & neutral           & neutral           & neutral           & neutral           \\
            $(\textit{accessibility}, \text{high})$     & \textit{positive} & \textbf{negative} & \textbf{negative} & neutral           \\
            \hline
        \end{tabular}
        \caption{ The attitudes of each group member profile. }
        \label{tab:Evaluation:GroupMemberMappings}
    \end{center}
\end{table}

\subsection{The Effect of Stored Finished Configurations}

When evaluating a subset of stored finished configurations it is important to avoid outliers. This is the reason why a process inspired by \emph{cross validation} \todo{referenz hinzufügen} is used. The configuration database is randomly ordered and sliced into sub databases of the needed size. As an example, if the evaluated stored data size is 20, a configuration database containing 100 configurations is split into five sub databases of size 20. Now the evaluation is done on each of the sub databases and as a result the average is taken. This avoid that randomly a subset can be picked which either performs much better than most other possible combinations of databases or which performs much worse. This way the data is more aligned to the \emph{expected value} \todo{referenz}.


\section{Hypotheses}
\label{sec:Evaluation:Hypotheses}

This section gives an overview over the hypothesis used during data analysis. First a hypothesis is posed followed by its explanation.

\begin{hypothesis}
    \begin{itshape}
        \label{hyp:Evaluation:MaximumMinimum} Highest improvements with group recommendation are when the amount of people satisfied with the dictator's decision is slightly lower than two. Respectively that holds true for dissatisfaction.
    \end{itshape} \medskip \\*
    This expectation is made because the assumption is made that in a real situation a group of four with having a few less than two satisfied members on average (with a dictator's decision) has enough room for improvement so that potentially three group members can be satisfied after the use of the recommender. Meaning that at least one more person is satisfied with the compromise. Potentially in some groups it might even be possible to then lift the last person from dissatisfaction towards a neutral attitude. A higher base satisfaction is assumed to reduce the possibility to make an additional group member satisfied.
\end{hypothesis}


\begin{hypothesis}
    \begin{itshape}
        \label{hyp:Evaluation:HigherTcLessSatisfied} A higher $tc$ value results in less satisfied people and more unsatisfied people with regard to the dictator's decision.
    \end{itshape} \medskip \\*
    A higher $tc$ value causes a person to be unsatisfied with a higher amount of configurations. Also it causes a person to be satisfied with less configurations. Therefore recommending a random configuration causes the chance of making an individual satisfied sink while increasing the chance of that person being unsatisfied. Already the change in probability leads to the assumption that this should be seen with non random recommendations too.
\end{hypothesis}

\begin{hypothesis}
    \begin{itshape}
        \label{hyp:Evaluation:OnlyOneSatisfied} There exists a $tc$ value which causes only one person to be satisfied with the dictator's decision and no one is satisfied with the group recommender's decision.
    \end{itshape} \medskip \\*
    A $tc$ value that reaches a high enough level eventually should make only the dictator herself satisfied with the dictator's decision. The bar for satisfaction lies so high that any group recommendation will cause the dictator to also be not satisfied or at least neutral with the group decision. This can be understood as that in a group where nobody is willing to compromise everyone is only satisfied with one's own decision. Having two members with identical interest of course results in this effect not being present but this is expected to be rare for a group size of four.
\end{hypothesis}

\begin{hypothesis}
    \begin{itshape}
        \label{hyp:Evaluation:HomogenousMoreSatisfied} Homogeneous groups have more satisfied members with the recommender's decision but also with the dictator's decision compared to heterogeneous groups.
    \end{itshape} \medskip \\*
    As the interest in homogenous groups are more aligned there is an expectation that the overall hapiness levels for more homogenous groups is higher. If the base level is higher already it is likely that even just a slight increase lifts recommendations for homogenous groups to satisfaction levels not reachable by heterogeneous groups.
\end{hypothesis}

\begin{hypothesis}
    \begin{itshape}
        \label{hyp:Evaluation:HeterogenousBiggerSatisfactionIncrease} More heterogeneous groups see a bigger satisfaction increase than less heterogeneous groups when switching from the decision of a dictator to a decision made by the recommender.
    \end{itshape} \medskip \\*
    The assumption is made that in more heterogeneous groups the satisfaction with the dictator's decision is less. Therefore there is a higher possible increase in satisfaction. A homogenous group that already satisfies all group members with the dictator's decision cannot see an increase in satisfaction therefore the assumption is made, that with a higher amount of people dissatisfied and not satisfied with the dictator's decision, there will be more people that can be lifted into satisfaction and therefore the increase will be bigger. However a group that has contradicting interest actually might not be able to reach high satisfaction levels.
\end{hypothesis}

\begin{hypothesis}
    \begin{itshape}
        \label{hyp:Evaluation:StoreSizeBetterResults} A higher amount of stored finished configurations results in a higher amount of satisfied and a lower amount of dissatisfied group member.
    \end{itshape} \medskip \\*
    This hypothesis is born by the fact that having a bigger pool of configurations to choose from increases the chances of having a good recommendation. This of course requires the assumption that aggregation strategies that pick recommendations pick configurations that also fare better in the chosen satisfaction metric. If that is not the case this hypothesis should not hold.
\end{hypothesis}

\begin{hypothesis}
    \begin{itshape}
        \label{hyp:Evaluation:AggregationStrategies} Multiplication and best average aggregation strategies perform better than least misery across the board.
    \end{itshape} \medskip \\
    Best average and multiplication are strategies that are performing best in some of the, by \citeauthor{Masthoff2015} \cite[p. 755f]{Masthoff2015}, listed online experiments. Therefore it is reasonable to assume that they perform well here too. Least misery was listed in some studies as performing worst. Therefore there is an expectation of it faring less good than other group aggregation strategies.
\end{hypothesis}

\section{Findings}
\label{sec:Evaluation:Findings}

\subsection{Threshold Center}

To get an understanding of the data all parameters except the $tc$ will be fixed. The preference aggregation strategy looked at is multiplication. The configuration database is used with all possible solutions (which is 148 in total). This results in a bigger visible effect as the recommender has access to all possible configurations. \autoref{fig:Evaluation:tcChange} shows the satisfaction change based on choice of $tc$. Of note is that the maxima of satisfaction change precedes the minima of dissatisfaction change for all group types. Maxima and minima occur at different tc values depending on the group type. Heterogeneous groups peek earliest while homogenous groups only show a peek towards the maximum $tc$ value. Changes in dissatisfaction are minimal even with $tc$ close to its maximum value. \autoref{fig:Evaluation:tcCount} shows the amount of group members satisfied and dissatisfied with a decision. The number of satisfied people decreases with an increasing $tc$ and its downward movement accelerates. The dissatisfaction curve shows a similar trend but in contrast here the number of dissatisfied group members increases with an increase in $tc$. The curve accelerates its growth analogues to the acceleration of the satisfaction curve. The behaviour of heterogeneous groups and random groups is similar but the curve for heterogeneous groups show less satisfaction and more dissatisfaction for a given tc. Also both curves have a negative satisfaction change when $tc$ reaches a certain height. Homogeneous groups only have happy group members for most $tc$ values but they decrease rapidly for values greater $85$. Dissatisfied group members are at zero for the whole value range of $tc$ except a very slight upward tick at the end that is barely noticeable.

\begin{figure}
    \centering
    \includegraphics[width=1\textwidth]{./figures/60_evaluation/tc_change__multi__db-size-148.pdf}
    \caption{The average satisfaction and dissatisfaction change based on $tc$ with a database size of 148 and multiplication as aggregation strategy.}
    \label{fig:Evaluation:tcChange}
\end{figure}

\autoref{hyp:Evaluation:MaximumMinimum} states that the highest satisfaction change is expected at places where the overall satisfaction with the dictator's decision is close to two. However the data shows a slightly different result. This hypothesis does not hold true. When looking at the data we see peeks in satisfaction change when values are equal to $2.81, 2.51$ and $3$ (heterogeneous, random, homogenous). Therefore the expectation does not hold up. Moreover, valleys for dissatisfaction change are also not at the expected value of \textit{two}. They are instead at $1.19, 1.49, 0.04$ (heterogeneous, random, homogenous). Here the valleys are lower than expected. However the data from homogenous groups seems to be cut of. Therefore, it is not possible to say if there would be a potentially bigger decrease with a use case with more possible solutions.

\begin{figure}
    \centering
    \includegraphics[width=1\textwidth]{./figures/60_evaluation/tc_dictator__multi__db-size-148.pdf}
    \caption{The average satisfaction and dissatisfaction with the dictator's decision based on $tc$.}
    \label{fig:Evaluation:tcCount}
\end{figure}

The predicted trend that a higher $tc$ results in a lower satisfaction and a higher dissatisfaction, with the dictator's decision, as predicted by \autoref{hyp:Evaluation:HigherTcLessSatisfied} can be clearly seen in \autoref{fig:Evaluation:tcCount} and has been described in this section already.

\autoref{hyp:Evaluation:OnlyOneSatisfied} predicts that the satisfaction with the individual decision eventually reaches one and that no one is satisfied with the group recommender decision. This means the satisfaction change should reach minus one. \autoref{fig:Evaluation:tcCount} shows a downward trend that comes close to one for heterogeneous and random groups. Therefore, the trend suggests that the hypothesis holds in regards to heterogeneous and random groups but as the drop for homogenous groups just reaches below $2.8$ suggesting that the hypothesis does not hold for homogenous groups. Also, satisfaction change in heterogeneous groups reaches close to minus one but this value is neither reached by random groups, nor by homogenous groups. The hypothesis therefore should not be seen as confirmed in that regard as well and further investigation is needed.

During a group decision it is better to make one less person dissatisfied opposed to one more person satisfied. Therefore, this thesis uses $tc$ values that are closer to the minima of dissatisfaction change than to the maxima of satisfaction change. The minima for heterogeneous groups is at $tc = 70\%$ therefore this is the chosen value for the evaluation of other aspects. For random groups the minima of dissatisfaction change can be found at $tc = 85\%$ which is the value used for all following analysis of random groups. For homogenous group dissatisfaction change is decreasing until the highest possible value of $tc$ is reached. Because of that $tc = 94\%$ is used for analysis.

\subsection{Analysing Data}

This subsection holds fixed parameters of $tc$. In it the satisfaction change and the total amount of satisfied people with the recommenders decision dependent on the amount of stored configurations. For clarity reasons not all graphs of the data are included. The missing graphs can be found in the appendix and have references to them.

\autoref{fig:Evaluation:HeteroSatisfactionIncrease} shows the relationship between the change in satisfaction and dissatisfaction and the stored number of configurations. There are three graphs each. One for multiplication, one for least misery and one for best average. The graphs for satisfaction look similar to a logarithmic curve. The increase in change of satisfaction decelerates with a higher number of stored configurations. The change in satisfaction is always above zero and a satisfaction increase of more than three quarters of the maximum can already be seen with around 25 stored configurations. Moreover, the curve for multiplication is greater than all other curves for all parameters. Least misery reaches the lowest amount of change across all values. The minimum number of satisfaction change is $0$ for least misery, and $0.1$ for best average and multiplications. The highest number is around $0.3$ for least misery, $0.4$ for best average and $0.5$ for multiplication
When looking at dissatisfaction change the graphs are all in the negative number range. Multiplication reaches the lowest number and best average the highest. The gap between all three functions is less than that of satisfaction increase. And overall the curves are flatter meaning the change with 25 stored configurations already reaches close to five sixth of the minimum value. The highest number of satisfaction change is $-0.4$ for all strategies meanwhile the lowest number is around $-0.57$ for least misery, $-0.53$ for best average and $-0.63$ for multiplication.

The figures for homogenous (\autoref{fig:Appendix:HomoSatisfactionIncrease}) and random groups (\autoref{fig:Appendix:RandomSatisfactionIncrease}) are in the appendix. The figures have a similar shape but their values and slope vary. The satisfaction change for homogenous groups is mostly negative, starting at $-2$, and only reaches a positive level for more than $100$ stored configurations with a value of $0.04$. Multiplication and best average have higher values than least misery here too. Moreover the dissatisfaction change is positive across the bored with a value range of $[0,1]$.
Random groups as seen in \autoref{fig:Appendix:RandomSatisfactionIncrease} mostly have a positive change in satisfaction. Values range here from $-0.55$ to $0.27$ for least misery, from $-0.27$ and $-0.28$ to $0.74$ for best average and multiplication. The change is higher than the change for heterogeneous groups. dissatisfaction also changes similarly to heterogeneous groups. Here the values for random groups reach a lower level. They range from $0$ to $-0.59$ for least misery. Multiplication and best average both have as minimum value around $-0.21$ and behave similarly. The range goes down to $-0.84$ for best average and $-0.86$ for multiplication.

\begin{figure}
    \centering
    \includegraphics[width=1\textwidth]{./figures/60_evaluation/heterogeneous_happy_unhappy_increase_amount-1000__tc-70}
    \caption{The satisfaction and dissatisfaction change using the group recommender for heterogeneous groups with $tc = 70$.}
    \label{fig:Evaluation:HeteroSatisfactionIncrease}
\end{figure}

\autoref{fig:Evaluation:HeteroSatisfactionTotal} shows the total number of group members satisfied and dissatisfied with the recommender's decision. The horizontal black continuous line shows the value for satisfaction and dissatisfaction with the dictators decision. The graphs show the same curve as \autoref{fig:Evaluation:HeteroSatisfactionIncrease} but in absolute numbers. Satisfaction with the recommender's decision starts at $2.4$ and quickly reaches $2.65$ for least misery and $2.8$ for best average and multiplication. The highest value for multiplication is at $2.89$. Dissatisfaction also  quickly plateaus. Here values for different recommenders are closer together. They start at $0.74$ (least misery) to $0.78$ (best average) and go as low as $0.62$ for least misery, $0.66$ for best average and $0.56$ for multiplication.

As shown in \autoref{fig:Appendix:HomoSatisfactionTotal} the value range for homogenous groups is much larger but the overall shape stays the same. Here satisfaction numbers go from $0.55$ to $2.95$. Least misery performs visibly worse than multiplication and best average reaching only $2.7$. Dissatisfaction values range from $1.21$ to $0.01$ and the values are not really visibly distinguishable besides that in the range $[25,50]$ least misery seems to have the highest number of dissatisfied group members.

Random groups have less overall satisfaction with $tc = 85\%$ as seen in \autoref{fig:Appendix:RandomSatisfactionTotal}. Satisfaction numbers start from $1.33$ (least misery), $1.61$ (best average) and $1.6$ (multiplication) and go up to $2.15$ for least misery and $2.62$ for best average and multiplication. The dissatisfaction numbers start at $1.5$ for least misery and $1.27$ for best average and multiplication and level of at $0.9$ (least misery), $0.65$ (best average) and $0.63$ (multiplication). Visibly there is a big difference between least misery and the other two aggregation functions.

\begin{figure}
    \centering
    \includegraphics[width=1\textwidth]{./figures/60_evaluation/heterogeneous_happy_unhappy_total_amount-1000__tc-70}
    \caption{The average satisfaction and dissatisfaction with the recommender's decision for heterogeneous groups based on $tc = 70$.}
    \label{fig:Evaluation:HeteroSatisfactionTotal}
\end{figure}

After description of the data now the focus shifts to the hypotheses left that have not been evaluated.
\autoref{hyp:Evaluation:HomogenousMoreSatisfied} states that homogenous groups have more satisfied member's with regards to the dictator's and the group recommender's decision. \autoref{fig:Evaluation:tcCount} shows that this holds true for dictator's decision as for every instance satisfaction in homogeneous groups is higher than that of other groups. However \autoref{fig:Evaluation:HeteroSatisfactionTotal}, \autoref{fig:Appendix:HomoSatisfactionTotal} and \autoref{fig:Appendix:RandomSatisfactionTotal} show that for satisfaction with the recommender's decision this does not hold when looking at $tc$ values where the recommender performs best for each  segment. In those places the homogenous group only reaches the highest amount of satisfaction when the recommender has access to all stored configurations. With a decreasing number of stored configurations both random groups and heterogeneous groups perform better. It is important to note, when the same $tc$ values are used homogenous groups have a higher amount of satisfied people across the board.

\autoref{hyp:Evaluation:HeterogenousBiggerSatisfactionIncrease} states that the increase in satisfaction should be bigger for more heterogeneous groups. However \autoref{fig:Evaluation:HeteroSatisfactionIncrease}, \autoref{fig:Appendix:HomoSatisfactionIncrease} and \autoref{fig:Appendix:RandomSatisfactionIncrease} show this to be not true. The recommendations for heterogeneous groups indeed cause a larger change in satisfaction compared to homogeneous groups but random groups cause a positive change of higher magnitude. Also the decrease in dissatisfaction is higher among random groups.

The data shows that having a larger configuration database causes the amount of satisfied group members to be greater than recommendation's using a smaller database. With dissatisfaction the same is seen in inverse. A larger configuration database causes the number of dissatisfied group members to drop compared to a small database. However in some runs there have been instances of least misery that have seen a slight drop. This can be seen in \autoref{fig:Evaluation:HeteroSatisfactionIncrease} when comparing $74$ and $148$ as number of stored configurations. Why this happens is not entirely clear but a cause of that might be that least misery just takes into account the worst performing group member of the group. Therefore it is possible that there is a second slightly worse solution, when comparing least misery scores, which actually has a slight advantage in terms of dissatisfaction. Having this second best configuration can cause it to be in the second database partition therefore resulting in less dissatisfaction on average. \autoref{hyp:Evaluation:StoreSizeBetterResults} therefore is supported by the data but it does not fully hold up when looking at least misery.

\autoref{hyp:Evaluation:AggregationStrategies} states least misery performs worse than multiplication. For a change in satisfaction this can be seen across the board however for dissatisfaction change this is not true everywhere. \autoref{fig:Evaluation:HeteroSatisfactionIncrease} shows that least misery performs better than best average in terms of dissatisfaction reduction. However in other cases it performs visibly worse. Also of note is multiplication performs best across the board. This supports the findings by \citeauthor{Masthoff2015} \cite[p. 755f]{Masthoff2015} and also shows that the satisfaction model does show some similar results to online evaluations.


To go back to \autoref{sec:Evaluation:Questions} this section has shown that for random and heterogeneous groups the recommender performs better than a dictator. The average satisfaction depends on the chosen parameters but for the chosen value range average satisfaction with the recommender decision lies above two and can reach close to three satisfied group members for a high number of stored configurations and for some group types. The amount of stored finished configurations plays an important role in performance but with a fraction of stored configurations the recommender still yields good results.