fixed parts of mistakes in evaluation chapter

This commit is contained in:
hannes.kuchelmeister
2020-05-06 16:43:11 +02:00
parent 93bdb51f1d
commit e1bdf719ae

View File

@@ -1,36 +1,36 @@
\chapter{Evaluation}
\label{ch:Evaluation}
In this chapter the prototype is evaluated in terms of its functionality and its properties. The evaluation is an offline evaluation with synthetic data. All possible valid configurations are generated for one use case i.e. all possible valid configurations for the forest use case. Moreover, groups with explicit preferences and a configuration state (which would be for example the currently existing forest) are generated, too.
In this chapter the prototype is evaluated in terms of its functionality and its properties. The evaluation is an offline evaluation with synthetic data. All possible valid configurations are generated for one use case, i.e. all possible valid configurations for the forest use case. Moreover, groups with explicit preferences and a configuration state (which, e.g. would be the currently existing forest) are generated, too.
\section{Metric}
\label{sec:Evaluation:Metrics}
For the evaluation a metric to evaluate by is needed. The proposed metric for usage is that of satisfaction. This metric has been newly created because existing literature did not provide metrics usable for this thesis. Satisfaction is quantified in this thesis by a threshold metric. A user's preference is used to calculate a rating for each possible solution. Each configuration solution gets an individual score determined by the user's preferences. The score is calculated using the average of a user's preference for each characteristic that is part of the configuration. The result allows that a configuration can be compared to all other configurations and ranked according to the percentage of configurations that it beats for a specific user. The threshold metric consists of two parameters. First the threshold center $tc$ and second the satisfaction distance $sd$. The threshold for a person being satisfied is at $tc + sd$ and of a person being dissatisfied is at $tc - sd$. If a recommendation lies in between these two thresholds the person is classified to neither be satisfied nor be unsatisfied with the solution. For this thesis $sd=5\%$ will be used. This choice is guided by the assumption that people switch from satisfied to unsatisfied rather quickly \todo{find a source psychology}. Therefore the parameter considered in this thesis is the $tc$. An example is the choice of $tc = 60\%$. This results in a person being satisfied with a recommendation if it is better than at least $65\%$ of all possible finished configurations. Moreover, a person is dissatisfied if the recommendation is not better than $55\%$ of possible finished configurations. A recommendation that is better than at least $55\%$ and not better than $65\%$ of possible solutions is considered neutral by the individual.
A metric is required to carry out the validation. The proposed metric is the metric of satisfaction. This metric was created because pertinent literature does not provide metrics usable for this thesis. Satisfaction is quantified in this thesis by a threshold metric. A user's preference is used to calculate a rating for each possible solution. Each configuration solution gets an individual score determined by the user's preferences. The score is calculated using the average of a user's preference for each characteristic that is part of the configuration. The result allows a configuration to be compared to all other configurations and ranked according to the percentage of configurations that it beats for a specific user. The threshold metric consists of two parameters. First the threshold center $tc$ and second the satisfaction distance $sd$. The threshold for a satisfied person is at $tc + sd$ and for a dissatisfied person is at $tc - sd$. If a recommendation lies in between these two thresholds the person is classified to neither be satisfied nor be unsatisfied with the solution. For this thesis $sd=5\%$ will be used. This choice is based on the assumption that people switch from satisfied to unsatisfied rather quickly \todo{find a source psychology}. Therefore, the parameter considered in this thesis is $tc$. An example is the choice of $tc = 60\%$. This results in a person satisfied with a recommendation if it is better than at least $65\%$ of all possible finished configurations. In contrast, a person is dissatisfied if the recommendation is not better than $55\%$ of all possible finished configurations. A recommendation that is better than at least $55\%$ and not better than $65\%$ of all possible solutions is considered neutral by the individual.
\todo{(optional) visualize tc value with an example configuration}
Different $tc$ values allow to model different situations. A situation where there is a low willingness to compromise is modelled by a high $tc$. A contrary situation where a group has a high willingness to compromise is modelled by a low $tc$.
Different $tc$ values allow to model different situations. A situation with a low willingness to compromise is modelled by a high $tc$. A contrary situation with a group that has a high willingness to compromise is modelled by a low $tc$.
A satisfaction and dissatisfaction classification allows groups to be measured by the amount of people that are satisfied and dissatisfied. Moreover, changes in satisfaction and dissatisfaction for different parameters can be compared. A reasonable $tc$ value has to be found for groups otherwise any derived metrics will not show any meaningful results.
\section{Evaluation Objective}
\label{sec:Evaluation:Questions}
This section poses three questions that will be answered during the evaluation. The question's aim is to guide through this chapter. They set the guidelines for this evaluation and where focuses are set. The questions answered during the evaluation are:
This section poses three questions that will be answered during the evaluation. The questions' aim is to guide through this chapter. They set the guidelines for this evaluation and define its focuses. The questions answered during the evaluation are:
\begin{itemize}
\item Main question: How does the satisfaction with a group decision, guided by the recommender, differ from the decision of a single decision maker, the dictator, who does not take the other group member's opinions into account?
\item How many group members are satisfied with the group decision on average?
\item Main question: How does the satisfaction with a group decision, guided by the recommender, differ from the decision of a single decision maker, the dictator, who does not take the other group members' opinions into account?
\item How many group members on average are satisfied with the group decision?
\item How does the amount of stored finished configurations relate to satisfaction with a recommendation?
\end{itemize}
The main question is used to understand the usefulness of the recommender and whether it gives benefits to groups. The second question is aimed at providing information regarding the data and how satisfaction looks like in group decisions and what factors influence it. Last, a technical question is posed. This question is relevant because it shows technical aspects of the recommender. This is important because other work for using the recommender in other possibly larger use cases depend on performance figures in relation to number of stored configurations.
The main question is addressed to understand the behaviour the recommender and whether it gives benefits to groups. The second question is aimed at providing information regarding the data and what satisfaction looks like in group decisions and which factors influence it. Last, a technical question is posed. This question is relevant because it shows technical aspects of the recommender. This is important since other work for using the recommender in other possibly larger use cases depend on performance figures in relation to number of stored configurations.
\section{Use Case}
\label{sec:Evaluation:UseCase}
To evaluate the recommender, a use case is needed. In this thesis, a forestry use case is evaluated. This is a use case with four stakeholders. \autoref{fig:Concept:ForestExample} presents the attributes and characteristics of this use case but an extension is needed to fully show the whole use case. Namely the rules of non valid configurations are missing. Therefore the constraints for this use case are listed in \emph{not with} form in \autoref{tab:Evaluation:UseCase}.
To evaluate the recommender, a use case is needed. In this thesis, a forestry use case is evaluated. This is a use case with four stakeholders. \autoref{fig:Concept:ForestExample} presents the attributes and characteristics of this use case but an extension is needed to fully show the whole use case. Namely the rules of non-valid configurations are missing. Therefore, the constraints for this use case are listed in \emph{not with} form in \autoref{tab:Evaluation:UseCase}.
\begin{table}
\tiny
@@ -75,13 +75,13 @@ To evaluate the recommender, a use case is needed. In this thesis, a forestry us
\end{center}
\end{table}
The stakeholders in this use case are: a forest owner, an athlete, an environmentalist, and a consumer. The owner sees the forest as an investment, he is interested in a high long term profit. On the other hand the consumer is interested in reasonable wood price as she uses wood for furniture and also for her fireplace. In contrast, the environmentalist is interested in a healthy forest that is not impacted negatively by human activity. Last is the athlete who is interested in good accessibility of the forest and that there is some plant and animal life.
Every group consists of four people whereby they need to try and find a compromise. Diverging preferences make this difficult. All stakeholders have an interest in getting their will but also all parties need the acceptance of others with the decision. It is not in the interest of a stakeholder to fully have their preferences met, while ending up with protests that arise from the deep dissatisfaction of other group members.
The stakeholders in this use case are: a forest owner, an athlete, an environmentalist, and a consumer. The owner sees the forest as an investment, he is interested in a high long-term profit. On the other hand consumers are interested in reasonable wood price as they use wood for furniture and also for their fireplaces. In contrast, the environmentalist is interested in a healthy forest that is not impacted negatively by human activity. Last is the athlete who is interested in good accessibility of the forest and that there is some plant and animal life.
Every group consists of four people which is why they need to try and find a compromise. Diverging preferences make this difficult. All stakeholders have an interest in getting their will but also all parties need others to accept the decision. It is not in the interest of a stakeholder to fully have their preferences met while ending up with protests that arise from the deep dissatisfaction of other group members.
\section{Data Generation}
\label{sec:Evaluation:GeneratingGroups}
This section describes the data generation process as seen in \autoref{fig:Evaluation:GeneratingDataProcess} that generates data based on the use case int \autoref{sec:Evaluation:UseCase}. Group profiles are used to generate groups of four with different group member types. The exact group composition depends on the group type. For every parameter and group type $1000$ groups are generated and converted to preferences. This amount was chosen as it's the highest number that allows computing times to work over night on the hardware that is available. Also this number is large enough to reduce strong variability between runs. For each group unfinished configurations are generated and its preferences are paired up with the generated unfinished configurations. These pairs later on are used for the evaluation.
This section describes the data generation process as seen in \autoref{fig:Evaluation:GeneratingDataProcess} that generates data based on the use case int \autoref{sec:Evaluation:UseCase}. Group profiles are used to generate groups of four with different group member types. The exact group composition depends on the group type. For every parameter and group type $1000$ groups are generated and converted to preferences. This number was chosen as it is the highest number that allows computing times to work overnight on the hardware that is available. Also this number is large enough to reduce strong variability between runs. For each group unfinished configurations are generated and its preferences are paired up with the generated unfinished configurations. These pairs later on are used for the evaluation.
\begin{figure}
\centering
@@ -92,11 +92,11 @@ This section describes the data generation process as seen in \autoref{fig:Evalu
\subsection{Unfinished Configurations Generation}
Unfinished configurations are generated using all finished configurations and taking a subset of the contained characteristics. This way all generated configurations will be valid and lead to valid solutions. For the results that are presented in this chapter around $\frac{1}{7} \approx 15\%$ of characteristics is kept. This value is chosen to allow the existing configuration to take effect but not to skew the results due to the penalty function severely limiting possible options.
Unfinished configurations are generated using all finished configurations and taking a subset of the contained characteristics. This way all generated configurations will be valid and lead to valid solutions. For the results that are presented in this chapter around $\frac{1}{7} \approx 15\%$ of characteristics is maintained. This value is chosen to allow the existing configuration to take effect but not to skew the results due to the penalty function severely limiting possible options.
\subsection{Preference Generation}
For the forest use case, the idea is that there are multiple types of user profiles. Each group profile is represented by a neutral, negative or positive attitude towards a characteristic. During data generation the attitude is converted to a preference using a normal distribution. Each preference lies in the interval $[0,1]$. Zero can be seen as worst possible option and one as best possible option. \autoref{fig:Evaluation:PreferenceDistribution} shows how the user profile can be converted to preferences. The actual group member profiles are shown in \autoref{tab:Evaluation:GroupMemberMappings}.
For the forest use case, the idea is that there are multiple types of user profiles. Each group profile is represented by a neutral, negative or positive attitude towards a characteristic. During data generation the attitude is converted to a preference using a normal distribution. Each preference lies in the interval $[0,1]$. Zero can be seen as worst possible option and one as best possible option. \autoref{fig:Evaluation:PreferenceDistribution} shows how the user profile can be converted to preferences. The actual group member profiles are shown in \autoref{tab:Evaluation:GroupMemberMappings}.
\pgfplotsset{height=5cm,width=\textwidth,compat=1.8}
\pgfmathdeclarefunction{gauss}{2}{%
@@ -174,23 +174,23 @@ These user profiles can be used to generate rather homogenous groups but also to
\item homogeneous groups (only one preference profile for all group members which in this evaluation is the forest owner)
\end{itemize}
The natural group type for the use case is a heterogeneous group but to widen the evaluation and to see how the recommender performs with different types of groups the two other group types are evaluated, too. Therefore more general statements about the recommender's performance can be made.
The natural group type for the use case is a heterogeneous group but to widen the evaluation and to see how the recommender performs with different types of groups the two other group types are evaluated too. As a result, more general statements about the recommender's performance can be made.
\subsection{The Effect of Stored Finished Configurations}
Another important component of the evaluation is the influence of stored finished configurations. When evaluating a subset of stored finished configurations it is important to avoid outliers. This is the reason why a process inspired by \emph{cross validation} \todo{referenz hinzufügen} is used. The configuration database is randomly ordered and sliced into sub databases of the needed size. As an example, if the evaluated stored data size is 20, a configuration database containing 100 configurations is split into five sub databases of size 20. Now the evaluation is done on each of the sub databases and as a result the average is taken. This avoid that randomly a subset can be picked which either performs much better than most other possible combinations of databases or which performs much worse. This way the data is more aligned to the \emph{expected value} \todo{referenz}.
Another important component of the evaluation is the influence of stored finished configurations. When evaluating a subset of stored finished configurations it is important to avoid outliers. This is the reason why a process inspired by \emph{cross validation} \todo{referenz hinzufügen} is used. The configuration database is randomly ordered and sliced into sub-databases of the needed size. As an example, if the evaluated stored data size is 20, a configuration database containing 100 configurations is split into five sub-databases of size 20. Now the evaluation is carried out for each of the sub-databases and finally the average is determined. This avoids the random picking of a subset which either performs much better than most other possible combinations of databases or which performs much worse. This way the data is more aligned to the \emph{expected value}. \todo{referenz}
\section{Hypotheses}
\label{sec:Evaluation:Hypotheses}
This section gives an overview over the hypotheses tested during data analysis. Each hypothesis is followed by an explanation to why the hypothesis was posed. In later sections the truthfulness of the hypothesis is tested. This allows to test if expectations about the behaviour of the recommender are true or false.
This section gives an overview on the hypotheses tested during data analysis. Each hypothesis is followed by an explanation as to why the hypothesis is presented. In later sections the truthfulness of the hypothesis is examined. This allows to verify if expectations about the behaviour of the recommender are true or false.
\begin{hypothesis}
\begin{itshape}
\label{hyp:Evaluation:MaximumMinimum} Highest improvements for group recommendations are when the amount of people satisfied with the dictator's decision is slightly lower than two and the highest reduction in dissatisfied group members can be seen at around two group members dissatisfied respectively.
\label{hyp:Evaluation:MaximumMinimum} Improvements for group recommendations are highest when the number of people satisfied with the dictator's decision is slightly lower than two and the highest reduction in dissatisfied group members can be seen at around two group members dissatisfied respectively.
\end{itshape} \medskip \\*
This stems from the assumption that in a real situation a group of four, having a few less than two satisfied members on average (with a dictator's decision), has enough room for improvement so that potentially three group members can be satisfied after the use of the recommender. Meaning that at least one more person is satisfied with the compromise. Potentially in some groups it might even be possible to then lift the last person from dissatisfaction towards a neutral attitude. A higher base satisfaction is assumed to reduce the possibility to make an additional group member satisfied.
This is based on the assumption that in a real situation a group of four with less than two satisfied members (with a dictator's decision) on average, has enough room for improvement so that potentially three group members can be satisfied after the use of the recommender. This means that at least one more person is satisfied with the compromise. In some groups it my then be potentially possible to shift the last person from dissatisfaction towards a neutral attitude. A higher base satisfaction is assumed to reduce the possibility to make an additional group member satisfied.
\end{hypothesis}
@@ -198,33 +198,33 @@ This section gives an overview over the hypotheses tested during data analysis.
\begin{itshape}
\label{hyp:Evaluation:HigherTcLessSatisfied} A higher $tc$ value results in less satisfied people and more unsatisfied people with regard to the dictator's decision.
\end{itshape} \medskip \\*
A higher $tc$ value causes a person to be unsatisfied with a higher amount of configurations. Also it causes a person to be satisfied with less configurations. Therefore recommending a random configuration causes the chance of making an individual satisfied to sink while increasing the chance of that person being unsatisfied. Already the change in probability leads to the assumption that this should be seen with non random recommendations too.
A higher $tc$ value causes a person to be unsatisfied with a higher amount of configurations. It also causes a person to be satisfied with less configurations. Therefore, recommending a random configuration causes the chance of making an individual satisfied to sink while increasing the chance of that person to be unsatisfied. Already the change in probability leads to the assumption that this result should be seen with non-random recommendations too.
\end{hypothesis}
\begin{hypothesis}
\begin{itshape}
\label{hyp:Evaluation:OnlyOneSatisfied} There exists a $tc$ value which causes only one person to be classified as satisfied with the dictator's decision and no one is classified as satisfied with the group recommender's decision.
\end{itshape} \medskip \\*
A $tc$ value that reaches a high enough level eventually should make only the dictator herself satisfied with the dictator's decision. The bar for satisfaction lies so high that any group recommendation will cause the dictator to also be not satisfied or at least neutral with the group decision. This can be understood as that in a group where nobody is willing to compromise everyone is only satisfied with one's own decision. Having two members with identical interest of course results in this effect not being present but this is expected to be rare for a group size of four.
A $tc$ value that reaches a high enough level eventually should make only the dictator herself satisfied with the dictator's decision. The bound for satisfaction is so high that any group recommendation will cause the dictator to also be unsatisfied or at least neutral with the group decision. This can be understood as a complete unwillingness of a group to compromise. All group members are only satisfied with their own decision. Having two group members with identical interest, which is expected to be rare, results in this effect not being present even in a situation like that.
\end{hypothesis}
\begin{hypothesis}
\begin{itshape}
\label{hyp:Evaluation:HomogenousMoreSatisfied} Homogeneous groups have more satisfied members with the recommender's decision but also with the dictator's decision compared to heterogeneous groups.
\label{hyp:Evaluation:HomogenousMoreSatisfied} Homogeneous groups have more members satisfied with the recommender's decision but also with the dictator's decision compared to heterogeneous groups.
\end{itshape} \medskip \\*
As the interest in homogenous groups is more aligned there is an expectation that the overall satisfaction levels for more homogenous groups is higher. If the base level is already higher it is likely that even just a slight increase lifts recommendations for homogenous groups to satisfaction levels not reachable by heterogeneous groups.
As the interest in homogenous groups is more aligned it is to be expected that the overall satisfaction levels for more homogenous groups is higher. If the base level is already higher it is likely that even just a slight increase shifts recommendations for homogenous groups to satisfaction levels not achievable by heterogeneous groups.
\end{hypothesis}
\begin{hypothesis}
\begin{itshape}
\label{hyp:Evaluation:HeterogenousBiggerSatisfactionIncrease} More heterogeneous groups see a bigger satisfaction increase than less heterogeneous groups when switching from a decision of a dictator to a decision made by the recommender.
\label{hyp:Evaluation:HeterogenousBiggerSatisfactionIncrease} More heterogeneous groups see a bigger increase in satisfaction than less heterogeneous groups when switching from a decision of a dictator to a decision made by the recommender.
\end{itshape} \medskip \\*
The assumption is made that in more heterogeneous groups the satisfaction with the dictator's decision is less. Therefore there is a higher possible increase in satisfaction. A homogenous group that already satisfies all group members with the dictator's decision cannot see an increase in satisfaction therefore the assumption is made, that with a higher amount of people dissatisfied and not satisfied with the dictator's decision, there will be more people that can be lifted into satisfaction and therefore the increase will be bigger. However a group that has contradicting interest actually might not be able to reach high satisfaction levels.
The assumption is made that in more heterogeneous groups the satisfaction with the dictator's decision is less. Therefore, there is a higher possible increase in satisfaction. A homogenous group that already satisfies all group members with the dictator's decision cannot see an increase in satisfaction, therefore, the assumption is made that with a higher number of people dissatisfied or neutral with the dictator's decision, more people will be be lifted into satisfaction and the increase in satisfaction will be bigger. However a group that has contradicting interest actually might not be able to reach high satisfaction levels.
\end{hypothesis}
\begin{hypothesis}
\begin{itshape}
\label{hyp:Evaluation:StoreSizeBetterResults} A higher amount of stored finished configurations results in a higher amount of satisfied and a lower amount of dissatisfied group members when the recommender is used to make the group decision.
\label{hyp:Evaluation:StoreSizeBetterResults} A higher amount of stored finished configurations results in a higher number of satisfied and a lower number of dissatisfied group members when the recommender is used to make the group decision.
\end{itshape} \medskip \\*
This hypothesis is born by the fact that having a bigger pool of configurations to choose from increases the chances of having a good recommendation. This of course requires the assumption that aggregation strategies that pick recommendations pick configurations that also fare better in the chosen satisfaction metric. If that is not the case this hypothesis should not hold.
\end{hypothesis}
@@ -233,7 +233,7 @@ This section gives an overview over the hypotheses tested during data analysis.
\begin{itshape}
\label{hyp:Evaluation:AggregationStrategies} Multiplication and best average aggregation strategies perform better than least misery across the board.
\end{itshape} \medskip \\
Best average and multiplication are strategies that are performing best in some of the, by \citeauthor{Masthoff2015} \cite[~ 755f]{Masthoff2015}, listed online experiments. Therefore it is reasonable to assume that they perform well here, too. Least misery was listed in some studies as performing worst. Therefore there is an expectation of it faring less good than other group aggregation strategies.
Best average and multiplication are strategies that are performing best in some of the, by \citeauthor{Masthoff2015} \cite[~ 755f]{Masthoff2015}, listed online experiments. Therefore, it is reasonable to assume that they perform well here, too. Least misery was listed in some studies as performing worst. Therefore, there is an expectation of it faring less good than other group aggregation strategies.
\end{hypothesis}
\section{Results}
@@ -241,7 +241,7 @@ This section gives an overview over the hypotheses tested during data analysis.
\subsection{Threshold Center Selection}
In this section the goal is to find a $tc$ parameter for the analysis. This is needed to reduce dimensionality of data that has to be looked at and to get results of value. Therefore all parameters except $tc$ will be fixed. The preference aggregation strategy looked at is multiplication because this strategy shows good results across the board when briefly looking at the generated data. The configuration database is used with all possible solutions (which is 148 in total). This results in a bigger visible effect in terms of satisfaction and dissatisfaction change as the recommender has access to all possible configurations and also provides more solid and predictable results. \autoref{fig:Evaluation:tcChange} shows the satisfaction change based on choice of $tc$. Of note is that the maxima of satisfaction change precedes the minima of dissatisfaction change for all group types. Maxima and minima occur at different tc values depending on the group type. Heterogeneous groups peek earliest while homogenous groups only show a peek towards the maximum $tc$ value. Changes in dissatisfaction are minimal even with $tc$ close to its maximum value for homogeneous groups. \autoref{fig:Evaluation:tcCount} shows the amount of group members satisfied and dissatisfied with the dictator's decision. The number of satisfied people decreases with an increasing $tc$ and its downward movement accelerates. The dissatisfaction curve shows a similar trend in reverse. Here the number of dissatisfied group members increases with an increase in $tc$. The curve accelerates its growth analogous to the acceleration of the satisfaction curve. The behaviour of heterogeneous groups and random groups is similar but the curve for heterogeneous groups shows less satisfaction and more dissatisfaction for a given tc. Also both curves have a negative satisfaction change when $tc$ reaches a certain height. Homogeneous groups only have satisfied group members for most $tc$ values but they decrease rapidly for values greater than $85$. Dissatisfied group members are at zero for the whole value range of $tc$ except a very slight upward tick at the end that is barely noticeable.
In this section the goal is to find a $tc$ parameter for the analysis. This is needed to reduce dimensionality of data that has to be looked at and to get results of value. Therefore, all parameters except $tc$ will be fixed. The preference aggregation strategy looked at is multiplication because this strategy shows good results across the board when briefly looking at the generated data. The configuration database is used with all possible solutions (which is 148 in total). This results in a bigger visible effect in terms of satisfaction and dissatisfaction change as the recommender has access to all possible configurations and also provides more solid and predictable results. \autoref{fig:Evaluation:tcChange} shows the satisfaction change based on choice of $tc$. Of note is that the maxima of satisfaction change precedes the minima of dissatisfaction change for all group types. Maxima and minima occur at different tc values depending on the group type. Heterogeneous groups peek earliest while homogenous groups only show a peek towards the maximum $tc$ value. Changes in dissatisfaction are minimal even with $tc$ close to its maximum value for homogeneous groups. \autoref{fig:Evaluation:tcCount} shows the amount of group members satisfied and dissatisfied with the dictator's decision. The number of satisfied people decreases with an increasing $tc$ and its downward movement accelerates. The dissatisfaction curve shows a similar trend in reverse. Here the number of dissatisfied group members increases with an increase in $tc$. The curve accelerates its growth analogous to the acceleration of the satisfaction curve. The behaviour of heterogeneous groups and random groups is similar but the curve for heterogeneous groups shows less satisfaction and more dissatisfaction for a given tc. Also both curves have a negative satisfaction change when $tc$ reaches a certain height. Homogeneous groups only have satisfied group members for most $tc$ values but they decrease rapidly for values greater than $85$. Dissatisfied group members are at zero for the whole value range of $tc$ except a very slight upward tick at the end that is barely noticeable.
\begin{figure}
\centering
@@ -250,7 +250,7 @@ In this section the goal is to find a $tc$ parameter for the analysis. This is n
\label{fig:Evaluation:tcChange}
\end{figure}
\autoref{hyp:Evaluation:MaximumMinimum} states that the highest satisfaction change is expected at places where the overall satisfaction with the dictator's decision is close to two. However, the data shows a slightly different result. This hypothesis does not hold true. When looking at the data we see peeks in satisfaction change when values are equal to $2.81, 2.51$ and $3$ (heterogeneous, random, homogenous). Therefore the expectation does not hold up. Most likely this happens because at lower satisfaction numbers with the dictator's decision the threshold for satisfaction is set too high which causes the group compromise to classify less group members as satisfied. Moreover, valleys for dissatisfaction change are also not at the expected value of \textit{two}. They are instead at $1.19, 1.49, 0.04$ (heterogeneous, random, homogenous). Here the valleys are lower than expected. However, the data from homogenous groups seems to be cut of. Therefore, a judgement for homogenous groups is difficult and with slightly less heterogeneous groups this graph should show bigger effects.
\autoref{hyp:Evaluation:MaximumMinimum} states that the highest satisfaction change is expected at places where the overall satisfaction with the dictator's decision is close to two. However, the data shows a slightly different result. This hypothesis does not hold true. When looking at the data we see peeks in satisfaction change when values are equal to $2.81, 2.51$ and $3$ (heterogeneous, random, homogenous). Therefore, the expectation does not hold up. Most likely this happens because at lower satisfaction numbers with the dictator's decision the threshold for satisfaction is set too high which causes the group compromise to classify less group members as satisfied. Moreover, valleys for dissatisfaction change are also not at the expected value of \textit{two}. They are instead at $1.19, 1.49, 0.04$ (heterogeneous, random, homogenous). Here the valleys are lower than expected. However, the data from homogenous groups seems to be cut of. Therefore, a judgement for homogenous groups is difficult and with slightly less heterogeneous groups this graph should show bigger effects.
\begin{figure}
\centering
@@ -263,7 +263,7 @@ The predicted trend that a higher $tc$ results in a lower satisfaction and a hig
\autoref{hyp:Evaluation:OnlyOneSatisfied} predicts that the satisfaction with the individual decision eventually reaches one and that no one is satisfied with the group recommender's decision. This means the satisfaction change should reach minus one. \autoref{fig:Evaluation:tcCount} shows a downward trend that comes close to one for heterogeneous and random groups. Therefore, the trend suggests that the hypothesis holds with regard to heterogeneous and random groups but as the drop for homogenous groups just reaches below $2.8$ suggesting that the hypothesis does not hold for homogenous groups. Also, satisfaction change in heterogeneous groups reaches close to minus one but this value is neither reached by random groups, nor by homogenous groups. The hypothesis therefore holds true only for heterogeneous groups. A likely cause why it does not seem to hold true for random or homogenous groups is that as the highest tc value still includes multiple configurations and a recommended configuration keeps some group members satisfied for some of the time. Also possibly for random groups another group member than the dictator could be satisfied with the group decision.
During a group decision it is better to make one less person dissatisfied opposed to one more person satisfied. Therefore, this thesis uses $tc$ values that are closer to the minima of dissatisfaction change than to the maxima of satisfaction change. The minima for heterogeneous groups is at $tc = 70\%$ therefore this is the chosen value for evaluation of the remaining hypotheses. This is needed because otherwise analysis would be infeasible due to the parameter space being too large. For random groups the minima of dissatisfaction change can be found at $tc = 85\%$ which is the value used for all following analysis of random groups. For homogenous group dissatisfaction change is decreasing until the highest possible value of $tc$ is reached. Because of that $tc = 94\%$ is used for analysis.
During a group decision it is better to make one less person dissatisfied opposed to one more person satisfied. Therefore, this thesis uses $tc$ values that are closer to the minima of dissatisfaction change than to the maxima of satisfaction change. The minima for heterogeneous groups is at $tc = 70\%$, therefore, this is the chosen value for evaluation of the remaining hypotheses. This is needed because otherwise analysis would be infeasible due to the parameter space being too large. For random groups the minima of dissatisfaction change can be found at $tc = 85\%$ which is the value used for all following analysis of random groups. For homogenous group dissatisfaction change is decreasing until the highest possible value of $tc$ is reached. Because of that $tc = 94\%$ is used for analysis.
\subsection{Recommender Performance Analysis}
@@ -305,11 +305,11 @@ Random groups have less overall satisfaction with $tc = 85\%$ as seen in \autore
\subsection{Discussion}
After description of the data the remaining hypotheses are discussed.
\autoref{hyp:Evaluation:HomogenousMoreSatisfied} states that homogenous groups have more satisfied member's with regards to the dictator's and the group recommender's decision. \autoref{fig:Evaluation:tcCount} shows that this holds true for the dictator's decision as for every instance satisfaction in homogeneous groups is higher than that of other groups. However, \autoref{fig:Evaluation:HeteroSatisfaction}, \autoref{fig:Evaluation:HomoSatisfaction} and \autoref{fig:Evaluation:RandomSatisfaction} show that for satisfaction with the recommender's decision this does not hold when looking at $tc$ values where the recommender performs best for each segment. In those places the homogenous group only reaches the highest amount of satisfaction when the recommender has access to all stored configurations. With a decreasing number of stored configurations both random groups and heterogeneous groups achieve a higher satisfaction. This likely happens because of the similarity between group members. A recommender with imperfect knowledge and a, in size reduced, configuration database gives results that are not good enough and cannot compete with the dictator who always finds the perfect individual match that group members of homogeneous groups are satisfied with. It is important to note, when the same $tc$ values are used homogenous groups have a higher amount of satisfied people across the board.
\autoref{hyp:Evaluation:HomogenousMoreSatisfied} states that homogenous groups have more satisfied member's with regards to the dictator's and the group recommender's decision. \autoref{fig:Evaluation:tcCount} shows that this holds true for the dictator's decision as for every instance satisfaction in homogeneous groups is higher than that of other groups. However, \autoref{fig:Evaluation:HeteroSatisfaction}, \autoref{fig:Evaluation:HomoSatisfaction} and \autoref{fig:Evaluation:RandomSatisfaction} show that for satisfaction with the recommender's decision this does not hold when looking at $tc$ values where the recommender performs best for each segment. In those places the homogenous group only reaches the highest amount of satisfaction when the recommender has access to all stored configurations. With a decreasing number of stored configurations both random groups and heterogeneous groups achieve a higher satisfaction. This likely happens because of the similarity between group members. A recommender with imperfect knowledge and a, in size reduced, configuration database gives results that are not good enough and cannot compete with the dictator who always finds the perfect individual match that group members of homogeneous groups are satisfied with. It is important to note, when the same $tc$ values are used homogenous groups have a higher amount of satisfied people across the board.
\autoref{hyp:Evaluation:HeterogenousBiggerSatisfactionIncrease} states that the increase in satisfaction should be bigger for more heterogeneous groups. However, \autoref{fig:Evaluation:HeteroSatisfaction}, \autoref{fig:Evaluation:HomoSatisfaction} and \autoref{fig:Evaluation:RandomSatisfaction} show this to be not true. The recommendations for heterogeneous groups indeed cause a larger change in satisfaction compared to homogeneous groups but random groups cause a positive change of higher magnitude. Also the decrease in dissatisfaction is higher among random groups. This possibly happens due to random groups having interest that are more aligned and their preferences among group members therefore they do not diverge as much, thereby resulting in compromises for the group that can satisfy more individual members. Also the group preferences are still far apart enough to cause enough dissatisfaction and neutrality with the dictator's decision.
The data shows that having a larger configuration database causes the amount of satisfied group members to be greater than recommendation's using a smaller database. With dissatisfaction the same is seen in inverse. A larger configuration database causes the number of dissatisfied group members to drop compared to a small database. However, in some runs there have been instances of least misery that have seen a slight drop. This can be seen in \autoref{fig:Evaluation:HeteroSatisfaction} when comparing $74$ and $148$ as number of stored configurations. Why this happens is not entirely clear but a cause of that might be that least misery just takes into account the worst performing group member of the group. Therefore it is possible that there is a second slightly worse solution, when comparing least misery scores, which actually has a slight advantage in terms of dissatisfaction. Having this second best configuration can cause it to be in the second database partition therefore resulting in less dissatisfaction on average. \autoref{hyp:Evaluation:StoreSizeBetterResults} therefore is supported by the data but it does not fully hold up when looking at least misery.
The data shows that having a larger configuration database causes the amount of satisfied group members to be greater than recommendation's using a smaller database. With dissatisfaction the same is seen in inverse. A larger configuration database causes the number of dissatisfied group members to drop compared to a small database. However, in some runs there have been instances of least misery that have seen a slight drop. This can be seen in \autoref{fig:Evaluation:HeteroSatisfaction} when comparing $74$ and $148$ as number of stored configurations. Why this happens is not entirely clear but a cause of that might be that least misery just takes into account the worst performing group member of the group. Therefore, it is possible that there is a second slightly worse solution, when comparing least misery scores, which actually has a slight advantage in terms of dissatisfaction. Having this second best configuration can cause it to be in the second database partition therefore resulting in less dissatisfaction on average. \autoref{hyp:Evaluation:StoreSizeBetterResults} therefore is supported by the data but it does not fully hold up when looking at least misery.
\autoref{hyp:Evaluation:AggregationStrategies} states least misery performs worse than multiplication. For a change in satisfaction this can be seen across the board, however, for dissatisfaction change this is not true everywhere. \autoref{fig:Evaluation:HeteroSatisfaction} shows that least misery performs better than best average in terms of dissatisfaction reduction. This behaviour possibly occurs because an average metric yields the same results for heavily polarised decisions and decisions that everyone feels neutral about. Least misery on the other hand takes only the group member least satisfied with the decision into account therefore this metric performs better. However, in other cases it performs visibly worse. Also of note is multiplication performs best across the board. This supports the findings by \citeauthor{Masthoff2015} \cite[~p. 755f]{Masthoff2015} and also shows that the satisfaction model does show some similar results to online evaluations.