fix language mistakes in evaluation

This commit is contained in:
hannes.kuchelmeister
2020-05-04 16:26:12 +02:00
parent 0ecab29e11
commit aaccd8a2bc

View File

@@ -6,7 +6,7 @@ In this chapter the prototype is evaluated in terms of its functionality and its
\section{Metric}
\label{sec:Evaluation:Metrics}
For the evaluation a metric to evaluate by is needed. The proposed metric for usage is that of satisfaction. This metric has been newly created because existing literature did not provide metrics usable for this thesis. Satisfaction is quantified in this thesis by a threshold metric. A user's preference is used to calculate a rating for each possible solution. Each configuration solution gets an individual score determined by the user's preferences. The score is calculated using the average of a user's preference for each characteristic that is part of the configuration. The result allows that a configuration can be compared to all other configurations and ranked according to the percentage of configurations that it beats for a specific user. The threshold metric consists of two parameters. First the threshold center $tc$ and second the satisfaction distance $sd$. The threshold for a person being satisfied is at $tc + sd$ and of a person being dissatisfied is at $tc - sd$. If a recommendation lies in between these two thresholds the person is classified to neither by satisfied nor be unsatisfied with the solution. For this thesis $sd=5\%$ will be used. This choice is guided by the assumption that people switch from satisfied to unsatisfied rather quickly \todo{find a source psychology}. Therefore the parameter considered in this thesis is the $tc$. An example is the choice of $tc = 60\%$. This results in a person being satisfied with a recommendation if it is better than at least $65\%$ of all possible finished configurations. Moreover, a person is dissatisfied if the recommendation is not better than $55\%$ of possible finished configurations. A recommendation that is better than at least $55\%$ and not better than $65\%$ of possible solutions is considered neutral by the individual.
For the evaluation a metric to evaluate by is needed. The proposed metric for usage is that of satisfaction. This metric has been newly created because existing literature did not provide metrics usable for this thesis. Satisfaction is quantified in this thesis by a threshold metric. A user's preference is used to calculate a rating for each possible solution. Each configuration solution gets an individual score determined by the user's preferences. The score is calculated using the average of a user's preference for each characteristic that is part of the configuration. The result allows that a configuration can be compared to all other configurations and ranked according to the percentage of configurations that it beats for a specific user. The threshold metric consists of two parameters. First the threshold center $tc$ and second the satisfaction distance $sd$. The threshold for a person being satisfied is at $tc + sd$ and of a person being dissatisfied is at $tc - sd$. If a recommendation lies in between these two thresholds the person is classified to neither be satisfied nor be unsatisfied with the solution. For this thesis $sd=5\%$ will be used. This choice is guided by the assumption that people switch from satisfied to unsatisfied rather quickly \todo{find a source psychology}. Therefore the parameter considered in this thesis is the $tc$. An example is the choice of $tc = 60\%$. This results in a person being satisfied with a recommendation if it is better than at least $65\%$ of all possible finished configurations. Moreover, a person is dissatisfied if the recommendation is not better than $55\%$ of possible finished configurations. A recommendation that is better than at least $55\%$ and not better than $65\%$ of possible solutions is considered neutral by the individual.
\todo{(optional) visualize tc value with an example configuration}
@@ -76,12 +76,12 @@ To evaluate the recommender, a use case is needed. In this thesis, a forestry us
\end{table}
The stakeholders in this use case are: a forest owner, an athlete, an environmentalist, and a consumer. The owner sees the forest as an investment, he is interested in a high long term profit. On the other hand the consumer is interested in reasonable wood price as she uses wood for furniture and also for her fireplace. In contrast, the environmentalist is interested in a healthy forest that is not impacted negatively by human activity. Last is the athlete who is interested in good accessibility of the forest and that there is some plant and animal life.
Every group consists of four people whereby they need to try and find a compromise. Diverging preferences make this difficult. All stakeholders have an interest in getting their will but also all parties need the acceptance of others with the decision. It is not in the interest of a stakeholder to fully have their preferences met, while ending up with protests that arise from the deep dissatisfaction of other groups members.
Every group consists of four people whereby they need to try and find a compromise. Diverging preferences make this difficult. All stakeholders have an interest in getting their will but also all parties need the acceptance of others with the decision. It is not in the interest of a stakeholder to fully have their preferences met, while ending up with protests that arise from the deep dissatisfaction of other group members.
\section{Data Generation}
\label{sec:Evaluation:GeneratingGroups}
This section describes the data generation process as seen in \autoref{fig:Concept:ConfigurationProcess} that generates data based on the use case int \autoref{sec:Evaluation:UseCase}. Group profiles are used to generate groups of four with different group member types. The exact group composition depends on the group type. For every parameter and group type $1000$ groups are generated and converted to preferences. This amount was chosen as it's the highest number that allows computing times to work over night on the hardware that is available. Also this number is large enough to reduce strong variability between runs. For each group unfinished configurations are generated and its preferences are paired up with the generated unfinished configurations. These pairs later on are used for the evaluation.
This section describes the data generation process as seen in \autoref{fig:Evaluation:GeneratingDataProcess} that generates data based on the use case int \autoref{sec:Evaluation:UseCase}. Group profiles are used to generate groups of four with different group member types. The exact group composition depends on the group type. For every parameter and group type $1000$ groups are generated and converted to preferences. This amount was chosen as it's the highest number that allows computing times to work over night on the hardware that is available. Also this number is large enough to reduce strong variability between runs. For each group unfinished configurations are generated and its preferences are paired up with the generated unfinished configurations. These pairs later on are used for the evaluation.
\begin{figure}
\centering
@@ -184,13 +184,13 @@ Another important component of the evaluation is the influence of stored finishe
\section{Hypotheses}
\label{sec:Evaluation:Hypotheses}
This section gives an overview over the hypothesis tested during data analysis. Each hypothesis is followed by an explanation to why the hypothesis was posed. In later sections the truthfulness of the hypothesis is tested. This allows to test if expectations about the behaviour of the recommender are true or false.
This section gives an overview over the hypotheses tested during data analysis. Each hypothesis is followed by an explanation to why the hypothesis was posed. In later sections the truthfulness of the hypothesis is tested. This allows to test if expectations about the behaviour of the recommender are true or false.
\begin{hypothesis}
\begin{itshape}
\label{hyp:Evaluation:MaximumMinimum} Highest improvements with group recommendation are when the amount of people satisfied with the dictator's decision is slightly lower than two and the highest reduction in dissatisfied group members can be seen at around two group members dissatisfied respectively.
\label{hyp:Evaluation:MaximumMinimum} Highest improvements for group recommendations are when the amount of people satisfied with the dictator's decision is slightly lower than two and the highest reduction in dissatisfied group members can be seen at around two group members dissatisfied respectively.
\end{itshape} \medskip \\*
This stems from the assumption that in a real situation a group of four with having a few less than two satisfied members on average (with a dictator's decision) has enough room for improvement so that potentially three group members can be satisfied after the use of the recommender. Meaning that at least one more person is satisfied with the compromise. Potentially in some groups it might even be possible to then lift the last person from dissatisfaction towards a neutral attitude. A higher base satisfaction is assumed to reduce the possibility to make an additional group member satisfied.
This stems from the assumption that in a real situation a group of four, having a few less than two satisfied members on average (with a dictator's decision), has enough room for improvement so that potentially three group members can be satisfied after the use of the recommender. Meaning that at least one more person is satisfied with the compromise. Potentially in some groups it might even be possible to then lift the last person from dissatisfaction towards a neutral attitude. A higher base satisfaction is assumed to reduce the possibility to make an additional group member satisfied.
\end{hypothesis}
@@ -198,7 +198,7 @@ This section gives an overview over the hypothesis tested during data analysis.
\begin{itshape}
\label{hyp:Evaluation:HigherTcLessSatisfied} A higher $tc$ value results in less satisfied people and more unsatisfied people with regard to the dictator's decision.
\end{itshape} \medskip \\*
A higher $tc$ value causes a person to be unsatisfied with a higher amount of configurations. Also it causes a person to be satisfied with less configurations. Therefore recommending a random configuration causes the chance of making an individual satisfied sink while increasing the chance of that person being unsatisfied. Already the change in probability leads to the assumption that this should be seen with non random recommendations too.
A higher $tc$ value causes a person to be unsatisfied with a higher amount of configurations. Also it causes a person to be satisfied with less configurations. Therefore recommending a random configuration causes the chance of making an individual satisfied to sink while increasing the chance of that person being unsatisfied. Already the change in probability leads to the assumption that this should be seen with non random recommendations too.
\end{hypothesis}
\begin{hypothesis}
@@ -212,12 +212,12 @@ This section gives an overview over the hypothesis tested during data analysis.
\begin{itshape}
\label{hyp:Evaluation:HomogenousMoreSatisfied} Homogeneous groups have more satisfied members with the recommender's decision but also with the dictator's decision compared to heterogeneous groups.
\end{itshape} \medskip \\*
As the interest in homogenous groups are more aligned there is an expectation that the overall satisfaction levels for more homogenous groups is higher. If the base level is already higher it is likely that even just a slight increase lifts recommendations for homogenous groups to satisfaction levels not reachable by heterogeneous groups.
As the interest in homogenous groups is more aligned there is an expectation that the overall satisfaction levels for more homogenous groups is higher. If the base level is already higher it is likely that even just a slight increase lifts recommendations for homogenous groups to satisfaction levels not reachable by heterogeneous groups.
\end{hypothesis}
\begin{hypothesis}
\begin{itshape}
\label{hyp:Evaluation:HeterogenousBiggerSatisfactionIncrease} More heterogeneous groups see a bigger satisfaction increase than less heterogeneous groups when switching from the decision of a dictator to a decision made by the recommender.
\label{hyp:Evaluation:HeterogenousBiggerSatisfactionIncrease} More heterogeneous groups see a bigger satisfaction increase than less heterogeneous groups when switching from a decision of a dictator to a decision made by the recommender.
\end{itshape} \medskip \\*
The assumption is made that in more heterogeneous groups the satisfaction with the dictator's decision is less. Therefore there is a higher possible increase in satisfaction. A homogenous group that already satisfies all group members with the dictator's decision cannot see an increase in satisfaction therefore the assumption is made, that with a higher amount of people dissatisfied and not satisfied with the dictator's decision, there will be more people that can be lifted into satisfaction and therefore the increase will be bigger. However a group that has contradicting interest actually might not be able to reach high satisfaction levels.
\end{hypothesis}
@@ -241,7 +241,7 @@ This section gives an overview over the hypothesis tested during data analysis.
\subsection{Threshold Center Selection}
In this section the goal is to find a $tc$ parameter for the analysis. This is needed to reduce dimensionality of data that has to be looked at and to get results of value. Therefore all parameters except $tc$ will be fixed. The preference aggregation strategy looked at is multiplication because this strategy shows good results across the board when briefly looking at the generated data. The configuration database is used with all possible solutions (which is 148 in total). This results in a bigger visible effect in terms of satisfaction and dissatisfaction change as the recommender has access to all possible configurations and also provides more solid and predictable results. \autoref{fig:Evaluation:tcChange} shows the satisfaction change based on choice of $tc$. Of note is that the maxima of satisfaction change precedes the minima of dissatisfaction change for all group types. Maxima and minima occur at different tc values depending on the group type. Heterogeneous groups peek earliest while homogenous groups only show a peek towards the maximum $tc$ value. Changes in dissatisfaction are minimal even with $tc$ close to its maximum value for homogeneous groups. \autoref{fig:Evaluation:tcCount} shows the amount of group members satisfied and dissatisfied with the dictator's decision. The number of satisfied people decreases with an increasing $tc$ and its downward movement accelerates. The dissatisfaction curve shows a similar trend in reverse. Here the number of dissatisfied group members increases with an increase in $tc$. The curve accelerates its growth analogous to the acceleration of the satisfaction curve. The behaviour of heterogeneous groups and random groups is similar but the curve for heterogeneous groups show less satisfaction and more dissatisfaction for a given tc. Also both curves have a negative satisfaction change when $tc$ reaches a certain height. Homogeneous groups only have satisfied group members for most $tc$ values but they decrease rapidly for values greater $85$. Dissatisfied group members are at zero for the whole value range of $tc$ except a very slight upward tick at the end that is barely noticeable.
In this section the goal is to find a $tc$ parameter for the analysis. This is needed to reduce dimensionality of data that has to be looked at and to get results of value. Therefore all parameters except $tc$ will be fixed. The preference aggregation strategy looked at is multiplication because this strategy shows good results across the board when briefly looking at the generated data. The configuration database is used with all possible solutions (which is 148 in total). This results in a bigger visible effect in terms of satisfaction and dissatisfaction change as the recommender has access to all possible configurations and also provides more solid and predictable results. \autoref{fig:Evaluation:tcChange} shows the satisfaction change based on choice of $tc$. Of note is that the maxima of satisfaction change precedes the minima of dissatisfaction change for all group types. Maxima and minima occur at different tc values depending on the group type. Heterogeneous groups peek earliest while homogenous groups only show a peek towards the maximum $tc$ value. Changes in dissatisfaction are minimal even with $tc$ close to its maximum value for homogeneous groups. \autoref{fig:Evaluation:tcCount} shows the amount of group members satisfied and dissatisfied with the dictator's decision. The number of satisfied people decreases with an increasing $tc$ and its downward movement accelerates. The dissatisfaction curve shows a similar trend in reverse. Here the number of dissatisfied group members increases with an increase in $tc$. The curve accelerates its growth analogous to the acceleration of the satisfaction curve. The behaviour of heterogeneous groups and random groups is similar but the curve for heterogeneous groups shows less satisfaction and more dissatisfaction for a given tc. Also both curves have a negative satisfaction change when $tc$ reaches a certain height. Homogeneous groups only have satisfied group members for most $tc$ values but they decrease rapidly for values greater than $85$. Dissatisfied group members are at zero for the whole value range of $tc$ except a very slight upward tick at the end that is barely noticeable.
\begin{figure}
\centering
@@ -305,13 +305,12 @@ Random groups have less overall satisfaction with $tc = 85\%$ as seen in \autore
\subsection{Discussion}
After description of the data the remaining hypotheses are discussed.
\autoref{hyp:Evaluation:HomogenousMoreSatisfied} states that homogenous groups have more satisfied member's with regards to the dictator's and the group recommender's decision. \autoref{fig:Evaluation:tcCount} shows that this holds true for dictator's decision as for every instance satisfaction in homogeneous groups is higher than that of other groups. However, \autoref{fig:Evaluation:HeteroSatisfaction}, \autoref{fig:Evaluation:HomoSatisfaction} and \autoref{fig:Evaluation:RandomSatisfaction} show that for satisfaction with the recommender's decision this does not hold when looking at $tc$ values where the recommender performs best for each segment. In those places the homogenous group only reaches the highest amount of satisfaction when the recommender has access to all stored configurations. With a decreasing number of stored configurations both random groups and heterogeneous groups achieve a higher satisfaction. This likely happens because of similarity between group members. A recommender with imperfect knowledge and a, in size, reduced configuration database gives results that are not good enough and cannot compete with the dictator who always finds the perfect individual match that group members of homogeneous groups are satisfied with. It is important to note, when the same $tc$ values are used homogenous groups have a higher amount of satisfied people across the board.
\autoref{hyp:Evaluation:HomogenousMoreSatisfied} states that homogenous groups have more satisfied member's with regards to the dictator's and the group recommender's decision. \autoref{fig:Evaluation:tcCount} shows that this holds true for the dictator's decision as for every instance satisfaction in homogeneous groups is higher than that of other groups. However, \autoref{fig:Evaluation:HeteroSatisfaction}, \autoref{fig:Evaluation:HomoSatisfaction} and \autoref{fig:Evaluation:RandomSatisfaction} show that for satisfaction with the recommender's decision this does not hold when looking at $tc$ values where the recommender performs best for each segment. In those places the homogenous group only reaches the highest amount of satisfaction when the recommender has access to all stored configurations. With a decreasing number of stored configurations both random groups and heterogeneous groups achieve a higher satisfaction. This likely happens because of the similarity between group members. A recommender with imperfect knowledge and a, in size reduced, configuration database gives results that are not good enough and cannot compete with the dictator who always finds the perfect individual match that group members of homogeneous groups are satisfied with. It is important to note, when the same $tc$ values are used homogenous groups have a higher amount of satisfied people across the board.
\autoref{hyp:Evaluation:HeterogenousBiggerSatisfactionIncrease} states that the increase in satisfaction should be bigger for more heterogeneous groups. However, \autoref{fig:Evaluation:HeteroSatisfaction}, \autoref{fig:Evaluation:HomoSatisfaction} and \autoref{fig:Evaluation:RandomSatisfaction} show this to be not true. The recommendations for heterogeneous groups indeed cause a larger change in satisfaction compared to homogeneous groups but random groups cause a positive change of higher magnitude. Also the decrease in dissatisfaction is higher among random groups. This possibly happens due to random groups having interest that are more aligned and their preferences among group membes therefore do not diverge as much, therefore resulting in compromises for the group that can satisfy more individual members. Also the group preferences are still far apart enough to cause enough dissatisfaction and neutrality with the dictator's decision.
\autoref{hyp:Evaluation:HeterogenousBiggerSatisfactionIncrease} states that the increase in satisfaction should be bigger for more heterogeneous groups. However, \autoref{fig:Evaluation:HeteroSatisfaction}, \autoref{fig:Evaluation:HomoSatisfaction} and \autoref{fig:Evaluation:RandomSatisfaction} show this to be not true. The recommendations for heterogeneous groups indeed cause a larger change in satisfaction compared to homogeneous groups but random groups cause a positive change of higher magnitude. Also the decrease in dissatisfaction is higher among random groups. This possibly happens due to random groups having interest that are more aligned and their preferences among group members therefore they do not diverge as much, thereby resulting in compromises for the group that can satisfy more individual members. Also the group preferences are still far apart enough to cause enough dissatisfaction and neutrality with the dictator's decision.
The data shows that having a larger configuration database causes the amount of satisfied group members to be greater than recommendation's using a smaller database. With dissatisfaction the same is seen in inverse. A larger configuration database causes the number of dissatisfied group members to drop compared to a small database. However, in some runs there have been instances of least misery that have seen a slight drop. This can be seen in \autoref{fig:Evaluation:HeteroSatisfaction} when comparing $74$ and $148$ as number of stored configurations. Why this happens is not entirely clear but a cause of that might be that least misery just takes into account the worst performing group member of the group. Therefore it is possible that there is a second slightly worse solution, when comparing least misery scores, which actually has a slight advantage in terms of dissatisfaction. Having this second best configuration can cause it to be in the second database partition therefore resulting in less dissatisfaction on average. \autoref{hyp:Evaluation:StoreSizeBetterResults} therefore is supported by the data but it does not fully hold up when looking at least misery.
\autoref{hyp:Evaluation:AggregationStrategies} states least misery performs worse than multiplication. For a change in satisfaction this can be seen across the board however for dissatisfaction change this is not true everywhere. \autoref{fig:Evaluation:HeteroSatisfaction} shows that least misery performs better than best average in terms of dissatisfaction reduction. This behaviour possibly occurs because an average metric yields the same results for heavily polarised decisions and decisions that everyone feels neutral about. Least misery on the other hand takes only the group member least satisfied with the decision into account therefore this metric performs better. However in other cases it performs visibly worse. Also of note is multiplication performs best across the board. This supports the findings by \citeauthor{Masthoff2015} \cite[~p. 755f]{Masthoff2015} and also shows that the satisfaction model does show some similar results to online evaluations.
\autoref{hyp:Evaluation:AggregationStrategies} states least misery performs worse than multiplication. For a change in satisfaction this can be seen across the board, however, for dissatisfaction change this is not true everywhere. \autoref{fig:Evaluation:HeteroSatisfaction} shows that least misery performs better than best average in terms of dissatisfaction reduction. This behaviour possibly occurs because an average metric yields the same results for heavily polarised decisions and decisions that everyone feels neutral about. Least misery on the other hand takes only the group member least satisfied with the decision into account therefore this metric performs better. However, in other cases it performs visibly worse. Also of note is multiplication performs best across the board. This supports the findings by \citeauthor{Masthoff2015} \cite[~p. 755f]{Masthoff2015} and also shows that the satisfaction model does show some similar results to online evaluations.
To go back to in \autoref{sec:Evaluation:Questions} posed evaluation questions this section has shown that for random and heterogeneous groups the recommender performs better than a dictator. The average satisfaction depends on the chosen parameters but for the chosen value range average satisfaction with the recommender decision lies above two and can reach close to three satisfied group members for a high number of stored configurations and for some group types. The amount of stored finished configurations plays an important role in performance but with a fraction of stored configurations the recommender still yields good results. This shows that the recommender provides useful decision support for helping in group decisions. It provides a solid basis for groups and can help their group decision. Most decisions the recommender makes improve group satisfaction which shows that the recommender is able to be used to improve group decisions.
To go back to in \autoref{sec:Evaluation:Questions} posed evaluation questions this section has shown that for random and heterogeneous groups the recommender performs better than a dictator. The average satisfaction depends on the chosen parameters but for the chosen value range average satisfaction with the recommender decision lies above two and can reach close to three satisfied group members for a high number of stored configurations and for some group types. The amount of stored finished configurations plays an important role in the recommender's performance but with a fraction of stored configurations the recommender still yields good results. This shows that the recommender provides useful decision support for helping in group decisions. It provides a solid basis for groups and can help their group decision. Most decisions the recommender does improve group satisfaction which shows that the recommender is able to be used to improve group decisions.