improve mainly recommender performance analysis and threshold center selection

2024-09-04 01:11:00 +02:00 · 2020-04-07 17:12:45 +02:00
parent 7da51a2353
commit 37c8489f53
1 changed files with 23 additions and 28 deletions
--- a/30_Thesis/sections/60_evaluation.tex
+++ b/30_Thesis/sections/60_evaluation.tex
@@ -70,15 +70,12 @@ To evaluate the recommender, a use case is needed. In this thesis, a forestry us
 \end{table}

 The stakeholders in this use case are: a forest owner, an athlete, an environmentalist, and a consumer. The owner sees the forest as an investment, he is interested in a high long term profit. On the other hand the consumer is interested in reasonable wood price as she uses wood for furniture and also for her fireplace. In contrast, the environmentalist is interested in a healthy forest that is not impacted negatively by human activity. Last is the athlete who is interested in good accessibility of the forest and that there is some plant and animal life.
-
-\todo[inline]{Kapitel nochmal abschließen mit: hier liegen als sich widersprechende Präferenzen vor. Und: was sollen die Stakeholder jetzt entscheiden? in welcher Situation befinden sie sich? Wie setzt sich eine Gruppe zusammen? Aus 4 Personen von je einem Typ?}
+Every group consists of four people whereby they need to try and find a compromise. Diverging preferences make this difficult. All stakeholders have an interest in getting their will but also all parties need the acceptance of others with the decision. None of the stakeholders want to have a decision go exactly their way but end up with protests that arise from the deep dissatisfaction of other groups members.

 \section{Data Generation}
 \label{sec:Evaluation:GeneratingGroups}

-\todo[inline]{Dieses Kapitel ist für mich noch nicht konsistent. Auf der Abbildung fehlen Elemente (Präferenzen \& Gruppen generieren), im Text ist das Paaren von Präferenzen(?) und Konfigurationen nicht beschrieben. Und: was wird da wirklich gepaart: präferenzen oder Gruppen?}
-
-The whole process explained in \todo[inline]{hier einen besseren Übergang schaffen: um den use case zu evaluieren, wurden basierend auf den vorherigen Informationen Daten generiert. Die Visualisierung...} this section is visualized in \autoref{fig:Evaluation:GeneratingDataProcess}.
+This section describes the data generation process as seen in \autoref{fig:Concept:ConfigurationProcess} that generates data based on the use case int  \autoref{sec:Evaluation:UseCase}. Group profiles are used to generate groups of four with different group member types. The exact group composition depends on the group type. For every parameter and group type $1000$ groups are generated and converted to preferences. This amount was chosen as it's the highest number that allows computing times to work over night on the hardware that is available. Also this number is large enough to reduce strong variability between runs. For each group unfinished configurations are generated and its preferences are paired up with the generated unfinished configurations. These pairs later on are used for the evaluation. 

 \begin{figure}
    \centering
@@ -89,11 +86,11 @@ The whole process explained in \todo[inline]{hier einen besseren Übergang schaf

 \subsection{Unfinished Configurations Generation}

-Unfinished configurations are generated using all finished configurations and taking a subset of the contained characteristics. This way all generated configurations will be valid and lead to valid solutions. For the results that are presented in this chapter around $\frac{1}{7} \approx 15\%$ \todo{why this number} of characteristics is kept.
+Unfinished configurations are generated using all finished configurations and taking a subset of the contained characteristics. This way all generated configurations will be valid and lead to valid solutions. For the results that are presented in this chapter around $\frac{1}{7} \approx 15\%$ of characteristics is kept. This value is chosen to allow the existing configuration to take effect but not to skew the results due to the penalty function severely limiting possible options.

 \subsection{Preference Generation}

-For the forest use case, the idea is that there are multiple types of user profiles. Each group profile is represented by a neutral, negative or positive attitude towards a characteristic. During data generation the attitude is converted to a preference \todo{hier evlt nochmal nennen, dass du Präferenzen zwischen 0 und 1 verwendest, steht aktuell nur in der Grafik} using a normal distribution. \autoref{fig:Evaluation:DataGeneration} shows how the user profile can be converted to preferences. The actual group member profiles are shown in \autoref{tab:Evaluation:GroupMemberMappings}.
+For the forest use case, the idea is that there are multiple types of user profiles. Each group profile is represented by a neutral, negative or positive attitude towards a characteristic. During data generation the attitude is converted to a preference using a normal distribution. Each preference lies in the interval $[0,1]$. Zero can be seen as worst possible option and one as best possible option.  \autoref{fig:Evaluation:PreferenceDistribution} shows how the user profile can be converted to preferences. The actual group member profiles are shown in \autoref{tab:Evaluation:GroupMemberMappings}.

 \pgfplotsset{height=5cm,width=\textwidth,compat=1.8}
 \pgfmathdeclarefunction{gauss}{2}{%
@@ -120,17 +117,9 @@ For the forest use case, the idea is that there are multiple types of user profi
        \end{axis}
        \end{tikzpicture}
 \caption{Distribution of preferences for a user type.}
-\label{fig:Evaluation:DataGeneration}
+\label{fig:Evaluation:PreferenceDistribution}
 \end{figure}

-These user profiles can be used to generate rather homogenous groups but also to create groups that have interests that are more conflicting. The following group types are generated: \todo{wie genau sehen diese Gruppen aus? Aus wievielen Personen bestehen sie?}
-
-\begin{itemize}
-    \item random groups (preferences are uniformly random)
-    \item heterogeneous groups (people adhere to one preference profile like forest owner, athlete, consumer, environmentalist)
-    \item homogeneous groups (only one preference profile for all group members which in this evaluation is the forest owner)
-\end{itemize}
-\todo[inline]{warum diese unterscheidungen}

 \begin{table}
    \begin{center}
@@ -171,9 +160,19 @@ These user profiles can be used to generate rather homogenous groups but also to
    \end{center}
 \end{table}

+These user profiles can be used to generate rather homogenous groups but also to create groups that have interests that are more conflicting. The following group types, with four members each are generated:
+
+\begin{itemize}
+    \item random groups (preferences are uniformly random)
+    \item heterogeneous groups (people adhere to one preference profile like forest owner, athlete, consumer, environmentalist)
+    \item homogeneous groups (only one preference profile for all group members which in this evaluation is the forest owner)
+\end{itemize}
+
+The natural group type for the use case is a heterogeneous group but to widen the evaluation and to see how the recommender performs with different types of groups the two other group types are evaluated, too. Therefore more general statements about the recommender's performance can be made:
+
 \subsection{The Effect of Stored Finished Configurations}

-When evaluating a subset of stored finished configurations it is important to avoid outliers. This is the reason why a process inspired by \emph{cross validation} \todo{referenz hinzufügen} is used. The configuration database is randomly ordered and sliced into sub databases of the needed size. As an example, if the evaluated stored data size is 20, a configuration database containing 100 configurations is split into five sub databases of size 20. Now the evaluation is done on each of the sub databases and as a result the average is taken. This avoid that randomly a subset can be picked which either performs much better than most other possible combinations of databases or which performs much worse. This way the data is more aligned to the \emph{expected value} \todo{referenz}.
+Another important component of the evaluation is the influence of stored finished configurations. When evaluating a subset of stored finished configurations it is important to avoid outliers. This is the reason why a process inspired by \emph{cross validation} \todo{referenz hinzufügen} is used. The configuration database is randomly ordered and sliced into sub databases of the needed size. As an example, if the evaluated stored data size is 20, a configuration database containing 100 configurations is split into five sub databases of size 20. Now the evaluation is done on each of the sub databases and as a result the average is taken. This avoid that randomly a subset can be picked which either performs much better than most other possible combinations of databases or which performs much worse. This way the data is more aligned to the \emph{expected value} \todo{referenz}.


 \section{Hypotheses}
@@ -242,10 +241,9 @@ This section gives an overview over the hypothesis tested during data analysis.
 \section{Results}
 \label{sec:Evaluation:Findings}

-\subsection{Threshold Center}
+\subsection{Threshold Center Selection}

-To get an understanding of the data \todo{konkreter werden!
-hier geht es darum, einen sinnvollen Wert für tc für die weiteren Auswertungen zu finden} all parameters except the $tc$ will be fixed. The preference aggregation strategy looked at is multiplication \todo{warum?}. The configuration database is used with all possible solutions (which is 148 in total). This results in a bigger visible \todo{von was?} effect as the recommender has access to all possible configurations. \autoref{fig:Evaluation:tcChange} shows the satisfaction change based on choice of $tc$. Of note is that the maxima of satisfaction change precedes the minima of dissatisfaction change for all group types. Maxima and minima occur at different tc values depending on the group type. Heterogeneous groups peek earliest while homogenous groups only show a peek towards the maximum $tc$ value. Changes in dissatisfaction are minimal even with $tc$ close to its maximum value for homogeneous groups. \autoref{fig:Evaluation:tcCount} shows the amount of group members satisfied and dissatisfied with the dictator's decision. The number of satisfied people decreases with an increasing $tc$ and its downward movement accelerates. The dissatisfaction curve shows a similar trend in reverse. Here the number of dissatisfied group members increases with an increase in $tc$. The curve accelerates its growth analogous to the acceleration of the satisfaction curve. The behaviour of heterogeneous groups and random groups is similar but the curve for heterogeneous groups show less satisfaction and more dissatisfaction for a given tc. Also both curves have a negative satisfaction change when $tc$ reaches a certain height. Homogeneous groups only have happy group members for most $tc$ values but they decrease rapidly \todo[]{warum} for values greater $85$. Dissatisfied group members are at zero for the whole value range of $tc$ except a very slight upward tick at the end that is barely noticeable.
+In this section the goal is to find a $tc$ parameter for the analysis. This is needed to reduce dimensionality of data that has to be looked at and to get results of value. Therefore all parameters except $tc$ will be fixed. The preference aggregation strategy looked at is multiplication because this strategy shows good results across the board when briefly looking at the generated data. The configuration database is used with all possible solutions (which is 148 in total). This results in a bigger visible effect in terms of satisfaction and dissatisfaction change as the recommender has access to all possible configurations and also provides more solid and predictable results. \autoref{fig:Evaluation:tcChange} shows the satisfaction change based on choice of $tc$. Of note is that the maxima of satisfaction change precedes the minima of dissatisfaction change for all group types. Maxima and minima occur at different tc values depending on the group type. Heterogeneous groups peek earliest while homogenous groups only show a peek towards the maximum $tc$ value. Changes in dissatisfaction are minimal even with $tc$ close to its maximum value for homogeneous groups. \autoref{fig:Evaluation:tcCount} shows the amount of group members satisfied and dissatisfied with the dictator's decision. The number of satisfied people decreases with an increasing $tc$ and its downward movement accelerates. The dissatisfaction curve shows a similar trend in reverse. Here the number of dissatisfied group members increases with an increase in $tc$. The curve accelerates its growth analogous to the acceleration of the satisfaction curve. The behaviour of heterogeneous groups and random groups is similar but the curve for heterogeneous groups show less satisfaction and more dissatisfaction for a given tc. Also both curves have a negative satisfaction change when $tc$ reaches a certain height. Homogeneous groups only have happy group members for most $tc$ values but they decrease rapidly for values greater $85$. Dissatisfied group members are at zero for the whole value range of $tc$ except a very slight upward tick at the end that is barely noticeable.

 \begin{figure}
    \centering
@@ -254,7 +252,7 @@ hier geht es darum, einen sinnvollen Wert für tc für die weiteren Auswertungen
    \label{fig:Evaluation:tcChange}
 \end{figure}

-\autoref{hyp:Evaluation:MaximumMinimum} states that the highest satisfaction change is expected at places where the overall satisfaction with the dictator's decision is close to two. However, the data shows a slightly different result. This hypothesis does not hold true. When looking at the data we see peeks in satisfaction change when values are equal to $2.81, 2.51$ and $3$ (heterogeneous, random, homogenous). Therefore the expectation does not hold up \todo[inline]{dieser Absatz sollte mit einer kurzen DIskussion enden, warum die Hypothese wider Erwarten nicht zutrifft}. Moreover, valleys for dissatisfaction change are also not at the expected value of \textit{two}. They are instead at $1.19, 1.49, 0.04$ (heterogeneous, random, homogenous). Here the valleys are lower than expected. However, the data from homogenous groups seems to be cut of. Therefore, it is not possible to say if there would be a potentially bigger decrease with a use case with more possible solutions \todo[inline]{wie hängt die Anzahl der gespeicherten Lösungen mit dieser Hypothese zusammen?}.
+\autoref{hyp:Evaluation:MaximumMinimum} states that the highest satisfaction change is expected at places where the overall satisfaction with the dictator's decision is close to two. However, the data shows a slightly different result. This hypothesis does not hold true. When looking at the data we see peeks in satisfaction change when values are equal to $2.81, 2.51$ and $3$ (heterogeneous, random, homogenous). Therefore the expectation does not hold up. Most likely this happens because at lower satisfaction numbers with the dictator's decision the threshold for satisfaction is set too high which causes the group compromise to classify less group members as satisfied. Moreover, valleys for dissatisfaction change are also not at the expected value of \textit{two}. They are instead at $1.19, 1.49, 0.04$ (heterogeneous, random, homogenous). Here the valleys are lower than expected. However, the data from homogenous groups seems to be cut of. Therefore, a judgement for homogenous groups is difficult and with slightly less heterogeneous groups this graph should show bigger effects.

 \begin{figure}
    \centering
@@ -263,14 +261,11 @@ hier geht es darum, einen sinnvollen Wert für tc für die weiteren Auswertungen
    \label{fig:Evaluation:tcCount}
 \end{figure}

-The predicted trend that a higher $tc$ results in a lower satisfaction and a higher dissatisfaction with the dictator's decision, as predicted by \autoref{hyp:Evaluation:HigherTcLessSatisfied}, can be clearly seen in \autoref{fig:Evaluation:tcCount} and has been described in this section already \todo[inline]{discussion: was bedeutet das für die Hypothese bzw deine Evaluation? 
-Verhalten ist wie erwartet und auch so, wie es wahrscheinlich im realen Setting wäre}.
+The predicted trend that a higher $tc$ results in a lower satisfaction and a higher dissatisfaction with the dictator's decision, as predicted by \autoref{hyp:Evaluation:HigherTcLessSatisfied}, can be clearly seen in \autoref{fig:Evaluation:tcCount} and has been described in this section already. This means for the evaluation that the behaviour of the recommender is predictable and suggests that used metrics are modelling behaviour expected in reality.

-\autoref{hyp:Evaluation:OnlyOneSatisfied} predicts that the satisfaction with the individual decision eventually reaches one and that no one is satisfied with the group recommender decision. This means the satisfaction change should reach minus one. \autoref{fig:Evaluation:tcCount} shows a downward trend that comes close to one for heterogeneous and random groups. Therefore, the trend suggests that the hypothesis holds with regard to heterogeneous and random groups but as the drop for homogenous groups just reaches below $2.8$ suggesting that the hypothesis does not hold for homogenous groups. Also, satisfaction change in heterogeneous groups reaches close to minus one but this value is neither reached by random groups, nor by homogenous groups. The hypothesis therefore should not be seen as confirmed in that regard as well and further investigation is needed \todo[inline]{besser: auf heterogene Gruppen trifft die Hypothese zu. Hier auch wieder: was bedeutet das?
-Diskussion, warum die Hyp bei den anderen zwei Gruppen nicht zutreffend ist}.
+\autoref{hyp:Evaluation:OnlyOneSatisfied} predicts that the satisfaction with the individual decision eventually reaches one and that no one is satisfied with the group recommender decision. This means the satisfaction change should reach minus one. \autoref{fig:Evaluation:tcCount} shows a downward trend that comes close to one for heterogeneous and random groups. Therefore, the trend suggests that the hypothesis holds with regard to heterogeneous and random groups but as the drop for homogenous groups just reaches below $2.8$ suggesting that the hypothesis does not hold for homogenous groups. Also, satisfaction change in heterogeneous groups reaches close to minus one but this value is neither reached by random groups, nor by homogenous groups. The hypothesis therefore holds true only for heterogeneous groups. A likely cause why it does not seem to hold true for random or homogenous groups is that as the highest tc value still includes multiple configurations and a recommended configuration keeps some group members happy for some of the time. Also possibly for random groups another group member than the dictator could be satisfied with the group decision.

-During \todo[inline]{hier oder am Anfang der Analyse nochmal eine Erklärung warum mit genau dieser gewählten Parametereinstellung Werte für tc bestimmt wurden
--maximale \# config ist am solidesten, bei einer groben betrachtung der Daten (eyeballing) hat sich multiplication als am besten herausgestellt, o.ä.} a group decision it is better to make one less person dissatisfied opposed to one more person satisfied. Therefore, this thesis uses $tc$ values that are closer to the minima of dissatisfaction change than to the maxima of satisfaction change. The minima for heterogeneous groups is at $tc = 70\%$ therefore this is the chosen value for evaluation of the remaining hypotheses. This is needed because otherwise analysis would be infeasible due to the parameter space being too large. For random groups the minima of dissatisfaction change can be found at $tc = 85\%$ which is the value used for all following analysis of random groups. For homogenous group dissatisfaction change is decreasing until the highest possible value of $tc$ is reached. Because of that $tc = 94\%$ is used for analysis.
+During a group decision it is better to make one less person dissatisfied opposed to one more person satisfied. Therefore, this thesis uses $tc$ values that are closer to the minima of dissatisfaction change than to the maxima of satisfaction change. The minima for heterogeneous groups is at $tc = 70\%$ therefore this is the chosen value for evaluation of the remaining hypotheses. This is needed because otherwise analysis would be infeasible due to the parameter space being too large. For random groups the minima of dissatisfaction change can be found at $tc = 85\%$ which is the value used for all following analysis of random groups. For homogenous group dissatisfaction change is decreasing until the highest possible value of $tc$ is reached. Because of that $tc = 94\%$ is used for analysis.

 \subsection{Recommender Performance Analysis}

@@ -297,7 +292,7 @@ During \todo[inline]{hier oder am Anfang der Analyse nochmal eine Erklärung war

 This subsection holds fixed parameters of $tc$. It describes the satisfaction change and the total amount of satisfied people with the recommenders decision dependent on the amount of stored configurations. For clarity reasons not all graphs of the data are included. The missing graphs can be found in the appendix and have references to them.

-\autoref{fig:Evaluation:HeteroSatisfaction} \todo[]{auch hier: erkläre genauer, wie die Graphen aufgebaut sind und gib dem Leser eine Interpretationshilfe an die hand (links sind hohe Werte gut, rechts niedrige)} shows the relationship between the change in satisfaction and dissatisfaction and the number of stored configurations. There are three graphs each. One for multiplication, one for least misery and one for best average. The graphs for satisfaction are similar to a logarithmic curve. The increase in change of satisfaction decelerates with a higher number of stored configurations. The change in satisfaction is always above zero and a satisfaction increase of more than three quarters of the maximum can already be seen at around 25 stored configurations. Moreover, the curve for multiplication is greater than all other curves for all parameters. Least misery reaches the lowest amount of change across all values. The minimum number of satisfaction change is $0$ for least misery, and $0.1$ for best average and multiplications. The highest number is around $0.3$ for least misery, $0.4$ for best average and $0.5$ for multiplication
+\autoref{fig:Evaluation:HeteroSatisfaction} shows the relationship between the satisfaction and dissatisfaction and the number of stored configurations. The left y-axis shows the change in satisfaction compared to a decision made by a dictator. The right axis shows the average number of group members. The left figure shows numbers for satisfaction and the right for dissatisfaction. On the left higher numbers are better and on the right lower ones (in regards to change). There are three graphs each. One for multiplication, one for least misery and one for best average. The graphs for satisfaction are similar to a logarithmic curve. The increase in change of satisfaction decelerates with a higher number of stored configurations. The change in satisfaction is always above zero and a satisfaction increase of more than three quarters of the maximum can already be seen at around 25 stored configurations. Moreover, the curve for multiplication is greater than all other curves for all parameters. Least misery reaches the lowest amount of change across all values. The minimum number of satisfaction change is $0$ for least misery, and $0.1$ for best average and multiplications. The highest number is around $0.3$ for least misery, $0.4$ for best average and $0.5$ for multiplication
 When looking at dissatisfaction change the graphs are all in the negative number range. Multiplication reaches the lowest number and best average the highest. The gap between all three functions is less than that of satisfaction increase. And overall the curves are flatter meaning the change with 25 stored configurations already reaches close to five sixth of the minimum value. The highest number of satisfaction change is $-0.4$ for all strategies meanwhile the lowest number is around $-0.57$ for least misery, $-0.53$ for best average and $-0.63$ for multiplication.

 The figures for homogenous (\autoref{fig:Evaluation:HomoSatisfaction}) and random groups (\autoref{fig:Evaluation:RandomSatisfaction}) have a similar shape but their values and slope vary. The satisfaction change for homogenous groups is mostly negative, starting at $-2$, and only reaches a positive level for more than $100$ stored configurations with a value of $0.04$. Multiplication and best average have higher values than least misery here, too. Moreover the dissatisfaction change is always positive with a value range of $[0,1]$, except it slightly falls below zero after more than $75$ configurations are stored.