fyp-report/report/eval.tex

\section{Service Evaluation} \label{sec:se}
	This section will discuss how the service can be evaluated from a technical standpoint and its results.

	With the goals of the project, there are two kinds of tests that need to be accounted for.
	User testing tests that relate to the experience of the user while using the project and tests that quantitive test the project.

	Such as accuracy of the generated models, response time to queries.

	\subsection{Testing the model creation}
		To test the system, a few datasets were selected.
		The datasets were selected to represent different possible sizes of models, and sizes of output labels.

		The ImageNet\cite{imagenet} was not selected as one of the datasets that will be tested, as it does not represent the target problem that this project is trying to tackle.

		The tests will measure:
		\begin{itemize}
			\item Time to process and validate the entire dataset upon upload
			\item Time to train the dataset
			\item Time to classify the image once the dataset has been trained
			\item Time to extend the model
			\item Accuracy of the newly created model
		\end{itemize}

		The results will be placed in the results table.

		\subsubsection*{MNIST}

			The MNIST \cite{mnist} is a large dataset of handwritten digits, that is commonly used to trains and test machine learning systems.
			This dataset was selected due to its size. It is a small dataset that can be trained quickly and can be used to verify other internal systems of the service.
			During testing, only the 9 out of 10 classes are trained and the 10th is added during the retraining process.

			\begin{figure}[H]
		 	\centering
		    \subfloat{{\includegraphics[width=.2\linewidth]{minst_1}}}
		    \qquad
		    \subfloat{{\includegraphics[width=.2\linewidth]{minst_2}}}
			\caption{Examples of the images in the MNIST dataset}
			\end{figure}


		\subsubsection*{CIFAR-10}

			The CIFAR-10 \cite{cifar10} dataset contains various images that are commonly used to train and test machine learning algorithms.
			This dataset was selected due to its size. It is a small dataset that can be trained quickly, but it has bigger, and coloured images, which makes it harder than MNIST.

			During testing, only the 9 out of 10 classes are trained and the 10th is added during the retraining process.

			\begin{figure}[H]
		 	\centering
		    \subfloat{{\includegraphics[width=.2\linewidth]{cifar_1}}}
		    \qquad
		    \subfloat{{\includegraphics[width=.2\linewidth]{cifar_2}}}
			\caption{Examples of the images in the CIFAR-10 dataset}
			\end{figure}

		\subsubsection*{STL-10}
			The STL-10 \cite{stl10} dataset that was inspired by the CIFAR-10 \cite{cifar10}, but it has bigger images.
			This dataset was selected because of the bigger image. The images are bigger than both CIFAR-10 and MNIST which makes the model harder to create, and train.

			During testing, only the 9 out of 10 classes are trained and the 10th is added during the retraining process.
			\begin{figure}[H]
		 	\centering
		    \subfloat{{\includegraphics[width=.2\linewidth]{stl_1}}}
		    \qquad
		    \subfloat{{\includegraphics[width=.2\linewidth]{stl_2}}}
			\caption{Examples of the images in the STL-10 dataset}
			\end{figure}

		\subsubsection*{ArtBench}
			The ArtBench \cite{artbench} dataset is a dataset that contains artworks annotated with their art style that is intended to train generative models.
			This dataset was selected due to the even bigger images than the previously tested models.

			During testing, only the 9 out of 10 classes are trained and the 10th is added during the retraining process.
			\begin{figure}[H]
		 	\centering
		    \subfloat{{\includegraphics[width=.2\linewidth]{artbench1}}}
		    \qquad
		    \subfloat{{\includegraphics[width=.2\linewidth]{artbench2}}}
			\caption{Examples of the images in the ArtBench dataset}
			\end{figure}

		\subsubsection*{Incompatible datasets}

			There were attempts to test other datasets against the system, but those datasets were incompatible.
			The datasets had irregular images sizes, which, as it was mentioned previously, the system does not support.
			This caused a large section of images inputted being rejected, which means that it would have not trained.

			A list of datasets that are incompatible because of this are:

			\begin{multicols}{2}
				\begin{itemize}
					\item Caltech 256 \cite{caltech256}
					\item FGVC-Aircraft \cite{fgvca}
					\item IFood 2019 \cite{fooddataset}
				\end{itemize}
			\end{multicols}


		\subsubsection*{Results}
			\begin{longtable}{ | c | c | c | c | c | c |}
				\hline
				Dataset & Import Time & Train Time & Classification Time & Extend Time & Accuracy \\ \hline
				MNIST & $8s$ & $2m$ & $>1s$ & $50s$ & $98\%$   \\ \hline
				CIFAR-10 & $6s$ & $41m 38s$ & $>1s$ & $1m 11s$ & $95.2\%$   \\ \hline
				STL-10 & $1s$ & $37m 50s$ & $>1s$ & $1m 10s$ & $95.3\%$   \\ \hline
				Art Bench & $10s$ & $4h 20m 31$ & $>1s$ & $1m 41s$ & $41.86\%$   \\ \hline
				\caption{Evaluation Results}
				\label{tab:eval-results}
			\end{longtable}

			The system was able to easily import all the datasets provided in an incredibly fast time, this included the incompatible datasets.
			While the system was able to load and verify the images of the incompatible datasets, it correctly marked the images as incompatible, which can be seen in Figure \ref{fig:incompatible_images}.
			Which would make them not being able to be used for training, which would mean the model would have not had any data to train, which would obviously result in terrible accuracy results.

		    \begin{figure}[h!]
				\centering
				\includegraphics[width=0.7\textheight]{incompatible_images}
		        \caption{Screenshot of a web application showing many images that do not have the correct format.}
				\label{fig:incompatible_images}
		    \end{figure}

			The system was able to train, classify, and extend the MNIST, CIFAR-10, and STL-10 datasets, with high accuracy rates.
			This is expected as these models are models that are commonly known for being easy to train.
			The system could also train these models in a relatively short, small amount of time.
			The classification time is optimal, with all datasets being able to classify an image in less than a second.
			The time to extend is also very promising, and the system could extend a new set of classes fairly quickly.

			The system was unable to achieve a high level of accuracy while training for the ArtBench dataset.
			And the training time to achieve that lower level of accuracy was also much higher than the other datasets.
			The longer training time can be attributed to the larger images, which make the model harder to train, as the model has to make more computations.
			Another factor for the increased training time is the necessity for the model to train longer to achieve a higher accuracy, due to the model's decreased learning rate.
			As for the low accuracy ratting, I hypothesise that this is caused by the nature of the dataset.
			The dataset is categorized into various art styles.
			Even within a single art style, artists' individual styles can vary significantly.
			Given the relatively small sample size of only 5000 training images per art style, this variability poses a challenge for the model's ability to discern between distinct styles.
			Another option is that the system did not generate a good enough model for this dataset.
			The system was still able to fairly quickly classify and image, with the classification time still being under less than a second.
			The expansion time was also fairy quick, being on par with the other models.

		\subsubsection*{Testing limitations}
			There are some limitations caused by this testing.
			The biggest problem is in the training, classification and expansion timings, this value will depend on what hardware the system that is running the model has.
			The small sample size of the datasets is also limiting, as it does not fully prove that the system can create generalized models.


			% api benchmarking if there is time

	\subsection{API Performance Testing}
		The application performance was also tested.
		To test the performance of the API, a small program was written that would simultaneously request an image to be classified.
		The selected image was one of the sample images provided in the MNIST dataset.
		The program tries to perform 1, 10, 100, 1000, 10000 simultaneous requests, and waits 2 seconds between each set.
		The program would then record how much time it would take for the image classification task to be completed.
		And after all requests are completed, the program call calculates the mean and max requests times.

		\begin{figure}[H]
	 	\centering
	    \subfloat{{\includegraphics[width=.5\linewidth]{max}}}
	    \subfloat{{\includegraphics[width=.5\linewidth]{mean}}}
		\caption{Results of the API testing}
		\label{fig:results-api}
		\end{figure}

		The values shown in Figure \ref{fig:results-api} show that if you configure the system to only have one runner, it will struggle to handle large amounts of simultaneous requests.
		This is expected, as only having one process trying to classify large amounts of images would be unwise.
		In reality this would never be set up this way since only having one runner in a production environment would never be acceptable.

		\begin{figure}[H]
	 	\centering
	    \subfloat{{\includegraphics[width=.5\linewidth]{max-no-1}}}
	    \subfloat{{\includegraphics[width=.5\linewidth]{mean-no-1}}}
		\caption{Results of the API testing}
		\label{fig:results-api-no-one}
		\end{figure}

		Figure \ref{fig:results-api-no-one} shows the same graph as Figure \ref{fig:results-api} but with the results for the test where the API only had one runner removed.
		The graph indicates that the system was able to handle, 10000 simultaneous requests in less than 30 seconds, which more than exceeds the expectations of the project.
		The results also indicate that the numbers of runners have demising returns, as the values maximum and mean request time are within a small range.
		This can be caused by multiple reasons.
		One such reason is that were not enough requests to show a significant difference between then number of runners.
		Another reason is that the amount of work that the system has to perform to manage all the runners outweighs the benefits of having more runners.

		While testing, the ram usage was monitored but not recorded.
		As expected, the memory usage significant increased with the number of runners, but did not exceed 5 GiB.
		The higher memory usage is a result of the runners caching the model used.
		The memory footprint of the system limited by the model selected as the model generated for MNIST dataset is not large.
		And larger models are expected to generate larger memory footprints.
		When deploying the application, an administrator should take considerations the expected model sizes as well as the usage frequency expected and configure the application accordingly.

		These results are very positive since the project was running on my personal computer and not on professional server hardware.
		This indicates that when deployed to a production environment the service is most likely to perform extremely well.

		\subsubsection*{Testing limitations}
			As with the previous testing, this test has also some limitations.
			Including the same hardware limitation where different hardware will give different results for this test.
			Another limiting factor is that the test did not use different models or images which could cause the service to have to reload models from disk, affecting performance.


	\subsection{Usability}
		While if a service is usable differs vastly from user to user, the implemented system is simple enough where a user who does not know anything about image classification could upload images, and obtain a working classification model.
		This simplicity may pose limitations for users with advanced knowledge, which would fall short of optimal usability standards for that user.
		As this user might choose not to use the system because it does not allow the level of control that they might want.

		The administrator area is less user-friendly than the rest of the application, but that is less critical.
		An administrator is not the target user of the application, and is expected to manage this system, which requires prior knowledge about the system.


	\subsection{Summary}
		The service can create models, and train models that service the user's needs.
		These models will most likely be able to achieve high accuracy targets, but in some cases the system might fail to generate a good enough model for the provided dataset.
		During testing, the limitations of the strict image size requirements were also shown, as the system, would have failed to train those datasets because most of the images would have been removed before the model started training.

		While classifying images, the service performed extremely well.
		The API performance tests showed that if configured correctly, a single server configuration can handle a large amount of simultaneous images extremely fast.
		These results indicate that the system has the performance required to be put in a production environment and perform well.

		As for the usability of the service the system, the system is usable by beginners, but might detract more advanced users from using it.

		Overall, the service evaluation is positive, as the system was able to create and train new models, as well as being user-friendly to users who might not have the skills to perform image classification.

\pagebreak