CST383 Learning Log #5

 This week focused on data exploration, preprocessing, and building a K-Nearest Neighbors (KNN) classification model. One of the most interesting topics was learning how important it is to understand and clean data before training a model. The diabetes dataset looked clean at first because there were no missing values, but after exploring the data more closely, we discovered that many predictor variables contained zeros that were likely acting as missing values. This showed me that data quality problems are not always obvious and that domain knowledge is important when analyzing datasets.

I also learned how visualization can reveal patterns that are difficult to see in summary statistics alone. Histograms, boxplots, scatterplots, and pair plots helped identify outliers and suspicious values. The process of removing problematic rows and then standardizing the data before applying KNN made it clear that preprocessing can have a major impact on model performance.


One concept I am still thinking about is how to decide whether unusual values should be removed, imputed, or left alone. In this assignment we removed rows with zeros in certain columns, but I wonder how much the final accuracy would change if we had used imputation instead. It seems like there is often no single "correct" answer, and the best choice depends on the situation.


A question I have is how sensitive KNN is to preprocessing decisions compared to other algorithms. For example, would a decision tree or random forest be affected as much by outliers and scaling? An idea I have is to compare several classification algorithms on the same cleaned dataset to see which performs best and which is most robust to imperfect data. Overall, this week reinforced the idea that successful machine learning depends just as much on understanding and preparing the data as it does on choosing the model itself.


Comments

Popular posts from this blog

Computer Science BS Journal (CST334) : Week 3

Computer Science BS Journal (CST334) : Week 5

Computer Science BS Journal (CST334) : Week 2