CST383 Learning Log #5
This week focused on data exploration, preprocessing, and building a K-Nearest Neighbors (KNN) classification model. One of the most interesting topics was learning how important it is to understand and clean data before training a model. The diabetes dataset looked clean at first because there were no missing values, but after exploring the data more closely, we discovered that many predictor variables contained zeros that were likely acting as missing values. This showed me that data quality problems are not always obvious and that domain knowledge is important when analyzing datasets. I also learned how visualization can reveal patterns that are difficult to see in summary statistics alone. Histograms, boxplots, scatterplots, and pair plots helped identify outliers and suspicious values. The process of removing problematic rows and then standardizing the data before applying KNN made it clear that preprocessing can have a major impact on model performance. One concept I am sti...