Posts

CST438 Learning Journal 1

Before starting this Software Engineering course, I expected it to focus primarily on programming techniques and learning how to write better code. I assumed most of the class would be about using different programming languages, design patterns, and frameworks to build applications. I also expected to spend time learning debugging techniques and improving coding efficiency. While I knew teamwork might be discussed, I did not realize how much emphasis software engineering places on the overall process of developing and maintaining software. After completing the first week, my perspective has changed. I now understand that software engineering is much broader than just programming. Writing code is only one part of creating successful software. The course introduced concepts like maintainability, sustainability, testing strategies, version control, and the importance of making design decisions that support long-term development. These topics highlighted that software often lives for man...

CST383 Learning Log #7

  This week focused on classification models, particularly logistic regression and k-nearest neighbors (kNN), as well as evaluating model performance. One of the most useful concepts I learned was how logistic regression uses the sigmoid function to convert a linear combination of predictors into a probability between 0 and 1. Before this week, I understood classification at a high level, but I did not fully understand how a model could estimate the probability that an observation belongs to a particular class. I also learned more about confusion matrices and the metrics derived from them, including accuracy, precision, recall, and false positives. Working through examples helped me see that accuracy alone does not always tell the full story, especially when one type of error is more important than another. Understanding recall was particularly helpful because it measures how well a model identifies actual positive cases. Another important topic was the comparison between logistic ...

CST383 Learning Log #6

     This week in class focused on several important machine learning concepts, including linear regression, classification, train/test splits, model evaluation, and hyperparameter tuning. One of the biggest takeaways for me was understanding the difference between training a model and evaluating how well it generalizes to new data. Before this week, I tended to focus on how well a model fit the training data, but I now understand that test performance is a much better indicator of how useful a model will be in practice. I also learned more about the role of hyperparameters and how they differ from model parameters. The discussion of GridSearchCV helped me understand how machine learning practitioners systematically search for better hyperparameter values rather than choosing them arbitrarily. I found it interesting that the best hyperparameters can vary significantly depending on the dataset being used. One concept that I’m still working to fully understand is the tradeo...

CST383 Learning Log #5

  This week focused on data exploration, preprocessing, and building a K-Nearest Neighbors (KNN) classification model. One of the most interesting topics was learning how important it is to understand and clean data before training a model. The diabetes dataset looked clean at first because there were no missing values, but after exploring the data more closely, we discovered that many predictor variables contained zeros that were likely acting as missing values. This showed me that data quality problems are not always obvious and that domain knowledge is important when analyzing datasets. I also learned how visualization can reveal patterns that are difficult to see in summary statistics alone. Histograms, boxplots, scatterplots, and pair plots helped identify outliers and suspicious values. The process of removing problematic rows and then standardizing the data before applying KNN made it clear that preprocessing can have a major impact on model performance. One concept I am sti...

CST383 Learning Log#4

This week we focused heavily on data visualization and exploratory data analysis using Pandas, Matplotlib, and Seaborn. We worked with datasets involving campaign contributions and US Census information, and I learned that choosing the correct visualization is just as important as creating the graph itself. Different types of variables require different approaches. For example, histograms worked well for continuous variables like contribution amounts or hours worked per week, while grouped and stacked bar charts were better for comparing categories such as occupations, employment status, sex, and income level. One thing I improved on this week was using Pandas methods to summarize and prepare data before plotting. We used functions like groupby(), value_counts(), and crosstab() repeatedly. For example, this line was useful for comparing contribution amounts across categories: df.groupby('candidate')['contb_receipt_amt'].median().plot.barh() I also learned how normalizat...

CST383 Learning Log#3

This week in class I spent a lot of time working with NumPy, Pandas, and data visualization in Python. Early in the week I focused on NumPy arrays, indexing, boolean masks, and vectorized operations. I also practiced using list comprehensions and learned more about the differences between Python lists and NumPy arrays. One thing I noticed is that NumPy operations become much cleaner and more efficient once you stop thinking in terms of loops and start thinking in terms of whole-array operations. Later in the week we moved into Pandas Series and DataFrames. I learned how to select columns, filter rows with boolean conditions, group data, compute statistics, and rename columns. I also became more comfortable reading dataframe summaries with functions like info() and describe(). At first I mixed up when operations returned a Series versus a DataFrame, but after working through the labs I feel much more confident about it. The visualization part of the labs was especially interesting. We c...

CST383 Learning Log #2

  This week focused heavily on using Pandas and NumPy for data analysis and aggregation. I practiced working with Pandas Series and DataFrames, including creating columns, renaming columns, filtering rows with boolean masks, and using .loc and .iloc for indexing. I also worked with aggregation functions such as mean(), median(), value_counts(), and groupby() to analyze datasets like heart disease data, census data, penguin body mass data, and Lyft bike-sharing data. One important idea I learned is that Pandas automatically aligns data using indexes, which can be very powerful but also confusing when values are missing or indexes do not match. A concept I am still working on fully understanding is when to use groupby() versus value_counts(). I understand that value_counts() is useful for counting categories quickly, while groupby() is more flexible for computing statistics, but sometimes the two approaches seem similar. Another topic that took practice was combining boolean masks wi...