Posts

CST383 Learning Log #7

  This week focused on classification models, particularly logistic regression and k-nearest neighbors (kNN), as well as evaluating model performance. One of the most useful concepts I learned was how logistic regression uses the sigmoid function to convert a linear combination of predictors into a probability between 0 and 1. Before this week, I understood classification at a high level, but I did not fully understand how a model could estimate the probability that an observation belongs to a particular class. I also learned more about confusion matrices and the metrics derived from them, including accuracy, precision, recall, and false positives. Working through examples helped me see that accuracy alone does not always tell the full story, especially when one type of error is more important than another. Understanding recall was particularly helpful because it measures how well a model identifies actual positive cases. Another important topic was the comparison between logistic ...

CST383 Learning Log #6

     This week in class focused on several important machine learning concepts, including linear regression, classification, train/test splits, model evaluation, and hyperparameter tuning. One of the biggest takeaways for me was understanding the difference between training a model and evaluating how well it generalizes to new data. Before this week, I tended to focus on how well a model fit the training data, but I now understand that test performance is a much better indicator of how useful a model will be in practice. I also learned more about the role of hyperparameters and how they differ from model parameters. The discussion of GridSearchCV helped me understand how machine learning practitioners systematically search for better hyperparameter values rather than choosing them arbitrarily. I found it interesting that the best hyperparameters can vary significantly depending on the dataset being used. One concept that I’m still working to fully understand is the tradeo...

CST383 Learning Log #5

  This week focused on data exploration, preprocessing, and building a K-Nearest Neighbors (KNN) classification model. One of the most interesting topics was learning how important it is to understand and clean data before training a model. The diabetes dataset looked clean at first because there were no missing values, but after exploring the data more closely, we discovered that many predictor variables contained zeros that were likely acting as missing values. This showed me that data quality problems are not always obvious and that domain knowledge is important when analyzing datasets. I also learned how visualization can reveal patterns that are difficult to see in summary statistics alone. Histograms, boxplots, scatterplots, and pair plots helped identify outliers and suspicious values. The process of removing problematic rows and then standardizing the data before applying KNN made it clear that preprocessing can have a major impact on model performance. One concept I am sti...

CST383 Learning Log#4

This week we focused heavily on data visualization and exploratory data analysis using Pandas, Matplotlib, and Seaborn. We worked with datasets involving campaign contributions and US Census information, and I learned that choosing the correct visualization is just as important as creating the graph itself. Different types of variables require different approaches. For example, histograms worked well for continuous variables like contribution amounts or hours worked per week, while grouped and stacked bar charts were better for comparing categories such as occupations, employment status, sex, and income level. One thing I improved on this week was using Pandas methods to summarize and prepare data before plotting. We used functions like groupby(), value_counts(), and crosstab() repeatedly. For example, this line was useful for comparing contribution amounts across categories: df.groupby('candidate')['contb_receipt_amt'].median().plot.barh() I also learned how normalizat...

CST383 Learning Log#3

This week in class I spent a lot of time working with NumPy, Pandas, and data visualization in Python. Early in the week I focused on NumPy arrays, indexing, boolean masks, and vectorized operations. I also practiced using list comprehensions and learned more about the differences between Python lists and NumPy arrays. One thing I noticed is that NumPy operations become much cleaner and more efficient once you stop thinking in terms of loops and start thinking in terms of whole-array operations. Later in the week we moved into Pandas Series and DataFrames. I learned how to select columns, filter rows with boolean conditions, group data, compute statistics, and rename columns. I also became more comfortable reading dataframe summaries with functions like info() and describe(). At first I mixed up when operations returned a Series versus a DataFrame, but after working through the labs I feel much more confident about it. The visualization part of the labs was especially interesting. We c...

CST383 Learning Log #2

  This week focused heavily on using Pandas and NumPy for data analysis and aggregation. I practiced working with Pandas Series and DataFrames, including creating columns, renaming columns, filtering rows with boolean masks, and using .loc and .iloc for indexing. I also worked with aggregation functions such as mean(), median(), value_counts(), and groupby() to analyze datasets like heart disease data, census data, penguin body mass data, and Lyft bike-sharing data. One important idea I learned is that Pandas automatically aligns data using indexes, which can be very powerful but also confusing when values are missing or indexes do not match. A concept I am still working on fully understanding is when to use groupby() versus value_counts(). I understand that value_counts() is useful for counting categories quickly, while groupby() is more flexible for computing statistics, but sometimes the two approaches seem similar. Another topic that took practice was combining boolean masks wi...

CST383 Learning Log #1

This was the first week of CST383 and we focused on building a strong foundation in Python for data science, especially working with NumPy and basic scripting tools. I practiced creating and manipulating arrays, including slicing, fancy indexing, and boolean masking. I also learned how vectorized operations allow computations to be performed efficiently across entire arrays without using loops. Working with both 1D and 2D arrays helped me better understand how to access specific rows and columns, as well as how to compute statistics like mean and median along different axes. In addition, I explored filtering data using conditions and using those filters to extract subsets of interest. One concept that I found a bit confusing at first was how boolean masks need to match the shape of the array they are indexing. For example, trying to index an array with a mask of a different length results in an error, which made me realize how important array dimensions are in NumPy. This made me ask: ...