CST383 Learning Log #2
This week focused heavily on using Pandas and NumPy for data analysis and aggregation. I practiced working with Pandas Series and DataFrames, including creating columns, renaming columns, filtering rows with boolean masks, and using .loc and .iloc for indexing. I also worked with aggregation functions such as mean(), median(), value_counts(), and groupby() to analyze datasets like heart disease data, census data, penguin body mass data, and Lyft bike-sharing data. One important idea I learned is that Pandas automatically aligns data using indexes, which can be very powerful but also confusing when values are missing or indexes do not match.
A concept I am still working on fully understanding is when to use groupby() versus value_counts(). I understand that value_counts() is useful for counting categories quickly, while groupby() is more flexible for computing statistics, but sometimes the two approaches seem similar. Another topic that took practice was combining boolean masks with aggregation, especially when filtering rows before computing statistics.
One question I have after this week is how these aggregation techniques scale to very large datasets. Many of the datasets we used were manageable in size, but I wonder how data scientists efficiently perform the same operations on millions of rows of data. An idea I found interesting is that many real-world questions can be answered with only a few lines of Pandas code once the data is organized correctly. It makes me realize that understanding the structure of the data is often just as important as writing the code itself. Overall, this week helped me feel much more comfortable reading, filtering, grouping, and analyzing real datasets.
Comments
Post a Comment