CST338 Learning Log#4
This week we focused heavily on data visualization and exploratory data analysis using Pandas, Matplotlib, and Seaborn. We worked with datasets involving campaign contributions and US Census information, and I learned that choosing the correct visualization is just as important as creating the graph itself. Different types of variables require different approaches. For example, histograms worked well for continuous variables like contribution amounts or hours worked per week, while grouped and stacked bar charts were better for comparing categories such as occupations, employment status, sex, and income level.
One thing I improved on this week was using Pandas methods to summarize and prepare data before plotting. We used functions like groupby(), value_counts(), and crosstab() repeatedly. For example, this line was useful for comparing contribution amounts across categories:
df.groupby('candidate')['contb_receipt_amt'].median().plot.barh()
I also learned how normalization can help compare proportions instead of raw totals:
pd.crosstab(df['sex'], df['label'], normalize='index').plot.bar()
This helped show the relationship between income level and sex more clearly than raw counts would have.
Another thing I found interesting was how preprocessing affects the quality of analysis. Dropping missing values, renaming columns, creating new variables, and filtering rows made the plots much easier to understand. For example, we used .between() to focus on a useful range of values:
df[df['contb_receipt_amt'].between(0, 500)]
One concept I still want more practice with is deciding which type of categorical plot communicates information best. Sometimes grouped bar charts make comparisons easier, while stacked bar charts show distributions more clearly. I understand how to create both, but I still need experience deciding which one is more appropriate for a specific situation.
I also noticed that percentages can completely change the interpretation of a graph. Using normalized percentages instead of counts helped avoid misleading conclusions when category sizes were very different. I understand the mechanics of using normalize='index', but I still want to get better at understanding when percentages are more meaningful than totals.
Overall, this week made me much more comfortable with using visualization as a tool for understanding data instead of just presenting it. I feel more confident cleaning data, summarizing it with Pandas, and selecting plots that match the variable types and research questions.
Comments
Post a Comment