Data Analysis Hacks 4/25
Pandas Hacks
My CODE IS BELOW THE QUESTIONS
- make your own data using your brian, google or chatgpt, should look different than mine.
- modify my code or write your own
- output your data other than a bar graph.
- write an 850+ word essay on how pandas, python or irl, affected your life. If AI score below 85%, then -1 grading point
- answer the questions below, the more explained the better.
Questions
- What are the two primary data structures in pandas and how do they differ?
- Series is one-dimensional while a dataframe is 2-dimensional.
- How do you read a CSV file into a pandas DataFrame?
- Pandas read_csv() function to read a CSV file.
- How do you select a single column from a pandas DataFrame?
- You do the command
df['column-name']
- You do the command
- How do you filter rows in a pandas DataFrame based on a condition?
df[df['column-name']=value]
- How do you group rows in a pandas DataFrame by a particular column?
df.groupby('name')['name']
- How do you aggregate data in a pandas DataFrame using functions like sum and mean?
df.groupby['column-name'].mean()
- How do you handle missing values in a pandas DataFrame?
- you can drop rows or columns with a missing value with the command
df.dropna()
- you can drop rows or columns with a missing value with the command
- How do you merge two pandas DataFrames together?
pd.merge(left=dfname1, right=dfname2, left_on='column-name1', right_on='column-name2)
- How do you export a pandas DataFrame to a CSV file?
df.to_csv('file_name.csv')
- What is the difference between a Series and a DataFrame in Pandas?
- series is only 1-d, but Dataframe is 2-d
import pandas as pd
# read the CSV file
df = pd.read_csv('datasets/books.csv')
df = df.drop(columns=['bookID', 'isbn', 'isbn13', 'language_code', 'publication_date', 'publisher'])
print(df.head())
df.dropna()
def assign_average_rating(rating):
if rating <= 1:
return '0-1'
elif rating < 2:
return '1-2'
elif rating < 3:
return '2-3'
elif rating < 4:
return '3-4'
else:
return '4-5'
df['Page-Group'] = df['average_rating'].apply(assign_average_rating)
rating_counts = df.groupby('Page-Group')['title'].count()
# print the age group counts
print(rating_counts)
df_filtered = df[df['average_rating'] >= 3.5]
# sort the data by age in descending order
df_sorted = df.sort_values('average_rating', ascending=False)
# group the data by gender and calculate the mean age for each group
rating_by_author = df.groupby('authors')['average_rating'].mean()
# print the filtered data
print(df_filtered.head())
# print the sorted data
print(df_sorted.head())
# print the mean age by gender
print(rating_by_author)
import matplotlib.pyplot as plt
rating_groups = ['0-1', '1-2', '2-3', '3-4', '4-5']
rating_counts = pd.cut(df['average_rating'], bins=[0, 1, 2, 3, 4, df['average_rating'].max()], labels=rating_groups, include_lowest=True).value_counts()
plt.bar(rating_counts.index, rating_counts.values)
plt.title('Number of books in each rating group')
plt.xlabel('Rating group')
plt.ylabel('Number of books')
plt.show()
# create a scatter plot of number of ratings vs. rating
plt.scatter(df['ratings_count'], df['average_rating'])
plt.title('Ratings count VS Average rating')
plt.xlabel('Ratings count')
plt.ylabel('Average rating')
plt.show()
In my code, I looked at the ratings for a database of books from Goodreads. Through my bar graph, I found out that most books have a rating between 3-4, and between 4-5 is also pretty close by. I also plotted the number of ratings (Ratings count) versus Average rating and found that there is no correlation between them! However, I did notice that most books with lower ratings tend to have less ratings, which makes sense because a low rating would deter others from reading the book!
Data Analysis / Predictive Analysis Hacks
- How can Numpy and Pandas be used to preprocess data for predictive analysis?
- Pandas is used for data anlysis tasks, while Numpy can be used for working with numbers because you can do different math functions. Pandas has more functionality for more data types. They can both be used for data cleaning, standardization, and transofmration to preprocess data for predictive analysis. Essentially, they make the data formatted so the process of analysis is made easier.
- What machine learning algorithms can be used for predictive analysis, and how do they differ?
- Linear regression is used to predict continuous outcomes in a linear relationship between the independent and dependent variable. Decision trees are used to model decisions and their possible results. You use random forests in cases with lots of data. Neural networks are modeled after the human brain and how they make decisions. Support vector machines are used for finding the best possible boundary between different classes in data.
- Can you discuss some real-world applications of predictive analysis in different industries?
- Predicting the weather/temperature, predicting which teams wins in a sports match, predicting if someone has a sickness or not based on a scan or image, predicting if a user will like a video or not, predicting stock markets
- Can you explain the role of feature engineering in predictive analysis, and how it can improve model accuracy?
- Feature engineering selects and processes variables when creating a predictive analysis model. It can improve model accuracy because you can isolate key information to highlight patterns.
- How can machine learning models be deployed in real-time applications for predictive analysis?
- Machine learning models can be deployed in real-time applications, such as when a user is scrolling on TikTok and they are liking certain videos, then the machine learning model can work to recommend some videos that they will like.
- Can you discuss some limitations of Numpy and Pandas, and when it might be necessary to use other data analysis tools?
- Pandas consumes more memory, while numpyu is memory efficient. pandas has abetter performance when there are more rows. If you have a HUGE dataset, you may need to consider using other tools. Also, the syntax can sometimes be complex. Pandas also has a steep learning curve, poor ducomentation, and poor 3D matrix compatibility.
- How can predictive analysis be used to improve decision-making and optimize business processes?
- Predictive analysis can help improve decision-making and optimize business processes because it will predict the most likely outcomes for your business and allow you to better tailor to your customers and therefore increase profit.
from skimage import io
photo = io.imread('../images/waldo.jpg')
type(photo)
plt.imshow(photo)
import matplotlib.pyplot as plt
plt.imshow(photo)
photo.shape
plt.imshow(photo[210:350, 425:500])
I found WALDO (as you can see in the image above)!!!!
import numpy as np
np.sum(photo)
Another numpy function is gradient, which displays teh gradient of an N-dimensional array. Below, you can see me doing that with the photo array.
np.gradient(photo)