Pandas Hacks

My CODE IS BELOW THE QUESTIONS

  1. make your own data using your brian, google or chatgpt, should look different than mine.
  2. modify my code or write your own
  3. output your data other than a bar graph.
  4. write an 850+ word essay on how pandas, python or irl, affected your life. If AI score below 85%, then -1 grading point
  5. answer the questions below, the more explained the better.

Questions

  1. What are the two primary data structures in pandas and how do they differ?
    • Series is one-dimensional while a dataframe is 2-dimensional.
  2. How do you read a CSV file into a pandas DataFrame?
    • Pandas read_csv() function to read a CSV file.
  3. How do you select a single column from a pandas DataFrame?
    • You do the command df['column-name']
  4. How do you filter rows in a pandas DataFrame based on a condition?
    • df[df['column-name']=value]
  5. How do you group rows in a pandas DataFrame by a particular column?
    • df.groupby('name')['name']
  6. How do you aggregate data in a pandas DataFrame using functions like sum and mean?
    • df.groupby['column-name'].mean()
  7. How do you handle missing values in a pandas DataFrame?
    • you can drop rows or columns with a missing value with the command df.dropna()
  8. How do you merge two pandas DataFrames together?
    • pd.merge(left=dfname1, right=dfname2, left_on='column-name1', right_on='column-name2)
  9. How do you export a pandas DataFrame to a CSV file?
    • df.to_csv('file_name.csv')
  10. What is the difference between a Series and a DataFrame in Pandas?
    • series is only 1-d, but Dataframe is 2-d
import pandas as pd

# read the CSV file
df = pd.read_csv('datasets/books.csv')

df = df.drop(columns=['bookID', 'isbn', 'isbn13', 'language_code', 'publication_date', 'publisher'])

print(df.head())
                                               title  \
0  Harry Potter and the Half-Blood Prince (Harry ...   
1  Harry Potter and the Order of the Phoenix (Har...   
2  Harry Potter and the Chamber of Secrets (Harry...   
3  Harry Potter and the Prisoner of Azkaban (Harr...   
4  Harry Potter Boxed Set  Books 1-5 (Harry Potte...   

                      authors  average_rating    num_pages  ratings_count  \
0  J.K. Rowling/Mary GrandPré            4.57          652        2095690   
1  J.K. Rowling/Mary GrandPré            4.49          870        2153167   
2                J.K. Rowling            4.42          352           6333   
3  J.K. Rowling/Mary GrandPré            4.56          435        2339585   
4  J.K. Rowling/Mary GrandPré            4.78         2690          41428   

   text_reviews_count  
0               27591  
1               29221  
2                 244  
3               36325  
4                 164  
df.dropna()


def assign_average_rating(rating):
    if rating <= 1:
        return '0-1'
    elif rating < 2:
        return '1-2'
    elif rating < 3:
        return '2-3'
    elif rating < 4:
        return '3-4'
    else:
        return '4-5'

df['Page-Group'] = df['average_rating'].apply(assign_average_rating)

rating_counts = df.groupby('Page-Group')['title'].count()

# print the age group counts
print(rating_counts)
Page-Group
0-1       8
2-3      15
3-4    2154
4-5    1891
Name: title, dtype: int64
df_filtered = df[df['average_rating'] >= 3.5]

# sort the data by age in descending order
df_sorted = df.sort_values('average_rating', ascending=False)

# group the data by gender and calculate the mean age for each group
rating_by_author = df.groupby('authors')['average_rating'].mean()

# print the filtered data
print(df_filtered.head())

# print the sorted data
print(df_sorted.head())

# print the mean age by gender
print(rating_by_author)
                                               title  \
0  Harry Potter and the Half-Blood Prince (Harry ...   
1  Harry Potter and the Order of the Phoenix (Har...   
2  Harry Potter and the Chamber of Secrets (Harry...   
3  Harry Potter and the Prisoner of Azkaban (Harr...   
4  Harry Potter Boxed Set  Books 1-5 (Harry Potte...   

                      authors  average_rating    num_pages  ratings_count  \
0  J.K. Rowling/Mary GrandPré            4.57          652        2095690   
1  J.K. Rowling/Mary GrandPré            4.49          870        2153167   
2                J.K. Rowling            4.42          352           6333   
3  J.K. Rowling/Mary GrandPré            4.56          435        2339585   
4  J.K. Rowling/Mary GrandPré            4.78         2690          41428   

   text_reviews_count Page-Group  
0               27591        4-5  
1               29221        4-5  
2                 244        4-5  
3               36325        4-5  
4                 164        4-5  
                                                  title  \
624   Comoediae 1: Acharenses/Equites/Nubes/Vespae/P...   
1243  Middlesex Borough (Images of America: New Jersey)   
786                   Willem de Kooning: Late Paintings   
855   Literature Circle Guide: Bridge to Terabithia:...   
4     Harry Potter Boxed Set  Books 1-5 (Harry Potte...   

                                   authors  average_rating    num_pages  \
624    Aristophanes/F.W. Hall/W.M. Geldart            5.00          364   
1243  Middlesex Borough Heritage Committee            5.00          128   
786        Julie Sylvester/David Sylvester            5.00           83   
855                         Tara MacCarthy            5.00           32   
4               J.K. Rowling/Mary GrandPré            4.78         2690   

      ratings_count  text_reviews_count Page-Group  
624               0                   0        4-5  
1243              2                   0        4-5  
786               1                   0        4-5  
855               4                   1        4-5  
4             41428                 164        4-5  
authors
A.S. Byatt                                            3.65
Abdul Rahman Munif/Peter Theroux                      4.13
Abigail Adams/Frank Shuffelton                        4.14
Abraham Lincoln/Michael McCurdy                       4.53
Adam Ginsberg                                         3.48
                                                      ... 
Zolar                                                 3.68
Zoë Heller                                            3.71
Åsne Seierstad/Ingrid Christopherson                  3.77
Émile Zola/Ernest Alfred Vizetelly/Henry Vizetelly    3.91
Éric-Emmanuel Schmitt                                 3.82
Name: average_rating, Length: 2597, dtype: float64
import matplotlib.pyplot as plt


rating_groups = ['0-1', '1-2', '2-3', '3-4', '4-5']
rating_counts = pd.cut(df['average_rating'], bins=[0, 1, 2, 3, 4, df['average_rating'].max()], labels=rating_groups, include_lowest=True).value_counts()
plt.bar(rating_counts.index, rating_counts.values)
plt.title('Number of books in each rating group')
plt.xlabel('Rating group')
plt.ylabel('Number of books')
plt.show()


# create a scatter plot of number of ratings vs. rating
plt.scatter(df['ratings_count'], df['average_rating'])
plt.title('Ratings count VS Average rating')
plt.xlabel('Ratings count')
plt.ylabel('Average rating')
plt.show()

In my code, I looked at the ratings for a database of books from Goodreads. Through my bar graph, I found out that most books have a rating between 3-4, and between 4-5 is also pretty close by. I also plotted the number of ratings (Ratings count) versus Average rating and found that there is no correlation between them! However, I did notice that most books with lower ratings tend to have less ratings, which makes sense because a low rating would deter others from reading the book!

Data Analysis / Predictive Analysis Hacks

  1. How can Numpy and Pandas be used to preprocess data for predictive analysis?
    • Pandas is used for data anlysis tasks, while Numpy can be used for working with numbers because you can do different math functions. Pandas has more functionality for more data types. They can both be used for data cleaning, standardization, and transofmration to preprocess data for predictive analysis. Essentially, they make the data formatted so the process of analysis is made easier.
  2. What machine learning algorithms can be used for predictive analysis, and how do they differ?
    • Linear regression is used to predict continuous outcomes in a linear relationship between the independent and dependent variable. Decision trees are used to model decisions and their possible results. You use random forests in cases with lots of data. Neural networks are modeled after the human brain and how they make decisions. Support vector machines are used for finding the best possible boundary between different classes in data.
  3. Can you discuss some real-world applications of predictive analysis in different industries?
    • Predicting the weather/temperature, predicting which teams wins in a sports match, predicting if someone has a sickness or not based on a scan or image, predicting if a user will like a video or not, predicting stock markets
  4. Can you explain the role of feature engineering in predictive analysis, and how it can improve model accuracy?
    • Feature engineering selects and processes variables when creating a predictive analysis model. It can improve model accuracy because you can isolate key information to highlight patterns.
  5. How can machine learning models be deployed in real-time applications for predictive analysis?
    • Machine learning models can be deployed in real-time applications, such as when a user is scrolling on TikTok and they are liking certain videos, then the machine learning model can work to recommend some videos that they will like.
  6. Can you discuss some limitations of Numpy and Pandas, and when it might be necessary to use other data analysis tools?
    • Pandas consumes more memory, while numpyu is memory efficient. pandas has abetter performance when there are more rows. If you have a HUGE dataset, you may need to consider using other tools. Also, the syntax can sometimes be complex. Pandas also has a steep learning curve, poor ducomentation, and poor 3D matrix compatibility.
  7. How can predictive analysis be used to improve decision-making and optimize business processes?
    • Predictive analysis can help improve decision-making and optimize business processes because it will predict the most likely outcomes for your business and allow you to better tailor to your customers and therefore increase profit.

Numpy Hacks

For your hacks, use matplotlib and numpy to slice this image to display Waldo. Also find and display one other numpy function and blog about what it is used for.

Displaying Waldo

from skimage import io

photo = io.imread('../images/waldo.jpg')
type(photo)

plt.imshow(photo)
<matplotlib.image.AxesImage at 0x7ff4344637f0>
import matplotlib.pyplot as plt
plt.imshow(photo)

photo.shape
(461, 700, 3)
plt.imshow(photo[210:350, 425:500])
<matplotlib.image.AxesImage at 0x7ff43431c6a0>

I found WALDO (as you can see in the image above)!!!!

Numpy function

The sum numpy function sums array elements over a given axis

Below, I display the sum of all the elements in the photo array.

import numpy as np
np.sum(photo)
151011654

Another numpy function is gradient, which displays teh gradient of an N-dimensional array. Below, you can see me doing that with the photo array.

np.gradient(photo)
[array([[[ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ],
         ...,
         [ 1. ,  1. ,  1. ],
         [ 1. ,  1. ,  1. ],
         [ 1. ,  1. ,  1. ]],
 
        [[ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ],
         ...,
         [ 1.5,  1.5,  1.5],
         [ 1.5,  1.5,  1.5],
         [ 1.5,  1.5,  1.5]],
 
        [[-1. ,  0. , -0.5],
         [-1. ,  0. , -0.5],
         [-1. ,  0. , -0.5],
         ...,
         [ 1. ,  1. ,  1. ],
         [ 1. ,  1. ,  1. ],
         [ 1. ,  1. ,  1. ]],
 
        ...,
 
        [[-0.5, -1. ,  1.5],
         [ 0. ,  0. ,  1. ],
         [ 0. ,  0.5, -1. ],
         ...,
         [ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ]],
 
        [[-0.5, -0.5,  0.5],
         [ 0. ,  0. ,  1. ],
         [ 0. ,  0.5, -1. ],
         ...,
         [ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ]],
 
        [[ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ],
         ...,
         [ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ]]]),
 array([[[ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ],
         ...,
         [ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ]],
 
        [[ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ],
         ...,
         [ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ]],
 
        [[ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ],
         ...,
         [ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ]],
 
        ...,
 
        [[-1. , -1. , -1. ],
         [-0.5, -1. ,  1.5],
         [-0.5, -1. ,  1.5],
         ...,
         [-0.5, -0.5, -0.5],
         [-1. , -1. , -1. ],
         [-1. , -1. , -1. ]],
 
        [[ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ],
         ...,
         [-0.5, -0.5, -0.5],
         [-1. , -1. , -1. ],
         [-1. , -1. , -1. ]],
 
        [[ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ],
         [ 0. ,  0. ,  0. ],
         ...,
         [-0.5, -0.5, -0.5],
         [-1. , -1. , -1. ],
         [-1. , -1. , -1. ]]]),
 array([[[  37. ,   21.5,    6. ],
         [  37. ,   21.5,    6. ],
         [  37. ,   21.5,    6. ],
         ...,
         [  76. ,   36. ,   -4. ],
         [  76. ,   36. ,   -4. ],
         [  76. ,   36. ,   -4. ]],
 
        [[  37. ,   21.5,    6. ],
         [  37. ,   21.5,    6. ],
         [  37. ,   21.5,    6. ],
         ...,
         [  76. ,   36. ,   -4. ],
         [  76. ,   36. ,   -4. ],
         [  76. ,   36. ,   -4. ]],
 
        [[  37. ,   21.5,    6. ],
         [  37. ,   21.5,    6. ],
         [  37. ,   21.5,    6. ],
         ...,
         [  76. ,   36. ,   -4. ],
         [  76. ,   36. ,   -4. ],
         [  76. ,   36. ,   -4. ]],
 
        ...,
 
        [[  15. ,  -52. , -119. ],
         [  15. ,  -52. , -119. ],
         [  14. ,  -50. , -114. ],
         ...,
         [  15. ,  -52. , -119. ],
         [  15. ,  -52. , -119. ],
         [  15. ,  -52. , -119. ]],
 
        [[  15. ,  -51. , -117. ],
         [  15. ,  -51. , -117. ],
         [  15. ,  -51. , -117. ],
         ...,
         [  15. ,  -52. , -119. ],
         [  15. ,  -52. , -119. ],
         [  15. ,  -52. , -119. ]],
 
        [[  15. ,  -51. , -117. ],
         [  15. ,  -51. , -117. ],
         [  15. ,  -51. , -117. ],
         ...,
         [  15. ,  -52. , -119. ],
         [  15. ,  -52. , -119. ],
         [  15. ,  -52. , -119. ]]])]