Pandas and DataFrames

In this lesson we will be exploring data analysis using Pandas.

  • College Board talks about ideas like
    • Tools. "the ability to process data depends on users capabilities and their tools"
    • Combining Data. "combine county data sets"
    • Status on Data"determining the artist with the greatest attendance during a particular month"
    • Data poses challenge. "the need to clean data", "incomplete data"
  • From Pandas Overview -- When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.

DataFrame

'''Pandas is used to gather data sets through its DataFrames implementation'''
import pandas as pd

Cleaning Data

When looking at a data set, check to see what data needs to be cleaned. Examples include:

  • Missing Data Points
  • Invalid Data
  • Inaccurate Data

Run the following code to see what needs to be cleaned

df = pd.read_json('grade.json')

print(df)
# What part of the data set needs to be cleaned?
# From PBL learning, what is a good time to clean data?  Hint, remember Garbage in, Garbage out?
   Student ID Year in School   GPA
0         123             12  3.57
1         246             10  4.00
2         578             12  2.78
3         469             11  3.45
4         324         Junior  4.75
5         313             20  3.33
6         145             12  2.95
7         167             10  3.90
8         235      9th Grade  3.15
9         nil              9  2.80
10        469             11  3.45
11        456             10  2.75

Extracting Info

Take a look at some features that the Pandas library has that extracts info from the dataset

DataFrame Extract Column

print(df[['GPA']])

print()

#try two columns and remove the index from print statement
print(df[['Student ID','GPA']].to_string(index=False))
     GPA
0   3.57
1   4.00
2   2.78
3   3.45
4   4.75
5   3.33
6   2.95
7   3.90
8   3.15
9   2.80
10  3.45
11  2.75

Student ID  GPA
       123 3.57
       246 4.00
       578 2.78
       469 3.45
       324 4.75
       313 3.33
       145 2.95
       167 3.90
       235 3.15
       nil 2.80
       469 3.45
       456 2.75

DataFrame Sort

print(df.sort_values(by=['GPA']))

print()

#sort the values in reverse order
print(df.sort_values(by=['GPA'], ascending=False))
   Student ID Year in School   GPA
11        456             10  2.75
2         578             12  2.78
9         nil              9  2.80
6         145             12  2.95
8         235      9th Grade  3.15
5         313             20  3.33
3         469             11  3.45
10        469             11  3.45
0         123             12  3.57
7         167             10  3.90
1         246             10  4.00
4         324         Junior  4.75

   Student ID Year in School   GPA
4         324         Junior  4.75
1         246             10  4.00
7         167             10  3.90
0         123             12  3.57
3         469             11  3.45
10        469             11  3.45
5         313             20  3.33
8         235      9th Grade  3.15
6         145             12  2.95
9         nil              9  2.80
2         578             12  2.78
11        456             10  2.75

DataFrame Selection or Filter

print(df[df.GPA > 3.00])
   Student ID Year in School   GPA
0         123             12  3.57
1         246             10  4.00
3         469             11  3.45
4         324         Junior  4.75
5         313             20  3.33
7         167             10  3.90
8         235      9th Grade  3.15
10        469             11  3.45

DataFrame Selection Max and Min

print(df[df.GPA == df.GPA.max()])
print()
print(df[df.GPA == df.GPA.min()])
  Student ID Year in School   GPA
4        324         Junior  4.75

   Student ID Year in School   GPA
11        456             10  2.75

Create your own DataFrame

Using Pandas allows you to create your own DataFrame in Python.

Python Dictionary to Pandas DataFrame

import pandas as pd

#the data can be stored as a python dictionary
dict = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}
#stores the data in a data frame
print("-------------Dict_to_DF------------------")
df = pd.DataFrame(dict)
print(df)

print("----------Dict_to_DF_labels--------------")

#or with the index argument, you can label rows.
df = pd.DataFrame(dict, index = ["day1", "day2", "day3"])
print(df)
-------------Dict_to_DF------------------
   calories  duration
0       420        50
1       380        40
2       390        45
----------Dict_to_DF_labels--------------
      calories  duration
day1       420        50
day2       380        40
day3       390        45

Examine DataFrame Rows

print("-------Examine Selected Rows---------")
#use a list for multiple labels:
print(df.loc[["day1", "day3"]])

#refer to the row index:
print("--------Examine Single Row-----------")
print(df.loc["day1"])
-------Examine Selected Rows---------
      calories  duration
day1       420        50
day3       390        45
--------Examine Single Row-----------
calories    420
duration     50
Name: day1, dtype: int64

Pandas DataFrame Information

print(df.info())
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, day1 to day3
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   calories  3 non-null      int64
 1   duration  3 non-null      int64
dtypes: int64(2)
memory usage: 180.0+ bytes
None

Example of larger data set

Pandas can read CSV and many other types of files, run the following code to see more features with a larger data set

import pandas as pd

#read csv and sort 'Duration' largest to smallest
df = pd.read_csv('files/data.csv').sort_values(by=['Duration'], ascending=False)

print("--Duration Top 10---------")
print(df.head(10))

print("--Duration Bottom 10------")
print(df.tail(10))
--Duration Top 10---------
     Duration  Pulse  Maxpulse  Calories
69        300    108       143    1500.2
79        270    100       131    1729.0
109       210    137       184    1860.4
60        210    108       160    1376.0
106       180     90       120     800.3
90        180    101       127     600.1
65        180     90       130     800.4
61        160    110       137    1034.4
62        160    109       135     853.0
67        150    107       130     816.0
--Duration Bottom 10------
     Duration  Pulse  Maxpulse  Calories
68         20    106       136     110.4
100        20     95       112      77.7
89         20     83       107      50.3
135        20    136       156     189.0
94         20    150       171     127.4
95         20    151       168     229.4
139        20    141       162     222.4
64         20    110       130     131.4
112        15    124       139     124.2
93         15     80       100      50.5

APIs are a Source for Writing Programs with Data

3rd Party APIs are a great source for creating Pandas Data Frames.

  • Data can be fetched and resulting json can be placed into a Data Frame
  • Observe output, this looks very similar to a Database
'''Pandas can be used to analyze data'''
import pandas as pd
import requests

def fetch():
    '''Obtain data from an endpoint'''
    url = "https://flask.nighthawkcodingsociety.com/api/covid/"
    fetch = requests.get(url)
    json = fetch.json()

    # filter data for requirement
    df = pd.DataFrame(json['countries_stat'])  # filter endpoint for country stats
    print(df.loc[0:5, 'country_name':'deaths']) # show row 0 through 5 and columns country_name through deaths
    
fetch()
  country_name       cases     deaths
0          USA  82,649,779  1,018,316
1        India  43,057,545    522,193
2       Brazil  30,345,654    662,663
3       France  28,244,977    145,020
4      Germany  24,109,433    134,624
5           UK  21,933,206    173,352

Hacks

Early Seed award

  • Add this Blog to you own Blogging site.
  • Have all lecture files saved to your files directory before Tech Talk starts. Have data.csv open in vscode. Don't tell anyone. Show to Teacher.

AP Prep

  • Add this Blog to you own Blogging site. In the Blog add notes and observations on each code cell.
  • In blog add College Board practice problems for 2.3.

The next 4 weeks, Teachers want you to improve your understanding of data. Look at the blog and others on Unit 2. Your intention is to find some things to differentiate your individual College Board project.

  • Create or Find your own dataset. The suggestion is to use a JSON file, integrating with your PBL project would be Amazing.

  • When choosing a data set, think about the following...

    • Does it have a good sample size?
    • Is there bias in the data?
    • Does the data set need to be cleaned?
    • What is the purpose of the data set?
    • ...
  • Continue this Blog using Pandas extract info from that dataset (ex. max, min, mean, median, mode, etc.)

Hack Helpers

Here is how Mort started on this assignment by asking ChatGPT ... Regarding Python Pandas, what are some data sets that would be good for learning Pandas?

  • There are many data sets that are suitable for learning pandas, depending on your interests and the skills you want to develop.

    Here are some suggestions...

    • Titanic Dataset:This is a classic dataset for data analysis and machine learning, and is often used as an introduction to pandas. The dataset contains information about passengers on the Titanic, including their demographics, ticket class, and survival status.
    • Iris Dataset: This is another classic dataset that is often used in machine learning and data analysis courses. The dataset contains information about iris flowers, including their sepal length, sepal width, petal length, and petal width.

    • NBA Player Stats: This dataset contains information about NBA players, including their stats for various games, their positions, and their salaries. It is a great dataset for learning how to clean and manipulate data using pandas.

    • Housing Prices Dataset: This dataset contains information about housing prices in a particular city or region, including variables such as the number of bedrooms, square footage, and price. It is a great dataset for learning how to manipulate and analyze numerical data using pandas.

    • COVID-19 Dataset: This dataset contains information about the COVID-19 pandemic, including the number of cases, deaths, and recoveries in various countries and regions. It is a great dataset for learning how to manipulate and analyze time-series data using pandas.

    • World Bank Dataset: This dataset contains information about economic indicators from various countries, such as GDP, population, and poverty rates. It is a great dataset for learning how to manipulate and analyze large datasets using pandas.

  • These are just a few examples of the many datasets that are suitable for learning pandas. The best dataset for you will depend on your interests and the specific skills you want to develop.

    Follow up question, "where can I find Titanic data set?"

  • The Titanic dataset is a popular dataset for data analysis and machine learning, and it can be found on various websites and data repositories. Here are a few places where you can find the Titanic dataset...

    • Kaggle:The Titanic dataset is available on Kaggle, which is a popular platform for data scientists and machine learning enthusiasts. You can download the dataset from the Kaggle website after creating an account.
    • UCI Machine Learning Repository: The Titanic dataset is also available on the UCI Machine Learning Repository, which is a collection of datasets that are commonly used for machine learning research. You can download the dataset from the UCI Machine Learning Repository website.

    • Seaborn library: If you have the Seaborn library installed in your Python environment, you can load the Titanic dataset directly from the library using the following code:

      import seaborn as sns
      titanic_data = sns.load_dataset('titanic')
      
emissions_df = pd.read_csv('fuel-production-consumption.csv')
emissions_df.head()
Year Country Gas_Prod Gas_Cons Coal_Prod Coal_Cons Oil_Prod Oil_Cons Gas_Prod_Per_Capita Gas_Cons_Per_Capita Coal_Prod_Per_Capita Coal_Cons_Per_Capita Oil_Prod_Per_Capita Oil_Cons_Per_Capita Population
0 1980 Afghanistan 1.699000e+09 56640000.0 119000.0 119000.0 0.0 406500.0 127.2 4.241 0.00891 0.00891 0.0 0.03043 13360000.0
1 1981 Afghanistan 2.237000e+09 84960000.0 125000.0 125000.0 0.0 464600.0 169.9 6.450 0.00949 0.00949 0.0 0.03527 13170000.0
2 1982 Afghanistan 2.294000e+09 141600000.0 145000.0 145000.0 0.0 452900.0 178.1 10.990 0.01126 0.01126 0.0 0.03516 12880000.0
3 1983 Afghanistan 2.407000e+09 141600000.0 145000.0 145000.0 0.0 638800.0 192.0 11.290 0.01157 0.01157 0.0 0.05095 12540000.0
4 1984 Afghanistan 2.407000e+09 141600000.0 148000.0 148000.0 0.0 638800.0 197.2 11.600 0.01213 0.01213 0.0 0.05234 12200000.0
emissions_df.max()
Year                               2021
Country                        Zimbabwe
Gas_Prod                4040000000000.0
Gas_Cons                4000000000000.0
Coal_Prod                  8144000000.0
Coal_Cons                  8179000000.0
Oil_Prod                   4817000000.0
Oil_Cons                   5820000000.0
Gas_Prod_Per_Capita             71520.0
Gas_Cons_Per_Capita             27160.0
Coal_Prod_Per_Capita              21.77
Coal_Cons_Per_Capita              7.093
Oil_Prod_Per_Capita               233.0
Oil_Cons_Per_Capita               125.8
Population                 7786000000.0
dtype: object
emissions_df.min()
Year                           1973
Country                 Afghanistan
Gas_Prod                        0.0
Gas_Cons                        0.0
Coal_Prod                       0.0
Coal_Cons                       0.0
Oil_Prod                        0.0
Oil_Cons                        0.0
Gas_Prod_Per_Capita             0.0
Gas_Cons_Per_Capita             0.0
Coal_Prod_Per_Capita            0.0
Coal_Cons_Per_Capita            0.0
Oil_Prod_Per_Capita             0.0
Oil_Cons_Per_Capita             0.0
Population                    959.0
dtype: object
emissions_df.describe()
Year Gas_Prod Gas_Cons Coal_Prod Coal_Cons Oil_Prod Oil_Cons Gas_Prod_Per_Capita Gas_Cons_Per_Capita Coal_Prod_Per_Capita Coal_Cons_Per_Capita Oil_Prod_Per_Capita Oil_Cons_Per_Capita Population
count 9237.000000 8.708000e+03 8.689000e+03 8.817000e+03 8.817000e+03 8.792000e+03 8.555000e+03 7689.000000 7671.000000 7757.000000 7757.000000 7820.000000 7601.000000 8.132000e+03
mean 2000.443434 2.383170e+10 2.392697e+10 5.261801e+07 5.215915e+07 4.141132e+07 4.550758e+07 963.569615 612.621901 0.424189 0.434401 2.414781 1.392495 6.719422e+07
std 12.439275 1.896607e+11 1.891655e+11 4.355659e+08 4.349869e+08 2.901607e+08 3.268641e+08 4554.541934 2008.360465 1.577120 1.028148 11.332552 4.256257 4.709697e+08
min 1973.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 9.590000e+02
25% 1990.000000 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 3.233000e+05 0.000000 0.000000 0.000000 0.000000 0.000000 0.152800 1.308750e+06
50% 2001.000000 0.000000e+00 0.000000e+00 0.000000e+00 1.221000e+03 0.000000e+00 1.568000e+06 0.000000 1.406000 0.000000 0.002069 0.000000 0.583400 6.651000e+06
75% 2011.000000 1.614000e+09 3.981000e+09 1.720000e+05 2.001000e+06 3.194000e+06 1.207000e+07 112.100000 450.500000 0.022110 0.254800 0.280250 1.650000 2.144750e+07
max 2021.000000 4.040000e+12 4.000000e+12 8.144000e+09 8.179000e+09 4.817000e+09 5.820000e+09 71520.000000 27160.000000 21.770000 7.093000 233.000000 125.800000 7.786000e+09
fuel_final = emissions_df.dropna()
fuel_final = fuel_final.drop(columns=['Gas_Prod_Per_Capita', 'Gas_Cons_Per_Capita', 'Coal_Prod_Per_Capita', 'Coal_Cons_Per_Capita', 'Oil_Prod_Per_Capita', 'Oil_Cons_Per_Capita', 'Population'], axis=1)
fuel_final.index = range(len(fuel_final))

# remove zeroes
fuel_empty = []
for row in range(0, len(fuel_final)):
  if (fuel_final.loc[row, 'Gas_Prod'] == 0):
    fuel_empty.append(row)
  elif (fuel_final.loc[row, 'Gas_Cons'] == 0):
    fuel_empty.append(row)
  elif (fuel_final.loc[row, 'Coal_Prod'] == 0):
    fuel_empty.append(row)
  elif (fuel_final.loc[row, 'Coal_Prod'] == 0):
    fuel_empty.append(row)
  elif (fuel_final.loc[row, 'Oil_Prod'] == 0):
    fuel_empty.append(row)
  elif (fuel_final.loc[row, 'Oil_Cons'] == 0):
    fuel_empty.append(row)
fuel_final = fuel_final.drop(fuel_final.index[fuel_empty])
fuel_final.index = range(len(fuel_final))
fuel_final.describe()
Year Gas_Prod Gas_Cons Coal_Prod Coal_Cons Oil_Prod Oil_Cons
count 1853.000000 1.853000e+03 1.853000e+03 1.853000e+03 1.853000e+03 1.853000e+03 1.853000e+03
mean 2000.686454 9.785606e+10 9.927935e+10 2.302688e+08 2.256243e+08 1.362930e+08 1.709018e+08
std 11.140106 3.984665e+11 3.977010e+11 9.048997e+08 9.056939e+08 5.744224e+08 6.720070e+08
min 1980.000000 1.000000e+06 9.912000e+06 9.000000e+02 3.176000e+03 2.904000e+02 2.288000e+05
25% 1992.000000 1.161000e+09 3.820000e+09 1.020000e+06 1.899000e+06 7.549000e+05 9.789000e+06
50% 2001.000000 9.003000e+09 1.340000e+10 6.933000e+06 1.744000e+07 4.524000e+06 2.246000e+07
75% 2010.000000 3.530000e+10 4.034000e+10 6.426000e+07 6.453000e+07 7.481000e+07 8.898000e+07
max 2020.000000 4.040000e+12 4.000000e+12 8.144000e+09 8.179000e+09 4.817000e+09 5.820000e+09
fuel_final.min()
Year              1980
Country        Albania
Gas_Prod     1000000.0
Gas_Cons     9912000.0
Coal_Prod        900.0
Coal_Cons       3176.0
Oil_Prod         290.4
Oil_Cons      228800.0
dtype: object
def correlation(country):

  corr_data = fuel_final[fuel_final.Country == country]
  print(country)
  print("The correlation between gas production and consumption is: ", corr_data['Gas_Prod'].corr(corr_data['Gas_Cons']))
  print("The correlation between coal production and consumption is: ", corr_data['Coal_Prod'].corr(corr_data['Coal_Cons']))
  print("The correlation between oil production and consumption is: ", corr_data['Oil_Prod'].corr(corr_data['Oil_Cons']))
  print("\n")

for i in fuel_final.Country.unique():
  correlation(i)
Albania
The correlation between gas production and consumption is:  1.0
The correlation between coal production and consumption is:  0.9983671135750389
The correlation between oil production and consumption is:  0.7269680841195096


Algeria
The correlation between gas production and consumption is:  0.8284234639601752
The correlation between coal production and consumption is:  0.3192372115918825
The correlation between oil production and consumption is:  0.8338354337119105


Argentina
The correlation between gas production and consumption is:  0.9280700903964625
The correlation between coal production and consumption is:  -0.12951811472076866
The correlation between oil production and consumption is:  -0.11525627305459005


Australia
The correlation between gas production and consumption is:  0.9134223143571423
The correlation between coal production and consumption is:  0.7171564919877736
The correlation between oil production and consumption is:  -0.48336410244708533


Austria
The correlation between gas production and consumption is:  0.6585969526316259
The correlation between coal production and consumption is:  0.6746863339853054
The correlation between oil production and consumption is:  -0.6617415122382505


Bangladesh
The correlation between gas production and consumption is:  0.9954085972905028
The correlation between coal production and consumption is:  0.6184563728036183
The correlation between oil production and consumption is:  -0.9069198871901405


Brazil
The correlation between gas production and consumption is:  0.9810553072041654
The correlation between coal production and consumption is:  0.2384133236648425
The correlation between oil production and consumption is:  0.9637617252090602


Bulgaria
The correlation between gas production and consumption is:  -0.15111220036099837
The correlation between coal production and consumption is:  0.8659910124000999
The correlation between oil production and consumption is:  0.8763819224978519


Canada
The correlation between gas production and consumption is:  0.8320819818151721
The correlation between coal production and consumption is:  0.6082292711571285
The correlation between oil production and consumption is:  0.9045684582301484


Chile
The correlation between gas production and consumption is:  0.04855788860424116
The correlation between coal production and consumption is:  0.35681087513943216
The correlation between oil production and consumption is:  -0.8911358708539838


China
The correlation between gas production and consumption is:  0.985059119010339
The correlation between coal production and consumption is:  0.9983081243490006
The correlation between oil production and consumption is:  0.8893156039330106


Colombia
The correlation between gas production and consumption is:  0.9796652790343146
The correlation between coal production and consumption is:  0.3831450321252026
The correlation between oil production and consumption is:  0.9441848111059311


Croatia
The correlation between gas production and consumption is:  -0.06021330629606104
The correlation between coal production and consumption is:  0.7994111564321135
The correlation between oil production and consumption is:  -0.640239975943251


Czechia
The correlation between gas production and consumption is:  -0.6974901041779209
The correlation between coal production and consumption is:  0.9915337520986545
The correlation between oil production and consumption is:  0.1852268207575415


Egypt
The correlation between gas production and consumption is:  0.9497340948323015
The correlation between coal production and consumption is:  0.5835421597482772
The correlation between oil production and consumption is:  -0.673893290855846


France
The correlation between gas production and consumption is:  -0.7251693943829752
The correlation between coal production and consumption is:  0.9185531927941775
The correlation between oil production and consumption is:  -0.6634323024669895


Georgia
The correlation between gas production and consumption is:  -0.23411302102915654
The correlation between coal production and consumption is:  0.8815106543502035
The correlation between oil production and consumption is:  -0.6055968331166464


Germany
The correlation between gas production and consumption is:  -0.14919784306529152
The correlation between coal production and consumption is:  0.9775104225872394
The correlation between oil production and consumption is:  0.659772516244861


Greece
The correlation between gas production and consumption is:  -0.7748248194967237
The correlation between coal production and consumption is:  0.9973055058940936
The correlation between oil production and consumption is:  -0.6397448404647996


Hungary
The correlation between gas production and consumption is:  -0.061786406076335704
The correlation between coal production and consumption is:  0.9968639876141443
The correlation between oil production and consumption is:  0.6071942565891912


India
The correlation between gas production and consumption is:  0.9449739068277528
The correlation between coal production and consumption is:  0.982881209094128
The correlation between oil production and consumption is:  0.6645858666917066


Indonesia
The correlation between gas production and consumption is:  0.938300304325842
The correlation between coal production and consumption is:  0.9852700313312451
The correlation between oil production and consumption is:  -0.8692344012555523


Iran
The correlation between gas production and consumption is:  0.9987553096547839
The correlation between coal production and consumption is:  0.11297599443514915
The correlation between oil production and consumption is:  0.7513424050393397


Italy
The correlation between gas production and consumption is:  -0.44515374642790906
The correlation between coal production and consumption is:  0.05123790597985525
The correlation between oil production and consumption is:  -0.22127605989837382


Japan
The correlation between gas production and consumption is:  0.7273061318009235
The correlation between coal production and consumption is:  -0.9161029264045187
The correlation between oil production and consumption is:  0.5062980513082399


Kazakhstan
The correlation between gas production and consumption is:  0.15222274230563682
The correlation between coal production and consumption is:  0.9723059274623547
The correlation between oil production and consumption is:  0.19452782154957043


Kyrgyzstan
The correlation between gas production and consumption is:  0.2460199278170101
The correlation between coal production and consumption is:  0.7737301455373388
The correlation between oil production and consumption is:  -0.7080965079348311


Malaysia
The correlation between gas production and consumption is:  0.9890481404893454
The correlation between coal production and consumption is:  0.9592350629636897
The correlation between oil production and consumption is:  0.2730588478343626


Mexico
The correlation between gas production and consumption is:  0.6639389207997793
The correlation between coal production and consumption is:  0.9381013926821217
The correlation between oil production and consumption is:  0.2854917900409883


Morocco
The correlation between gas production and consumption is:  0.21097733501831528
The correlation between coal production and consumption is:  -0.9362012254190981
The correlation between oil production and consumption is:  -0.7606558378998611


Myanmar
The correlation between gas production and consumption is:  0.9599589479846825
The correlation between coal production and consumption is:  0.9189250643592559
The correlation between oil production and consumption is:  -0.305815164142203


New Zealand
The correlation between gas production and consumption is:  0.997794372600727
The correlation between coal production and consumption is:  0.8824108190793091
The correlation between oil production and consumption is:  0.1931655679172063


Nigeria
The correlation between gas production and consumption is:  0.9431108914030848
The correlation between coal production and consumption is:  0.934434116462668
The correlation between oil production and consumption is:  0.34104488581072323


Norway
The correlation between gas production and consumption is:  0.7330869914203065
The correlation between coal production and consumption is:  -0.5541619130430874
The correlation between oil production and consumption is:  0.5855104141353953


Pakistan
The correlation between gas production and consumption is:  0.9858419916634996
The correlation between coal production and consumption is:  0.8253671790562133
The correlation between oil production and consumption is:  0.8996767909923996


Peru
The correlation between gas production and consumption is:  0.9888783302148328
The correlation between coal production and consumption is:  0.12800223522910248
The correlation between oil production and consumption is:  -0.8024665565142914


Philippines
The correlation between gas production and consumption is:  0.9999962018758283
The correlation between coal production and consumption is:  0.9688230340538656
The correlation between oil production and consumption is:  -0.5920034312791445


Poland
The correlation between gas production and consumption is:  0.3551558222634344
The correlation between coal production and consumption is:  0.9796099107965117
The correlation between oil production and consumption is:  0.8921274061839379


Romania
The correlation between gas production and consumption is:  0.9783453135343
The correlation between coal production and consumption is:  0.9893225750571135
The correlation between oil production and consumption is:  0.8994459348690592


Russia
The correlation between gas production and consumption is:  0.9419470334164719
The correlation between coal production and consumption is:  -0.21083401242824787
The correlation between oil production and consumption is:  0.5507860321120374


Serbia
The correlation between gas production and consumption is:  0.31531906059312187
The correlation between coal production and consumption is:  0.983579076367945
The correlation between oil production and consumption is:  -0.7101095648951388


Slovakia
The correlation between gas production and consumption is:  0.6897074750519705
The correlation between coal production and consumption is:  0.8827055240582095
The correlation between oil production and consumption is:  -0.7471051074795697


Slovenia
The correlation between gas production and consumption is:  -0.13462047463517798
The correlation between coal production and consumption is:  0.9851363021999034
The correlation between oil production and consumption is:  -0.7471079387860973


South Africa
The correlation between gas production and consumption is:  -0.49339340164649725
The correlation between coal production and consumption is:  0.9143541765933124
The correlation between oil production and consumption is:  -0.7396351462508706


Spain
The correlation between gas production and consumption is:  -0.6044706796216464
The correlation between coal production and consumption is:  0.8911000149701018
The correlation between oil production and consumption is:  -0.8117230700072131


Taiwan
The correlation between gas production and consumption is:  -0.4713261364343826
The correlation between coal production and consumption is:  -0.8753469349790672
The correlation between oil production and consumption is:  -0.8164451404317592


Tajikistan
The correlation between gas production and consumption is:  0.6965032578748406
The correlation between coal production and consumption is:  0.9983854510126821
The correlation between oil production and consumption is:  0.10071815943063475


Thailand
The correlation between gas production and consumption is:  0.993732041029292
The correlation between coal production and consumption is:  0.8170646900500173
The correlation between oil production and consumption is:  0.946114215023692


Turkey
The correlation between gas production and consumption is:  0.6012012301579368
The correlation between coal production and consumption is:  0.960203470002108
The correlation between oil production and consumption is:  -0.1333529515489173


Ukraine
The correlation between gas production and consumption is:  -0.40461011161362054
The correlation between coal production and consumption is:  0.9785370443288366
The correlation between oil production and consumption is:  0.690634107446907


United Kingdom
The correlation between gas production and consumption is:  0.7409434244067105
The correlation between coal production and consumption is:  0.9334066914492226
The correlation between oil production and consumption is:  0.5880517269272993


United States
The correlation between gas production and consumption is:  0.9272276775479863
The correlation between coal production and consumption is:  0.972041780313141
The correlation between oil production and consumption is:  -0.3030309016126291


Uzbekistan
The correlation between gas production and consumption is:  0.8421127208007584
The correlation between coal production and consumption is:  0.8952945082754322
The correlation between oil production and consumption is:  0.4212094929689429


Venezuela
The correlation between gas production and consumption is:  0.9872840363217603
The correlation between coal production and consumption is:  -0.05987035192924389
The correlation between oil production and consumption is:  0.23592620866951014


Vietnam
The correlation between gas production and consumption is:  0.999999894146438
The correlation between coal production and consumption is:  0.797305147060114
The correlation between oil production and consumption is:  0.6936404350579995


World
The correlation between gas production and consumption is:  0.9995402721550243
The correlation between coal production and consumption is:  0.9981437543533751
The correlation between oil production and consumption is:  0.9951371074298835


Collegeboard Quiz

I got 5/6 title

Here is the question I got wrong. I thought that amount of food could determine how popular the artist was, but now I realized that the average ticket price could be used to accurately determine the number of attendees by dividing total dollar amount of tickets sold by average ticket price. title

Titanic Data

Look at a sample of data.

import seaborn as sns

# Load the titanic dataset
titanic_data = sns.load_dataset('titanic')

print("Titanic Data")


print(titanic_data.columns) # titanic data set

print(titanic_data[['survived','pclass', 'sex', 'age', 'sibsp', 'parch', 'class', 'fare', 'embark_town']]) # look at selected columns
Titanic Data
Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')
     survived  pclass     sex   age  sibsp  parch   class     fare  \
0           0       3    male  22.0      1      0   Third   7.2500   
1           1       1  female  38.0      1      0   First  71.2833   
2           1       3  female  26.0      0      0   Third   7.9250   
3           1       1  female  35.0      1      0   First  53.1000   
4           0       3    male  35.0      0      0   Third   8.0500   
..        ...     ...     ...   ...    ...    ...     ...      ...   
886         0       2    male  27.0      0      0  Second  13.0000   
887         1       1  female  19.0      0      0   First  30.0000   
888         0       3  female   NaN      1      2   Third  23.4500   
889         1       1    male  26.0      0      0   First  30.0000   
890         0       3    male  32.0      0      0   Third   7.7500   

     embark_town  
0    Southampton  
1      Cherbourg  
2    Southampton  
3    Southampton  
4    Southampton  
..           ...  
886  Southampton  
887  Southampton  
888  Southampton  
889    Cherbourg  
890   Queenstown  

[891 rows x 9 columns]

Use Pandas to clean the data. Most analysis, like Machine Learning or even Pandas in general like data to be in standardized format. This is called 'Training' or 'Cleaning' data.

# Preprocess the data
from sklearn.preprocessing import OneHotEncoder


td = titanic_data
td.drop(['alive', 'who', 'adult_male', 'class', 'embark_town', 'deck'], axis=1, inplace=True)
td.dropna(inplace=True)
td['sex'] = td['sex'].apply(lambda x: 1 if x == 'male' else 0)
td['alone'] = td['alone'].apply(lambda x: 1 if x == True else 0)

# Encode categorical variables
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(td[['embarked']])
onehot = enc.transform(td[['embarked']]).toarray()
cols = ['embarked_' + val for val in enc.categories_[0]]
td[cols] = pd.DataFrame(onehot)
td.drop(['embarked'], axis=1, inplace=True)
td.dropna(inplace=True)

print(td)
     survived  pclass  sex   age  sibsp  parch      fare  alone  embarked_C  \
0           0       3    1  22.0      1      0    7.2500      0         0.0   
1           1       1    0  38.0      1      0   71.2833      0         1.0   
2           1       3    0  26.0      0      0    7.9250      1         0.0   
3           1       1    0  35.0      1      0   53.1000      0         0.0   
4           0       3    1  35.0      0      0    8.0500      1         0.0   
..        ...     ...  ...   ...    ...    ...       ...    ...         ...   
705         0       2    1  39.0      0      0   26.0000      1         0.0   
706         1       2    0  45.0      0      0   13.5000      1         0.0   
707         1       1    1  42.0      0      0   26.2875      1         0.0   
708         1       1    0  22.0      0      0  151.5500      1         0.0   
710         1       1    0  24.0      0      0   49.5042      1         1.0   

     embarked_Q  embarked_S  
0           0.0         1.0  
1           0.0         0.0  
2           0.0         1.0  
3           0.0         1.0  
4           0.0         1.0  
..          ...         ...  
705         0.0         1.0  
706         0.0         1.0  
707         1.0         0.0  
708         0.0         1.0  
710         0.0         0.0  

[564 rows x 11 columns]

The result of 'Training' data is making it easier to analyze or make conclusions. In looking at the Titanic, as you clean you would probably want to make assumptions on likely chance of survival.

This would involve analyzing various factors (such as age, gender, class, etc.) that may have affected a person's chances of survival, and using that information to make predictions about whether an individual would have survived or not.

  • Data description:- Survival - Survival (0 = No; 1 = Yes). Not included in test.csv file. - Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)

    • Name - Name
    • Sex - Sex
    • Age - Age
    • Sibsp - Number of Siblings/Spouses Aboard
    • Parch - Number of Parents/Children Aboard
    • Ticket - Ticket Number
    • Fare - Passenger Fare
    • Cabin - Cabin
    • Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
  • Perished Mean/Average

print(titanic_data.query("survived == 0").mean())
survived       0.000000
pclass         2.464072
sex            0.844311
age           31.073353
sibsp          0.562874
parch          0.398204
fare          24.835902
alone          0.616766
embarked_C     0.185629
embarked_Q     0.038922
embarked_S     0.775449
dtype: float64
  • Survived Mean/Average
print(td.query("survived == 1").mean())
survived       1.000000
pclass         1.878261
sex            0.326087
age           28.481522
sibsp          0.504348
parch          0.508696
fare          50.188806
alone          0.456522
embarked_C     0.152174
embarked_Q     0.034783
embarked_S     0.813043
dtype: float64

Survived Max and Min Stats

print(td.query("survived == 1").max())
print(td.query("survived == 1").min())
survived        1.0000
pclass          3.0000
sex             1.0000
age            80.0000
sibsp           4.0000
parch           5.0000
fare          512.3292
alone           1.0000
embarked_C      1.0000
embarked_Q      1.0000
embarked_S      1.0000
dtype: float64
survived      1.00
pclass        1.00
sex           0.00
age           0.75
sibsp         0.00
parch         0.00
fare          0.00
alone         0.00
embarked_C    0.00
embarked_Q    0.00
embarked_S    0.00
dtype: float64

Machine Learning Visit Tutorials Point

Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python.

  • Description from ChatGPT. The Titanic dataset is a popular dataset for data analysis and machine learning. In the context of machine learning, accuracy refers to the percentage of correctly classified instances in a set of predictions. In this case, the testing data is a subset of the original Titanic dataset that the decision tree model has not seen during training......After training the decision tree model on the training data, we can evaluate its performance on the testing data by making predictions on the testing data and comparing them to the actual outcomes. The accuracy of the decision tree classifier on the testing data tells us how well the model generalizes to new data that it hasn't seen before......For example, if the accuracy of the decision tree classifier on the testing data is 0.8 (or 80%), this means that 80% of the predictions made by the model on the testing data were correct....Chance of survival could be done using various machine learning techniques, including decision trees, logistic regression, or support vector machines, among others.

  • Code Below prepares data for further analysis and provides an Accuracy. IMO, you would insert a new passenger and predict survival. Datasets could be used on various factors like prediction if a player will hit a Home Run, or a Stock will go up or down.

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Split arrays or matrices into random train and test subsets.
X = td.drop('survived', axis=1)
y = td['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a decision tree classifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

# Test the model
y_pred = dt.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('DecisionTreeClassifier Accuracy:', accuracy)

# Train a logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Test the model
y_pred = logreg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('LogisticRegression Accuracy:', accuracy)
DecisionTreeClassifier Accuracy: 0.7705882352941177
LogisticRegression Accuracy: 0.788235294117647
/Users/johnmortensen/opt/anaconda3/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(