Pandas Fundamentals with Earthquake Data

Pandas Fundamentals with Earthquake Data#

Welcome to this practical tutorial on Pandas, here we will review pandas fundamentals, such as how to

Open csv files
Manipulate dataframe indexes
Parse date columns
Examine basic dataframe statistics
Manipulate text columns and extract values
Plot dataframe contents using
- Bar charts
- Histograms
- Scatter plots

First, import Numpy, Pandas and Matplotlib and set the display options.

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

Note: %matplotlib inline is a command specific to Jupyter Notebook and JupyterLab, to enable the inline plotting of graphs and visualizations. It is typically used to display these plots directly below the code cell in the notebook.

Data is in .csv format and downloaded from the USGS Earthquakes Database is available at:

http://www.ldeo.columbia.edu/~rpa/usgs_earthquakes_2014.csv

You can open this file directly with Pandas.

Using Pandas’ read_csv function directly on a URL to open it as a DataFrame#

#url for data
url = "http://www.ldeo.columbia.edu/~rpa/usgs_earthquakes_2014.csv"
#read file using pandas
data = pd.read_csv(url)

Pandas provides various methods for fetching and reading data depending on the file type. Here, , since the file is in .csv format, we will untilize the pd.read_csv() function and assign the resulting DataFrame to the variable data.

#display first 5 rows
data.head()

	time	latitude	longitude	depth	mag	magType	nst	gap	dmin	rms	net	id	updated	place	type
0	2014-01-31 23:53:37.000	60.252000	-152.7081	90.20	1.10	ml	NaN	NaN	NaN	0.2900	ak	ak11155107	2014-02-05T19:34:41.515Z	26km S of Redoubt Volcano, Alaska	earthquake
1	2014-01-31 23:48:35.452	37.070300	-115.1309	0.00	1.33	ml	4.0	171.43	0.34200	0.0247	nn	nn00436847	2014-02-01T01:35:09.000Z	32km S of Alamo, Nevada	earthquake
2	2014-01-31 23:47:24.000	64.671700	-149.2528	7.10	1.30	ml	NaN	NaN	NaN	1.0000	ak	ak11151142	2014-02-01T00:03:53.010Z	12km NNW of North Nenana, Alaska	earthquake
3	2014-01-31 23:30:54.000	63.188700	-148.9575	96.50	0.80	ml	NaN	NaN	NaN	1.0700	ak	ak11151135	2014-01-31T23:41:25.007Z	22km S of Cantwell, Alaska	earthquake
4	2014-01-31 23:30:52.210	32.616833	-115.6925	10.59	1.34	ml	6.0	285.00	0.04321	0.2000	ci	ci37171541	2014-02-01T00:13:20.107Z	10km WNW of Progreso, Mexico	earthquake

The pd.DataFrame.head() function returns the first 5 rows of a DataFrame. To display a different number of rows, an integer value can be provided within the function, such as data.head(10), which would provide the first 10 rows. Similarly, the pd.DataFrame.tail() function presents the last 5 rows by default, displaying the final rows of the DataFrame.

#data types and summaries

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120108 entries, 0 to 120107
Data columns (total 15 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   time       120108 non-null  object 
 1   latitude   120108 non-null  float64
 2   longitude  120108 non-null  float64
 3   depth      120107 non-null  float64
 4   mag        120065 non-null  float64
 5   magType    120065 non-null  object 
 6   nst        59688 non-null   float64
 7   gap        94935 non-null   float64
 8   dmin       85682 non-null   float64
 9   rms        119716 non-null  float64
 10  net        120108 non-null  object 
 11  id         120108 non-null  object 
 12  updated    120108 non-null  object 
 13  place      120108 non-null  object 
 14  type       120108 non-null  object 
dtypes: float64(8), object(7)
memory usage: 13.7+ MB

While we can onserve the first few rows of the data, it doesn’t provide comprehensive information about the entire DataFrame. For this Pandas offers the pd.DataFrame.info(). This function provies a basic summary of the DataFrame which includes counts of non-null values, data types, field names and other details.

You should have seen that the dates were not automatically parsed into datetime types. pd.read_csv() actucally provides options for dealing with this. Let’s take a look.

Re-Reading Data: Defining Date Columns and Index#

#read file using pandas
data = pd.read_csv(url, parse_dates=['time','updated'], index_col='id') #date cols specified and index col specified

#display first 5 rows
data.head()

	time	latitude	longitude	depth	mag	magType	nst	gap	dmin	rms	net	updated	place	type
id
ak11155107	2014-01-31 23:53:37.000	60.252000	-152.7081	90.20	1.10	ml	NaN	NaN	NaN	0.2900	ak	2014-02-05 19:34:41.515000+00:00	26km S of Redoubt Volcano, Alaska	earthquake
nn00436847	2014-01-31 23:48:35.452	37.070300	-115.1309	0.00	1.33	ml	4.0	171.43	0.34200	0.0247	nn	2014-02-01 01:35:09+00:00	32km S of Alamo, Nevada	earthquake
ak11151142	2014-01-31 23:47:24.000	64.671700	-149.2528	7.10	1.30	ml	NaN	NaN	NaN	1.0000	ak	2014-02-01 00:03:53.010000+00:00	12km NNW of North Nenana, Alaska	earthquake
ak11151135	2014-01-31 23:30:54.000	63.188700	-148.9575	96.50	0.80	ml	NaN	NaN	NaN	1.0700	ak	2014-01-31 23:41:25.007000+00:00	22km S of Cantwell, Alaska	earthquake
ci37171541	2014-01-31 23:30:52.210	32.616833	-115.6925	10.59	1.34	ml	6.0	285.00	0.04321	0.2000	ci	2014-02-01 00:13:20.107000+00:00	10km WNW of Progreso, Mexico	earthquake

#display dataframe info
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 120108 entries, ak11155107 to ak11453389
Data columns (total 14 columns):
 #   Column     Non-Null Count   Dtype              
---  ------     --------------   -----              
 0   time       120108 non-null  datetime64[ns]     
 1   latitude   120108 non-null  float64            
 2   longitude  120108 non-null  float64            
 3   depth      120107 non-null  float64            
 4   mag        120065 non-null  float64            
 5   magType    120065 non-null  object             
 6   nst        59688 non-null   float64            
 7   gap        94935 non-null   float64            
 8   dmin       85682 non-null   float64            
 9   rms        119716 non-null  float64            
 10  net        120108 non-null  object             
 11  updated    120108 non-null  datetime64[ns, UTC]
 12  place      120108 non-null  object             
 13  type       120108 non-null  object             
dtypes: datetime64[ns, UTC](1), datetime64[ns](1), float64(8), object(4)
memory usage: 13.7+ MB

By including the parse_dates argument of pd.read_csv and a list of datetime columns, we can automatically convert these fields at the time of loading the DataFrame.

Using `describe` to get the basic statistics of all the columns#

Note the highest and lowest magnitude of earthquakes in the databse.

#describe dataframe and transposing it to make it more readable
data.describe().T

# T - transpose dataframe

	count	mean	min	25%	50%	75%	max	std
time	120108	2014-07-05 09:10:37.116720128	2014-01-01 00:01:16.610000	2014-04-08 03:43:10.768999936	2014-07-07 10:44:06.035000064	2014-09-30 23:36:40.595000064	2014-12-31 23:54:33.900000	NaN
latitude	120108.0	38.399579	-73.462	34.228917	38.8053	53.8895	86.6514	21.938258
longitude	120108.0	-99.961402	-179.9989	-147.742025	-120.832	-116.0681	179.998	82.996858
depth	120107.0	28.375029	-9.9	4.1	9.2	22.88	697.36	62.215416
mag	120065.0	1.793958	-0.97	0.82	1.4	2.4	8.2	1.343466
nst	59688.0	17.878284	0.0	8.0	14.0	22.0	365.0	14.911369
gap	94935.0	124.048978	9.0	74.0	107.0	155.0	356.4	68.518595
dmin	85682.0	0.893198	0.0	0.02076	0.07367	0.447	64.498	2.903966
rms	119716.0	0.358174	0.0	0.07	0.2	0.59	8.46	0.364046

The pd.DataFrame.describe() function offers a statistical summary of all fields of numeric type. This summary helps in understanding the distribution of our data, a crucial aspect of exploratory data analysis.

Use of the T attribute (transpose), it’s solely at your discretion and how you prefer to display the returned DataFrame. This option can be particularly useful in rendering high-dimensional DataFrames with few rows, making them more comprehensible.

Using `nlargest` to get the top 20 earthquakes by magnitude#

Review the documentation for pandas.Series.nlargest here

#display the top 20 earthquakes by mag
data.nlargest(20,'mag')

	time	latitude	longitude	depth	mag	magType	nst	gap	dmin	rms	net	updated	place	type
id
usc000nzvd	2014-04-01 23:46:47.260	-19.6097	-70.7691	25.00	8.2	mww	NaN	23.0	0.609	0.66	us	2015-07-30 16:24:51.223000+00:00	94km NW of Iquique, Chile	earthquake
usc000rki5	2014-06-23 20:53:09.700	51.8486	178.7352	109.00	7.9	mww	NaN	22.0	0.133	0.71	us	2015-04-18 21:54:08.699000+00:00	19km SE of Little Sitkin Island, Alaska	earthquake
usc000p27i	2014-04-03 02:43:13.110	-20.5709	-70.4931	22.40	7.7	mww	NaN	44.0	1.029	0.82	us	2015-06-06 07:31:05.755000+00:00	53km SW of Iquique, Chile	earthquake
usc000phx5	2014-04-12 20:14:39.300	-11.2701	162.1481	22.56	7.6	mww	NaN	13.0	2.828	0.71	us	2015-04-18 21:54:27.398000+00:00	93km SSE of Kirakira, Solomon Islands	earthquake
usb000pr89	2014-04-19 13:28:00.810	-6.7547	155.0241	43.37	7.5	mww	NaN	16.0	3.820	1.25	us	2015-04-18 21:54:18.633000+00:00	70km SW of Panguna, Papua New Guinea	earthquake
usc000piqj	2014-04-13 12:36:19.230	-11.4633	162.0511	39.00	7.4	mww	NaN	17.0	2.885	1.00	us	2015-08-13 19:29:13.018000+00:00	112km S of Kirakira, Solomon Islands	earthquake
usb000slwn	2014-10-14 03:51:34.460	12.5262	-88.1225	40.00	7.3	mww	NaN	18.0	1.078	0.70	us	2015-08-13 19:35:02.679000+00:00	74km S of Intipuca, El Salvador	earthquake
usb000pq41	2014-04-18 14:27:24.920	17.3970	-100.9723	24.00	7.2	mww	NaN	46.0	2.250	1.20	us	2015-08-13 19:30:39.599000+00:00	33km ESE of Petatlan, Mexico	earthquake
usc000pft9	2014-04-11 07:07:23.130	-6.5858	155.0485	60.53	7.1	mww	NaN	21.0	3.729	0.88	us	2014-07-01 02:37:56+00:00	56km WSW of Panguna, Papua New Guinea	earthquake
usc000sxh8	2014-11-15 02:31:41.720	1.8929	126.5217	45.00	7.1	mww	NaN	18.0	1.397	0.71	us	2015-03-20 18:42:02.735000+00:00	154km NW of Kota Ternate, Indonesia	earthquake
usc000stdc	2014-11-01 18:57:22.380	-19.6903	-177.7587	434.00	7.1	mww	NaN	13.0	4.415	0.84	us	2015-01-20 09:03:09.040000+00:00	144km NE of Ndoi Island, Fiji	earthquake
usb000sk6k	2014-10-09 02:14:31.440	-32.1082	-110.8112	16.54	7.0	mww	NaN	22.0	5.127	0.43	us	2015-08-13 19:31:44.129000+00:00	Southern East Pacific Rise	earthquake
usc000mnvj	2014-02-12 09:19:49.060	35.9053	82.5864	10.00	6.9	mww	NaN	18.0	7.496	0.83	us	2015-01-30 23:03:45.902000+00:00	272km ESE of Hotan, China	earthquake
usc000nzwm	2014-04-01 23:57:58.790	-19.8927	-70.9455	28.42	6.9	mww	NaN	119.0	0.828	0.93	us	2014-05-29 23:32:13+00:00	91km WNW of Iquique, Chile	earthquake
usb000r2hc	2014-05-24 09:25:02.440	40.2893	25.3889	6.43	6.9	mww	NaN	25.0	0.402	0.67	us	2015-01-28 09:17:17.266000+00:00	22km SSW of Kamariotissa, Greece	earthquake
usc000rngj	2014-06-29 07:52:55.170	-55.4703	-28.3669	8.00	6.9	mww	NaN	25.0	4.838	0.76	us	2014-09-26 11:49:45+00:00	154km NNW of Visokoi Island,	earthquake
usc000rkg5	2014-06-23 19:19:15.940	-29.9772	-177.7247	20.00	6.9	mww	NaN	35.0	0.751	0.99	us	2014-09-19 17:23:16+00:00	80km SSE of Raoul Island, New Zealand	earthquake
usb000ruzk	2014-07-21 14:54:41.000	-19.8015	-178.4001	615.42	6.9	mww	NaN	15.0	3.934	0.96	us	2014-10-17 21:12:13+00:00	99km NNE of Ndoi Island, Fiji	earthquake
usc000rr6a	2014-07-07 11:23:54.780	14.7240	-92.4614	53.00	6.9	mww	NaN	51.0	0.263	1.38	us	2015-01-28 13:08:13.282000+00:00	4km W of Puerto Madero, Mexico	earthquake
usb000rzki	2014-08-03 00:22:03.680	0.8295	146.1688	13.00	6.9	mww	NaN	12.0	6.393	0.93	us	2014-10-29 19:52:55+00:00	Federated States of Micronesia region	earthquake

The pd.Series.nlargest() function is straightforward and incredibly useful. It operates on a pd.Series (a single column) and retrieves the top \(n\) values. When applied within a DataFrame context, it allows us to extract the rows containing the \(n\) largest values from the specified column.

Examine the structure of the place column. The state / country information seems to be in there. How would you get it out?

Extracting the state or country and adding it a new column to the dataframe called `country`.#

Note that some of the “countries” are actually U.S. states.

#text cleaning

#if there is a comma then split, if not just use text

def extract_locations(text):
    """ This function parses strings, splitting them based on a comma
        Then checks the length resulting list is more than zero
        Then takes the last entry from the list
        Then checks if the last in the list or the original sting itself is in the list of states
        If yes it returns USA
        If not it returns the entity
        If the entity has no comma then it returns the original text
        
        Parameters:
            text: takes a string input of variable length
        Return:
            string output of country name
    """
    
    #list of states of the US from
    # https://gist.githubusercontent.com/norcal82/e4c7e8113f377db184bb/raw/87558bb70c5c149f357663061bc1b1ab96c90b7e/state_names.py
    state_names = ["Alaska", "Alabama", "Arkansas", "American Samoa", "Arizona", "California", "Colorado", "Connecticut", "District ", 
                   "of Columbia", "Delaware", "Florida", "Georgia", "Guam", "Hawaii", "Iowa", "Idaho", "Illinois", "Indiana", "Kansas",
                   "Kentucky", "Louisiana", "Massachusetts", "Maryland", "Maine", "Michigan", "Minnesota", "Missouri", "Mississippi", 
                   "Montana", "North Carolina", "North Dakota", "Nebraska", "New Hampshire", "New Jersey", "New Mexico", "Nevada", 
                   "New York", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Puerto Rico", "Rhode Island", "South Carolina", 
                   "South Dakota", "Tennessee", "Texas", "Utah", "Virginia", "Virgin Islands", "Vermont", "Washington", "Wisconsin", 
                   "West Virginia", "Wyoming"]
    
   
    parts = text.split(', ')
    if len(parts) > 0:
        entity = parts[-1]
        if entity in state_names or text in state_names:
            return "USA"
        return entity
        
    return np.nan

Although this function may initially seem complex, its logic is quite straightforward. Observe the pattern within the place column, where country names typically appear after the final comma, or there are no commas, and the country/place name is provided in full.

To extract the country, we split each place name at commas:
```
parts = text.split(', ')
```
We then select the last string in the created list. If commas aren’t present, then the first and last string will be the same:
```
entity = parts[-1]
```
Additionally, we need to consider whether the place name is a U.S. state or not. If it is, we should return “USA”; otherwise, we can simply return entity as it won’t be a U.S. state:
```
if entity in state_names or text in state_names:
    return "USA"
return entity
```
While not absolutely necessary in this circumstance, it’s good practice to specify behavior if the data in place does not meet the specified logic above. This is why the final return is:
```
return np.nan
```

data['country'] = data['place'].apply(extract_locations)
data.head()

	time	latitude	longitude	depth	mag	magType	nst	gap	dmin	rms	net	updated	place	type	country
id
ak11155107	2014-01-31 23:53:37.000	60.252000	-152.7081	90.20	1.10	ml	NaN	NaN	NaN	0.2900	ak	2014-02-05 19:34:41.515000+00:00	26km S of Redoubt Volcano, Alaska	earthquake	USA
nn00436847	2014-01-31 23:48:35.452	37.070300	-115.1309	0.00	1.33	ml	4.0	171.43	0.34200	0.0247	nn	2014-02-01 01:35:09+00:00	32km S of Alamo, Nevada	earthquake	USA
ak11151142	2014-01-31 23:47:24.000	64.671700	-149.2528	7.10	1.30	ml	NaN	NaN	NaN	1.0000	ak	2014-02-01 00:03:53.010000+00:00	12km NNW of North Nenana, Alaska	earthquake	USA
ak11151135	2014-01-31 23:30:54.000	63.188700	-148.9575	96.50	0.80	ml	NaN	NaN	NaN	1.0700	ak	2014-01-31 23:41:25.007000+00:00	22km S of Cantwell, Alaska	earthquake	USA
ci37171541	2014-01-31 23:30:52.210	32.616833	-115.6925	10.59	1.34	ml	6.0	285.00	0.04321	0.2000	ci	2014-02-01 00:13:20.107000+00:00	10km WNW of Progreso, Mexico	earthquake	Mexico

Let’s create a new column in the DataFrame.

Create a column in the DataFrame using the syntax DataFrame['new_column']:
```
data['country']
```
Populate the new column with data. In this case, we want to extract the country name from the place column. To achieve this, we need to apply a function to the data within the column. Pandas facilitates this through the pd.DataFrame.apply() function, which takes a function as its argument:
```
data['place'].apply(extract_locations)
```

You can also attempt this using Pandas text data functions.

Displaying each unique value from the new column#

Review the documentation for pandas.Series.unique here

#each unique value
data.country.unique()

array(['USA', 'Mexico', 'Papua New Guinea', 'New Zealand',
       'South of the Fiji Islands', 'British Virgin Islands', 'Canada',
       'Fiji', 'Antarctica', 'Chile', 'Indonesia', 'Solomon Islands',
       'Micronesia', 'Russia', 'Philippines', 'Bolivia', 'Greece',
       'Japan', 'Iran', 'Tonga', 'Wallis and Futuna', 'CA',
       'Carlsberg Ridge', 'Pakistan',
       'Off the west coast of northern Sumatra', 'Burma', 'China', 'Peru',
       'Off the east coast of the North Island of New Zealand',
       'Costa Rica', 'Reykjanes Ridge', 'East Timor',
       'Central East Pacific Rise', 'Mid-Indian Ridge', 'Japan region',
       'Northern Mariana Islands', 'El Salvador', 'Samoa',
       'Northern Mid-Atlantic Ridge', 'Taiwan', 'South Sandwich Islands',
       'Colombia', 'Dominican Republic', 'Argentina', 'Saint Helena',
       'West of Vancouver Island', 'Tanzania', 'Vanuatu',
       'Bosnia and Herzegovina', 'India', 'Southeast of Easter Island',
       'Serbia', 'Off the coast of Oregon', 'Nicaragua',
       'Republic of the Congo', 'Western Indian-Antarctic Ridge',
       'U.S. Virgin Islands', 'Southern East Pacific Rise', '',
       'Guatemala', 'Kyrgyzstan', 'Kuril Islands', 'Afghanistan',
       'Scotia Sea', 'West of Macquarie Island',
       'North of Ascension Island', 'New Caledonia', 'Saudi Arabia',
       'Southern Pacific Ocean', 'Romania', 'Italy',
       'Southern Mid-Atlantic Ridge', 'Panama', 'South Indian Ocean',
       'West Chile Rise', 'South of Panama', 'Albania', 'Haiti',
       'Ecuador', 'Tajikistan', 'Venezuela', 'North Indian Ocean',
       'Hungary', 'Central Mid-Atlantic Ridge', 'Christmas Island',
       'West of the Galapagos Islands', 'Bouvet Island', 'Greenland Sea',
       'South of Africa', 'Eritrea', 'Vanuatu region',
       'North of Svalbard', 'Sea of Okhotsk',
       '1km ENE of Quarry at VICTORVILLE', 'Turkmenistan',
       'East of Severnaya Zemlya', '1km NNE of Quarry at BORON',
       'Southeast Indian Ridge', 'Turkey',
       'British Indian Ocean Territory', 'Australia', 'Fiji region',
       'Southwest of Africa', '3km WSW of Quarry at CUSHENBURY', 'Cuba',
       'Guadeloupe', 'Azerbaijan', 'eastern Montana',
       'South Atlantic Ocean', 'Anguilla', 'Mauritania',
       'East of the North Island of New Zealand',
       'Chagos Archipelago region', 'Banda Sea', 'Easter Island region',
       'Comoros', 'Galapagos Triple Junction region', 'Honduras',
       'South of the Kermadec Islands', 'Davis Strait',
       'East of the Kuril Islands', 'Off the south coast of Australia',
       'Prince Edward Islands', 'Mongolia', 'Mauritius - Reunion region',
       'Drake Passage', 'Indian Ocean Triple Junction',
       'Prince Edward Islands region', 'Cape Verde', 'Yemen',
       'Owen Fracture Zone region', 'Spain', 'North Atlantic Ocean',
       'Algeria', 'Pacific-Antarctic Ridge', 'Mauritius',
       'Southwest Indian Ridge', 'Aruba', 'Svalbard and Jan Mayen',
       'United Kingdom', 'Crozet Islands region', 'Barbados',
       'Cook Islands', 'Bangladesh', 'South Sudan', 'Greenland',
       'Martinique', 'Balleny Islands region', 'Cyprus', 'South of Tonga',
       'Malaysia', 'North Korea', 'Off the coast of Central America',
       'Cayman Islands', 'Portugal', 'Ethiopia',
       'Azores-Cape St. Vincent Ridge', 'Uzbekistan', 'India region',
       'French Southern Territories', 'Poland',
       'South Georgia Island region', 'North of New Zealand', 'Egypt',
       'Morocco', 'Slovenia', 'East of the South Sandwich Islands',
       'Ascension Island region', 'western Montana',
       'East central Pacific Ocean', 'Timor Sea', 'West of Australia',
       'Labrador Sea', 'Somalia', 'Zambia', 'Saint Martin', 'Ukraine',
       'Southwest of Australia', 'Mariana Islands region',
       'Norwegian Sea', 'France', 'Gulf of Alaska',
       'Northwest of Australia', 'Montenegro', 'Bermuda', 'South Africa',
       'North of Severnaya Zemlya', 'Saint Kitts and Nevis', 'Iraq',
       'Bouvet Island region', 'North of Franz Josef Land', 'Laptev Sea',
       'Brazil', 'Southeast central Pacific Ocean',
       'South Georgia and the South Sandwich Islands',
       'Northern California', 'Central California', 'Czech Republic',
       'Mozambique Channel', 'Santa Cruz Islands region',
       'Macquarie Island region', 'Alaska Peninsula', 'Oman',
       'Azores Islands region', 'Off the coast of Ecuador',
       'Central Alaska', 'Macedonia', 'Southeastern Alaska',
       'Southern Alaska', 'Lebanon', 'northern Alaska', 'Jordan',
       'Southern California', 'Puerto Rico region', 'Djibouti',
       'Molucca Sea', 'Leeward Islands', 'Germany', 'south of Panama',
       'Celebes Sea', 'Iceland', 'Falkland Islands region',
       'Ecuador region', 'Vietnam', 'South of Alaska',
       'Philippine Islands region', 'western Texas',
       'northern Mid-Atlantic Ridge', 'Russia region', 'Thailand',
       'Peru-Ecuador border region', 'Jamaica', 'Kermadec Islands region',
       'north of Ascension Island', 'Tristan da Cunha region',
       'Western Australia', 'Tonga region', 'Palau region',
       'East of the Philippine Islands', 'southern Idaho', 'Kazakhstan',
       'south of the Aleutian Islands', 'South Sandwich Islands region',
       'western Xizang', 'Norway', 'South Shetland Islands',
       'Virgin Islands region', 'offshore Northern California',
       'South of Australia', 'Southwestern Atlantic Ocean',
       'Saint Eustatius and Saba ', 'Jersey',
       'Northern East Pacific Rise', 'Trinidad and Tobago', 'Burundi',
       'Niue', 'Dominica', 'Nepal',
       'Off the coast of Southeastern Alaska',
       'Federated States of Micronesia region', 'Bay of Bengal',
       'Barbuda', 'South Napa Earthquake', 'East of the Volcano Islands',
       'Syria', 'Uganda', 'Sweden', 'South of the Mariana Islands',
       'Mozambique', 'Saint Vincent and the Grenadines', 'Sudan',
       'Socotra region', 'French Polynesia', 'Palau',
       'Tennessee-Virginia border region',
       'Democratic Republic of the Congo', 'Bhutan', 'Angola',
       'Portugal region', 'Malawi', 'South of Tasmania',
       'Eastern Greenland', 'Kosovo', 'Saint Lucia',
       'Off the south coast of Northwest Africa', 'Sierra Leone'],
      dtype=object)

#number of unique country values
data.country.nunique()

Observe the difference between pd.Series.unique() and pd.Series.nunique(). To get a better idea, try

len(data.country.unique())

Note that the country column is still imperfect. There are methods to further clean this column using libraries such as SpaCy or NLTK but that is outside of the scope of this tutorial.

I recommend taking a look at a course in Information Retrieval or Natural Language Processing to gain further familiarity, if interested.

Filtering A Dataset: Magnitude 4 or Larger Earthquakes#

#filtering in pandas

filter = data['mag']>=4
magnitude_4_and_higher = data[filter]

magnitude_4_and_higher.head()

	time	latitude	longitude	depth	mag	magType	nst	gap	dmin	rms	net	updated	place	type	country
id
usc000mqlp	2014-01-31 23:08:03.660	-4.9758	153.9466	110.18	4.2	mb	NaN	98.0	1.940	0.61	us	2014-04-08 01:43:19+00:00	115km ESE of Taron, Papua New Guinea	earthquake	Papua New Guinea
usc000mqln	2014-01-31 22:54:32.970	-28.1775	-177.9058	95.84	4.3	mb	NaN	104.0	1.063	1.14	us	2014-04-08 01:43:19+00:00	120km N of Raoul Island, New Zealand	earthquake	New Zealand
usc000mqls	2014-01-31 22:49:49.740	-23.1192	179.1174	528.34	4.4	mb	NaN	80.0	5.439	0.95	us	2014-04-08 01:43:19+00:00	South of the Fiji Islands	earthquake	South of the Fiji Islands
usc000mf1x	2014-01-31 22:19:44.330	51.1569	-178.0910	37.50	4.2	mb	NaN	NaN	NaN	0.83	us	2014-04-08 01:43:19+00:00	72km E of Amatignak Island, Alaska	earthquake	USA
usc000mqlm	2014-01-31 21:56:44.320	-4.8800	153.8434	112.66	4.3	mb	NaN	199.0	1.808	0.79	us	2014-04-08 01:43:19+00:00	100km ESE of Taron, Papua New Guinea	earthquake	Papua New Guinea

Pandas simplifies the process of data filtering.

Define our filter criteria - earthquakes with a magnitude of 4 or higher:
```
filter = data['mag'] >= 4
```
This filter creates a boolean mask, marking True where the mag is greater than or equal to \(4\).
Apply the created filter to the data:
```
data[filter]
```
This operation selects all rows from the data DataFrame where the filter evaluates to True. You can also create more intricate filters. For instance, try filtering earthquakes between magnitudes \(6\) and \(9\).

Hint - take a look here

Making a bar chart of the top 5 locations with the most earthquakes from the filtered dataset

Location name on the x axis, Earthquake count on the y axis

Method one using nlargest()

# get value counts of 
country_freq = magnitude_4_and_higher['country'].value_counts()

top_5 = country_freq.nlargest(5)

top_5.plot(kind='bar', grid=True)

plt.show()

../../../_images/108aa87cadff37fa166a6b0eb59a24a402d17c2be9002a06eb2c595e30d756db.png

Pandas offers the pd.Series.value_counts() function, which generates an ordered series displaying counts for each unique value within a series or column.

Try executing:

magnitude_4_and_higher['country'].value_counts()

Additionally, Pandas seamlessly interfaces with Matplotlib through pd.DataFrame.plot(). This function allows us to specify the desired plot type and other keyword arguments, offering extensive customization options.

For example:

top_5.plot(kind='bar', grid=True)

Refer to the documentation here for more detailed information.

Getting familiar with this syntax is crucial as it’s extensively used in various data analysis and manipulation packages commonly utilized in scientific computing.es and research.

Method 2 using head()

# get value counts
country_freq = magnitude_4_and_higher['country'].value_counts()

#get top 5 from series suing head()
top_5 = country_freq.head()

#plot bar chart using matplotlib.pyplot.bar()
plt.bar(top_5.index, top_5)

#rotate the x labels 90 degrees
plt.xticks(rotation=90)

plt.grid(which='major')

plt.show()

../../../_images/a5329d0b9fb817a982d956cbac4b140dc79f3d20adf82a49344c8f51fa508503.png

Notice the syntax variation when calling the Matplotlib.pyplot.bar() function directly instead of using pd.DataFrame.plot().

The pd.DataFrame.head() method could be utilized similarly to the first method. This plot showcases the adaptability of Pandas, highlighting that there’s no sole method to achieve an output. Experimenting with different methods will enhance your coding proficiency.

Making a histogram of the distribution of the Earthquake magnitudes#

For more information visit pandas.DataFrame.hist or matplotlib.pyplot.hist

#create figure
plt.figure(figsize=(15,8))

#create subplots
plt.subplot(1,2,1) #1 row, 2 cols and first position
#histogram, with ylog scale, grid and edges of bins in black
magnitude_4_and_higher['mag'].plot(kind='hist',logy=True, grid=True, edgecolor='black', bins=5)

#name x axis
plt.xlabel('Magnitude')

#name subplot
plt.title('Magnitude 4 and higher')

plt.subplot(1,2,2)
data['mag'].plot(kind='hist',logy=True, grid=True, color='red', edgecolor='black', bins=9)
plt.xlabel('Magnitude')
plt.title('Frequency of all earthquakes')

plt.show()

../../../_images/012eca45d0051a692301fee51ba6361cfd8939ee24e4e0eac566f60d8dbc51b2.png

Let’s explore another effective method for creating subplots.

We can separate the figure and subplot creation steps:

# Create figure
plt.figure(figsize=(15, 8))

# Create the first subplot
plt.subplot(1, 2, 1)
# Subplot code

# Create the second subplot
plt.subplot(1, 2, 2)
# Subplot code

Here, we establish a \(1 \times 2\) subplot grid, specifying the location for the subsequent subplot as \(1\) and the next as \(2\).

Take note of the title creation for each plot compared to the methods used previously.

Additionally, a histogram is generated, showcasing various keyword arguments available in pd.DataFrame.plot():

data['mag'].plot(kind='hist', logy=True, grid=True, color='red', edgecolor='black', bins=9)

It’s crucial to emphasize the significance of reviewing the documentation for the pd.DataFrame.plot() function.

Visualising the locations of earthquakes with a scatterplot of their latitude and longitude#

Examine the code block below, try to note the purpose of each step.

#create figure
plt.figure(figsize=(20, 20))

#create subplots
plt.subplot(2,1,1)
#make scatter plot
scatter = plt.scatter(magnitude_4_and_higher.longitude,
                      magnitude_4_and_higher.latitude,
                      s=magnitude_4_and_higher.mag,
                      c=magnitude_4_and_higher.depth,
                     cmap='Spectral')
#create colorbar
cbar = plt.colorbar(scatter)
# Set colorbar label
cbar.set_label('Depth [km]') 
#name subplot
plt.title('Magnitude 4 and higher')


#subplot 2
plt.subplot(2,1,2)
#make scatter plot
scatter = plt.scatter(data.longitude,
                      data.latitude,
                      s=data.mag,
                      c=data.depth,
                     cmap='Spectral')
#create colorbar
cbar = plt.colorbar(scatter) #format colorbar for 10^x)
# Set colorbar label
cbar.set_label('Depth [km]') 
#name subplot
plt.title('Frequency of all earthquakes')

plt.show()

../../../_images/7107d375acf77472a03e063e8912b52f5b5cafea2eefea521644e5c835009f4b.png

Did you understand how these plots were created?
Hint: Review the tutorial on Matplotlib.

Take a look matplotlib.pyplot for more cmap options. Try improving or changing aspects of the code above.

Final Thoughts#

By now, you should have acquired a foundational understanding of Pandas and its significance in data analysis and manipulation. Widely utilized in advanced scientific computing, Pandas offers extensive capabilities that we will delve into further in upcoming sections. Before progressing, ensure you comprehend and validate the calculations and transformations introduced in this chapter to solidify your grasp of the material.