← Back to Portfolio

NOAA Lightning Strike Analysis (EDA/Feature Engineeringn with Python)

By Alfred Rico

Key Takeaways

Project Background

The NOAA is a U.S. government agency responsible for monitoring weather, oceans, and atmospheric activity, using a combination of advanced technologies to detect, track, and record environmental events. Although most lightning does not make contact with the ground, GOES satellites equipped with GLM (Geostationary Lightning Mapper) instrumentation are able to detect total lightning activity, including both cloud-to-ground and in-cloud events.

Analysis of this data has revealed notable patterns in lightning behavior and this has led to intriguing discoveries like its higher frequency in tropical regions. Additionally, lightning activity has proven to be a valuable indicator of storm intensity and development. Lightnig spikes can signal tornado formation, rapid storm intensification, and hail-producing storms to name a few of these events.

Project Goal

Using Python in the Jupyter notebook environment, 2018 lightning strike data collected by the National Oceanic and Atmospheric Administration (NOAA) will be examined, calculating the total number of strikes for each month and plotting this information on a bar graph.

Technical Implementation

Python Packages

Importing Python packages

  • NumPy – Used primarily for numerical computation and a wide range of mathematical operations, including working with arrays and vectorized functions.
  • Pandas – Used for data manipulation, structuring, and handling, particularly with tabular data through DataFrames.
  • datetime – Used for working with dates and time data, allowing for parsing, formatting, and performing time-based calculations.
  • matplotlib.pyplot – Used for data visualization, enabling the creation of charts such as line graphs, bar charts, and histograms for exploratory analysis.
View Code
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
Preliminary Data Inspection

Importing the data set

df = pd.read_csv("lightning_strikes_1.csv")

Initial data inspection, validation and exploration performed using head(), .shape, and info()

head() - to examine sample record dataset structure and feature layout

.shape - Provides a tuple (rows, columns) to understand dataset dimensions

info() - to assess data types, missing values, and dataset integrity

View Code
# Importing dataset
df = pd.read_csv("lightning_strikes_1.csv")  

# Inspect the first 5 rows
df.head()  
# Rows/ column counts
df.shape  
# More information about the data types
df.info()
Results:

head():

date number_of_strikes center_point_geom
02018-01-03194POINT(-75 27)
12018-01-0341POINT(-78.4 29)
22018-01-0333POINT(-73.9 27)
32018-01-0338POINT(-73.8 27)
42018-01-0392POINT(-79 28)

.shape:

(3401012, 3)

info():

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3401012 entries, 0 to 3401011
Data columns (total 3 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   date               object
 1   number_of_strikes  int64 
 2   center_point_geom  object
dtypes: int64(1), object(2)
memory usage: 77.8+ MB
  
Summary

Preliminary analysis using .shape reveals the dataset contains over 3.4 million observations across three variables, representing a large-scale, high-resolution record of lightning activity.

While head() provides an initial view of the dataset structure, showing the presence of temporal data (date), strike counts (number_of_strikes), and geographic coordinates (center_point_geom) providing clear time and spatial based data.

info() provides confirmation that the dataset is structurally consistent, providing further validation, with appropriat data types for the focus of this project. Data types (integer and object fields) are present with no immediate issues affecting usibility such as NULL factors.

With the focus of this project on monthly lightning strikes, it appears the dataset is consistent particularly in regards to temporal and geographical trends in lightning activity.

Data Preparation

The date column is converted from string to DateTime format using pd.to_datetime() enabling time-based operations.

A new month feature is extracted from the newly formatted DateTime column with the use of .dt.month enabling aggregated analysis.

View Code
  # Convert date column to proper datetime format
  df['date'] = pd.to_datetime(df['date'])
  
  # Create new 'month' column
  df['month'] = df['date'].dt.month
  df.head()
  
Results:

Data after datetime conversion and month feature extraction:

date number_of_strikes center_point_geom month
02018-01-03194POINT(-75 27)1
12018-01-0341POINT(-78.4 29)1
22018-01-0333POINT(-73.9 27)1
32018-01-0338POINT(-73.8 27)1
42018-01-0392POINT(-79 28)1
Summary

Data preparation begins with first examining the date column, which was originally stored as a string or, object type. Considering the need for temporal data in a format that is suitable for analysis, this column is converted to proper datetime format using pd.to_datetime() allowing time-based operations and clearer feature extraction.

A new month feature is engineered from the DateTime column using .dt.month, returning a NumPy array containing the month of the DateTime column and, incidentally, allowing the aggregation of lightning strike activity at a monthly level. At this phase of the analysis the dataset has been enhanced at a temporal level that can support trend analysis and visualization.

Daily Strike Analysis

At this phase of the analysis, the top 10 days of 2018 with the most number of lightning strikes can be calculated using groupby(), sum(), and sort_values() functions from pandas

groupby() on the date column to combine all rows with the same date into a single row

sum() on all other summable columns except center_point_geom as this column (string object) is not summable.

sort_values() orders total strikes for each day in descending order for clearer view of highest data points.

View Code
# Calculate days with most lightning strikes
df.groupby(['date']).sum().sort_values('number_of_strikes', ascending=False).head(10)
Results:

Top 10 days with the highest number of lightning strikes (2018):

date number_of_strikes
2018-08-291070457
2018-08-17969774
2018-08-28917199
2018-08-27824589
2018-08-30802170
2018-08-19786225
2018-08-18741180
2018-08-16734475
2018-08-31723624
2018-08-15673455
Summary

Analysis shifts focus to daily lightning strike activity, grouping the dataset by date and calculating total number of strikes per day using groupby() and sum(), revealing peak activity periods. sort_values() in descending order highlighted periods of intense lightning activity in these 10 days.

One thing to note is that sum() was used instead of count() as the former correctly aggregates the true full number of strikes recorded for each day. count() would only return how many rows exist for each day, not the number of strikes occuring that particular period.

Monthly Trend Analysis

groupby() and sum() to aggregate lightning strike activity by month

dt.month_name() to improve readability using str.slice to omit text byond the third letter.

View Code
# Strikes per month
df.groupby('date')['number_of_strikes'].sum().sort_values(ascending=False).head(12)

# new 'month_txt' column
df['month_txt'] = df['date'].dt.month_name().str.slice(stop=3)
df.head()
Results:

Total lightning strikes by month (2018):

month number_of_strikes
815525255
78320400
66445083
54166726
93018336
22071315
41524339
101093962
1860045
3854168
11409263
12312097

Data with month labels:

date number_of_strikes center_point_geom month month_txt
02018-01-03194POINT(-75 27)1Jan
12018-01-0341POINT(-78.4 29)1Jan
22018-01-0333POINT(-73.9 27)1Jan
32018-01-0338POINT(-73.8 27)1Jan
42018-01-0392POINT(-79 28)1Jan
Summary

For the focus of this project, broader temporal trends beyond daily fluctuations were analyzed by aggregating total strike counts by month using groupby() and sum().

Month labels were also converted to labels for clarity to improve readability in analysis and visualization.

Overall, results from this approach reveal a clear seasonal pattern, with peak lightning activity occurring durring the summer, more specifically the months of June, July, and August.

Visualization

Creating a dataframe called df_by_month allowing access to the month, month text, and total number of srikes for each month to help with plotting.

View Code
# Dataframe for plotting
df_by_month = (df.groupby(['month', 'month_txt'])['number_of_strikes'].sum().reset_index().sort_values('month'))
df_by_month

# Plotting
plt.bar(x=df_by_month['month_txt'], height = df_by_month['number_of_strikes'], label = "Number of strikes")
plt.plot()

plt.xlabel("Months(2018)")
plt.ylabel("Numer of lightning strikes")
plt.title("Number of lightning strikes in 2018 by months")
plt.legend()
plt.show()
Results:

Monthly lightning strike dataset used for visualization:

month month_txt number_of_strikes
01Jan860045
12Feb2071315
23Mar854168
34Apr1524339
45May4166726
56Jun6445083
67Jul8320400
78Aug15525255
89Sep3018336
910Oct1093962
1011Nov409263
1112Dec312097

Monthly Lightning Strike Visualization:

Lightning strikes by month
Summary

The dataframe from this analysis summarizes lightning strike counts for each month of 2018, with numeric month values in combination with abbreviated labels supporting clear visualization.

This bar chart reveals strong seasonal trends, with lightning activity increasing through spring, peaking sharply in August, and then declining through the winter, illustrating clear concentrations of strikes during the summer.

Project Overview

This project analyzes over 3.4 million lightning strike records collected by the National Oceanic and Atmospheric Administration (NOAA) in 2018, with the goal of identifying temporal patterns and seasonal trends in lightning activity. Using Python in the Jupyter notebook environment, the dataset was cleaned, transformed, aggregated, and visualized to examine strike frequency at both daily and monthly levels. Overall, this analysis of 2018 provided a high quality snapshot of lightning activity variation.

Because of the sheer size of the file (102mb), I was unable to provide the full .csv file on this projects github repository however, I was able to truncate approximately the first 20mb worth of data from the top and provided that instead. Although the results are not reproduceable, the technical implementation should be.