NOAA Lightning Strike Analysis (EDA/Feature Engineeringn with Python)

By Alfred Rico

Key Takeaways

Lightning activity is not evenly distributed, with a relatively small number of days accounting for a large share of total strikes. This points to concentrated periods of high intensity and suggests monitoring efforts are more effective when focused on peak activity windows.
The most extreme lightning events occur within short time windows rather than being spread throughout the year. Peak activity tends to cluster, making short term forecasting and rapid response more important during these periods.
Differences in activity become much clearer at the monthly level, where variation is more pronounced than at the daily level. Broader aggregation provides a more reliable view of underlying trends than day to day fluctuations.
Lightning activity increases through the year, peaks in late summer, and then declines into the winter months. This consistency suggests seasonal patterns can be used to anticipate higher risk periods and plan accordingly.

Project Background

The NOAA is a U.S. government agency responsible for monitoring weather, oceans, and atmospheric activity, using a combination of advanced technologies to detect, track, and record environmental events. Although most lightning does not make contact with the ground, GOES satellites equipped with GLM (Geostationary Lightning Mapper) instrumentation are able to detect total lightning activity, including both cloud-to-ground and in-cloud events.

Analysis of this data has revealed notable patterns in lightning behavior and this has led to intriguing discoveries like its higher frequency in tropical regions. Additionally, lightning activity has proven to be a valuable indicator of storm intensity and development. Lightnig spikes can signal tornado formation, rapid storm intensification, and hail-producing storms to name a few of these events.

Project Goal

Using Python in the Jupyter notebook environment, 2018 lightning strike data collected by the National Oceanic and Atmospheric Administration (NOAA) will be examined, calculating the total number of strikes for each month and plotting this information on a bar graph.

Technical Implementation

Python Packages

Importing Python packages

NumPy – Used primarily for numerical computation and a wide range of mathematical operations, including working with arrays and vectorized functions.
Pandas – Used for data manipulation, structuring, and handling, particularly with tabular data through DataFrames.
datetime – Used for working with dates and time data, allowing for parsing, formatting, and performing time-based calculations.
matplotlib.pyplot – Used for data visualization, enabling the creation of charts such as line graphs, bar charts, and histograms for exploratory analysis.

View Code

import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt

Preliminary Data Inspection

Importing the data set

df = pd.read_csv("lightning_strikes_1.csv")

Initial data inspection, validation and exploration performed using head(), .shape, and info()

head() - to examine sample record dataset structure and feature layout

.shape - Provides a tuple (rows, columns) to understand dataset dimensions

info() - to assess data types, missing values, and dataset integrity

View Code

# Importing dataset
df = pd.read_csv("lightning_strikes_1.csv")  

# Inspect the first 5 rows
df.head()  
# Rows/ column counts
df.shape  
# More information about the data types
df.info()

Results:

head():

	date	number_of_strikes	center_point_geom
0	2018-01-03	194	POINT(-75 27)
1	2018-01-03	41	POINT(-78.4 29)
2	2018-01-03	33	POINT(-73.9 27)
3	2018-01-03	38	POINT(-73.8 27)
4	2018-01-03	92	POINT(-79 28)

.shape:

(3401012, 3)

info():

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3401012 entries, 0 to 3401011
Data columns (total 3 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   date               object
 1   number_of_strikes  int64 
 2   center_point_geom  object
dtypes: int64(1), object(2)
memory usage: 77.8+ MB

Summary

Preliminary analysis using .shape reveals the dataset contains over 3.4 million observations across three variables, representing a large-scale, high-resolution record of lightning activity.

While head() provides an initial view of the dataset structure, showing the presence of temporal data (date), strike counts (number_of_strikes), and geographic coordinates (center_point_geom) providing clear time and spatial based data.

info() provides confirmation that the dataset is structurally consistent, providing further validation, with appropriat data types for the focus of this project. Data types (integer and object fields) are present with no immediate issues affecting usibility such as NULL factors.

With the focus of this project on monthly lightning strikes, it appears the dataset is consistent particularly in regards to temporal and geographical trends in lightning activity.

Data Preparation

The date column is converted from string to DateTime format using pd.to_datetime() enabling time-based operations.

A new month feature is extracted from the newly formatted DateTime column with the use of .dt.month enabling aggregated analysis.

View Code

  # Convert date column to proper datetime format
  df['date'] = pd.to_datetime(df['date'])
  
  # Create new 'month' column
  df['month'] = df['date'].dt.month
  df.head()

Results:

Data after datetime conversion and month feature extraction:

	date	number_of_strikes	center_point_geom	month
0	2018-01-03	194	POINT(-75 27)	1
1	2018-01-03	41	POINT(-78.4 29)	1
2	2018-01-03	33	POINT(-73.9 27)	1
3	2018-01-03	38	POINT(-73.8 27)	1
4	2018-01-03	92	POINT(-79 28)	1

Summary

Data preparation begins with first examining the date column, which was originally stored as a string or, object type. Considering the need for temporal data in a format that is suitable for analysis, this column is converted to proper datetime format using pd.to_datetime() allowing time-based operations and clearer feature extraction.

A new month feature is engineered from the DateTime column using .dt.month, returning a NumPy array containing the month of the DateTime column and, incidentally, allowing the aggregation of lightning strike activity at a monthly level. At this phase of the analysis the dataset has been enhanced at a temporal level that can support trend analysis and visualization.

Daily Strike Analysis

At this phase of the analysis, the top 10 days of 2018 with the most number of lightning strikes can be calculated using groupby(), sum(), and sort_values() functions from pandas

groupby() on the date column to combine all rows with the same date into a single row

sum() on all other summable columns except center_point_geom as this column (string object) is not summable.

sort_values() orders total strikes for each day in descending order for clearer view of highest data points.

View Code

# Calculate days with most lightning strikes
df.groupby(['date']).sum().sort_values('number_of_strikes', ascending=False).head(10)

Results:

Top 10 days with the highest number of lightning strikes (2018):

date	number_of_strikes
2018-08-29	1070457
2018-08-17	969774
2018-08-28	917199
2018-08-27	824589
2018-08-30	802170
2018-08-19	786225
2018-08-18	741180
2018-08-16	734475
2018-08-31	723624
2018-08-15	673455

Summary

Analysis shifts focus to daily lightning strike activity, grouping the dataset by date and calculating total number of strikes per day using groupby() and sum(), revealing peak activity periods. sort_values() in descending order highlighted periods of intense lightning activity in these 10 days.

One thing to note is that sum() was used instead of count() as the former correctly aggregates the true full number of strikes recorded for each day. count() would only return how many rows exist for each day, not the number of strikes occuring that particular period.

Monthly Trend Analysis

groupby() and sum() to aggregate lightning strike activity by month

dt.month_name() to improve readability using str.slice to omit text byond the third letter.

View Code

# Strikes per month
df.groupby('date')['number_of_strikes'].sum().sort_values(ascending=False).head(12)

# new 'month_txt' column
df['month_txt'] = df['date'].dt.month_name().str.slice(stop=3)
df.head()

Results:

Total lightning strikes by month (2018):

month	number_of_strikes
8	15525255
7	8320400
6	6445083
5	4166726
9	3018336
2	2071315
4	1524339
10	1093962
1	860045
3	854168
11	409263
12	312097

Data with month labels:

	date	number_of_strikes	center_point_geom	month	month_txt
0	2018-01-03	194	POINT(-75 27)	1	Jan
1	2018-01-03	41	POINT(-78.4 29)	1	Jan
2	2018-01-03	33	POINT(-73.9 27)	1	Jan
3	2018-01-03	38	POINT(-73.8 27)	1	Jan
4	2018-01-03	92	POINT(-79 28)	1	Jan

Summary

For the focus of this project, broader temporal trends beyond daily fluctuations were analyzed by aggregating total strike counts by month using groupby() and sum().

Month labels were also converted to labels for clarity to improve readability in analysis and visualization.

Overall, results from this approach reveal a clear seasonal pattern, with peak lightning activity occurring durring the summer, more specifically the months of June, July, and August.

Visualization

Creating a dataframe called df_by_month allowing access to the month, month text, and total number of srikes for each month to help with plotting.

View Code

# Dataframe for plotting
df_by_month = (df.groupby(['month', 'month_txt'])['number_of_strikes'].sum().reset_index().sort_values('month'))
df_by_month

# Plotting
plt.bar(x=df_by_month['month_txt'], height = df_by_month['number_of_strikes'], label = "Number of strikes")
plt.plot()

plt.xlabel("Months(2018)")
plt.ylabel("Numer of lightning strikes")
plt.title("Number of lightning strikes in 2018 by months")
plt.legend()
plt.show()

Results:

Monthly lightning strike dataset used for visualization:

	month	month_txt	number_of_strikes
0	1	Jan	860045
1	2	Feb	2071315
2	3	Mar	854168
3	4	Apr	1524339
4	5	May	4166726
5	6	Jun	6445083
6	7	Jul	8320400
7	8	Aug	15525255
8	9	Sep	3018336
9	10	Oct	1093962
10	11	Nov	409263
11	12	Dec	312097

Monthly Lightning Strike Visualization:

Summary

The dataframe from this analysis summarizes lightning strike counts for each month of 2018, with numeric month values in combination with abbreviated labels supporting clear visualization.

This bar chart reveals strong seasonal trends, with lightning activity increasing through spring, peaking sharply in August, and then declining through the winter, illustrating clear concentrations of strikes during the summer.

Project Overview

This project analyzes over 3.4 million lightning strike records collected by the National Oceanic and Atmospheric Administration (NOAA) in 2018, with the goal of identifying temporal patterns and seasonal trends in lightning activity. Using Python in the Jupyter notebook environment, the dataset was cleaned, transformed, aggregated, and visualized to examine strike frequency at both daily and monthly levels. Overall, this analysis of 2018 provided a high quality snapshot of lightning activity variation.

Because of the sheer size of the file (102mb), I was unable to provide the full .csv file on this projects github repository however, I was able to truncate approximately the first 20mb worth of data from the top and provided that instead. Although the results are not reproduceable, the technical implementation should be.