← Back to Portfolio

NOAA Lightning Stirke Analysis (EDA Structuring)

by Alfred Rico

Key Takeaways

Project Background

The National Oceanic and Atmospheric Administration (NOAA) collects lightning strike data using satellite-based systems capable of detecting both cloud-to-ground and in-cloud activity. These datasets include detailed records of strike frequency, location, and timing.

In this project, multi-year NOAA data was structured and integrated to support consistent analysis. Feature engineering was used to extract temporal components such as weekday, week, month, and year, enabling comparisons across time. This allowed for the analysis of spatial concentration, frequency of lightning by location, and variations in strike patterns across days and seasons. Monthly strike percentages were also calculated to highlight seasonal trends and periods of increased lightning activity.

Project Goal

The purpose of this project is to structure and integrate multi-year NOAA lightning strike data to enable analysis of spatial, temporal, and seasonal patterns and generate actionable insights from the data.

Technical Implementation

Data Loading & Inspection

For this project, two datasets will be used: A smaller file containing 2018 data and a larger file containing 2016-2017 data. The smaller file can be found in the Github repo (linked above) while the larger one was not permitted due to file size restrictions. Converting to Parquet and zip methods did not pan out either unfortunately. As with other projects, I can truncate part of the second dataset for demonstrative purposes though the results will differ.

lightning_strikes_dataset_1.csv is loaded first along with necessary packages and libraries

View Code
# Import
import pandas as pd
import numpy as np
import seaborn as sns
import datetime
from matplotlib import pyplot as plt

# Load
df = pd.read_csv('lightning_strikes_dataset_1.csv')
df.head()
Results:
date number_of_strikes center_point_geom
2018-01-03194POINT(-75 27)
2018-01-0341POINT(-78.4 29)
2018-01-0333POINT(-73.9 27)
2018-01-0338POINT(-73.8 27)
2018-01-0392POINT(-79 28)

Similar to other projects, this dataset has three columns date, number_of_strikes and center_point_geom. The date column will be converted into a usable datetime format. df.info() can be used to confirm conversion success after checking data types.

.shape to run a quick check on the shape of the dataset followed by dropping duplicate rows with .drop_duplicates().shape, if present, and seeing if they both match. Since both appear to be matching, it is established that there is at most one row for each date, area and number of strikes.

View Code
# Datetime conversion
df['date'] = pd.to_datetime(df['date'])

# Check
df.shape

# Check for duplicates
df.drop_duplicates().shape
Results:

df.shape()

(3401012, 3)

df.drop_duplicates().shape

(3401012, 3)

Summary

Two NOAA lightning strike datasets are used: a smaller 2018 file (included in the repo) and a larger 2016–2017 file (excluded due to size limits). The initial dataset is loaded and inspected, revealing three fields (date, number of strikes, and location). The date column is converted to datetime, and dataset integrity is verified by checking shape and confirming no duplicate records exist.

Geographic Analysis

As a good launch point for this analysis, .sort_values() identifies the locations with the highest number of lightning strikes in a single day by descending order.

This pulls on the most extreme events in the dataset to surface and provides a clear starting point for understanding where lightning strikes are most concentrated.

View Code
# Sort by numbr of strikes (descending)
df.sort_values(by='number_of_strikes', ascending=False).head(10)
Results:
indexdatenumber_of_strikescenter_point_geom
3027582018-08-202211POINT(-92.5 35.5)
2783832018-08-162142POINT(-96.1 36.1)
2808302018-08-172061POINT(-90.2 36.1)
2804532018-08-172031POINT(-89.9 35.9)
2783822018-08-161902POINT(-96.2 36.1)
115172018-02-101899POINT(-95.5 28.1)
2775062018-08-161878POINT(-89.7 31.5)
249062018-02-251833POINT(-98.7 28.9)
2843202018-08-171767POINT(-90.1 36)
248252018-02-251741POINT(-98 29)

To get the locations with the most days with at least one lightning strike, the value_counts() function is used on the center_point_geom column.

If each row represent a day and location, then counting the number of times each location occurs in the data reveals the number of days with at least one lightning strike.

By default, value_counts() sorts in descending order revealing an interesting trend between activity and geographical proximity.

View Code
# Locations appearing most in the dataset
df.center_point_geom.value_counts()
Results:
center_point_geomcount
POINT(-81.5 22.5)108
POINT(-84.1 22.4)108
POINT(-82.7 22.9)107
POINT(-82.5 22.9)107
POINT(-84.2 22.3)106
...
POINT(-130.2 47.4)1
POINT(-60.4 44.5)1

Showing truncated results (full dataset contains 170,855 rows)

Using the value_counts() function again but only to output the top 20 results, rename the columns and apply a color gradient

View Code
# Top 20
df.center_point_geom.value_counts()[:20]
    .rename_axis('unique_values')
    .reset_index(name='counts')
    .style.background_gradient()
Results:
#unique_valuescounts
0POINT(-81.5 22.5)108
1POINT(-84.1 22.4)108
2POINT(-82.7 22.9)107
3POINT(-82.5 22.9)107
4POINT(-84.2 22.3)106
5POINT(-82.5 22.8)106
6POINT(-76 20.5)105
7POINT(-75.9 20.4)105
8POINT(-82.2 22.9)104
9POINT(-78 18.2)104
10POINT(-83.9 22.5)103
11POINT(-78 18.3)102
12POINT(-82 22.4)102
13POINT(-82 22.8)102
14POINT(-82.3 22.9)102
15POINT(-84 22.4)102
16POINT(-75.5 20.6)101
17POINT(-82 22.3)101
18POINT(-78.2 18.3)101
19POINT(-84 22.5)101
Summary

The analysis begins by identifying the locations with the highest number of lightning strikes in a single day, which surfaces the most extreme events and provides a clear starting point for understanding geographic concentration. From there, frequency is examined by counting how often each location appears in the dataset, revealing which areas experience lightning most consistently over time. Focusing on the top locations and applying a gradient helps highlight relative intensity, making it easier to compare and identify areas with the highest overall lightning activity.

A key takeaway is that lightning activity appears to be geographically clustered, with the highest frequency locations occurring in close proximity, suggesting concentrated regions of recurring lightning events rather than a even distribution.

Temporal Analysis

To find out which days of the week had higher concentrations, lightning strikes are categorized by day of the week. Using the date column, converted to datetime format earlier, a week column is extracted.

dt.isocalendar() is a function designed for use on a pandas series, returning in this case, a new dataframe with year, week, and day columns. Adding .week will extract only the week number for use in this phase.

dt.day.name() is also be used to create a weekeday column which will provide the text name of day for any given datetime formatted date (Useful for visualization).

View Code
# Create two new columns
df['week'] = df.date.dt.isocalendar().week
df['weekday'] = df.date.dt.day_name()
df.head()
Results:
indexdatenumber_of_strikescenter_point_geomweekweekday
02018-01-03194POINT(-75 27)1Wednesday
12018-01-0341POINT(-78.4 29)1Wednesday
22018-01-0333POINT(-73.9 27)1Wednesday
32018-01-0338POINT(-73.8 27)1Wednesday
42018-01-0392POINT(-79 28)1Wednesday

The mean number of lightning strikes for each weekday of the year is calculated using groupby() to get an idea of what the distribution will look like

View Code
# Mean number of strikes per weekday of the year
df[['weekday', 'number_of_strikes']].groupby(['weekday']).mean()
Results:
weekdaynumber_of_strikes
Friday13.349972
Monday13.152804
Saturday12.732694
Sunday12.324717
Thursday13.240594
Tuesday13.813599
Wednesday13.224568

Interestingly, Saturday and Sunday have fewer lightning strikes on average than the other five days of the week. Pursuing this further, a box plot is made showing the distribution across weekdays.

For this visualization, outliers will be ignored setting showfliers to False. The focus for this visualization will be on median values without possible skewing from outliers.

View Code
# Define order of days (plot)
weekday_order = ['Monday','Tuesday', 'Wednesday', 'Thursday','Friday','Saturday','Sunday']

# Plot
g = sns.boxplot(data=df,
                x='weekday',
                y='number_of_strikes',
                order=weekday_order,
                showfliers=False
               );
g.set_title('Lightning distribution per weekday (2018)');
Results:
Lightning distribution per weekday
Summary

Although the boxplot further highlights a lower average lightning count for Saturday and Sunday, the median remains consistent across all days, suggesting no meaningful shift in typical activity. It appears that the difference in averages could likely be driven by sampling or temporal effects. This is consistent with what df.center_point_geom.value_counts() highlights in that most activity seems to be clustered close in proximity.

There appears to be an aerosol theory going around that less traffic, less people at work, and less industrial release have a causal effect. I doubt it.

Seasonal Trends (Multi-Year)

To calculate monthly lightning strikes from 2016-2018, the second structured dataset is used, calculating percentages of total lightning strikes that occurred for each year in a given month.

Like the first dataset, the date column needs to be converted.

View Code
# Load dataset
df_2 = pd.read_csv('lightning_strikes_dataset_2.csv')

# Datetime conversion
df_2['date'] = pd.to_datetime(df_2['date'])

df_2.head()
Results:
datenumber_of_strikescenter_point_geom
2016-01-0455POINT(-83.2 21.1)
2016-01-0433POINT(-83.1 21.1)
2016-01-0546POINT(-77.5 22.1)
2016-01-0528POINT(-76.8 22.3)
2016-01-0528POINT(-77 22.1)

The 2016-2017 dataframe is combined with the 2018 dataframe using concat() so that data can be aggregated across all years.

The week and weekday columns that were created earlier are dropped for this operation only to simplify the results.

View Code
# Create new 2016-2018 dataframe
union_df = pd.concat([df.drop(['weekday', 'week'], axis=1), df_2], ignore_index=True)
union_df.head()
Results:
datenumber_of_strikescenter_point_geom
2018-01-03194POINT(-75 27)
2018-01-0341POINT(-78.4 29)
2018-01-0333POINT(-73.9 27)
2018-01-0338POINT(-73.8 27)
2018-01-0392POINT(-79 28)

Three new columns need to be created, year, month number, and month name, to assist with naming bars on the bar plot.

View Code
# Add 3 new columns
union_df['year'] = union_df.date.dt.year
union_df['month'] = union_df.date.dt.month
union_df['month_txt'] = union_df.date.dt.month_name()
union_df.head() 
Results:
datenumber_of_strikescenter_point_geomyearmonthmonth_txt
2018-01-03194POINT(-75 27)20181January
2018-01-0341POINT(-78.4 29)20181January
2018-01-0333POINT(-73.9 27)20181January
2018-01-0338POINT(-73.8 27)20181January
2018-01-0392POINT(-79 28)20181January

With this new dataframe, the total number of strikes per year can be called

View Code
union_df[['year', 'number_of_strikes']].groupby(['year']).sum()
Results:
yearnumber_of_strikes
201641582229
201735095195
201844600989

A new dataframe is created called lightning_by_month highlighting percentages of total lightning strikes that occurred in a given month for each year.

To avoid any future errors with agg(), sum is passed as a string since that is what Pandas expects for built-ins (sum, mean, max etc.)

View Code
# Lightning by month
lightning_by_month = union_df.groupby(['month_txt', 'year']).agg(
    number_of_strikes = pd.NamedAgg(column = 'number_of_strikes', aggfunc='sum')
    ).reset_index()

lightning_by_month.head()
Results:
month_txtyearnumber_of_strikes
April20162636427
April20173819075
April20181524339
August20167250442
August20176021702

With the monthly total strike count in hand, agg() is used to calculate the same yearly totals from before into a dataframe that can be merged downstream.

View Code
# lightning by year
lightning_by_year = union_df.groupby(['year']).agg(
    year_strikes = pd.NamedAgg(column='number_of_strikes', aggfunc='sum')
).reset_index()

lightning_by_year.head()
Results:
yearyear_strikes
201641582229
201735095195
201844600989

With both lightning_by_month and lightning_by_year, a new dataframe called percentage_lightning can be created.

merge() offers this capability by merging both lightning_by_month and lightning_by_year into a single dataframe, specifying the merge on the year column. Wherever year contains the same value in both dataframes, a row is created in the merged dataframe with all other columns.

percentage_lightning will add a new column called year_strikes representing total number of strikes for each year.

View Code
# Combine
percentage_lightning = lightning_by_month.merge(lightning_by_year, on='year')
percentage_lightning.head()
Results:
month_txtyearnumber_of_strikesyear_strikes
April2016263642741582229
April2017381907535095195
April2018152433944600989
August2016725044241582229
August2017602170235095195

To get the percentage of total lightning strikes that occurred during each month, the number_of_strikes column is divided by the year_strikes column and multiplied by 100.

View Code
percentage_lightning['percentage_lightning_per_month'] = (percentage_lightning.number_of_strikes/
                                                          percentage_lightning.year_strikes * 100.0)
percentage_lightning.head()
Results:
month_txtyearnumber_of_strikesyear_strikespercentage_lightning_per_month
April20162636427415822296.340273
April201738190753509519510.882045
April20181524339446009893.417725
August201672504424158222917.436396
August201760217023509519517.158195

Now the percentages by month for each year can be plotted

View Code
# Plot
plt.figure(figsize=(10,6));

month_order = ['January', 'February', 'March', 'April', 'May', 'June', 
               'July', 'August', 'September', 'October', 'November', 'December']

sns.barplot(
    data = percentage_lightning,
    x = 'month_txt',
    y = 'percentage_lightning_per_month',
    hue = 'year',
    order = month_order );
plt.xlabel("Month");
plt.ylabel("% of lightning strikes");
plt.title("% of lightning strikes each Month (2016-2018)");
Results:
Summary

To look at how lightning activity changes over time, data from 2016–2018 is combined so everything can be analyzed in one place . Since both datasets follow the same format, they can be stacked together after converting the date column and dropping a couple of extra fields that are not needed for this part. From there, breaking the data down into year and month makes it easier to compare how activity shifts throughout the calendar.

Once totals are calculated, the focus shifts to how much each month contributes relative to the full year. Instead of just looking at raw counts, converting everything into percentages gives a clearer picture of seasonality and makes comparisons across years more consistent. Merging monthly totals with yearly totals allows that percentage calculation to happen cleanly in one place.

When plotted, a clear pattern starts to show. Lightning activity is not evenly spread out as it builds up throughout the year and peaks heavily in late summer. August stands out the most, especially in 2018 where it accounts for a significant share of total strikes, reinforcing that lightning activity tends to concentrate within specific seasonal windows rather than being evenly distributed.

Project Overview

This project brings together multiple years of NOAA lightning strike data to look at how activity varies across both location and time. After combining and structuring the datasets, a few consistent patterns start to emerge. Lightning activity is not evenly distributed. It tends to concentrate in certain areas, and a relatively small number of locations show up repeatedly with higher strike frequency, suggesting a clear clustering effect rather than random spread.

Across weekdays, there is not much variation. Activity stays fairly stable, with only a slight decrease on weekends, which does not meaningfully change the overall distribution. The more noticeable differences show up over longer periods. Lightning activity shifts throughout the year, with certain months consistently seeing higher volumes. These seasonal patterns repeat across multiple years, pointing to underlying environmental or atmospheric drivers rather than one-off fluctuations.

The seasonal consistency suggests that lightning activity can be anticipated to some extent, while the geographic concentration highlights areas where activity is persistently higher. Both of these patterns provide a starting point for more targeted monitoring, planning, or deeper analysis into what is driving the behavior.