NOAA Lightning Stirke Analysis (EDA Structuring)

Technical Implementation

Data Loading & Inspection

For this project, two datasets will be used: A smaller file containing 2018 data and a larger file containing 2016-2017 data. The smaller file can be found in the Github repo (linked above) while the larger one was not permitted due to file size restrictions. Converting to Parquet and zip methods did not pan out either unfortunately. As with other projects, I can truncate part of the second dataset for demonstrative purposes though the results will differ.

lightning_strikes_dataset_1.csv is loaded first along with necessary packages and libraries

View Code

# Import
import pandas as pd
import numpy as np
import seaborn as sns
import datetime
from matplotlib import pyplot as plt

# Load
df = pd.read_csv('lightning_strikes_dataset_1.csv')
df.head()

Results:

date	number_of_strikes	center_point_geom
2018-01-03	194	POINT(-75 27)
2018-01-03	41	POINT(-78.4 29)
2018-01-03	33	POINT(-73.9 27)
2018-01-03	38	POINT(-73.8 27)
2018-01-03	92	POINT(-79 28)

Similar to other projects, this dataset has three columns date, number_of_strikes and center_point_geom. The date column will be converted into a usable datetime format. df.info() can be used to confirm conversion success after checking data types.

.shape to run a quick check on the shape of the dataset followed by dropping duplicate rows with .drop_duplicates().shape, if present, and seeing if they both match. Since both appear to be matching, it is established that there is at most one row for each date, area and number of strikes.

View Code

# Datetime conversion
df['date'] = pd.to_datetime(df['date'])

# Check
df.shape

# Check for duplicates
df.drop_duplicates().shape

Results:

df.shape()

(3401012, 3)

df.drop_duplicates().shape

(3401012, 3)

Summary

Two NOAA lightning strike datasets are used: a smaller 2018 file (included in the repo) and a larger 2016–2017 file (excluded due to size limits). The initial dataset is loaded and inspected, revealing three fields (date, number of strikes, and location). The date column is converted to datetime, and dataset integrity is verified by checking shape and confirming no duplicate records exist.

Geographic Analysis

As a good launch point for this analysis, .sort_values() identifies the locations with the highest number of lightning strikes in a single day by descending order.

This pulls on the most extreme events in the dataset to surface and provides a clear starting point for understanding where lightning strikes are most concentrated.

View Code

# Sort by numbr of strikes (descending)
df.sort_values(by='number_of_strikes', ascending=False).head(10)

Results:

index	date	number_of_strikes	center_point_geom
302758	2018-08-20	2211	POINT(-92.5 35.5)
278383	2018-08-16	2142	POINT(-96.1 36.1)
280830	2018-08-17	2061	POINT(-90.2 36.1)
280453	2018-08-17	2031	POINT(-89.9 35.9)
278382	2018-08-16	1902	POINT(-96.2 36.1)
11517	2018-02-10	1899	POINT(-95.5 28.1)
277506	2018-08-16	1878	POINT(-89.7 31.5)
24906	2018-02-25	1833	POINT(-98.7 28.9)
284320	2018-08-17	1767	POINT(-90.1 36)
24825	2018-02-25	1741	POINT(-98 29)

To get the locations with the most days with at least one lightning strike, the value_counts() function is used on the center_point_geom column.

If each row represent a day and location, then counting the number of times each location occurs in the data reveals the number of days with at least one lightning strike.

By default, value_counts() sorts in descending order revealing an interesting trend between activity and geographical proximity.

View Code

# Locations appearing most in the dataset
df.center_point_geom.value_counts()

Results:

center_point_geom	count
POINT(-81.5 22.5)	108
POINT(-84.1 22.4)	108
POINT(-82.7 22.9)	107
POINT(-82.5 22.9)	107
POINT(-84.2 22.3)	106
...
POINT(-130.2 47.4)	1
POINT(-60.4 44.5)	1

Showing truncated results (full dataset contains 170,855 rows)

Using the value_counts() function again but only to output the top 20 results, rename the columns and apply a color gradient

View Code

# Top 20
df.center_point_geom.value_counts()[:20]
    .rename_axis('unique_values')
    .reset_index(name='counts')
    .style.background_gradient()

Results:

#	unique_values	counts
0	POINT(-81.5 22.5)	108
1	POINT(-84.1 22.4)	108
2	POINT(-82.7 22.9)	107
3	POINT(-82.5 22.9)	107
4	POINT(-84.2 22.3)	106
5	POINT(-82.5 22.8)	106
6	POINT(-76 20.5)	105
7	POINT(-75.9 20.4)	105
8	POINT(-82.2 22.9)	104
9	POINT(-78 18.2)	104
10	POINT(-83.9 22.5)	103
11	POINT(-78 18.3)	102
12	POINT(-82 22.4)	102
13	POINT(-82 22.8)	102
14	POINT(-82.3 22.9)	102
15	POINT(-84 22.4)	102
16	POINT(-75.5 20.6)	101
17	POINT(-82 22.3)	101
18	POINT(-78.2 18.3)	101
19	POINT(-84 22.5)	101

Summary

The analysis begins by identifying the locations with the highest number of lightning strikes in a single day, which surfaces the most extreme events and provides a clear starting point for understanding geographic concentration. From there, frequency is examined by counting how often each location appears in the dataset, revealing which areas experience lightning most consistently over time. Focusing on the top locations and applying a gradient helps highlight relative intensity, making it easier to compare and identify areas with the highest overall lightning activity.

A key takeaway is that lightning activity appears to be geographically clustered, with the highest frequency locations occurring in close proximity, suggesting concentrated regions of recurring lightning events rather than a even distribution.

Temporal Analysis

To find out which days of the week had higher concentrations, lightning strikes are categorized by day of the week. Using the date column, converted to datetime format earlier, a week column is extracted.

dt.isocalendar() is a function designed for use on a pandas series, returning in this case, a new dataframe with year, week, and day columns. Adding .week will extract only the week number for use in this phase.

dt.day.name() is also be used to create a weekeday column which will provide the text name of day for any given datetime formatted date (Useful for visualization).

View Code

# Create two new columns
df['week'] = df.date.dt.isocalendar().week
df['weekday'] = df.date.dt.day_name()
df.head()

Results:

index	date	number_of_strikes	center_point_geom	week	weekday
0	2018-01-03	194	POINT(-75 27)	1	Wednesday
1	2018-01-03	41	POINT(-78.4 29)	1	Wednesday
2	2018-01-03	33	POINT(-73.9 27)	1	Wednesday
3	2018-01-03	38	POINT(-73.8 27)	1	Wednesday
4	2018-01-03	92	POINT(-79 28)	1	Wednesday

The mean number of lightning strikes for each weekday of the year is calculated using groupby() to get an idea of what the distribution will look like

View Code

# Mean number of strikes per weekday of the year
df[['weekday', 'number_of_strikes']].groupby(['weekday']).mean()

Results:

weekday	number_of_strikes
Friday	13.349972
Monday	13.152804
Saturday	12.732694
Sunday	12.324717
Thursday	13.240594
Tuesday	13.813599
Wednesday	13.224568

Interestingly, Saturday and Sunday have fewer lightning strikes on average than the other five days of the week. Pursuing this further, a box plot is made showing the distribution across weekdays.

For this visualization, outliers will be ignored setting showfliers to False. The focus for this visualization will be on median values without possible skewing from outliers.

View Code

# Define order of days (plot)
weekday_order = ['Monday','Tuesday', 'Wednesday', 'Thursday','Friday','Saturday','Sunday']

# Plot
g = sns.boxplot(data=df,
                x='weekday',
                y='number_of_strikes',
                order=weekday_order,
                showfliers=False
               );
g.set_title('Lightning distribution per weekday (2018)');

Results:

Summary

Although the boxplot further highlights a lower average lightning count for Saturday and Sunday, the median remains consistent across all days, suggesting no meaningful shift in typical activity. It appears that the difference in averages could likely be driven by sampling or temporal effects. This is consistent with what df.center_point_geom.value_counts() highlights in that most activity seems to be clustered close in proximity.

There appears to be an aerosol theory going around that less traffic, less people at work, and less industrial release have a causal effect. I doubt it.

Seasonal Trends (Multi-Year)

To calculate monthly lightning strikes from 2016-2018, the second structured dataset is used, calculating percentages of total lightning strikes that occurred for each year in a given month.

Like the first dataset, the date column needs to be converted.

View Code

# Load dataset
df_2 = pd.read_csv('lightning_strikes_dataset_2.csv')

# Datetime conversion
df_2['date'] = pd.to_datetime(df_2['date'])

df_2.head()

Results:

date	number_of_strikes	center_point_geom
2016-01-04	55	POINT(-83.2 21.1)
2016-01-04	33	POINT(-83.1 21.1)
2016-01-05	46	POINT(-77.5 22.1)
2016-01-05	28	POINT(-76.8 22.3)
2016-01-05	28	POINT(-77 22.1)

The 2016-2017 dataframe is combined with the 2018 dataframe using concat() so that data can be aggregated across all years.

The week and weekday columns that were created earlier are dropped for this operation only to simplify the results.

View Code

# Create new 2016-2018 dataframe
union_df = pd.concat([df.drop(['weekday', 'week'], axis=1), df_2], ignore_index=True)
union_df.head()

Results:

date	number_of_strikes	center_point_geom
2018-01-03	194	POINT(-75 27)
2018-01-03	41	POINT(-78.4 29)
2018-01-03	33	POINT(-73.9 27)
2018-01-03	38	POINT(-73.8 27)
2018-01-03	92	POINT(-79 28)

Three new columns need to be created, year, month number, and month name, to assist with naming bars on the bar plot.

View Code

# Add 3 new columns
union_df['year'] = union_df.date.dt.year
union_df['month'] = union_df.date.dt.month
union_df['month_txt'] = union_df.date.dt.month_name()
union_df.head()

Results:

date	number_of_strikes	center_point_geom	year	month	month_txt
2018-01-03	194	POINT(-75 27)	2018	1	January
2018-01-03	41	POINT(-78.4 29)	2018	1	January
2018-01-03	33	POINT(-73.9 27)	2018	1	January
2018-01-03	38	POINT(-73.8 27)	2018	1	January
2018-01-03	92	POINT(-79 28)	2018	1	January

With this new dataframe, the total number of strikes per year can be called

View Code

union_df[['year', 'number_of_strikes']].groupby(['year']).sum()

Results:

year	number_of_strikes
2016	41582229
2017	35095195
2018	44600989

A new dataframe is created called lightning_by_month highlighting percentages of total lightning strikes that occurred in a given month for each year.

To avoid any future errors with agg(), sum is passed as a string since that is what Pandas expects for built-ins (sum, mean, max etc.)

View Code

# Lightning by month
lightning_by_month = union_df.groupby(['month_txt', 'year']).agg(
    number_of_strikes = pd.NamedAgg(column = 'number_of_strikes', aggfunc='sum')
    ).reset_index()

lightning_by_month.head()

Results:

month_txt	year	number_of_strikes
April	2016	2636427
April	2017	3819075
April	2018	1524339
August	2016	7250442
August	2017	6021702

With the monthly total strike count in hand, agg() is used to calculate the same yearly totals from before into a dataframe that can be merged downstream.

View Code

# lightning by year
lightning_by_year = union_df.groupby(['year']).agg(
    year_strikes = pd.NamedAgg(column='number_of_strikes', aggfunc='sum')
).reset_index()

lightning_by_year.head()

Results:

year	year_strikes
2016	41582229
2017	35095195
2018	44600989

With both lightning_by_month and lightning_by_year, a new dataframe called percentage_lightning can be created.

merge() offers this capability by merging both lightning_by_month and lightning_by_year into a single dataframe, specifying the merge on the year column. Wherever year contains the same value in both dataframes, a row is created in the merged dataframe with all other columns.

percentage_lightning will add a new column called year_strikes representing total number of strikes for each year.

View Code

# Combine
percentage_lightning = lightning_by_month.merge(lightning_by_year, on='year')
percentage_lightning.head()

Results:

month_txt	year	number_of_strikes	year_strikes
April	2016	2636427	41582229
April	2017	3819075	35095195
April	2018	1524339	44600989
August	2016	7250442	41582229
August	2017	6021702	35095195

To get the percentage of total lightning strikes that occurred during each month, the number_of_strikes column is divided by the year_strikes column and multiplied by 100.

View Code

percentage_lightning['percentage_lightning_per_month'] = (percentage_lightning.number_of_strikes/
                                                          percentage_lightning.year_strikes * 100.0)
percentage_lightning.head()

Results:

month_txt	year	number_of_strikes	year_strikes	percentage_lightning_per_month
April	2016	2636427	41582229	6.340273
April	2017	3819075	35095195	10.882045
April	2018	1524339	44600989	3.417725
August	2016	7250442	41582229	17.436396
August	2017	6021702	35095195	17.158195

Now the percentages by month for each year can be plotted

View Code

# Plot
plt.figure(figsize=(10,6));

month_order = ['January', 'February', 'March', 'April', 'May', 'June', 
               'July', 'August', 'September', 'October', 'November', 'December']

sns.barplot(
    data = percentage_lightning,
    x = 'month_txt',
    y = 'percentage_lightning_per_month',
    hue = 'year',
    order = month_order );
plt.xlabel("Month");
plt.ylabel("% of lightning strikes");
plt.title("% of lightning strikes each Month (2016-2018)");

Results:

Summary

To look at how lightning activity changes over time, data from 2016–2018 is combined so everything can be analyzed in one place . Since both datasets follow the same format, they can be stacked together after converting the date column and dropping a couple of extra fields that are not needed for this part. From there, breaking the data down into year and month makes it easier to compare how activity shifts throughout the calendar.

Once totals are calculated, the focus shifts to how much each month contributes relative to the full year. Instead of just looking at raw counts, converting everything into percentages gives a clearer picture of seasonality and makes comparisons across years more consistent. Merging monthly totals with yearly totals allows that percentage calculation to happen cleanly in one place.

When plotted, a clear pattern starts to show. Lightning activity is not evenly spread out as it builds up throughout the year and peaks heavily in late summer. August stands out the most, especially in 2018 where it accounts for a significant share of total strikes, reinforcing that lightning activity tends to concentrate within specific seasonal windows rather than being evenly distributed.

Project Overview

This project brings together multiple years of NOAA lightning strike data to look at how activity varies across both location and time. After combining and structuring the datasets, a few consistent patterns start to emerge. Lightning activity is not evenly distributed. It tends to concentrate in certain areas, and a relatively small number of locations show up repeatedly with higher strike frequency, suggesting a clear clustering effect rather than random spread.

Across weekdays, there is not much variation. Activity stays fairly stable, with only a slight decrease on weekends, which does not meaningfully change the overall distribution. The more noticeable differences show up over longer periods. Lightning activity shifts throughout the year, with certain months consistently seeing higher volumes. These seasonal patterns repeat across multiple years, pointing to underlying environmental or atmospheric drivers rather than one-off fluctuations.

The seasonal consistency suggests that lightning activity can be anticipated to some extent, while the geographic concentration highlights areas where activity is persistently higher. Both of these patterns provide a starting point for more targeted monitoring, planning, or deeper analysis into what is driving the behavior.

NOAA Lightning Stirke Analysis (EDA Structuring)

Key Takeaways

Project Background

Project Goal

Technical Implementation

Project Overview