← Back to Portfolio

NOAA Lightning Strike Analysis (Date String Manipulations)

by Alfred Rico

Key Takeaways

Project Background

The data used for this project represents cloud-to-ground lightning activity recorded by Vaisala's National Lightning Detection Network (NLDN), a system that tracks lightning strikes across the United States.

To make analysis of the data easier, the National Centers for Environmental Information distribute individual lightning strikes into spatial grids where all strikes that occur within a 0.1° latitude by 0.1° longitude are grouped and summarized daily. Each grid cell is represented by a pair of coordinates corresponding to the center of that tile.

In this way, instead of working with millions of individual strike events, the data is structured by location and time with each row representing total strikes recorded within a tile on a given day.

Due to the substantial size of the csv file used in this project, converting it to Parquet format was not sufficient to reduce its size, and it could not be uploaded to the repository. The data is instead available through the bigquery-public-data.noaa_lightning.lightning_strikes public table accessible through Google BigQuery.

Project Goal

Geotemporal aggregation of the data lends practicality to data exploration especially when working with date features, which will be the focus of this project.

Data collected from the National Oceanic and Atmospheric Association (NOAA) holding records of 2016-2018 lightning strikes will be used to calculate weekly sums and quarterly lightning strike totals and then visualized.

Technical Implementation

Data Loading & Inspection

Before getting started, all required libraries and extensions are imported, the dataset is loaded, and a brief overview is obtained using .head() while .info() is used to assess data types and data shape.

View Code
# Import
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
  
# Load dataset
df = pd.read_csv('lightning_strikes_dataset.csv')
df.head()
df.info()
Results:

df.head()

date number_of_strikes center_point_geom
0 2016-08-05 16 POINT(-101.5 24.7)
1 2016-08-05 16 POINT(-85 34.3)
2 2016-08-05 16 POINT(-89 41.4)
3 2016-08-05 16 POINT(-89.8 30.7)
4 2016-08-05 16 POINT(-86.2 37.9)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10479003 entries, 0 to 10479002
Data columns (total 3 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   date               object
 1   number_of_strikes  int64 
 2   center_point_geom  object
dtypes: int64(1), object(2)
memory usage: 239.8+ MB
Summary

Similar to another project, the date column has been recorded as an object type and, as the target feature, will be converted to datetime format to extract the required temporal components.

.info() reveals the dataset to contain ~10.5M rows with 3 columns as better represented with .head().

Feature Engineering

At this phase of the project, the date column is converted to a proper datetime format using pd.to_datetime()

Four new columns, week, month, quarter and year, are created using the datetime.strftime() method with associated format codes.

View Code
# Create columns
df['week'] = df['date'].dt.strftime('%Y-W%V')
df['month'] = df['date'].dt.strftime('%Y-%m')
df['quarter'] = df['date'].dt.to_period('Q').dt.strftime('%Y-Q%q')
df['year'] = df['date'].dt.strftime('%Y') 

# Check
df.head(10)

Results:

df.head(10)

date number_of_strikes center_point_geom week month quarter year
02016-08-0516POINT(-101.5 24.7)2016-W312016-082016-Q32016
12016-08-0516POINT(-85 34.3)2016-W312016-082016-Q32016
22016-08-0516POINT(-89 41.4)2016-W312016-082016-Q32016
32016-08-0516POINT(-89.8 30.7)2016-W312016-082016-Q32016
42016-08-0516POINT(-86.2 37.9)2016-W312016-082016-Q32016
52016-08-0516POINT(-97.8 38.9)2016-W312016-082016-Q32016
62016-08-0516POINT(-81.9 36)2016-W312016-082016-Q32016
72016-08-0516POINT(-90.9 36.7)2016-W312016-082016-Q32016
82016-08-0516POINT(-106.6 26.1)2016-W312016-082016-Q32016
92016-08-0516POINT(-108 31.6)2016-W312016-082016-Q32016
Summary

The date column was converted from a string to a datetime format to enable time based calculations. New temporal features (week, month, quarter and year) were then extracted using datetime methods and formatting, allowing easier aggregation and analysis of geotemporal patterns.

2018 Weekly Strikes Visualization

The dataset is filtered to 2018 and aggregated with groupby() and sum() before plotting

Using pandas versions v.2.x+ requires setting numeric_only=True in the sum() during aggregation or an error will be thrown.

View Code
# 2018 data summed by week
df_by_week_2018 = df[df['year'] == '2018'].groupby(['week']).sum(numeric_only=True).reset_index()
df_by_week_2018.head()

# Plot
plt.bar(x = df_by_week_2018['week'], height = df_by_week_2018['number_of_strikes'])
plt.plot()
plt.xlabel("Week number")
plt.ylabel("Number of lightning strikes")
plt.title("Number of lightning strikes per week (2018)");
plt.show()
  
# Adjusting
plt.figure(figsize = (20, 5)) # Increase output size.
plt.bar(x = df_by_week_2018['week'], height = df_by_week_2018['number_of_strikes'])
plt.plot()
plt.xlabel("Week number")
plt.ylabel("Number of lightning strikes")
plt.title("Number of lightning strikes per week (2018)")
plt.xticks(rotation = 45, fontsize = 8) # Rotate x-axis labels and decrease font size.

plt.show()

Results:
Summary

The dataset was filtered for 2018 and grouped by week using groupby() to calculate total weekly lightning strikes. The results were visualized using bar(), with plt.figure(figsize=(20, 5)) and plt.xticks(rotation=45, fontsize=8) applied to improve readability.

The visualization shows relatively low activity early in the year, followed by a sharp increase during mid-year, with peak lightning activity occurring in late summer before declining toward the end of the year.

2016-2018 Quarterly Strikes Visualization

Analysis now shifts focus to quarterly lightning strikes for the full date range of available data. Because it is easiest to work with numbers in millions, .div(1000000) will be used during aggregation before plotting.

A number_of_strikes_formatted column is added representing the number in millions.

A labeling function is used to annotate each bar in the graph with its corresponding number_of_strikes_formatted text. The pyplot function uses positional arguments x, y, and s for x-axis, y-axis, and text respectively.

View Code
# 2016-2018 data by quarter and sum
df_by_quarter = df.groupby(['quarter']).sum(numeric_only=True).reset_index()

# Text converted to millions
df_by_quarter['number_of_strikes_formatted'] = df_by_quarter['number_of_strikes'].div(1000000).round(1).astype(str) + 'M'

df_by_quarter.head()

# pyplot function
def addlabels(x, y, labels):
    for i in range(len(x)):
        plt.text(i, y[i], labels[i], ha = 'center', va = 'bottom')

# Plot
plt.figure(figsize = (15, 5))
plt.bar(x = df_by_quarter['quarter'], height = df_by_quarter['number_of_strikes'])
addlabels(df_by_quarter['quarter'], df_by_quarter['number_of_strikes'], df_by_quarter['number_of_strikes_formatted'])
plt.plot()
plt.xlabel('Quarter')
plt.ylabel('Number of lightning strikes')
plt.title('Number of lightning strikes per quarter (2016-2018)')
plt.show()
Results:

df_by_quarter.head()

quarter number_of_strikes number_of_strikes_formatted
02016-Q126837982.7M
12016-Q21508485715.1M
22016-Q32184382021.8M
32016-Q419697542.0M
42017-Q124442792.4M
Summary

The grouped visualization reveals consistent seasonal trends, with peak lightning strikes occurring in Q2 and Q3 across all years, while also highlighting variation in intensity between years.

Year-over-Year Quarterly Strike Visualization

For a grouped analysis, two new columns are created to enable grouped comparisons representing two extracted substrings, quarter and year

A grouped bar chart is created to highlight lightning activity across the same quarter in different years.

View Code
# Create two new columns
df_by_quarter['quarter_number'] = df_by_quarter['quarter'].str[-2:]
df_by_quarter['year'] = df_by_quarter['quarter'].str[:4]

# Check
df_by_quarter.head()

# Plot
plt.figure(figsize = (15, 5))
p = sns.barplot(
    data = df_by_quarter,
    x = 'quarter_number',
    y = 'number_of_strikes',
    hue = 'year')
for b in p.patches:
    p.annotate(str(round(b.get_height()/1000000, 1))+'M', 
                   (b.get_x() + b.get_width() / 2., b.get_height() + 1.2e6), 
                   ha = 'center', va = 'bottom', 
                   xytext = (0, -12), 
                   textcoords = 'offset points')
plt.xlabel("Quarter")
plt.ylabel("Number of lightning strikes")
plt.title("Number of lightning strikes per quarter (2016-2018)")
plt.show()
Results:

.head()

quarter number_of_strikes number_of_strikes_formatted quarter_number year
02016-Q126837982.7MQ12016
12016-Q21508485715.1MQ22016
22016-Q32184382021.8MQ32016
32016-Q419697542.0MQ42016
42017-Q124442792.4MQ12017
Summary

Two new features, quarter_number and year, were extracted from the quarter column to enable grouped comparison. A grouped bar chart was then created with each bar being annotated with its respective value (in millions) to improve readability. Ultimately, the bar chart now highlights a grouped version of quarterly lightning activity across years.

Project Overview

This project focuses on transforming and extracting meaningful features from date string data to enable structured temporal analysis. Using Python, raw string based date values are converted into datetime format and expanded into multiple time based components such as week, month, quarter, and year. These engineered features are then used to aggregate and visualize lightning strike activity across different time scales, revealing seasonal patterns and year-over-year trends. The project highlights how effective date manipulation serves as a critical step in preparing data for temporal analysis and visualization.