TikTok Claims Project (EDA with Python)

By Alfred Rico

Key Takeaways

Engagement is not evenly distributed across videos. A small number of highly viral posts account for a large share of total interaction, which pulls averages upward and makes median values more reliable when understanding typical performance.
Claim videos consistently generate higher engagement than opinion videos across views, likes, shares, and comments. The separation is large enough that engagement patterns alone begin to distinguish between the two groups.
While authors labeled as banned or under review tend to show higher raw engagement, this is largely driven by a small number of high-performing videos. Once this effect is accounted for, the difference between groups becomes less pronounced.
Normalizing engagement into rates helps remove the influence of scale and makes comparisons more consistent. After doing this, content type remains the stronger signal, with claim videos maintaining higher engagement regardless of author status.

Problem Statement

TikTok's ever growing community base, with close to 2 billion monthly users worldwide as of 2026, presents moderators with the challenge of reviewing reported content suspected of containing user claims. In order to maintain trust and safety on the platform, moderators must take precautionary measures to prevent the spread of misinformation and ensure adherence to policy. To help reduce the level of backlogging, the overall focus of this project will be to create a predictive model that can determine whether a video contains claims or offers opinions allowing moderators to prioritize reports efficiently.

Project Goal

For this part of the project, the focus will be on cursory data inspection of the provided dataset, tiktok_dataset.csv. After building a dataframe with Python in Jupyter notebook the claims data will be organized for exploratory data analysis, cleaned, and transformed in order to derive engagement trends identifying any suggestions of correlations between user engagement variables and claim status. Engagement trends provide insight into how users interact with different types of content, revealing meaningful patterns associated with claim versus opinion videos. Incorporating these behavioral signals into a predictive model will improve it's ability to distinguish between the two classes.

Technical Implementation

Python Packages

Importing Python packages

NumPy - Used primarily for numerical computation and wide ranging mathematical functions
Pandas - Used for data manipulation, structuring and handling

View Code

import numpy as np
import pandas as pd

Preliminary Data Inspection

Loading the dataset onto the Jupyter notebook environment with

data = pd.read_csv("tiktok_dataset.csv")

Initial data inspection, validation and exploration is performed using head(), info(), and describe() to understand structure, data types, and statistical distributions

head() - to examine sample record dataset structure and feature layout
info() - to asses data types, missing values, and dataset integrity
describe() - for statistical distribution (important for identifying trends, variability and potential outliers)

View Code

# Loading dataset  
data = pd.read_csv("tiktok_dataset.csv")
# Displaying only the first 5 rows to examine structure and features  
data.head()
# Summary information  
data.info()  
# Summary statistics  
data.describe()

Results:

	#	`claim_status`	video_id	video_duration_sec	video_transcription_text	verified_status	`author_ban_status`	video_view_count	video_like_count	video_share_count	video_download_count	video_comment_count
0	1	claim	7017666017	59	someone shared with me that drone deliveries are already hap...	not verified	under review	343296.0	19425.0	241.0	1.0	0.0
1	2	claim	4014381136	32	someone shared with me that there are more microorganisms in...	not verified	active	140877.0	77355.0	19034.0	1161.0	684.0
2	3	claim	9859838091	31	someone shared with me that american industrialist andrew ca...	not verified	active	902185.0	97690.0	2858.0	833.0	329.0
3	4	claim	1866847991	25	someone shared with me that the metro of st. petersburg, wit...	not verified	active	437506.0	239954.0	34812.0	1234.0	584.0
4	5	claim	7105231098	19	someone shared with me that the number of businesses allowin...	not verified	active	56167.0	34987.0	4110.0	547.0	152.0

Data Info:


RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB

	#	video_id	video_duration_sec	video_view_count	video_like_count	video_share_count	video_download_count	video_comment_count
count	19382.000000	1.938200e+04	19382.000000	19084.000000	19084.000000	19084.000000	19084.000000	19084.000000
mean	9691.500000	5.627454e+09	32.421732	254708.558688	84304.636030	16735.248323	1049.429627	349.312146
std	5595.245794	2.536440e+09	16.229967	322893.280814	133420.546814	32036.174350	2004.299894	799.638865
min	1.000000	1.234959e+09	5.000000	20.000000	0.000000	0.000000	0.000000	0.000000
25%	4846.250000	3.430417e+09	18.000000	4942.500000	810.750000	115.000000	7.000000	1.000000
50%	9691.500000	5.618664e+09	32.000000	9954.500000	3403.500000	717.000000	46.000000	9.000000
75%	14536.750000	7.843960e+09	47.000000	504327.000000	125020.000000	18222.000000	1156.250000	292.000000
max	19382.000000	9.999873e+09	60.000000	999817.000000	657830.000000	256130.000000	14994.000000	9599.000000

Output preview of dataset structure and summary statistics.

Summary

Upon inspection, data.head() displays the first five rows of the 19,382 entries, each representing either a claim or an opinion along with associated metadata. The dataset includes a mix of categorical, textual, and numerical features.

data.info() provides insight into data types and completeness, highlighting variables that contain missing or null values.

data.describe() is particularly relevant to the project objective, as it reveals the distribution of engagement-related variables. The results indicate strong right-skewness, where a small number of videos exhibit disproportionately high interaction values—suggesting the presence of high-end outliers likely associated with viral content.

Quartile analysis further supports this observation. Quartiles divide the data into four equal parts after sorting. At the 50% level (median), view counts are approximately 10k, while at the 75% level they rise to around 50k. This sharp increase indicates that a subset of videos significantly outperforms the majority.

Additionally, the standard deviation across engagement metrics is relatively large, reinforcing the presence of variability and outliers. It is important to note that the mean is not fully representative of typical engagement levels, as it is heavily influenced by extreme values. The median provides a more reliable measure of central tendency in this case, as it remains unaffected by outliers.

Claim Status & Engagement Analysis

With the preliminary data inspection, and the machine learning model being created for this project in mind, it is apparent that the column claim_status will play a key role. This variable distinguishes whether a video contains a claim or expresses an opinion, serving as the basis for the downstream classification task.

View Code

#Identifying values and their counts for claim_status  
data['claim_status'].value_counts()

Results:


  
    
      Claim Status
      Count
    
  
  
    
      claim
      9608
    
    
      opinion
      9476

Claim Status	Count
claim	9608
opinion	9476

The target variable claim_status contains a near even distribution of 9,608 claim and 9,476 opinion videos. This balance will be beneficial downstream for classification purposes as it reduces risks of model bias toward a dominant class and supports more reliable evaluations of model performance. For classification tasks such as this one, the near even distribution between claim and opinion videos suggests that this dataset is well-suited for training a model that can learn to distinguish between the two classes.

To further strengthen the predictive model, user engagement metrics are examined to determine whether interaction patterns differ across claim status categories. The dataset is filtered by claim status using Boolean masking to compare engagement levels between claims and opinions. Mean and median view counts are calculated for each group.

View Code

# Average view count (claim status)

claims = data[data['claim_status'] == 'claim']
print('Mean view count claims:', claims['video_view_count'].mean())
print('Median view count claims:', claims['video_view_count'].median())  

# Average view count (opinion status)

opinions = data[data['claim_status'] == 'opinion']
print('Mean view count opinions:', opinions['video_view_count'].mean())
print('Median view count opinions:', opinions['video_view_count'].median())

Results:

Mean view count claims: 501029.4527477102
Median view count claims: 501555.0  

Mean view count opinions: 4956.43224989447
Median view count opinions: 4953.0

Although the mean and median values within each group are closely aligned, suggesting even distribution, there is marked disparity between categories. Claim videos exhibit mean and median view counts exceeding 50k whereas opinion videos average close to 5k views. This strongly suggests that engagement patterns differ markedly by claim status making view count, and likely other engagement metrics, a particularly strong feature for classification.

Next, any trends associated with the ban status of the author will be examined using groupby() to calculate how many videos there are for each combination of categories of claim status including active, banned, or under review.

View Code

# Counts for each group combination of claim status and author ban status

data.groupby(['claim_status', 'author_ban_status']).count()[['#']]

Results:

`claim_status`	`author_ban_status`	Count
claim	active	6566
claim	banned	1439
claim	under review	1603
opinion	active	8817
opinion	banned	196
opinion	under review	463

Summary

The target variable claim_status is nearly evenly distributed, supporting unbiased model training and reliable evaluation.

Engagement analysis reveals a marked separation between classes, with claim videos averaging higher view counts than opinion videos. The closeness of mean and median values in each category group suggests relatively stable distributions within, while magnitudes of difference between categories indicates strong classification potential for engagement features.

Author ban status also seems to further highlight differences between classes, with claim videos disproportionately associated with banned authors.

Overall, the results indicate that engagement metrics and author status variables can enhance model performance.

Ban Status & Engagement Analysis

Analysis shifts focus to the author_ban_status variable, examining whether user interaction patterns differ between authors labeled as active, banned, or under review.

groupby() is used to segment the dataset by author ban status, aggregating key engagement metrics, including view, like, and share counts using mean and median values providing a comprehensive view of overall typical engagement levels as well as accounting for potential outliers.

View Code

# Mean and median share count across engagement metrics
data.groupby(['author_ban_status']).agg(
    {'video_view_count': ['mean', 'median'],
    'video_like_count': ['mean', 'median'],
    'video_share_count': ['mean', 'median']})

Results:

`author_ban_status`	video_view_count		video_like_count		video_share_count
`author_ban_status`	mean	median	mean	median	mean	median
active	215927.039524	8616.0	71036.533836	2222.0	14111.466164	437.0
banned	445845.439144	448201.0	153017.236697	105573.0	29998.942508	14468.0
under review	392204.836399	365245.5	128718.050339	71204.5	25774.696999	9444.0

Results reveal a clear pattern in that authors with a banned status consistently exhibit the highest engagement across all metrics, followed by those under review, while active authors show significantly lower engagement levels. Notably, the median share count for banned authors (14,468) is dramatically higher than that of active authors (437), revealing a 30-fold difference. Altogether, this further supports discriminatory ability from a modeling perspective, indicating that engagement features, especially share count, can serve as additional signals when identifying content that differs in behavior.

A count variable across engagement metrics is created in order to scale mean and median data across the sample and a few key patterns emerge. Authors labeled as banned or under review consistently exhibit dramataically higher values in engagement across all metrics (views, likes, and shares) compared to active authors suggesting contenet associated with these accounts is more widespread than it's counterpart. Another interesting observation is the wide gaps between the mean and median values across each group. The mean is substantially higher than the median strongly indicating the presence of a small number of videos with exceptionally high engagement, the outliers. In this preliminary analysis of ban status and engagement level, it is apparent that high engagement is not evenly distributed across videos or author groups. Instead, a small group of highly viral content influence much of the overall interaction, particularly among authors whose accounts are labeled as banned or under review. From a modeling perspective, these patterns indicate engagement features can carry a strong signal for classification but also require careful handling due to their apparent influence by extreme values.

View Code

# Adding a count variable
data.groupby(['author_ban_status']).agg(
    {'video_view_count': ['count', 'mean', 'median'],
     'video_like_count': ['count', 'mean', 'median'],
     'video_share_count': ['count', 'mean', 'median']
     })

Results:

`author_ban_status`	video_view_count			video_like_count			video_share_count
`author_ban_status`	count	mean	median	count	mean	median	count	mean	median
active	15383	215927.039524	8616.0	15383	71036.533836	2222.0	15383	14111.466164	437.0
banned	1635	445845.439144	448201.0	1635	153017.236697	105573.0	1635	29998.942508	14468.0
under review	2066	392204.836399	365245.5	2066	128718.050339	71204.5	2066	25774.696999	9444.0

Summary

Ban status and engaegment analysis reveals that authors categorized as banned or under review consistently exhibit significantly higher engagment across all metrics compared to active authors. Share counts in particular, reveal a strong indication that highly viral content is more commonly associated with moderated accounts. One of the bigger signals of this is the marked gap between mean and median values across all groups suggesting a severe rightly skewed average, with a small number of highly viral videos driving overall engagement averages. These patterns suggest that engagement metrics may serve as strong indicators for distinguishing content behavior though careful interpretation is required due to the influence of outliers.

Engagement Rate Analysis by Claim and Ban Status

To better understand engagement rates, raw counts (likes, comments, and shares) are normalized by total views and since raw numbers can be misleading, engagement metrics will be converted into rates per view.

likes_per_view: represents the number of likes divided by the number of views for each video
comments_per_view: represents the number of comments divided by the number of views for each video
shares_per_view: represents the number of shares divided by the number of views for each video

Next, groupby() is used to split the dataset into combinations of claim status and author ban status. The new engagement rate data for each group is compiled and summarized using agg() to calculate count, mean and median of each group.

View Code

# Create engagement rate features

data['likes_per_view'] = data['video_like_count'] / data['video_view_count']
data['comments_per_view'] = data['video_comment_count'] / data['video_view_count']
data['shares_per_view'] = data['video_share_count'] / data['video_view_count']

# Grouped aggregation

data.groupby(['claim_status', 'author_ban_status']).agg(
    {'likes_per_view': ['count', 'mean', 'median'],
     'comments_per_view': ['count', 'mean', 'median'],
     'shares_per_view': ['count', 'mean', 'median']}
)

Results:

`claim_status`	`author_ban_status`	`likes_per_view`			`comments_per_view`			`shares_per_view`
`claim_status`	`author_ban_status`	count	mean	median	count	mean	median	count	mean	median
claim	active	6566	0.329542	0.326538	6566	0.001393	0.000776	6566	0.065456	0.049279
claim	banned	1439	0.345071	0.358909	1439	0.001377	0.000746	1439	0.067893	0.051606
claim	under review	1603	0.327997	0.320867	1603	0.001367	0.000789	1603	0.065733	0.049967
opinion	active	8817	0.219744	0.218330	8817	0.000517	0.000252	8817	0.043729	0.032405
opinion	banned	196	0.206868	0.198483	196	0.000434	0.000193	196	0.040531	0.030728
opinion	under review	463	0.226394	0.228051	463	0.000536	0.000293	463	0.044472	0.035027

Summary

The analysis further strengthens the trend in that claim videos generate higher engagement than opinion videos across all metrics. Claim videos for example, average about 0.33 likes per view compared to 0.22 for opinion videos, and 0.065 shares per view versus 0.043. They also receive significantly more shares and even 2-3x more comments.

Differences across author ban status, in contrast, are quite small. Claim videos from active users, 0.329 likes per view, and banned users, 0.345, are nearly even revealing very strong indications that content type, not moderation status (active, banned, or under review) drives user engagement.

Project Overview

This project explores patterns in TikTok video engagement to support the development of a predictive model capable of distinguishing between claim-based and opinion-based content. Using a dataset of over 19,000 videos, exploratory data analysis (EDA) was conducted to understand how user interaction metrics and author characteristics relate to content type.

Initial inspection revealed that engagement variables such as views, likes, shares, and comments exhibit strong right-skewed distributions, where a small subset of highly viral videos drives disproportionately large values. This was supported by large standard deviations and significant gaps between quartiles, indicating the presence of extreme outliers. As a result, median values were identified as more reliable indicators of typical engaegment behavior than means.

The target variable, claim_status, is nearly evenly distributed between claim and opinion videos. This balance provides a strong foundation for unbiased model training and evaluation. When comparing engagement across these classes, a clear pattern indicates that claim videos significantly outperform opinion videos across all engagement metrics. They receive substantially higher view counts, more interactions, and stronger engagement rates per view. This separation highlihgts engagement features as highly predictive signals for classification.

Furhter anlaysis of author ban status revealed that videos from authors labeled as banned or under review tend to generate higher overall engagement than those from active users. However, these differences are heavily influenced by viral outliers and are less consistent when normalized into engagement rates. In contrast, engagement rate analysis shows that content type, not moderation status, is the primary driver of user interaction behavior. Claim videos maintain higher likes, shares, and comments per view regardless of author status.

Overall, the findings indicate that:

Engagement metrics are powerful predictors of claim-based content
Content type drives user interaction more than author status
Outliers and skewness must be carefully handled in modeling

These insights provide a strong analytical foundation for the next phase of the project.