← Back to Portfolio

TikTok Claims Project (EDA with Python)

By Alfred Rico

Key Takeaways

Problem Statement

TikTok's ever growing community base, with close to 2 billion monthly users worldwide as of 2026, presents moderators with the challenge of reviewing reported content suspected of containing user claims. In order to maintain trust and safety on the platform, moderators must take precautionary measures to prevent the spread of misinformation and ensure adherence to policy. To help reduce the level of backlogging, the overall focus of this project will be to create a predictive model that can determine whether a video contains claims or offers opinions allowing moderators to prioritize reports efficiently.

Project Goal

For this part of the project, the focus will be on cursory data inspection of the provided dataset, tiktok_dataset.csv. After building a dataframe with Python in Jupyter notebook the claims data will be organized for exploratory data analysis, cleaned, and transformed in order to derive engagement trends identifying any suggestions of correlations between user engagement variables and claim status. Engagement trends provide insight into how users interact with different types of content, revealing meaningful patterns associated with claim versus opinion videos. Incorporating these behavioral signals into a predictive model will improve it's ability to distinguish between the two classes.

Technical Implementation

Python Packages

Importing Python packages

  • NumPy - Used primarily for numerical computation and wide ranging mathematical functions
  • Pandas - Used for data manipulation, structuring and handling
View Code
import numpy as np
import pandas as pd
          
Preliminary Data Inspection

Loading the dataset onto the Jupyter notebook environment with

data = pd.read_csv("tiktok_dataset.csv")

Initial data inspection, validation and exploration is performed using head(), info(), and describe() to understand structure, data types, and statistical distributions

head() - to examine sample record dataset structure and feature layout
info() - to asses data types, missing values, and dataset integrity
describe() - for statistical distribution (important for identifying trends, variability and potential outliers)

View Code
# Loading dataset  
data = pd.read_csv("tiktok_dataset.csv")
# Displaying only the first 5 rows to examine structure and features  
data.head()
# Summary information  
data.info()  
# Summary statistics  
data.describe()
          
Results:
# claim_status video_id video_duration_sec video_transcription_text verified_status author_ban_status video_view_count video_like_count video_share_count video_download_count video_comment_count
0 1 claim 7017666017 59 someone shared with me that drone deliveries are already hap... not verified under review 343296.0 19425.0 241.0 1.0 0.0
1 2 claim 4014381136 32 someone shared with me that there are more microorganisms in... not verified active 140877.0 77355.0 19034.0 1161.0 684.0
2 3 claim 9859838091 31 someone shared with me that american industrialist andrew ca... not verified active 902185.0 97690.0 2858.0 833.0 329.0
3 4 claim 1866847991 25 someone shared with me that the metro of st. petersburg, wit... not verified active 437506.0 239954.0 34812.0 1234.0 584.0
4 5 claim 7105231098 19 someone shared with me that the number of businesses allowin... not verified active 56167.0 34987.0 4110.0 547.0 152.0
Data Info:

RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB
# video_id video_duration_sec video_view_count video_like_count video_share_count video_download_count video_comment_count
count 19382.000000 1.938200e+04 19382.000000 19084.000000 19084.000000 19084.000000 19084.000000 19084.000000
mean 9691.500000 5.627454e+09 32.421732 254708.558688 84304.636030 16735.248323 1049.429627 349.312146
std 5595.245794 2.536440e+09 16.229967 322893.280814 133420.546814 32036.174350 2004.299894 799.638865
min 1.000000 1.234959e+09 5.000000 20.000000 0.000000 0.000000 0.000000 0.000000
25% 4846.250000 3.430417e+09 18.000000 4942.500000 810.750000 115.000000 7.000000 1.000000
50% 9691.500000 5.618664e+09 32.000000 9954.500000 3403.500000 717.000000 46.000000 9.000000
75% 14536.750000 7.843960e+09 47.000000 504327.000000 125020.000000 18222.000000 1156.250000 292.000000
max 19382.000000 9.999873e+09 60.000000 999817.000000 657830.000000 256130.000000 14994.000000 9599.000000

Output preview of dataset structure and summary statistics.

Summary

Upon inspection, data.head() displays the first five rows of the 19,382 entries, each representing either a claim or an opinion along with associated metadata. The dataset includes a mix of categorical, textual, and numerical features.

data.info() provides insight into data types and completeness, highlighting variables that contain missing or null values.

data.describe() is particularly relevant to the project objective, as it reveals the distribution of engagement-related variables. The results indicate strong right-skewness, where a small number of videos exhibit disproportionately high interaction values—suggesting the presence of high-end outliers likely associated with viral content.

Quartile analysis further supports this observation. Quartiles divide the data into four equal parts after sorting. At the 50% level (median), view counts are approximately 10k, while at the 75% level they rise to around 50k. This sharp increase indicates that a subset of videos significantly outperforms the majority.

Additionally, the standard deviation across engagement metrics is relatively large, reinforcing the presence of variability and outliers. It is important to note that the mean is not fully representative of typical engagement levels, as it is heavily influenced by extreme values. The median provides a more reliable measure of central tendency in this case, as it remains unaffected by outliers.

Claim Status & Engagement Analysis

With the preliminary data inspection, and the machine learning model being created for this project in mind, it is apparent that the column claim_status will play a key role. This variable distinguishes whether a video contains a claim or expresses an opinion, serving as the basis for the downstream classification task.

View Code
#Identifying values and their counts for claim_status  
data['claim_status'].value_counts()
Results:
Claim Status Count
claim 9608
opinion 9476

The target variable claim_status contains a near even distribution of 9,608 claim and 9,476 opinion videos. This balance will be beneficial downstream for classification purposes as it reduces risks of model bias toward a dominant class and supports more reliable evaluations of model performance. For classification tasks such as this one, the near even distribution between claim and opinion videos suggests that this dataset is well-suited for training a model that can learn to distinguish between the two classes.

To further strengthen the predictive model, user engagement metrics are examined to determine whether interaction patterns differ across claim status categories. The dataset is filtered by claim status using Boolean masking to compare engagement levels between claims and opinions. Mean and median view counts are calculated for each group.

View Code
# Average view count (claim status)

claims = data[data['claim_status'] == 'claim']
print('Mean view count claims:', claims['video_view_count'].mean())
print('Median view count claims:', claims['video_view_count'].median())  

# Average view count (opinion status)

opinions = data[data['claim_status'] == 'opinion']
print('Mean view count opinions:', opinions['video_view_count'].mean())
print('Median view count opinions:', opinions['video_view_count'].median())
Results:
Mean view count claims: 501029.4527477102
Median view count claims: 501555.0  

Mean view count opinions: 4956.43224989447
Median view count opinions: 4953.0

Although the mean and median values within each group are closely aligned, suggesting even distribution, there is marked disparity between categories. Claim videos exhibit mean and median view counts exceeding 50k whereas opinion videos average close to 5k views. This strongly suggests that engagement patterns differ markedly by claim status making view count, and likely other engagement metrics, a particularly strong feature for classification.

Next, any trends associated with the ban status of the author will be examined using groupby() to calculate how many videos there are for each combination of categories of claim status including active, banned, or under review.

View Code
# Counts for each group combination of claim status and author ban status

data.groupby(['claim_status', 'author_ban_status']).count()[['#']]
Results:
claim_status author_ban_status Count
claimactive6566
claimbanned1439
claimunder review1603
opinionactive8817
opinionbanned196
opinionunder review463
Summary

The target variable claim_status is nearly evenly distributed, supporting unbiased model training and reliable evaluation.

Engagement analysis reveals a marked separation between classes, with claim videos averaging higher view counts than opinion videos. The closeness of mean and median values in each category group suggests relatively stable distributions within, while magnitudes of difference between categories indicates strong classification potential for engagement features.

Author ban status also seems to further highlight differences between classes, with claim videos disproportionately associated with banned authors.

Overall, the results indicate that engagement metrics and author status variables can enhance model performance.

Ban Status & Engagement Analysis

Analysis shifts focus to the author_ban_status variable, examining whether user interaction patterns differ between authors labeled as active, banned, or under review.

groupby() is used to segment the dataset by author ban status, aggregating key engagement metrics, including view, like, and share counts using mean and median values providing a comprehensive view of overall typical engagement levels as well as accounting for potential outliers.

View Code
# Mean and median share count across engagement metrics
data.groupby(['author_ban_status']).agg(
    {'video_view_count': ['mean', 'median'],
    'video_like_count': ['mean', 'median'],
    'video_share_count': ['mean', 'median']})
Results:
author_ban_status video_view_count video_like_count video_share_count
meanmedian meanmedian meanmedian
active 215927.0395248616.0 71036.5338362222.0 14111.466164437.0
banned 445845.439144448201.0 153017.236697105573.0 29998.94250814468.0
under review 392204.836399365245.5 128718.05033971204.5 25774.6969999444.0

Results reveal a clear pattern in that authors with a banned status consistently exhibit the highest engagement across all metrics, followed by those under review, while active authors show significantly lower engagement levels. Notably, the median share count for banned authors (14,468) is dramatically higher than that of active authors (437), revealing a 30-fold difference. Altogether, this further supports discriminatory ability from a modeling perspective, indicating that engagement features, especially share count, can serve as additional signals when identifying content that differs in behavior.

A count variable across engagement metrics is created in order to scale mean and median data across the sample and a few key patterns emerge. Authors labeled as banned or under review consistently exhibit dramataically higher values in engagement across all metrics (views, likes, and shares) compared to active authors suggesting contenet associated with these accounts is more widespread than it's counterpart. Another interesting observation is the wide gaps between the mean and median values across each group. The mean is substantially higher than the median strongly indicating the presence of a small number of videos with exceptionally high engagement, the outliers. In this preliminary analysis of ban status and engagement level, it is apparent that high engagement is not evenly distributed across videos or author groups. Instead, a small group of highly viral content influence much of the overall interaction, particularly among authors whose accounts are labeled as banned or under review. From a modeling perspective, these patterns indicate engagement features can carry a strong signal for classification but also require careful handling due to their apparent influence by extreme values.

View Code
# Adding a count variable
data.groupby(['author_ban_status']).agg(
    {'video_view_count': ['count', 'mean', 'median'],
     'video_like_count': ['count', 'mean', 'median'],
     'video_share_count': ['count', 'mean', 'median']
     })
Results:
author_ban_status video_view_count video_like_count video_share_count
countmeanmedian countmeanmedian countmeanmedian
active 15383215927.0395248616.0 1538371036.5338362222.0 1538314111.466164437.0
banned 1635445845.439144448201.0 1635153017.236697105573.0 163529998.94250814468.0
under review 2066392204.836399365245.5 2066128718.05033971204.5 206625774.6969999444.0
Summary

Ban status and engaegment analysis reveals that authors categorized as banned or under review consistently exhibit significantly higher engagment across all metrics compared to active authors. Share counts in particular, reveal a strong indication that highly viral content is more commonly associated with moderated accounts. One of the bigger signals of this is the marked gap between mean and median values across all groups suggesting a severe rightly skewed average, with a small number of highly viral videos driving overall engagement averages. These patterns suggest that engagement metrics may serve as strong indicators for distinguishing content behavior though careful interpretation is required due to the influence of outliers.

Engagement Rate Analysis by Claim and Ban Status

To better understand engagement rates, raw counts (likes, comments, and shares) are normalized by total views and since raw numbers can be misleading, engagement metrics will be converted into rates per view.

  • likes_per_view: represents the number of likes divided by the number of views for each video
  • comments_per_view: represents the number of comments divided by the number of views for each video
  • shares_per_view: represents the number of shares divided by the number of views for each video

Next, groupby() is used to split the dataset into combinations of claim status and author ban status. The new engagement rate data for each group is compiled and summarized using agg() to calculate count, mean and median of each group.

View Code
# Create engagement rate features

data['likes_per_view'] = data['video_like_count'] / data['video_view_count']
data['comments_per_view'] = data['video_comment_count'] / data['video_view_count']
data['shares_per_view'] = data['video_share_count'] / data['video_view_count']

# Grouped aggregation

data.groupby(['claim_status', 'author_ban_status']).agg(
    {'likes_per_view': ['count', 'mean', 'median'],
     'comments_per_view': ['count', 'mean', 'median'],
     'shares_per_view': ['count', 'mean', 'median']}
)
Results:
claim_status author_ban_status likes_per_view comments_per_view shares_per_view
countmeanmedian countmeanmedian countmeanmedian
claimactive65660.3295420.32653865660.0013930.00077665660.0654560.049279
claimbanned14390.3450710.35890914390.0013770.00074614390.0678930.051606
claimunder review16030.3279970.32086716030.0013670.00078916030.0657330.049967
opinionactive88170.2197440.21833088170.0005170.00025288170.0437290.032405
opinionbanned1960.2068680.1984831960.0004340.0001931960.0405310.030728
opinionunder review4630.2263940.2280514630.0005360.0002934630.0444720.035027
Summary

The analysis further strengthens the trend in that claim videos generate higher engagement than opinion videos across all metrics. Claim videos for example, average about 0.33 likes per view compared to 0.22 for opinion videos, and 0.065 shares per view versus 0.043. They also receive significantly more shares and even 2-3x more comments.

Differences across author ban status, in contrast, are quite small. Claim videos from active users, 0.329 likes per view, and banned users, 0.345, are nearly even revealing very strong indications that content type, not moderation status (active, banned, or under review) drives user engagement.

Project Overview

This project explores patterns in TikTok video engagement to support the development of a predictive model capable of distinguishing between claim-based and opinion-based content. Using a dataset of over 19,000 videos, exploratory data analysis (EDA) was conducted to understand how user interaction metrics and author characteristics relate to content type.

Initial inspection revealed that engagement variables such as views, likes, shares, and comments exhibit strong right-skewed distributions, where a small subset of highly viral videos drives disproportionately large values. This was supported by large standard deviations and significant gaps between quartiles, indicating the presence of extreme outliers. As a result, median values were identified as more reliable indicators of typical engaegment behavior than means.

The target variable, claim_status, is nearly evenly distributed between claim and opinion videos. This balance provides a strong foundation for unbiased model training and evaluation. When comparing engagement across these classes, a clear pattern indicates that claim videos significantly outperform opinion videos across all engagement metrics. They receive substantially higher view counts, more interactions, and stronger engagement rates per view. This separation highlihgts engagement features as highly predictive signals for classification.

Furhter anlaysis of author ban status revealed that videos from authors labeled as banned or under review tend to generate higher overall engagement than those from active users. However, these differences are heavily influenced by viral outliers and are less consistent when normalized into engagement rates. In contrast, engagement rate analysis shows that content type, not moderation status, is the primary driver of user interaction behavior. Claim videos maintain higher likes, shares, and comments per view regardless of author status.

Overall, the findings indicate that:

  • Engagement metrics are powerful predictors of claim-based content
  • Content type drives user interaction more than author status
  • Outliers and skewness must be carefully handled in modeling

These insights provide a strong analytical foundation for the next phase of the project.

© 2026 Alfred Rico. All rights reserved.