Importing Python packages
- NumPy - Used primarily for numerical computation and wide ranging mathematical functions
- Pandas - Used for data manipulation, structuring and handling
View Code
import numpy as np
import pandas as pd
By Alfred Rico
TikTok's ever growing community base, with close to 2 billion monthly users worldwide as of 2026, presents moderators with the challenge of reviewing reported content suspected of containing user claims. In order to maintain trust and safety on the platform, moderators must take precautionary measures to prevent the spread of misinformation and ensure adherence to policy. To help reduce the level of backlogging, the overall focus of this project will be to create a predictive model that can determine whether a video contains claims or offers opinions allowing moderators to prioritize reports efficiently.
For this part of the project, the focus will be on cursory data inspection of the provided dataset, tiktok_dataset.csv. After building a dataframe with Python in Jupyter notebook the claims data will be organized for exploratory data analysis, cleaned, and transformed in order to derive engagement trends identifying any suggestions of correlations between user engagement variables and claim status. Engagement trends provide insight into how users interact with different types of content, revealing meaningful patterns associated with claim versus opinion videos. Incorporating these behavioral signals into a predictive model will improve it's ability to distinguish between the two classes.
Importing Python packages
import numpy as np
import pandas as pd
Loading the dataset onto the Jupyter notebook environment with
data = pd.read_csv("tiktok_dataset.csv")
Initial data inspection, validation and exploration is performed using
head(), info(), and describe()
to understand structure, data types, and statistical distributions
head() - to examine sample record dataset structure and feature layout
info() - to asses data types, missing values, and dataset integrity
describe() - for statistical distribution (important for identifying trends, variability and potential outliers)
# Loading dataset
data = pd.read_csv("tiktok_dataset.csv")
# Displaying only the first 5 rows to examine structure and features
data.head()
# Summary information
data.info()
# Summary statistics
data.describe()
| # | claim_status |
video_id | video_duration_sec | video_transcription_text | verified_status | author_ban_status |
video_view_count | video_like_count | video_share_count | video_download_count | video_comment_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | claim | 7017666017 | 59 | someone shared with me that drone deliveries are already hap... | not verified | under review | 343296.0 | 19425.0 | 241.0 | 1.0 | 0.0 |
| 1 | 2 | claim | 4014381136 | 32 | someone shared with me that there are more microorganisms in... | not verified | active | 140877.0 | 77355.0 | 19034.0 | 1161.0 | 684.0 |
| 2 | 3 | claim | 9859838091 | 31 | someone shared with me that american industrialist andrew ca... | not verified | active | 902185.0 | 97690.0 | 2858.0 | 833.0 | 329.0 |
| 3 | 4 | claim | 1866847991 | 25 | someone shared with me that the metro of st. petersburg, wit... | not verified | active | 437506.0 | 239954.0 | 34812.0 | 1234.0 | 584.0 |
| 4 | 5 | claim | 7105231098 | 19 | someone shared with me that the number of businesses allowin... | not verified | active | 56167.0 | 34987.0 | 4110.0 | 547.0 | 152.0 |
RangeIndex: 19382 entries, 0 to 19381 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 # 19382 non-null int64 1 claim_status19084 non-null object 2 video_id 19382 non-null int64 3 video_duration_sec 19382 non-null int64 4 video_transcription_text 19084 non-null object 5 verified_status 19382 non-null object 6author_ban_status19382 non-null object 7 video_view_count 19084 non-null float64 8 video_like_count 19084 non-null float64 9 video_share_count 19084 non-null float64 10 video_download_count 19084 non-null float64 11 video_comment_count 19084 non-null float64 dtypes: float64(5), int64(3), object(4) memory usage: 1.8+ MB
| # | video_id | video_duration_sec | video_view_count | video_like_count | video_share_count | video_download_count | video_comment_count | |
|---|---|---|---|---|---|---|---|---|
| count | 19382.000000 | 1.938200e+04 | 19382.000000 | 19084.000000 | 19084.000000 | 19084.000000 | 19084.000000 | 19084.000000 |
| mean | 9691.500000 | 5.627454e+09 | 32.421732 | 254708.558688 | 84304.636030 | 16735.248323 | 1049.429627 | 349.312146 |
| std | 5595.245794 | 2.536440e+09 | 16.229967 | 322893.280814 | 133420.546814 | 32036.174350 | 2004.299894 | 799.638865 |
| min | 1.000000 | 1.234959e+09 | 5.000000 | 20.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 4846.250000 | 3.430417e+09 | 18.000000 | 4942.500000 | 810.750000 | 115.000000 | 7.000000 | 1.000000 |
| 50% | 9691.500000 | 5.618664e+09 | 32.000000 | 9954.500000 | 3403.500000 | 717.000000 | 46.000000 | 9.000000 |
| 75% | 14536.750000 | 7.843960e+09 | 47.000000 | 504327.000000 | 125020.000000 | 18222.000000 | 1156.250000 | 292.000000 |
| max | 19382.000000 | 9.999873e+09 | 60.000000 | 999817.000000 | 657830.000000 | 256130.000000 | 14994.000000 | 9599.000000 |
Output preview of dataset structure and summary statistics.
Upon inspection, data.head() displays the first five rows of the 19,382 entries,
each representing either a claim or an opinion along with associated metadata.
The dataset includes a mix of categorical, textual, and numerical features.
data.info() provides insight into data types and completeness,
highlighting variables that contain missing or null values.
data.describe() is particularly relevant to the project objective,
as it reveals the distribution of engagement-related variables.
The results indicate strong right-skewness, where a small number of videos
exhibit disproportionately high interaction values—suggesting the presence
of high-end outliers likely associated with viral content.
Quartile analysis further supports this observation. Quartiles divide the data into four equal parts after sorting. At the 50% level (median), view counts are approximately 10k, while at the 75% level they rise to around 50k. This sharp increase indicates that a subset of videos significantly outperforms the majority.
Additionally, the standard deviation across engagement metrics is relatively large, reinforcing the presence of variability and outliers. It is important to note that the mean is not fully representative of typical engagement levels, as it is heavily influenced by extreme values. The median provides a more reliable measure of central tendency in this case, as it remains unaffected by outliers.
With the preliminary data inspection, and the machine learning model being created for this project in mind, it is apparent that the column claim_status will play a key role. This variable distinguishes whether a video contains a claim or expresses an opinion, serving as the basis for the downstream classification task.
#Identifying values and their counts for claim_status data['claim_status'].value_counts()
| Claim Status | Count |
|---|---|
| claim | 9608 |
| opinion | 9476 |
The target variable claim_status contains a near even distribution of 9,608 claim and 9,476 opinion videos. This balance will be beneficial downstream for classification purposes as it reduces risks of model bias toward a dominant class and supports more reliable evaluations of model performance. For classification tasks such as this one, the near even distribution between claim and opinion videos suggests that this dataset is well-suited for training a model that can learn to distinguish between the two classes.
To further strengthen the predictive model, user engagement metrics are examined to determine whether interaction patterns differ across claim status categories. The dataset is filtered by claim status using Boolean masking to compare engagement levels between claims and opinions. Mean and median view counts are calculated for each group.
# Average view count (claim status)
claims = data[data['claim_status'] == 'claim']
print('Mean view count claims:', claims['video_view_count'].mean())
print('Median view count claims:', claims['video_view_count'].median())
# Average view count (opinion status)
opinions = data[data['claim_status'] == 'opinion']
print('Mean view count opinions:', opinions['video_view_count'].mean())
print('Median view count opinions:', opinions['video_view_count'].median())
Mean view count claims: 501029.4527477102 Median view count claims: 501555.0 Mean view count opinions: 4956.43224989447 Median view count opinions: 4953.0
Although the mean and median values within each group are closely aligned, suggesting even distribution, there is marked disparity between categories. Claim videos exhibit mean and median view counts exceeding 50k whereas opinion videos average close to 5k views. This strongly suggests that engagement patterns differ markedly by claim status making view count, and likely other engagement metrics, a particularly strong feature for classification.
Next, any trends associated with the ban status of the author will be examined using groupby() to calculate how many videos there are for each combination of categories of claim status including active, banned, or under review.
# Counts for each group combination of claim status and author ban status data.groupby(['claim_status', 'author_ban_status']).count()[['#']]
claim_status |
author_ban_status |
Count |
|---|---|---|
| claim | active | 6566 |
| claim | banned | 1439 |
| claim | under review | 1603 |
| opinion | active | 8817 |
| opinion | banned | 196 |
| opinion | under review | 463 |
The target variable claim_status is nearly evenly distributed, supporting unbiased model training and reliable evaluation.
Engagement analysis reveals a marked separation between classes, with claim videos averaging higher view counts than opinion videos. The closeness of mean and median values in each category group suggests relatively stable distributions within, while magnitudes of difference between categories indicates strong classification potential for engagement features.
Author ban status also seems to further highlight differences between classes, with claim videos disproportionately associated with banned authors.
Overall, the results indicate that engagement metrics and author status variables can enhance model performance.
Analysis shifts focus to the author_ban_status variable, examining whether user interaction patterns differ between authors labeled as active, banned, or under review.
groupby() is used to segment the dataset by author ban status, aggregating key engagement metrics, including view, like, and share counts using mean and median values providing a comprehensive view of overall typical engagement levels as well as accounting for potential outliers.
# Mean and median share count across engagement metrics
data.groupby(['author_ban_status']).agg(
{'video_view_count': ['mean', 'median'],
'video_like_count': ['mean', 'median'],
'video_share_count': ['mean', 'median']})
author_ban_status |
video_view_count | video_like_count | video_share_count | |||
|---|---|---|---|---|---|---|
| mean | median | mean | median | mean | median | |
| active | 215927.039524 | 8616.0 | 71036.533836 | 2222.0 | 14111.466164 | 437.0 |
| banned | 445845.439144 | 448201.0 | 153017.236697 | 105573.0 | 29998.942508 | 14468.0 |
| under review | 392204.836399 | 365245.5 | 128718.050339 | 71204.5 | 25774.696999 | 9444.0 |
Results reveal a clear pattern in that authors with a banned status consistently exhibit the highest engagement across all metrics, followed by those under review, while active authors show significantly lower engagement levels. Notably, the median share count for banned authors (14,468) is dramatically higher than that of active authors (437), revealing a 30-fold difference. Altogether, this further supports discriminatory ability from a modeling perspective, indicating that engagement features, especially share count, can serve as additional signals when identifying content that differs in behavior.
A count variable across engagement metrics is created in order to scale mean and median data across the sample and a few key patterns emerge. Authors labeled as banned or under review consistently exhibit dramataically higher values in engagement across all metrics (views, likes, and shares) compared to active authors suggesting contenet associated with these accounts is more widespread than it's counterpart. Another interesting observation is the wide gaps between the mean and median values across each group. The mean is substantially higher than the median strongly indicating the presence of a small number of videos with exceptionally high engagement, the outliers. In this preliminary analysis of ban status and engagement level, it is apparent that high engagement is not evenly distributed across videos or author groups. Instead, a small group of highly viral content influence much of the overall interaction, particularly among authors whose accounts are labeled as banned or under review. From a modeling perspective, these patterns indicate engagement features can carry a strong signal for classification but also require careful handling due to their apparent influence by extreme values.
# Adding a count variable
data.groupby(['author_ban_status']).agg(
{'video_view_count': ['count', 'mean', 'median'],
'video_like_count': ['count', 'mean', 'median'],
'video_share_count': ['count', 'mean', 'median']
})
author_ban_status |
video_view_count | video_like_count | video_share_count | ||||||
|---|---|---|---|---|---|---|---|---|---|
| count | mean | median | count | mean | median | count | mean | median | |
| active | 15383 | 215927.039524 | 8616.0 | 15383 | 71036.533836 | 2222.0 | 15383 | 14111.466164 | 437.0 |
| banned | 1635 | 445845.439144 | 448201.0 | 1635 | 153017.236697 | 105573.0 | 1635 | 29998.942508 | 14468.0 |
| under review | 2066 | 392204.836399 | 365245.5 | 2066 | 128718.050339 | 71204.5 | 2066 | 25774.696999 | 9444.0 |
Ban status and engaegment analysis reveals that authors categorized as banned or under review consistently exhibit significantly higher engagment across all metrics compared to active authors. Share counts in particular, reveal a strong indication that highly viral content is more commonly associated with moderated accounts. One of the bigger signals of this is the marked gap between mean and median values across all groups suggesting a severe rightly skewed average, with a small number of highly viral videos driving overall engagement averages. These patterns suggest that engagement metrics may serve as strong indicators for distinguishing content behavior though careful interpretation is required due to the influence of outliers.
To better understand engagement rates, raw counts (likes, comments, and shares) are normalized by total views and since raw numbers can be misleading, engagement metrics will be converted into rates per view.
likes_per_view: represents the number of likes divided by the number of views for each videocomments_per_view: represents the number of comments divided by the number of views for each videoshares_per_view: represents the number of shares divided by the number of views for each video
Next, groupby() is used to split the dataset into combinations of claim status and author ban status. The new engagement rate data for each group is compiled and summarized using agg() to calculate count, mean and median of each group.
# Create engagement rate features
data['likes_per_view'] = data['video_like_count'] / data['video_view_count']
data['comments_per_view'] = data['video_comment_count'] / data['video_view_count']
data['shares_per_view'] = data['video_share_count'] / data['video_view_count']
# Grouped aggregation
data.groupby(['claim_status', 'author_ban_status']).agg(
{'likes_per_view': ['count', 'mean', 'median'],
'comments_per_view': ['count', 'mean', 'median'],
'shares_per_view': ['count', 'mean', 'median']}
)
claim_status |
author_ban_status |
likes_per_view |
comments_per_view |
shares_per_view |
||||||
|---|---|---|---|---|---|---|---|---|---|---|
| count | mean | median | count | mean | median | count | mean | median | ||
| claim | active | 6566 | 0.329542 | 0.326538 | 6566 | 0.001393 | 0.000776 | 6566 | 0.065456 | 0.049279 |
| claim | banned | 1439 | 0.345071 | 0.358909 | 1439 | 0.001377 | 0.000746 | 1439 | 0.067893 | 0.051606 |
| claim | under review | 1603 | 0.327997 | 0.320867 | 1603 | 0.001367 | 0.000789 | 1603 | 0.065733 | 0.049967 |
| opinion | active | 8817 | 0.219744 | 0.218330 | 8817 | 0.000517 | 0.000252 | 8817 | 0.043729 | 0.032405 |
| opinion | banned | 196 | 0.206868 | 0.198483 | 196 | 0.000434 | 0.000193 | 196 | 0.040531 | 0.030728 |
| opinion | under review | 463 | 0.226394 | 0.228051 | 463 | 0.000536 | 0.000293 | 463 | 0.044472 | 0.035027 |
The analysis further strengthens the trend in that claim videos generate higher engagement than opinion videos across all metrics. Claim videos for example, average about 0.33 likes per view compared to 0.22 for opinion videos, and 0.065 shares per view versus 0.043. They also receive significantly more shares and even 2-3x more comments.
Differences across author ban status, in contrast, are quite small. Claim videos from active users, 0.329 likes per view, and banned users, 0.345, are nearly even revealing very strong indications that content type, not moderation status (active, banned, or under review) drives user engagement.
This project explores patterns in TikTok video engagement to support the development of a predictive model capable of distinguishing between claim-based and opinion-based content. Using a dataset of over 19,000 videos, exploratory data analysis (EDA) was conducted to understand how user interaction metrics and author characteristics relate to content type.
Initial inspection revealed that engagement variables such as views, likes, shares, and comments exhibit strong right-skewed distributions, where a small subset of highly viral videos drives disproportionately large values. This was supported by large standard deviations and significant gaps between quartiles, indicating the presence of extreme outliers. As a result, median values were identified as more reliable indicators of typical engaegment behavior than means.
The target variable, claim_status, is nearly evenly distributed between claim and opinion videos. This balance provides a strong foundation for unbiased model training and evaluation. When comparing engagement across these classes, a clear pattern indicates that claim videos significantly outperform opinion videos across all engagement metrics. They receive substantially higher view counts, more interactions, and stronger engagement rates per view. This separation highlihgts engagement features as highly predictive signals for classification.
Furhter anlaysis of author ban status revealed that videos from authors labeled as banned or under review tend to generate higher overall engagement than those from active users. However, these differences are heavily influenced by viral outliers and are less consistent when normalized into engagement rates. In contrast, engagement rate analysis shows that content type, not moderation status, is the primary driver of user interaction behavior. Claim videos maintain higher likes, shares, and comments per view regardless of author status.
Overall, the findings indicate that:
These insights provide a strong analytical foundation for the next phase of the project.