Unicorn Company Analysis (EDA with Python)

by Alfred Rico

Key Takeaways

Time to unicorn status is not consistent across industries. Some reach it much faster, while others take longer due to more complex development cycles or structural constraints. Growth speed is heavily influenced by industry rather than following a uniform path.
Industry plays a clear role in both how quickly companies grow and how large they become. Differences across industries point to factors like scalability, market demand, and operational complexity rather than random variation.
Valuation is also unevenly distributed, with technology-driven industries consistently appearing at the higher end. Other industries tend to cluster at lower maximum valuations, suggesting that both growth potential and eventual scale are closely tied to industry dynamics.

Project Background

The data used for this project provides information on over 1,000 unicorn companies, companies valued at over one billion dollars. Features in this dataset for each company include, among others, their industry, country, year founded, and select investors. When and how these companies reached unicorn status, a prestigious milestone, can provide valuable insight for investing firms, for example, when deciding which companies to invest in next.

Project Goal

The objective of this project will be to analyze unicorn company data to understand how valuation and growth timelines vary across industries, and generate insights to support investment decision-making.

Technical Implementation

Data Loading & Inspection

Loading Unicorn_Companies.csv

matplotlib - used for data visualization

Pandas - used to manipulate, structure, and handle the data

note

Datetime operations were performed using pandas' built-in datetime tools (pd.to_datetime, .dt) without relying on Python's standalone datetime module.

Loading the dataset and performing an initial inspection to understand structure, size, and data types using pandas.

View Code

# Importing
import pandas as pd
import matplotlib.pyplot as plt
  
# Loading
companies = pd.read_csv("Unicorn_Companies.csv")

# Inspection
companies.head(5)   # A brief overview (5 rows)
companies.shape     # Rows/columns count
companies.size      # Dataset largeness
companies.info()    # Dtypes, NULL factors, dataset integrity
companies.describe()  # Descriptive statistics

`companies.head(5)`

	Company	Valuation	Date Joined	Industry	City	Country/Region	Continent	Year Founded	Funding	Select Investors
0	Bytedance	$180B	4/7/17	Artificial intelligence	Beijing	China	Asia	2012	$8B	Sequoia Capital China...
1	SpaceX	$100B	12/1/12	Other	Hawthorne	United States	North America	2002	$7B	Founders Fund...
2	SHEIN	$100B	7/3/18	E-commerce	Shenzhen	China	Asia	2008	$2B	Tiger Global...
3	Stripe	$95B	1/23/14	Fintech	San Francisco	United States	North America	2010	$2B	Khosla Ventures...
4	Klarna	$46B	12/12/11	Fintech	Stockholm	Sweden	Europe	2005	$4B	Institutional Venture...

`companies.shape`

(1074, 10)

`companies.size`

`companies.info()`

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1074 entries, 0 to 1073
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Company           1074 non-null   object
 1   Valuation         1074 non-null   object
 2   Date Joined       1074 non-null   object
 3   Industry          1074 non-null   object
 4   City              1058 non-null   object
 5   Country/Region    1074 non-null   object
 6   Continent         1074 non-null   object
 7   Year Founded      1074 non-null   int64 
 8   Funding           1074 non-null   object
 9   Select Investors  1073 non-null   object
dtypes: int64(1), object(9)
memory usage: 84.0+ KB

`companies.describe()`

	Year Founded
count	1074.000000
mean	2012.895717
std	5.698573
min	1919.000000
25%	2011.000000
50%	2014.000000
75%	2016.000000
max	2021.000000

Summary

companies.head(5) provides an initial snapshot of the dataset’s structure and key features. Of particular relevance to the project objective is the Date Joined column, which captures the date each company reached unicorn status (valuation ≥ $1B). This field will be especially important for temporal analysis, as its year component can be extracted to examine trends in how quickly companies achieve unicorn status across industries.

companies.size reveals there are 10,740 total data points, an important distinction from the analysis companies.shape provides (1074 rows, 10 columns).

companies.info() dataset reveals a few details that standout immediately upon inspection. Firstly, there are 1074 entries and 10 columns, and unfortunately, 9 of those 10 columns are object types with one column housing a numerical feature (Year Founded). It is likely that columns like Funding and Valuation will need to be converted to numerical and others like Date Joined will need to be converted to a datetime format. It also appears that there are NULL values present in two columns, 16 in the City column and 1 in the Select Investors column representing less than 2% of the overall data. The dataset is small but with over 1000 entries, still usable and reliable for analysis.

companies.describe() highlights both outliers and recent unicorn formation trends in a Year Founded distribution. Immediately noticeable is the minimum value of 1919 signaling a clear outlier, suggesting the presence of a legacy company that likely achieved unicorn status much later or reflects a data irregularity. In contrast, a maximium of 2021, when compared to the median (2014) and 75th percentile (2016), strongly indicates that some companies are reaching unicorn status significantly faster.

Datetime Conversion & Feature Engineering

For this part of the analysis, the data is prepared by converting Date Joined to datetime format, extracting the year component, and engineerin a Year Joined feature to support a time-based analysis.

View Code

  # Convert to datetime
  companies['Date Joined'] = pd.to_datetime(companies['Date Joined'])
  # Checking
  companies.info()

  # Creating 'Year Joined' column
  companies['Year Joined'] = companies['Date Joined'].dt.year
  # Checking
  companies.head(5)

`Convert to datetime`


RangeIndex: 1074 entries, 0 to 1073
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Company           1074 non-null   object        
 1   Valuation         1074 non-null   object        
 2   Date Joined       1074 non-null   datetime64[ns]
 3   Industry          1074 non-null   object        
 4   City              1058 non-null   object        
 5   Country/Region    1074 non-null   object        
 6   Continent         1074 non-null   object        
 7   Year Founded      1074 non-null   int64         
 8   Funding           1074 non-null   object        
 9   Select Investors  1073 non-null   object        
dtypes: datetime64, int64(1), object(8)
memory usage: 84.0+ KB

`Creating 'Year Joined' column (to the right)`

	Company	Valuation	Date Joined	Industry	City	Country/Region	Continent	Year Founded	Funding	Select Investors	Year Joined
0	Bytedance	$180B	2017-04-07	Artificial intelligence	Beijing	China	Asia	2012	$8B	Sequoia Capital China...	2017
1	SpaceX	$100B	2012-12-01	Other	Hawthorne	United States	North America	2002	$7B	Founders Fund...	2012
2	SHEIN	$100B	2018-07-03	E-commerce & direct-to-consumer	Shenzhen	China	Asia	2008	$2B	Tiger Global...	2018
3	Stripe	$95B	2014-01-23	Fintech	San Francisco	United States	North America	2010	$2B	Khosla Ventures...	2014
4	Klarna	$46B	2011-12-12	Fintech	Stockholm	Sweden	Europe	2005	$4B	Institutional Venture...	2011

Summary

Highlighted during the preliminary analysis, the Date Joined column is converted from an object type to a datetime format, supporting more effective time-based analysis. A quick verification using companies.info() confirms that the conversion was successfully applied.

Year Joined, a new feature, is engineered by extracting the year component from the datetime column. Comparisons between when companies were founded and when they achieved unicorn status is simplified with this transformation, further supporting trend analysis and visualization.

Data Sampling & Time to Unicorn Visualization

The analysis shifts focus to visualization, where a random sample of 50 unicorn companies is selected using companies.sample(), with the random_state parameter specified to allow reproducibility.

time_to_unicorn is created by subtracting Year Founded from the year component of Date Joined

Grouped by Industry, maximum time_to_unicorn is calculated for each group, sorted in descending order to make comparisons and highlight industries with the longest timelines.

A bar chart is created using matplotlib, plotting industries on the x-axis and their corresponding maximum time to unicorn on the y-axis. Labels are rotated for readability and the layout is adjusted to prevent overlap.

View Code

  # Data Sampling
  companies_sample = companies.sample(n = 50, random_state = 42)

  # Time to unicorn data
  companies_sampled['time_to_unicorn'] = (
    companies_sampled['Date Joined'].dt.year - companies_sampled['Year Founded']
  )

  # Aggregate time_to_unicorn by Industry
  industry_max = (
    companies_sampled
    .groupby('Industry')['time_to_unicorn']
    .max()
    .sort_values(ascending=False)
  )

  
  # Create bar plot
  plt.figure()  # Initializing a new figure

  # Industry names (x), max times (y)
  plt.bar(industry_max.index, industry_max.values)

  # Rotating X-axis labels for readability
  plt.xticks(rotation=45, ha='right')

  # Axis labels
  plt.xlabel('Industry')
  plt.ylabel('Longest Time to Unicorn (Years)')

  # Chart title
  plt.title('Longest Time to Reach Unicorn Status by Industry (Sample)')

  # Prevent label cutoff
  plt.tight_layout()

  # Plot 
  plt.show()

Summary

The bar chart reveals clear differences in how long it takes companies across industries to reach unicorn status.

Overall, the sample indicates that the time to unicorn metric is not evenly distributed between industries signaling a strong influence of industry specific dynamics such as longer development cycles (Cybersecurity) or complex business models (Health Industry). In contrast, shorter time to unicorn values could reflect stronger market demand or rapid scalability in those industries.

Maximum Valuation by Industry Visualization

The Valuation column, originally stored as a string, is cleaned and converted into a numeric format. The dollar sign and 'B' suffix are removed and the values are cast to float before being scaled using 1e9 to represent actual dollar amounts.

In this phase of the analysis the data is grouped by Industry, and maximum valuation is calculated for each group using the new numeric valuation column identifying the highest valued company within each industry. Aggregated values are sorted in descending order for readability before plotting.

View Code

 
    # Create 'Valuation' column (string -> numeric)
    companies_sampled['Valuation_numeric'] = (
      companies_sampled['Valuation']
      .str.replace('$', '', regex=False)   # Remove $ sign
      .str.replace('B', '', regex=False)   # Remove B sign
      .astype(float) * 1e9                 # Convert billions -> actual value
    )

    # Aggregate max valuation by Industry
    industry_max_val = (
      companies_sampled
      .groupby('Industry')['valuation_numeric']
      .max()
      .sort_values(ascending=False)        # Better visualization
    )


    # Bar Plot
    plt.figure()
  
    # Plot industries vs their max valuation
    plt.bar(industry_max_val.index, industry_max_val.values)
  
    # Improve x-axis readability
    plt.xticks(rotation=45, ha='right')

    # labels and title
    plt.xlabel('Industry')
    plt.ylabel('Manimum Valuation (USD)')

    # Chart title
    plt.title('Maximum Unicorn Valuation by Industry (Sample)')

    # Adjust layout
    plt.tight_layout()

    # Plot
    plt.show()

Summary

Technology driven industries dominate maximum unicorn valuations within this sample, further suggesting that high-value companies are possibly driven by factors like demand and scalability.

Internet software & services and data management & analytics also show substantial valuations, indicating solid growth potential and demand while industries like consumer & retail and auto & transportation appear toward the lower end.

Overall, the maximum valuation bar chart further strengthens the trend that technology driven industries dominate the upper range, reinforcing the impact of industry specific dyamics.

Project Overview

This project explores a dataset of over 1,000 unicorn companies to better understand how growth timelines and valuations vary across industries. Analysis begins with data inspection, cleaning and transformation including converting key fields such as Date Joined into a usable datetime format as well as engineering new features like Year Joined to support time based analysis.

From preliminary inspection, the focus shifts to exploring how long companies take to reach unicorn status by calculating a time to unicorn metric and analyzing the distribution across industries. Results from sampled data indicate that industries such as artifical intelligence and auto & transportation reach unicorn status more quickly while others take longer.

Further analysis explores maximum valuations by industry revealing that artificial and fintech industries dominate the upper range of valuations. Findings indicate both growth speed (time to unicorn) and valuation are unevenly distributed, highlighting the strength of industry dynamic influences such as scalability and demand.

What can be recommended:

Clearly indentifying industries that align with the firm's investment strategy (high growth vs high valuation or both?)
After identifying key industries, filtering the dataset to include only companies within those sectors to further meaningful comparisons
Comparing high valuations against lower number of investors could indicate growth potential or untapped opportunity