← Back to Portfolio

Unicorn Company Analysis (EDA with Python)

by Alfred Rico

Key Takeaways

Project Background

The data used for this project provides information on over 1,000 unicorn companies, companies valued at over one billion dollars. Features in this dataset for each company include, among others, their industry, country, year founded, and select investors. When and how these companies reached unicorn status, a prestigious milestone, can provide valuable insight for investing firms, for example, when deciding which companies to invest in next.

Project Goal

The objective of this project will be to analyze unicorn company data to understand how valuation and growth timelines vary across industries, and generate insights to support investment decision-making.

Technical Implementation

Data Loading & Inspection

Loading Unicorn_Companies.csv

matplotlib - used for data visualization

Pandas - used to manipulate, structure, and handle the data

note

Datetime operations were performed using pandas' built-in datetime tools (pd.to_datetime, .dt) without relying on Python's standalone datetime module.

Loading the dataset and performing an initial inspection to understand structure, size, and data types using pandas.

View Code
# Importing
import pandas as pd
import matplotlib.pyplot as plt
  
# Loading
companies = pd.read_csv("Unicorn_Companies.csv")

# Inspection
companies.head(5)   # A brief overview (5 rows)
companies.shape     # Rows/columns count
companies.size      # Dataset largeness
companies.info()    # Dtypes, NULL factors, dataset integrity
companies.describe()  # Descriptive statistics

companies.head(5)

CompanyValuationDate Joined IndustryCityCountry/Region ContinentYear FoundedFunding Select Investors
0Bytedance$180B4/7/17Artificial intelligenceBeijingChinaAsia2012$8BSequoia Capital China...
1SpaceX$100B12/1/12OtherHawthorneUnited StatesNorth America2002$7BFounders Fund...
2SHEIN$100B7/3/18E-commerceShenzhenChinaAsia2008$2BTiger Global...
3Stripe$95B1/23/14FintechSan FranciscoUnited StatesNorth America2010$2BKhosla Ventures...
4Klarna$46B12/12/11FintechStockholmSwedenEurope2005$4BInstitutional Venture...

companies.shape

(1074, 10)

companies.size

10740

companies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1074 entries, 0 to 1073
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Company           1074 non-null   object
 1   Valuation         1074 non-null   object
 2   Date Joined       1074 non-null   object
 3   Industry          1074 non-null   object
 4   City              1058 non-null   object
 5   Country/Region    1074 non-null   object
 6   Continent         1074 non-null   object
 7   Year Founded      1074 non-null   int64 
 8   Funding           1074 non-null   object
 9   Select Investors  1073 non-null   object
dtypes: int64(1), object(9)
memory usage: 84.0+ KB
  

companies.describe()

Year Founded
count1074.000000
mean2012.895717
std5.698573
min1919.000000
25%2011.000000
50%2014.000000
75%2016.000000
max2021.000000
Summary

companies.head(5) provides an initial snapshot of the dataset’s structure and key features. Of particular relevance to the project objective is the Date Joined column, which captures the date each company reached unicorn status (valuation ≥ $1B). This field will be especially important for temporal analysis, as its year component can be extracted to examine trends in how quickly companies achieve unicorn status across industries.

companies.size reveals there are 10,740 total data points, an important distinction from the analysis companies.shape provides (1074 rows, 10 columns).

companies.info() dataset reveals a few details that standout immediately upon inspection. Firstly, there are 1074 entries and 10 columns, and unfortunately, 9 of those 10 columns are object types with one column housing a numerical feature (Year Founded). It is likely that columns like Funding and Valuation will need to be converted to numerical and others like Date Joined will need to be converted to a datetime format. It also appears that there are NULL values present in two columns, 16 in the City column and 1 in the Select Investors column representing less than 2% of the overall data. The dataset is small but with over 1000 entries, still usable and reliable for analysis.

companies.describe() highlights both outliers and recent unicorn formation trends in a Year Founded distribution. Immediately noticeable is the minimum value of 1919 signaling a clear outlier, suggesting the presence of a legacy company that likely achieved unicorn status much later or reflects a data irregularity. In contrast, a maximium of 2021, when compared to the median (2014) and 75th percentile (2016), strongly indicates that some companies are reaching unicorn status significantly faster.

Datetime Conversion & Feature Engineering

For this part of the analysis, the data is prepared by converting Date Joined to datetime format, extracting the year component, and engineerin a Year Joined feature to support a time-based analysis.

View Code
  # Convert to datetime
  companies['Date Joined'] = pd.to_datetime(companies['Date Joined'])
  # Checking
  companies.info()

  # Creating 'Year Joined' column
  companies['Year Joined'] = companies['Date Joined'].dt.year
  # Checking
  companies.head(5)

Convert to datetime


RangeIndex: 1074 entries, 0 to 1073
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Company           1074 non-null   object        
 1   Valuation         1074 non-null   object        
 2   Date Joined       1074 non-null   datetime64[ns]
 3   Industry          1074 non-null   object        
 4   City              1058 non-null   object        
 5   Country/Region    1074 non-null   object        
 6   Continent         1074 non-null   object        
 7   Year Founded      1074 non-null   int64         
 8   Funding           1074 non-null   object        
 9   Select Investors  1073 non-null   object        
dtypes: datetime64, int64(1), object(8)
memory usage: 84.0+ KB
  

Creating 'Year Joined' column (to the right)

CompanyValuationDate Joined IndustryCityCountry/Region ContinentYear FoundedFunding Select InvestorsYear Joined
0Bytedance$180B2017-04-07Artificial intelligenceBeijingChinaAsia2012$8BSequoia Capital China...2017
1SpaceX$100B2012-12-01OtherHawthorneUnited StatesNorth America2002$7BFounders Fund...2012
2SHEIN$100B2018-07-03E-commerce & direct-to-consumerShenzhenChinaAsia2008$2BTiger Global...2018
3Stripe$95B2014-01-23FintechSan FranciscoUnited StatesNorth America2010$2BKhosla Ventures...2014
4Klarna$46B2011-12-12FintechStockholmSwedenEurope2005$4BInstitutional Venture...2011
Summary

Highlighted during the preliminary analysis, the Date Joined column is converted from an object type to a datetime format, supporting more effective time-based analysis. A quick verification using companies.info() confirms that the conversion was successfully applied.

Year Joined, a new feature, is engineered by extracting the year component from the datetime column. Comparisons between when companies were founded and when they achieved unicorn status is simplified with this transformation, further supporting trend analysis and visualization.

Data Sampling & Time to Unicorn Visualization

The analysis shifts focus to visualization, where a random sample of 50 unicorn companies is selected using companies.sample(), with the random_state parameter specified to allow reproducibility.

time_to_unicorn is created by subtracting Year Founded from the year component of Date Joined

Grouped by Industry, maximum time_to_unicorn is calculated for each group, sorted in descending order to make comparisons and highlight industries with the longest timelines.

A bar chart is created using matplotlib, plotting industries on the x-axis and their corresponding maximum time to unicorn on the y-axis. Labels are rotated for readability and the layout is adjusted to prevent overlap.

View Code
  # Data Sampling
  companies_sample = companies.sample(n = 50, random_state = 42)

  # Time to unicorn data
  companies_sampled['time_to_unicorn'] = (
    companies_sampled['Date Joined'].dt.year - companies_sampled['Year Founded']
  )

  # Aggregate time_to_unicorn by Industry
  industry_max = (
    companies_sampled
    .groupby('Industry')['time_to_unicorn']
    .max()
    .sort_values(ascending=False)
  )

  
  # Create bar plot
  plt.figure()  # Initializing a new figure

  # Industry names (x), max times (y)
  plt.bar(industry_max.index, industry_max.values)

  # Rotating X-axis labels for readability
  plt.xticks(rotation=45, ha='right')

  # Axis labels
  plt.xlabel('Industry')
  plt.ylabel('Longest Time to Unicorn (Years)')

  # Chart title
  plt.title('Longest Time to Reach Unicorn Status by Industry (Sample)')

  # Prevent label cutoff
  plt.tight_layout()

  # Plot 
  plt.show()
Summary

The bar chart reveals clear differences in how long it takes companies across industries to reach unicorn status.

Overall, the sample indicates that the time to unicorn metric is not evenly distributed between industries signaling a strong influence of industry specific dynamics such as longer development cycles (Cybersecurity) or complex business models (Health Industry). In contrast, shorter time to unicorn values could reflect stronger market demand or rapid scalability in those industries.

Maximum Valuation by Industry Visualization

The Valuation column, originally stored as a string, is cleaned and converted into a numeric format. The dollar sign and 'B' suffix are removed and the values are cast to float before being scaled using 1e9 to represent actual dollar amounts.

In this phase of the analysis the data is grouped by Industry, and maximum valuation is calculated for each group using the new numeric valuation column identifying the highest valued company within each industry. Aggregated values are sorted in descending order for readability before plotting.

View Code
 
    # Create 'Valuation' column (string -> numeric)
    companies_sampled['Valuation_numeric'] = (
      companies_sampled['Valuation']
      .str.replace('$', '', regex=False)   # Remove $ sign
      .str.replace('B', '', regex=False)   # Remove B sign
      .astype(float) * 1e9                 # Convert billions -> actual value
    )

    # Aggregate max valuation by Industry
    industry_max_val = (
      companies_sampled
      .groupby('Industry')['valuation_numeric']
      .max()
      .sort_values(ascending=False)        # Better visualization
    )


    # Bar Plot
    plt.figure()
  
    # Plot industries vs their max valuation
    plt.bar(industry_max_val.index, industry_max_val.values)
  
    # Improve x-axis readability
    plt.xticks(rotation=45, ha='right')

    # labels and title
    plt.xlabel('Industry')
    plt.ylabel('Manimum Valuation (USD)')

    # Chart title
    plt.title('Maximum Unicorn Valuation by Industry (Sample)')

    # Adjust layout
    plt.tight_layout()

    # Plot
    plt.show()
  
Summary

Technology driven industries dominate maximum unicorn valuations within this sample, further suggesting that high-value companies are possibly driven by factors like demand and scalability.

Internet software & services and data management & analytics also show substantial valuations, indicating solid growth potential and demand while industries like consumer & retail and auto & transportation appear toward the lower end.

Overall, the maximum valuation bar chart further strengthens the trend that technology driven industries dominate the upper range, reinforcing the impact of industry specific dyamics.

Project Overview

This project explores a dataset of over 1,000 unicorn companies to better understand how growth timelines and valuations vary across industries. Analysis begins with data inspection, cleaning and transformation including converting key fields such as Date Joined into a usable datetime format as well as engineering new features like Year Joined to support time based analysis.

From preliminary inspection, the focus shifts to exploring how long companies take to reach unicorn status by calculating a time to unicorn metric and analyzing the distribution across industries. Results from sampled data indicate that industries such as artifical intelligence and auto & transportation reach unicorn status more quickly while others take longer.

Further analysis explores maximum valuations by industry revealing that artificial and fintech industries dominate the upper range of valuations. Findings indicate both growth speed (time to unicorn) and valuation are unevenly distributed, highlighting the strength of industry dynamic influences such as scalability and demand.

What can be recommended:

  • Clearly indentifying industries that align with the firm's investment strategy (high growth vs high valuation or both?)
  • After identifying key industries, filtering the dataset to include only companies within those sectors to further meaningful comparisons
  • Comparing high valuations against lower number of investors could indicate growth potential or untapped opportunity