Exploratory Data Analysis on Worldwide Startup

13 min readNov 30, 2020

Start ups are supposed to be known for their innovation that arises from the gap of problem and solving, and also known for companies of growth seeking-business, so that the nature of business itself requires heavy funds and it is common to look for capital from a variety of sources such as angel investor and venture capitals.

In this project we would like to crack the hidden insight along the years from 1904 until 2014 from crunching the numbers of the dataset acquired, not to mention from the company has been founded, their status, corporate actions such as Merger & Acquisition (M&A) or becoming Public Companies, and their funding status whether have received Seed, Angel fund or Venture Capital investment. Python 3 environment is being used to conduct this research.

The dataset acquired from Crunchbase in csv format, of which the data are worldwide start up companies recorded from 1902 until 2014. It is consisted of 54,294 entries (rows) with 38 attributes of various data types between object and float, as picture shown below:

For further analysis we would like to know the successful probability for companies by considering their status, market, and funding have been received as a basis by conducting correlation amongst attributes.

In various studies had been taken, successful start up commonly defined as two-way strategy that makes a large amount of money to its founders, investors and first employees, as a company can either have an IPO (Initial Public Offering) by going to a public stock market (i.e. Facebook going public, allowing everyone to invest in the company by buying shares being sold by its insiders in the U.S stock market) or, be acquired by or merged (M&A) with another company (i.e. Microsoft acquiring LinkedIn for $26B) where those who have previously invested receive immediate cash in return for their shares. This process is often denominated as an exit strategy (Guo, Lou, & Pérez-Castrillo, 2015). This project will therefore consider both an IPO (Initial Public Offering) and a process of M&A (Mergers & Acquisitions) as the critical events that classify a start-up as successful.

Data Cleansing & Feature Engineering

After acquiring the dataset we found that there are numerous tasks for data cleansing should be taken before doing any further analysis, since the dataset is quite messy with formatting, labelling header, quite a lot involving missing values, and the dataset are also dispersed to introduce outliers.

So that in this project, the data wrenching which have been taken are :

Fixing spacing format in header such as ‘ market ‘ and ‘ funding_total_usd ‘
Remove 4855 row duplicates
Tackling uncommon format.
Attribute ‘funding_total_usd’ involved uncommon string format with wrongly used comma as separate number, then we eliminate the comma and change the data type into numeric.
Handling missing values.
Change the missing values such as ‘funding_total_usd’, from NaN value with 0.
Detecting and handling outliers.
When plotting into distribution, outliers really matter to generate uninterpretable visualization. For this we remove the outlier by using interquartile range. Should be noted that this step only be used for Exploratory Data Analysis only, not to be used in Machine Learning (in ML we’ll be transforming the data whether using regression, polynomial regression or log instead)

Feature Engineering also brings advantages such as handling object data-type into numeric by One-Hot-Encoding and can also be used for transforming the attributes which have an outlier (considering removing them altogether can also reduce our training accuracy later in the Machine Learning process). Hence in this dataset we use :

One Hot Encoding for attribute ‘status’.
Creating new variables of ‘get_seed_funding’, ‘get_angel_fund’, and ‘get_venture’, and most importantly ‘successfull_code’.
1 for ‘Yes’, and 0 for ‘No’ for conditions mentioned.
Change attributes ‘founded_at’ to be ‘founded_year’, since inconsistent data between year in ‘founded_at’ and year in ‘founded_month’ is found, so that we extract the year in ‘founded_month’ to be new attribute ‘founded_year’, and subsequently drop the ‘founded_at’ column.

Key Finding and Insights

In this section, there are 3 sub-section: start up, market, and funding.

START UP

a. Top 5 Country in terms of Start Up Quantity

Until 2014, we can say that the USA has dominating start up quantity across the globe, more than 50% from whole startups worldwide. It is undoubtedly true, since the US has an immense support ecosystem for startups to grow from ideation to scale up the business. Following England with 2,642 start up companies, Canada 1,405 companies, China 1,239 companies and Germany with 968 companies respectively in 2014.

b. Start Up Status

Since 1902 until 2014 from 49,437 start up companies recorded, 5.4% of them are closed, 86.9% operating and 7,7% acquired can be called as one of terms for successful start up for their exit strategy.

c. Start Up Founded Year Distribution

From the figure above, year of 1995 is the commencement of growing startups worldwide, where recorded around 437 companies and almost doubled in following year by having 731 start up companies in 2001.

The history also took down traces as ‘Bubble DotCom’, where the technology-companies attracted the market to be over-valuation. In 1999, the height of the dotcom craze, there were 457 IPOs. Most were Internet and technology stocks. Of those, 117 doubled in price on the first day of trading. Tech and dotcom IPOs were minting new millionaires every day, both at the management level and retail investor level. But then the sell-off started on March 11, 2000. Investors suddenly realized that a tech and/or Internet company with a billion-dollar valuation that has no revenue or earnings is saddled with debt and has no future.

d. Those Who Survived from Dotcom Bubble and Become Tech Titans

After Dotcom Bubble, companies with strong business revenues have survived, namely Amazon, Netflix, eBay, Google, Alibaba. Some of them are now still tech-leading companies. As we know FANG+ companies (FACEBOOK, AMAZON, NETFLIX, GOOGLE, ALIBABA) are the take titans who outperformed the wider market since the coronavirus (COVID-19) pandemic spurred record sell-offs in March. Unlike other stocks which met their dip price in this time, FANG+ companies are hype up even until 80% take up rate compared to their lowest in early March due to its performance and forward looking valuation.

MARKET

a. Top 15 StartUp Market Worldwide

It is obvious that the growing number of startups would touch almost all sectors related to people as the market. The most common category from all startups worldwide is Software at the highest place, followed by Biotechnology, Mobile, E-Commerce, Curated Web, Enterprise Software, Health Care, Clean Technology, Games and Embedded system of hardware & software.

Whereas figure above shows that e-Commerce is the most favorable category amongs start up companies in China, Indonesia and India, which is slightly different from States.

b. Most Favorable Category of Startup Product

For almost 3 decades until 2014, Social Media, Curated Web, Mobile can be seen as the most favorable start up product from all over the world. The least favourable, the smaller the picture of words would be plotted.

FUNDING

a. Total Funding Distribution

Total Funding can be defined as total or sum from whole funding obtained, from seed, grant, angel investor, and venture capitals in all round. From the picture above we can see that the dispersion of total funding across startups is very high. So that we take out the outlier to understand better, as can be seen below. The data tells that most of total funding are below USD 2.5 Million, and top 10% start up companies received 78% from all total funding across the globe.

b. Total Funding in Various Unicorn in 2014

Unicorns in 2014 are not as many as today, but there are few of them who are still becoming the tech titans today. Facebook as one of the unicorns at that time successfully obtained the highest funding compared to Alibaba, Twitter, Cloudera and Uber with the amount almost USD 2.5 Billion.

c. Seed Funding, vs Angel Funding vs. Venture Investment

Seed money, sometimes being called seed funding or seed capital, is a form of securities offering in which an investor invests capital in a startup company in exchange for an equity stake or convertible note stake in the company. The term seed suggests that this is a very early investment, meant to support the business until it can generate cash of its own (see cash flow), or until it is ready for further investments. Seed money options include friends and family funding, seed venture capital funds, angel funding, and crowdfunding.

The difference between seed funding and angel investment used in this dataset is seed funding coming from seed venture capital institution funds. Whereas angel investment coming from informal or private investors or being called as angel investors who deliberately invest based on their personal preference.

While venture capital is a form of private equity and a type of financing that investors provide to startup companies and small businesses that are believed to have long-term growth potential. Venture capital generally comes from well-off investors, investment banks and any other financial institutions. However, it does not always take a monetary form; it can also be provided in the form of technical or managerial expertise. In the dataset column ‘venture’ are the total investment amount from round A, round B, round C, round D, until round H.

The question is, how many startups are getting seed, angel investment and venture investment? We can see the difference in three (3) pie chart below.

It is an obvious fact that the most difficult source of funding for start up is to get angel investment, as the number of angel investments is very few and requires a strong networking to get access to them. Business Incubator can be the hub between angel investors and start up companies.

Meanwhile, the startup percentage of getting seed funding is around 28%. From investor glasses, giving seed funding can be both advantage and disadvantage. The drawbacks is the risk would be higher than investors who inject the monetary during a VC round, since the real market and numbers of revenue are not there. On the other hand, when investors choose the right start up in the seed stage they will be having higher return as they do not need to inject more monetary funds to take a part in the shareholders list like VC rounds do. High risk, high return.

Last but not least, if there are 100 startup companies, statistics show that 47 of them are backed by Venture Capital institutions. Even though this might seem easy for a startup to get VC investment, it should be taken into consideration that the startup must be ready with all due diligence processes. If VC is interested in the proposal, the firm or the investor must then perform due diligence, which includes a thorough investigation of the company’s business model, products, management, and operating history, among other things.

Once due diligence has been completed, the firm or the investor will pledge an investment of capital in exchange for equity in the company. These funds may be provided all at once, but more typically the capital is provided in rounds. The firm or investor then takes an active role in the funded company, advising and monitoring its progress before releasing additional funds.

The investor exits the company after a period of time, typically four to six years after the initial investment, by initiating a merger, acquisition or initial public offering (IPO), of which we label them as successful startup companies later on in this project.

d. Do the Successful Unicorns require seed funding?

It is a fascinating fact that FANG+ companies (Facebook, Amazon, Netflix, Google, Alibaba) as the tech titans-companies today which took majority tech companies market caps in Wall Street were not required seed funding back then, right after they founded the product.

They are using bootstrapping to fund themselves. On the other hand, we can see Dropbox and Uber (also becoming unicorn today) received seed funding back then for USD 200,000 and below.

e. How much does a monetary fund differ from seed funding, angel investor, and each round in VC investment?

From the dataset we can conclude that the seed funding average took the lowest fund amongst other funding rounds and sources, around USD 776,350

While angel investment average is around USD 1 Million.

Whereas Venture Capital rounds average are:
Average of Round A : USD 6.9 Million
Average of Round B : USD 13.5 Million
Average of Round C : USD 21 Million
Average of Round D : USD 28 Million
Average of Round E : USD 32 Million
Average of Round F : USD 48 Million
Average of Round G : USD 83 Million
Average of Round H : USD 175 Million

In 2014, only 4 companies are getting round H investment, 3 of them are e-Commerce: Flipkart (India), Deem (USA), Locondo (Japan).

The other one company categorised in game, which has headquarter in Singapore, named Gumi.

Hypothesis & Statistical Significance Testing

After seeing the data, the hypothesis arises are:

Startups in the USA have strong relations for being successful startups or have linear relationships, since the ecosystem is set up greatly.
Startups who get venture investment supposed to be successful startups.
Founded year becomes one of the predictors for a startup to succeed.

Statistical Significance Testing

Method would be used to conduct statistical significance to prove the hypothesis is through calculating Pearson Correlation and p-value between attributes.

Correlation is a measure of the extent of interdependence between variables.
Causation is the relationship between cause and effect between two variables.

It is important to know the difference between these two and that correlation does not imply causation. Determining correlation is much simpler than determining causation as causation may require independent experimentation.

Pearson Correlation

The Pearson Correlation measures the linear dependence between two variables X & Y. The resulting coefficient is a value between -1 and 1 inclusive, where:

1: Total positive linear correlation.
0: No linear correlation, the two variables most likely do not affect each other.
-1: Total negative linear correlation.

P-value

P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant. By convention when:

p-value is < 0.001: we say there is strong evidence that the correlation is significant. p-value is < 0.05: there is moderate evidence that the correlation is significant.
p-value is < 0.1: there is weak evidence that the correlation is significant
p-value is > 0.1: there is no evidence that the correlation is significant.

New ‘target’ variable is made to define successful startup or not. As mentioned in the first section, successful startups are those who have Merger & Acquisition record or have become public companies.

Result below shows that correlation between USA startup and successful startup is extremely weak linear relationship, since the p-value is < 0.001, the correlation between country code USA and become successful startup is statistically significant, although the linear relationship is weak (~0.09), means that even if you are from USA and build a startup it does not mean it would be successful later on.

Same idea with relationship between getting venture investment and becoming successful startup, since the p-value is < 0.001, the correlation between getting venture and become successful startup is statistically significant, although the linear relationship is weak (~0.12), means getting venture does only have a slight effect on becoming successful startup.

Hence we can conclude that Hypothesis 1 and 2 are not correctly proved.

Suggestion & Summary

There are numerous suggestions to take the analysis going further:

Using the most updated dataset in Q3 2020, so that we can see the start up dynamics on what happened during Corona Virus, this can be a basis for the Government to mitigate the failure of promising startup companies.
Predicting whether from startup companies are having high probability to succeed (defined by Merger & Acquisition or going IPO) by using Machine Learning on Classification Model.

Summary

The dataset is about startup, corporate actions and investment obtained, sourced from Crunchbase in csv format, of which the data are worldwide start up companies recorded from 1902 until 2014. It consists of 54,294 entries (rows) with 38 attributes of various data types between objects and floats with quite messy formatting. Due to this limitation, several steps in data cleansing are highly required before conducting further analysis.

For further research, would be great if data can also involve variables which have correlation to target, such as founder experience year, number of patents/copyright/ any goodwill of the company, grit score of the founders, customer count, etc, by that we can see which ones of them could boost up number of successful startups. Subsequently could enhance prediction in data training for Machine Learning.

Source Code Archive

For complete python code you can see on my Github Repository.

Disclaimer : The dataset of startup companies used in this project is limited until 2014, means further corporate actions or further funding obtained after 2014 are not included in this research.

Thank you for taking your time to read this research. Any idea or suggestion are very welcome :)

#startup #digital #investment #venturecapital #exploratorydataanalysis #machinelearning