Predicting Successful Companies using Machine Learning

12 min readDec 20, 2020

This section is continuation process from earlier article Exploratory Data Analysis for Machine Learning on Successful Startup Worldwide.

Brief of Dataset Description & Objective

The dataset used is from Crunchbase in csv format, of which the data are worldwide start up companies recorded from 1902 until 2014. It is consisted of 54,294 entries (rows) with 38 attributes of various data types between object and float, as picture shown below:

As for an objective of this project we would like to predict successful startups from companies founded year from 1904 until 2014 by training 70% of each features on the dataset (such as: their market, funding total, origin country, and each investment round companies obtained) by using five various classification models, and then test 30% of them to measure the accuracy.
In a simple world, we predict the label successful companies and non companies by examining the result of learning process from those five different classification model from whole dataset features given with new set of test data. This work can be a benchmark for investment institution such as Venture Capital to do early assessment of start up company.

In various studies had been taken, successful start up commonly defined as two-way strategy that makes a large amount of money to its founders, investors and first employees, as a company can either have an IPO (Initial Public Offering) by going to a public stock market (i.e. Facebook going public, allowing everyone to invest in the company by buying shares being sold by its insiders in the U.S stock market) or, be acquired by or merged (M&A) with another company (i.e. Microsoft acquiring LinkedIn for $26B) where those who have previously invested receive immediate cash in return for their shares. This process is often denominated as an exit strategy (Guo, Lou, & Pérez-Castrillo, 2015). This project will therefore consider both an IPO (Initial Public Offering) and a process of M&A (Mergers & Acquisitions) as the critical events that classify a start-up as successful.

Initial plans before doing further machine learning are seeing the data type thoroughly regarding its datatype and all dataset given whether they are appropriate, by then we can know subsequently what kind of data cleansing should be taken to proceed before go into Machine Learning.

Data Cleansing & Feature Engineering

After acquiring the dataset we found that there are numerous tasks for data cleansing should be taken before doing any further analysis, since the dataset is quite messy with formatting, labelling header, quite a lot involving missing values, and the dataset are also dispersed to introduce outliers.

So that in this project, the data wrenching which have been taken are :

Fixing spacing format in header such as ‘ market ‘ and ‘ funding_total_usd ‘
Remove 4855 row duplicates
Tackling uncommon format.
Attribute ‘funding_total_usd’ involved uncommon string format with wrongly used comma as separate number, then we eliminate the comma and change the data type into numeric.
Handling missing values.
There are numbers of categorical variable in the dataset, so in this section we’d like to adjust the missing value with the appropriate data, as following:
- Replace missing value in ‘market’ with ‘other’
- Replace missing value in ‘funding_total_usd’ with 0
- Replace missing value in ‘status’ with majority status ‘operating’
- Replace missing value in ‘country_code’ with ‘other’
- Replace missing value in ‘region’ with ‘other’
- Replace missing value in ‘founded_year’ with mean from this column
Add Target Variable whether each company becomes successful or not, by definition the succeeded ones are those who have become public companies in the stock market or have been merged or acquired with/by other companies.

Feature Engineering also brings advantages such as handling object data-type into numeric by Label Encoding. Hence in this dataset we use :

Label Encoding for categorical attributes of ‘status’, ‘market’, ‘country_code’, ‘region’.
By using label encoding, it becomes very useful when a category has numerous types, such as the attribute ‘market’ that has more than 100 types of market categories.
Change attributes ‘founded_at’ to be ‘founded_year’, since inconsistent data between year in ‘founded_at’ and year in ‘founded_month’ is found, so that we extract the year in ‘founded_month’ to be new attribute ‘founded_year’, and subsequently drop the ‘founded_at’ column.
Feature Scaling, that becomes one of the most important steps in many machine learning processes, since it can normalise the dispersion of data points, reduce flaws of outliers and make it easier for classification algorithms to make decision boundaries between them. In this research we use Min Max Scaler to normalise.

Key Finding and Insights

Start ups are supposed to be known for their innovation from the gap of problem and solving, and also known for companies of growth seeking-business, so that the nature of business itself requires heavy funds and it is common to look for capital from a variety of sources such as angel investor and venture capitals.

In this research we would like to know whether this dataset of investment, market category, founded year and origin country can be separable to classify successful company probability.

After doing data wrenching or often being called as data preprocessing, we conduct the correlation between attributes with the result as following:

Heatmap of correlation from each figures

The more blueish colours shown, the more they are correlated to the other variables. Simply we can see the most correlated attributes to the target are: post_ipo_debt & funding_total.

The correlation of each attribute can also be seen through a bar plot of absolute pearson correlation, that the most correlated are status, founded year, funding round so on and so forth, yet there are only few features have certain little correlation with target.

Barplot of Pearson Correlation of each figures

Now, we would like to know how disperse target value from different attributes given. The figure below shows that by 2 predictors of funding round and founded year, the targets are very dispersed and close one to another. It is challenging for machine learning to distinguish between classes.

Target Dispersion by using two different features (founded year and funding rounds)

Classification Machine Learning

Before conducting a machine learning algorithm, one thing to define is how skewed the target of the dataset, of which in the dataset used the unbalanced target is obtained, with percentage 92% of ‘0’ or non-successful and 8% of successful companies from whole 49.437 data points.

To tackle this unbalanced dataset, Stratified Shuffle Sample is being used, so that we do not need to upsample dataset nor diminishing dataset, what actually being done is to sampling the dataset with the same unbalanced proportion.

Now we’re ready to look up 5 various classification algorithms and see how good they work on the dataset.

1. Logistic Regression

While Linear Regression is suited for estimating continuous values (e.g. estimating house price), it is not the best tool for predicting the class of an observed data point. In order to estimate the class of a data point, we need some sort of guidance on what would be the most probable class for that data point. For this, we use Logistic Regression.

Logistic regression fits a special s-shaped curve by taking the linear regression and transforming the numeric estimate into a probability with the following function, which is called sigmoid function 𝜎

Logistic Regression is a variation of Linear Regression, useful when the observed dependent variable, ‘y’ is categorical. It produces a formula that predicts the probability of the class label as a function of the independent variables. So, briefly, Logistic Regression passes the input through the logistic/sigmoid but then treats the result as a probability.

Result of this method as following :

Can be seen that it is only good for predicting non successful companies, and can not predict any of successful companies.

Measurement Model Result using Logistic Regression

2. Support Vector Machine

SVM works by mapping data to a high-dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. A separator between the categories is found, then the data is transformed in such a way that the separator could be drawn as a hyperplane. Following this, characteristics of new data can be used to predict the group to which a new record should belong. Result of this method as following :

Measurement Model Result using Support Vector Machine

The result looks the same as logistic regression, it is extremely bad for predicting successful companies. It is obvious, since SVM nature is good for outlier data points, but not for data points which have position near the margin of the decision boundary.

3. Random Forest

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean/average prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set. The hyperparameter need to be adjusted is n_estimator or number of trees. By using cross validation we can minimize its out-of-bag error and obtain the best hyperparameter to use, in this dataset 100 trees as table and curve following:

By using 100 trees, the result of this method:

Measurement Model Result using Random Forest

Different from two earlier methods, Random Forest outperformed them. With accuracy 99%, precision 100%, and recall 92% respectively, and F1 score 96% & AUC 96% can be defined as a great model.

4. Gradient Boosting

Gradient boosting re-defines boosting as a numerical optimisation problem where the objective is to minimise the loss function of the model by adding weak learners using gradient descent. Intuitively, gradient boosting is a stage-wise additive model that generates learners during the learning process (i.e., trees are added one at a time, and existing trees in the model are not changed). The contribution of the weak learner to the ensemble is based on the gradient descent optimisation process. The calculated contribution of each tree is based on minimising the overall error of the strong learner.

There are numerous hyperparameters that need to be fitted: number of trees, learning rate, max_depth, max_feature and subsample. Cross validation again being used to find out the best combination of each of those hyperparameters.

For the dataset, the best estimator being obtained are:

With the result of this method as follows:

Measurement Model Result using Gradient Boosting

We can see that this Gradient Boosting method becomes the best amongst 3 earlier method, although Random Forest considered as great algorithm, this Gradient Boosting can outperform it, due to its capability to distinguish successful companies better, by learning from weak learners as dependent trees, slight different from Random Forest does.

5. Stacking Random Forest & Gradient Boosting (using Voting Classifier)

Stacking or Stacked Generalization is an ensemble machine learning algorithm. It uses a meta-learning algorithm to learn how to best combine the predictions from two or more base machine learning algorithms. After combining as a base model, classifiers such as Voting Classifier or Stacking Classifier are used to decide the classes.

In this research we try to use the best 2 of models being conducted earlier as base models: Random Forest and Gradient Boosting, to know whether it would be a perfect classification machine learning method. Afterwards we use the Voting Classifier to combine those 2 methods.

Measurement Model Result using Voting Classifier with Base model of Random Forest and Gradient Boosting

From the result above we can conclude that it is not the best model, even though the result can be considered great, but amongst these 5 models the most outperform classification algorithm method is : Gradient Boosting.

The reason why this stacking is not proven to be the best model is simple, that the more complex the stacking or ensemble model, it would overfit the result.

Result

From 5 various machine learning methods being executed (Logistic Regression, Support Vector Machine, Random Forest, Gradient Boosting and Stacking Random Forest & Gradient Boosting using Voting Classifier), here are the recap of each confusion matrix obtained:

From those confusion matrix obtained, we can conclude that all models can predict non-successful companies perfectly, but predicting successful-companies which being labeled as ‘1’, is only can be done by model Random Forest, Gradient Boosting and Stacking of those two models with Voting Classifier.

Measurement Comparison of 5 Different Models used

Finally by also comparing the measurement matrices above, we can decide the best to worst model are :

Gradient Boosting
Random Forest
Stacking Gradient Boosting and Random Forest using Voting Classifier
Support Vector Machine and Logistic Regression

The result above shows that Gradient Boosting performs best due to its nature to learn weak trees that depend from one tree to previous trees, so that it can cover almost all features that are embodied in data points. The figure below shows their ROC & Precision-Recall measurement that almost perfect:

ROC curve and trade off of Precision Recall Curve of using Gradient Boosting Model.

Random Forest even though it builds random trees from the sub-sample feature of the dataset, it could also be considered as a good model.

Support Vector Machines could not be a good model in the dataset because the way they work is highly rely on decision boundary that very reluctant to data point which near to its margin. As in the dataset, the data points are very small rigid numbers differ so that they fail to predict the successful companies by feature given. On the other hand, if the dataset are required extreme outliers, this method would be appropriate to use. Same with the logistic regression, Sigmoid function as loss function cannot be used to differentiate the datapoint in dataset given, since its located very in a small rigid position that sometimes overlaps.

Suggestion & Summary

There is one suggestion to take this work going further:

Adding a dataset that can also involve variables which have correlation to target, such as founder experience year, number of patents/copyright/ any goodwill of the company, grit & resilience score of the founders, customer count, etc, by that we can see which one of those predictors that becomes most predictive to the successful startups. Nevertheless it will complete prediction as a comprehensive view to help investment decision.

Summary

The dataset is about startup, corporate actions and investment obtained, sourced from Crunchbase in csv format, of which the data are worldwide start up companies recorded from 1902 until 2014. It consists of 54,294 entries (rows) with 38 attributes of various data types between objects and floats with quite messy formatting. Due to this limitation of the dataset, several steps in data cleansing are highly required before conducting further analysis.

On the other hand, by using merely investment record to predict successful start up is not sufficient (although it’s possible), so that we suggest adding a dataset that can also involve variables which have correlation to target, such as founder experience year, number of patents/copyright/ any goodwill of the company, grit score of the founders, customer count, etc, by that we can see of which predictor can become most correlated to the successful ones. After doing this we can try more complex models such as Neural Network (ResNet for example), to see how it works for the dataset. For this we can revisit this project to complement this work.

Source Code Archive

Feel free to visit the source code of this work on my GitHub !

Thank you for taking your time reading this work, any ideas or advice are warmly welcome :)

Reference :

IBM Machine Learning Foundation, Supervised Learning: Classification, 2020
IBM Data Science For Professional: Machine Learning, 2020