In this post, I will be going through the installations and requirements needed to run the benchmark model presented for DengAI hosted on DrivenData.
When selecting the required packages you have to be careful to pick the one that is compatible with your machine. For example, when I run python in my cmd, I get the following:
Therefore when I have to get numpy, I will be selecting "numpy‑1.12.1+mkl‑cp35‑cp35m‑win32.whl" from the list as follows:
The benchmark requires a handful of libraries, which could be easily installed using pip using the command line as follows. IT's not a problem if some of them are already installed.
1. Upgrade pip
2. Install Numpy
3. Install pandas
4. Install SciPy
6. Install Scikit-learn
9. Install Jupyter
When you run this code, the jupyter server will be started which will be used to run our model. It will present a token which will be asked for, when running the notebook.
Once the notebook is created, you will be presented with the editor to manage the cells. Copy the In[1] in the benchmark post and past in a new cell. Then click on the run button to run the cell. You might be prompted to enter the token from the jupyter as said in #10 above.
If the first cell is successfully run, it indicates that your library installations are successful. Keep adding the other cells and run the benchmark.
What is DrivenData?
Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. DrivenData is an organization that hosts online data science competitions where the data and problem are posed by a socially-minded organization. You get to put your analytic skills to the test in order to tackle real-world problems with real-world impact. DengAI is a competition hosted by DrivenData that allows competitors to predict the local epidemics of the dengue disease. The blog on this link, presents a benchmark model provided by DrivenData that can be used to compare the competitors results with.Selecting the correct wheel file
The benchmark is modeled using Python (obviously) and Jupyter Notebook. Therefore the first thing you must have is python, in your machine. I am running this on a Windows machine therefore I had to install some of the required python plugins from the wheel files on gohlke packages.When selecting the required packages you have to be careful to pick the one that is compatible with your machine. For example, when I run python in my cmd, I get the following:
Therefore when I have to get numpy, I will be selecting "numpy‑1.12.1+mkl‑cp35‑cp35m‑win32.whl" from the list as follows:
Installing the packages
The benchmark requires a handful of libraries, which could be easily installed using pip using the command line as follows. IT's not a problem if some of them are already installed.1. Upgrade pip
python -m pip install --upgrade pip
2. Install Numpy
- Download the correct wheel file from numpy on gohlke
- Open command line in the downloaded folder
- Run
pip install "numpy wheel filename"
(Eg:pip install "numpy‑1.12.1+mkl‑cp35‑cp35m‑win32.whl")
3. Install pandas
pip install "pandas wheel filename"
4. Install SciPy
pip install "scipy wheel filename"
5. Install Statsmodelspip install "statsmodels wheel filename"
6. Install Scikit-learn
pip install -U scikit-learn
7. Install Seabornpip install seaborn
8. Install Matplotlibpip install matplotlib
9. Install Jupyter
pip install jupyter
10. Run Jupyterjupyter notebook
When you run this code, the jupyter server will be started which will be used to run our model. It will present a token which will be asked for, when running the notebook.
Create and run a notebook
I am using Jetbrains PyCharm to run this model, which allows users to easily manage python scripts. In PyCharm, create a new project and in that, create a new Jupyter Notebook.Once the notebook is created, you will be presented with the editor to manage the cells. Copy the In[1] in the benchmark post and past in a new cell. Then click on the run button to run the cell. You might be prompted to enter the token from the jupyter as said in #10 above.
If the first cell is successfully run, it indicates that your library installations are successful. Keep adding the other cells and run the benchmark.
Points to remember
- It is always better to run each cell individually - which is obviously why jupyter notebooks are useful for.
- If filepaths are troublesome, try using absolute paths.
- If there are exceptions and/or errors where the fix does not seem to work, try restarting the jupyter server.