Data science involves computational techniques to extract insights from data. On the other hand, a library is a collection of pre-written code that provides a set of functionalities that can be used to solve specific programming problems.
Prerequisites
Installation methods may include (e.g.: scipy
):
Distributions: Anaconda- pip:
pip3 install scipy
- conda:
conda install -c conda-forge scipy
✅ - Package manager:
sudo apt-get install python3-scipy
- Source
With pip
or conda
, you can control the package versions for a specific project to prevent conflicts. System package managers, like apt-get
, install across the entire computer, often have older versions. Source compilation is much more difficult but is necessary for debugging and development.
For more advanced users who will need to install or upgrade regularly, Miniforge is a more suitable way to install the conda (and mamba, a faster conda alternative) package manager.
Installing and managing packages in Python
Beginning users
- Install Anaconda (it installs all packages you need and all other tools mentioned below).
- For writing and executing code, use notebooks in JupyterLab for exploratory and interactive computing, and Spyder or VS Code for writing scripts and packages.
Advanced users
Conda
- Install Miniforge.
- Keep the
base
conda environment minimal, and use one or more conda environments to install the package you need for the task or project you’re working on.
Alternative if you prefer pip/PyPI
- Install Python from python.org, Homebrew, or your Linux package manager.
- Use Poetry as the most well-maintained tool that provides a dependency resolver and environment management capabilities in a similar fashion as conda does.
Python libraries
NumPy
NumPy is a low level library written in C and FORTRAN for high level mathematical functions. It provides a high-performance multidimensional array object, and tools for working with these arrays.
The NumPy API is used extensively in pandas, SciPy, Matplotlib, scikit-learn, scikit-image and most other data science and scientific Python packages.
NumPy: the absolute basics for beginners — NumPy v1.26 Manual
SciPy
SciPy is a collection of mathematical algorithms and convenience functions built on NumPy. It uses NumPy arrays as the basic data structure, and comes with modules for various commonly used tasks in scientific programming.
SciPy sub-packages:
- Special functions (scipy.special)
- Integration (scipy.integrate)
- Optimization (scipy.optimize)
- Interpolation (scipy.interpolate)
- Fourier Transforms (scipy.fft)
- Signal Processing (scipy.signal)
- Linear Algebra (scipy.linalg)
- Sparse Arrays (scipy.sparse)
- Sparse eigenvalue problems with ARPACK
- Compressed Sparse Graph Routines (scipy.sparse.csgraph)
- Spatial data structures and algorithms (scipy.spatial)
- Statistics (scipy.stats)
- Multidimensional image processing (scipy.ndimage)
- File IO (scipy.io)
Matplotlib
Matplotlib is a visualization-building plotting package that is used to plot graphs and charts.
Quick start guide — Matplotlib 3.8.0 documentation
scikit-learn
It is one of the best and a modern library in Machine Learning. It has the ability to supporting learning algorithms, especially unsupervised and supervised ones.
Examples of scikit-learn include the following:
- K-means
- Decision trees
- Linear and logistic regression
- Clustering
This kind of library has major components from NumPy and SciPy.
scikit-learn: machine learning in Python — scikit-learn 1.3.2 documentation
pandas
For example, say you want to explore a dataset stored in a CSV on your computer. pandas will extract the data from that CSV into a DataFrame - a table, basically.
pandas is built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated in Pandas. Data in pandas is often used to feed statistical analysis in SciPy, plotting functions from Matplotlib, and machine learning algorithms in scikit-learn.
pandas - Python Data Analysis Library (pydata.org)
pandas can collect data from other sources such as Excel, CSV, and even SQL databases. The pandas library consists of two structures that enable it to perform its functions correctly. That is the series, which has only one dimension and data frames that are characterized by being two-dimensional.
pandas is effective in the following areas:
- Splitting of data
- Merging of two or more types of data
- Data aggregation
- Selecting or subsetting data
- Data reshaping
Seaborn
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
An introduction to seaborn — seaborn 0.13.0 documentation (pydata.org)
PyTorch
Based on the Torch library, PyTorch is an open-source machine learning library used for tasks like computer vision and natural language processing.
Python 3.8 or greater
Tip
By default, you will have to use the command
python3
to run Python. If you want to use just the commandpython
, instead ofpython3
, you can symlinkpython
to thepython3
binary.
Scrapy
Scrapy is another library used for creating crawling programs.
statsmodels
statsmodels is a Python module that provides various statistical models and functions to explore, analyze, and visualize data. It is an open-source library that is built on top of NumPy, SciPy, and pandas libraries.
Theano
Theano is a Python library that allows you to define, optimize, and efficiently evaluate mathematical expressions involving multi-dimensional arrays. It is built on top of NumPy.
It is good to note that Theano works best with GPU compared to the CPU.