The Ultimate Guide to Essential Linux Commands for Data Scientists
1. What is Linux?
Linux is an open-source operating system kernel first created by Linus Torvalds in 1991. It is the most popular operating system kernel used in servers, mainframe computers, and embedded systems, and it serves as the foundation for various Linux distributions, commonly referred to as “distros.” Linux is known for its stability, security, and flexibility.
2. Why Linux for Data Science and MLOps?
- Flexibility: Linux offers a wide range of tools and libraries essential for data science and MLOps tasks.
- Performance: Linux is known for its efficient resource management, making it ideal for running complex computational tasks.
- Compatibility: Many popular data science and machine learning frameworks are developed and optimized for Linux.
- Community Support: Linux has a large and active community that provides extensive documentation, tutorials, and support.
3. Getting Started with Linux
3.1. Choosing a Linux Distribution
There are numerous Linux distributions available, each with its own strengths and focus. Some popular distributions for data science and MLOps include:
- Ubuntu: Known for its ease of use and extensive software repositories.
- CentOS/RHEL: Favored for stability and long-term support.
- Fedora: Provides bleeding-edge software for developers.
- Debian: Prioritizes stability and security.
4. Basic Linux Commands for Data Scientists and MLOps Engineers
4.1. File System Navigation
- pwd: Print the current working directory.
- Example:
pwd
- ls: List directory contents.
-l
: Detailed listing.-a
: Include hidden files.- Example:
ls -la
- cd: Change directory.
cd <directory>
: Move to a specific directory.cd ..
: Move up one directory.cd ~
: Move to the home directory.- Example:
cd /home/user
4.2. File and Directory Operations
- mkdir: Create a new directory.
mkdir <directory_name>
: Create a directory with a specific name.- Example:
mkdir new_directory
- rmdir: Remove a directory.
rmdir <directory_name>
: Remove an empty directory.- Example:
rmdir old_directory
- rm: Remove files or directories.
rm <file>
: Remove a file.rm -r <directory>
: Remove a directory and its contents recursively.- Example:
rm file.txt
- cp: Copy files or directories.
cp <source> <destination>
: Copy a file.cp -r <source_directory> <destination_directory>
: Copy a directory recursively.- Example:
cp file1.txt /path/to/destination
4.3. Working with Text Files
- cat: Display file content.
- Example:
cat file.txt
- less/more: View file content one page at a time.
- Example:
less file.txt
- head/tail: Display the beginning/end of a file.
- Example:
head file.txt
- nano/vim: Text editors for creating and editing files.
- Example:
nano file.txt
4.4. Managing Processes
- ps: Display information about active processes.
ps aux
: Display all processes.- Example:
ps aux
- kill: Terminate a process.
kill <process_id>
: Terminate a process by ID.- Example:
kill 1234
- top/htop: Interactive process viewer for monitoring system resources.
- Example:
top
4.5. Package Management
- apt/yum/dnf: Package managers for installing, updating, and removing software packages.
apt install <package_name>
: Install a package (Ubuntu/Debian).- Example:
apt install python
- pip: Package installer for Python packages.
pip install <package_name>
: Install a Python package.- Example:
pip install numpy
- conda: Package manager and environment manager for Python packages.
conda install <package_name>
: Install a package using Anaconda.- Example:
conda install pandas
4.6. File Transfer
- scp: Securely copy files between hosts.
scp <source> <destination>
: Copy a file from the source to the destination.- Example:
scp file.txt user@remote_host:/path/to/destination
- rsync: Synchronize files and directories between hosts.
rsync -av <source> <destination>
: Synchronize files and directories recursively.- Example:
rsync -av source_directory/ user@remote_host:/path/to/destination
4.7. Memory and Process Monitoring
- free: Display amount of free and used memory in the system.
- Example:
free -h
- top/htop: Interactive process viewer for monitoring system resources.
- Example:
htop
- pmap: Report memory map of a process.
- Example:
pmap <pid>
- df — disk free
- Example: df -h
4.8. Miscellaneous Commands
- nohup: Run a command immune to hangups, with output to a non-tty.
- Example:
nohup python script.py &
- find: Search for files in a directory hierarchy.
- Example:
find /path/to/search -name "*.txt"
- du: Display disk usage of files and directories.
- Example:
du -sh /path/to/directory
Here, I’ve covered essential Linux commands for data scientists , including some examples . By mastering these commands, you’ll be equipped to navigate the Linux environment efficiently and perform common tasks related to data science and MLOps. You might need to use these commands in addition to other arguments to get your work done.
If you wish to get an idea of how a production ready code should be like, check out the below code repository:
https://github.com/kshitijkutumbe/usa-visa-approval-prediction