Already using NumPy, Pandas, and Scikit-learn? Here are five more powerful Python data science tools that deserve a place in your toolkit. Credit: Rawpixel.com/Shutterstock Python’s rich ecosystem of data science tools is a big draw for users. The only downside of such a broad and deep collection is that sometimes the best tools can get overlooked. Here’s a rundown of some of the best newer or lesser-known data science projects available for Python. Some, like Polars, are getting more attention than before but still deserve wider notice. Others, like ConnectorX, are hidden gems. ConnectorX Most data sits in a database somewhere, but computation typically happens outside of a database. Getting data to and from the database for actual work can be a slowdown. ConnectorX loads data from databases into many common data-wrangling tools in Python, and it keeps things fast by minimizing the amount of work to be done. Like Polars (which I’ll discuss soon), ConnectorX uses a Rust library at its core. This allows for optimizations like being able to load from a data source in parallel with partitioning. Data in PostgreSQL, for instance, can be loaded this way by specifying a partition column. Aside from PostgreSQL, ConnectorX also supports reading from MySQL/MariaDB, SQLite, Amazon Redshift, Microsoft SQL Server and Azure SQL, and Oracle. The results can be funneled into a Pandas or PyArrow DataFrame, or into Modin, Dask, or Polars by way of PyArrow. DuckDB Data science folks who use Python ought to be aware of SQLite—a small, but powerful and speedy, relational database packaged with Python. Since it runs as an in-process library, rather than a separate application, it’s lightweight and responsive. DuckDB is a little like someone answered the question, “What if we made SQLite for OLAP?” Like other OLAP database engines, it uses a columnar datastore and is optimized for long-running analytical query workloads. But it gives you all the things you expect from a conventional database, like ACID transactions. And there’s no separate software suite to configure; you can get it running in a Python environment with a single pip install command. DuckDB can directly ingest data in CSV, JSON, or Parquet format. The resulting databases can also be partitioned into multiple physical files for efficiency, based on keys (e.g., by year and month). Querying works like any other SQL-powered relational database, but with additional built-in features like the ability to take random samples of data or construct window functions. DuckDB also has a small but useful collection of extensions, including full-text search, Excel import/export, direct connections to SQLite and PostgreSQL, Parquet file export, and support for many common geospatial data formats and types. Optimus One of the least enviable jobs you can be stuck with is cleaning and preparing data for use in a DataFrame-centric project. Optimus is an all-in-one tool set for loading, exploring, cleansing, and writing data back out to a variety of data sources. Optimus can use Pandas, Dask, CUDF (and Dask + CUDF), Vaex, or Spark as its underlying data engine. Data can be loaded in from and saved back out to Arrow, Parquet, Excel, a variety of common database sources, or flat-file formats like CSV and JSON. The data manipulation API resembles Pandas, but adds .rows() and .cols() accessors to make it easy to do things like sort a DataFrame, filter by column values, alter data according to criteria, or narrow the range of operations based on some criteria. Optimus also comes bundled with processors for handling common real-world data types like email addresses and URLs. One possible issue with Optimus is that it’s still under active development but its last official release was in 2020. This means it may not be as up-to-date as other components in your stack. Polars If you spend much of your time working with DataFrames and you’re frustrated by the performance limits of Pandas, reach for Polars. This DataFrame library for Python offers a convenient syntax similar to Pandas. Unlike Pandas, though, Polars uses a library written in Rust that takes maximum advantage of your hardware out of the box. You don’t need to use special syntax to take advantage of performance-enhancing features like parallel processing or SIMD; it’s all automatic. Even simple operations like reading from a CSV file are faster. Polars provides eager and lazy execution modes, so queries can be executed immediately or deferred until needed. It also provides a streaming API for processing queries incrementally, although streaming isn’t available yet for many functions. And Rust developers can craft their own Polars extensions using pyo3. Snakemake Data science workflows are hard to set up, and even harder to do so in a consistent, predictable way. Snakemake was created to automate the process, setting up data analysis workflows in ways that ensure everyone gets the same results. Many existing data science projects rely on Snakemake. The more moving parts you have in your data science workflow, the more likely you’ll benefit from automating that workflow with Snakemake. Snakemake workflows resemble GNU make workflows—you define the steps of the workflow with rules, which specify what they take in, what they put out, and what commands to execute to accomplish that. Workflow rules can be multi-threaded (assuming that gives them any benefit), and configuration data can be piped in from JSON or YAML files. You can also define functions in your workflows to transform data used in rules, and write the actions taken at each step to logs. Snakemake jobs are designed to be portable—they can be deployed on any Kubernetes-managed environment, or in specific cloud environments like Google Cloud Life Sciences or Tibanna on AWS. Workflows can be “frozen” to use a specific set of packages, and successfully executed workflows can have unit tests automatically generated and stored with them. And for long-term archiving, you can store the workflow as a tarball. Related content feature 14 great preprocessors for developers who love to code Sometimes it seems like the rules of programming are designed to make coding a chore. Here are 14 ways preprocessors can help make software development fun again. By Peter Wayner Nov 18, 2024 10 mins Development Tools Software Development news JetBrains IDEs ease debugging for Kubernetes apps Version 2024.3 updates to IntelliJ, PyCharm, WebStorm, and other JetBrains IDEs streamline remote debugging of Kubernetes microservices and much more. By Paul Krill Nov 14, 2024 3 mins Integrated Development Environments Java Python analysis Understanding Hyperlight, Microsoft’s minimal VM manager Microsoft is making its Rust-based, functions-focused VM tool available on Azure at last, ready to help event-driven applications at scale. By Simon Bisson Nov 14, 2024 8 mins Microsoft Azure Rust Serverless Computing analysis GitHub Copilot learns new tricks GitHub and Microsoft have taken their AI-powered programming assistant into new territories, tackling code reviews, simple web apps, Java upgrades, and Azure help and troubleshooting. By Simon Bisson Nov 07, 2024 8 mins GitHub Java Microsoft Azure Resources Videos