Efficient way to Query/Read CSV File using SQL

As you are already aware that, Data is the core part of any business solution these days, and generating insights from that data is the final outcome of every business solution. There are multiple storages available to store raw/processed data and the most common one is storing data into CSV file.

Programming languages these days have built-in packages which help to read/write CSV files. Other than programming languages, databases also have the support to import CSV data into tables and then query that data using SQL.

So when you want to perform some SQL queries on the CSV file to get some insights into the data, it involves you to first import the CSV data into the database table and then you can perform SQL queries over that data. So this process seems a bit long.

This made my mind search for something simpler and quicker way to perform SQL on CSV files. And the solution that I found really impressed me and made this process a lot simple and quicker.

The Solution is DuckDB:

DuckDB is an in-process SQL OLAP database management system

Some main core features of DuckDB are:

  • In-process, serverless
  • APIs for Python/R/Java/C/C++, and others
  • Extensive SQL support
  • Direct Parquet & CSV querying
  • Parallel query processing
  • Free & Open Source

I think it’s best to dive into code and see how DuckDB can help us with a simple and quick way to perform SQL on CSV files without any data import process.

I’ll use Python in the coding example. To install DuckDB for python:

pip install duckdb

Let’s take a look at the sample CSV file:

And we want to get the data for the users who belong to the Development department.

import duckdb

con = duckdb.connect(database=":memory:", read_only=False)

con.execute("SELECT * FROM read_csv_auto('users.csv', HEADER=TRUE, ALL_VARCHAR=1) WHERE department = 'Development' ")

data = con.fetchdf()

let’s understand this line by line, the first line is the import statement

import duckdb

and the second line is to create a DuckDB connection but as you can see this is not using a database instead it’s using :memory: which means data will be processed in the memory. you can use a database as well just like as follow:

con = duckdb.connect(database="my_database.db", read_only=False)

The third line execute a query on the CSV file which you passed to read_csv_auto fucntion

con.execute("SELECT * FROM read_csv_auto('users.csv', HEADER=TRUE, ALL_VARCHAR=1) WHERE department = 'Development' ")

and the last line fetches the query data and convert it into a pandas dataframe which can be used for other purposes.

data = con.fetchdf()

From above example, you can now have a better understanding why I mentioned DuckDB a simple and quick way to perform SQL on CSV file by directly querying it.

If you guys think this is the optimized solution then go for the clap or otherwise please share your suggestions that how we can further improve this thing.

If you have any questions, please feel free to leave a comment below.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store