Excel files are not a data source. They are a dumping ground with a .xlsx extension. But they are everywhere, and ignoring them is not an option.
The pattern is always the same: someone sends a spreadsheet, the business needs it in Postgres, and you have an afternoon to sort it. Here is how I do it. If you are pulling from multiple files in S3 with inconsistent column names, normalize them first before loading — this guide assumes a single clean file.
Read the file
import pandas as pd
df = pd.read_excel("data.xlsx", sheet_name="Sales")
If you don’t know the sheet name, pd.ExcelFile("data.xlsx").sheet_names tells you.
Clean before you load
Excel sheets are messy by nature. Fix what will break before you touch Postgres:
# Normalize column names
df.columns = (
df.columns.str.strip()
.str.lower()
.str.replace(" ", "_", regex=False)
.str.replace(r"[^\w]", "", regex=True)
)
# Drop empty rows (common in Excel)
df.dropna(how="all", inplace=True)
# Fix types that came in wrong
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
df["sale_date"] = pd.to_datetime(df["sale_date"], dayfirst=True, errors="coerce")
The errors="coerce" is doing real work. Excel dates are a mess, and amounts typed by humans will have commas, currency symbols, spaces. Coerce turns garbage into NaN so you see the problem instead of loading wrong data silently.
Low volume: use pandas
Under 100k rows, to_sql with SQLAlchemy is fine:
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://user:pass@host/db")
df.to_sql("sales", engine, if_exists="replace", index=False)
if_exists="replace" drops and recreates the table. Use "append" if you’re adding to existing data.
High volume: use COPY
Over a few hundred thousand rows, to_sql gets slow because it inserts row by row. Use COPY instead:
import psycopg2
import io
conn = psycopg2.connect("postgresql://user:pass@host/db")
cur = conn.cursor()
buffer = io.StringIO()
df.to_csv(buffer, index=False, header=False)
buffer.seek(0)
cur.copy_expert(
"COPY sales (col1, col2, amount, sale_date) FROM STDIN WITH CSV",
buffer
)
conn.commit()
This streams the data directly. On a 1M row file I have seen 15x the throughput of to_sql. The table needs to exist first and column order matters.
One thing to check
After the first load, look at what Postgres inferred for your column types. to_sql will often create TEXT where you wanted DATE, or FLOAT8 where you wanted INTEGER. Define the schema explicitly if the types matter.
The data is rarely as clean as the person who sent the file thinks it is. Run df.dtypes and df.describe() before you load anything.
Full script for low volume
import pandas as pd
import psycopg2
import io
from sqlalchemy import create_engine
df = pd.read_excel(“data.xlsx”, sheet_name=“Sales”)
df.columns = (
df.columns.str.strip()
.str.lower()
.str.replace(" “, “_”, regex=False)
.str.replace(r”[^\w]", “”, regex=True)
)
df.dropna(how=“all”, inplace=True)
df[“amount”] = pd.to_numeric(df[“amount”], errors=“coerce”)
df[“sale_date”] = pd.to_datetime(df[“sale_date”], dayfirst=True, errors=“coerce”)
engine = create_engine(“postgresql+psycopg2://user:pass@host/db”)
df.to_sql(“sales”, engine, if_exists=“replace”, index=False)