Normalize S3 Files to a Single Excel

Every data project has this problem: five teams, five files, five names for the same column. You do not need to transform the data. You just need to agree on what things are called, then combine them.

This is that script.

What you need

pip install boto3 pandas openpyxl

1. Define the column map

This is the only thing you need to maintain over time. Every canonical column name gets its known aliases, all the names you have seen for the same thing across different teams and systems:

COLUMN_MAP = {
    "id": "id",
    "product_id": "id",
    "prod_id": "id",
    "name": "name",
    "product_name": "name",
    "price": "price",
    "unit_price": "price",
    "amount": "price",
}

When a column is not in the map, it keeps its own name as-is. You do not need to list everything upfront, only the names that conflict between files. Add to the map as you discover new aliases.

2. Read files from S3

boto3 gives you the raw bytes. pandas handles the rest. The function is small because the logic is simple: look at the extension, pick the right reader.

import boto3
import pandas as pd
import io

def read_s3_file(s3, bucket, key):
    body = s3.get_object(Bucket=bucket, Key=key)["Body"].read()
    if key.endswith(".csv"):
        return pd.read_csv(io.BytesIO(body))
    if key.endswith((".xlsx", ".xls")):
        return pd.read_excel(io.BytesIO(body))
    return None

If you have other formats, add them here. None is returned for anything unrecognised, and the main loop skips it.

3. Normalize column names

Strip whitespace and lowercase everything first, so the map lookup is consistent regardless of how each team formatted their headers. Then rename using the map. Columns not in the map are left alone by the lambda.

def normalize_columns(df):
    df.columns = df.columns.str.strip().str.lower()
    return df.rename(columns=lambda c: COLUMN_MAP.get(c, c))

4. List, load, and merge

List everything under the S3 prefix, read each file, normalize it, and concat. ignore_index=True resets the row index so you get a clean sequential index in the output. Columns that appear in some files but not others will be present with NaN where data was missing, which is correct behavior.

s3 = boto3.client("s3")
bucket = "your-bucket"
prefix = "data/"

paginator = s3.get_paginator("list_objects_v2")
keys = [
    obj["Key"]
    for page in paginator.paginate(Bucket=bucket, Prefix=prefix)
    for obj in page.get("Contents", [])
]

frames = []
for key in keys:
    df = read_s3_file(s3, bucket, key)
    if df is not None:
        frames.append(normalize_columns(df))

combined = pd.concat(frames, ignore_index=True)
combined.to_excel("normalized.xlsx", index=False)

Note the paginator instead of a plain list_objects_v2 call. The plain call returns at most 1000 keys. If your prefix has more files, you will silently miss them. The paginator handles any number of files with no extra code.

The map is the hard part. The code is not.

Once you have the normalized file, loading it into Postgres is the next step.

Full script

import boto3
import pandas as pd
import io
COLUMN_MAP = {
“id”: “id”,
“product_id”: “id”,
“prod_id”: “id”,
“name”: “name”,
“product_name”: “name”,
“price”: “price”,
“unit_price”: “price”,
“amount”: “price”,
}
def read_s3_file(s3, bucket, key):
body = s3.get_object(Bucket=bucket, Key=key)[“Body”].read()
if key.endswith(".csv"):
return pd.read_csv(io.BytesIO(body))
if key.endswith((".xlsx", “.xls”)):
return pd.read_excel(io.BytesIO(body))
return None
def normalize_columns(df):
df.columns = df.columns.str.strip().str.lower()
return df.rename(columns=lambda c: COLUMN_MAP.get(c, c))
s3 = boto3.client(“s3”)
bucket = “your-bucket”
prefix = “data/”
paginator = s3.get_paginator(“list_objects_v2”)
keys = [
obj[“Key”]
for page in paginator.paginate(Bucket=bucket, Prefix=prefix)
for obj in page.get(“Contents”, [])
]
frames = []
for key in keys:
df = read_s3_file(s3, bucket, key)
if df is not None:
frames.append(normalize_columns(df))
combined = pd.concat(frames, ignore_index=True)
combined.to_excel(“normalized.xlsx”, index=False)

1. Define the column map

2. Read files from S3

3. Normalize column names

4. List, load, and merge

Full script

Want this running in production?