Data transformation converts raw, messy source data into clean, analysis-ready datasets. The four core transformations: Normalisation (scaling numeric values to a standard range), Discretisation (converting continuous values into categorical bins), Sampling (selecting representative subsets of large datasets), and Compression (reducing data size without losing essential information). These are part of the ETL process and are used in both data warehousing and machine learning preprocessing.
Real-life analogy: The food preparation kitchen
Raw ingredients from farms (source systems) arrive dirty, in different sizes, in bulk. Before cooking (analysis): wash and peel (cleaning), cut to uniform size (normalisation), group into categories (discretisation), taste-test a sample (sampling), vacuum-seal to reduce volume (compression). Data transformation is the preparation kitchen that makes raw ingredients usable.
1. Data Normalisation (feature scaling)
Three scaling methods. Min-Max: scales to [0,1], sensitive to outliers. Z-score: zero mean unit variance, handles outliers better. Robust: uses median and IQR, best when many outliers exist.
Normalisation methods with sklearn
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
data = pd.DataFrame({
'Age': [22, 35, 45, 55, 80, 28, 42],
'Salary': [25000, 80000, 150000, 250000, 500000, 45000, 120000],
'Score': [65, 72, 88, 91, 95, 70, 85]
})
# Min-Max: scales to [0, 1]
mm = MinMaxScaler()
std = StandardScaler() # Z-score: mean=0, std=1
rob = RobustScaler() # Median + IQR: robust to outliers
X_mm = pd.DataFrame(mm.fit_transform(data), columns=data.columns)
X_std = pd.DataFrame(std.fit_transform(data), columns=data.columns)
X_rob = pd.DataFrame(rob.fit_transform(data), columns=data.columns)
print("Min-Max range:", X_mm.min().round(2).to_dict(), "to", X_mm.max().round(2).to_dict())
print("Z-score mean:", X_std.mean().round(2).to_dict())
print("Z-score std: ", X_std.std().round(2).to_dict())
# IMPORTANT: Fit on TRAINING data only, transform both train and test
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(data, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit + transform
X_test_scaled = scaler.transform(X_test) # ONLY transform (no refit!)DBMS normalisation vs data normalisation — different concepts!
DBMS normalisation = structuring tables to eliminate redundancy (1NF/2NF/3NF/BCNF). Data transformation normalisation = scaling numerical feature values to a standard range for machine learning. Same word, completely different meanings. GATE tests both in different contexts.
2. Discretisation — continuous to categorical
Equal-width, equal-frequency, and custom binning
import pandas as pd
import numpy as np
ages = pd.Series([18, 22, 25, 31, 35, 42, 47, 55, 62, 68, 72, 85])
# METHOD 1: Equal-width (uniform interval size)
bins_ew = pd.cut(ages, bins=4, labels=['Young','Adult','Middle','Senior'])
print("Equal-width:", pd.cut(ages, bins=4).value_counts().sort_index())
# (17.9-36], (36-54], (54-72], (72-85] — each is ~17 years wide
# METHOD 2: Equal-frequency (same count per bin — quantile-based)
bins_ef = pd.qcut(ages, q=4, labels=['Q1','Q2','Q3','Q4'])
print("Equal-freq:", pd.qcut(ages, q=4).value_counts().sort_index())
# Each bin has ~3 values regardless of range
# METHOD 3: Custom domain-based bins (most meaningful)
bins_custom = pd.cut(ages,
bins=[0, 19, 29, 39, 59, 100],
labels=['Teen','Young Adult','Adult','Middle Aged','Senior'])
print("Custom:", bins_custom.value_counts())
# Income discretisation for data warehouse reporting
salary = pd.Series([15000, 35000, 75000, 120000, 200000, 500000])
salary_cat = pd.cut(salary,
bins=[0, 25000, 50000, 100000, 250000, float('inf')],
labels=['Low','Lower-Middle','Middle','Upper-Middle','High'])
print("Salary categories:", salary_cat.tolist())| Method | Bin sizes | When to use | Problem |
|---|---|---|---|
| Equal-width | Same range per bin | Uniform distribution data | Skewed data: most values in one bin |
| Equal-frequency | Same count per bin | Skewed data, ensure coverage | May group very different values together |
| Custom/Domain | Business-defined boundaries | When business rules define categories | Requires domain knowledge |
3. Sampling — working with representative subsets
Sampling strategies for large datasets
import pandas as pd
import numpy as np
np.random.seed(42)
n = 100_000
df = pd.DataFrame({
'CustomerID': range(n),
'Age': np.random.randint(18, 80, n),
'Region': np.random.choice(['North','South','East','West'], n, p=[0.4,0.3,0.2,0.1]),
'Purchased': np.random.choice([0, 1], n, p=[0.8, 0.2]) # 80/20 imbalance
})
# 1. Simple Random Sampling (SRS)
srs = df.sample(n=5000, random_state=42)
print(f"SRS: {len(srs)} rows, Region: {srs.Region.value_counts(normalize=True).round(2).to_dict()}")
# 2. Stratified Sampling (preserve class proportions)
stratified = df.groupby('Region', group_keys=False).apply(
lambda x: x.sample(frac=0.05, random_state=42)
)
print(f"Stratified: {len(stratified)} rows — region dist preserved")
# 3. Systematic Sampling — every k-th record
k = 20
systematic = df.iloc[::k]
print(f"Systematic (every {k}th): {len(systematic)} rows")
# 4. Over/Under-sampling for imbalanced datasets (fraud, churn)
from sklearn.utils import resample
minority = df[df.Purchased == 1] # 20,000 (20%)
majority = df[df.Purchased == 0] # 80,000 (80%)
# Oversample minority to balance classes
minority_up = resample(minority, n_samples=len(majority), random_state=42)
balanced = pd.concat([majority, minority_up])
print(f"Balanced dataset: {balanced.Purchased.value_counts().to_dict()}")
# 5. Reservoir sampling for streaming data (unknown total size)
def reservoir_sample(stream, k):
reservoir = []
for i, item in enumerate(stream):
if i < k: reservoir.append(item)
else:
j = np.random.randint(0, i + 1)
if j < k: reservoir[j] = item
return reservoir4. Data Compression
| Technique | How it works | Best for |
|---|---|---|
| Run-Length Encoding (RLE) | Replace runs: AAABBB → 3A3B | Sorted, repeated values (Region column sorted by region) |
| Dictionary Encoding | Map frequent values to short codes: {North:0,South:1} | Low-cardinality string columns (Status, Region, Category) |
| Delta Encoding | Store differences: [100,102,105] → [100,+2,+3] | Sequential numeric data (timestamps, IDs) |
| Parquet/ORC (columnar) | Store columns together + per-column compression | Data lakes, BigQuery, Redshift, Spark — 10-50x smaller |
| Gzip/Snappy/Zstd | General-purpose byte-stream compression | File storage, network transfer, S3 objects |
Columnar storage compression comparison
import pandas as pd, io
# 100k rows of realistic data
df = pd.DataFrame({
'Region': (['North','South','East','West'] * 25000),
'Status': (['Active','Inactive','Pending'] * 33333 + ['Active']),
'Amount': pd.Series(range(100000)) * 0.99,
'OrderDate': pd.date_range('2020-01-01', periods=100000, freq='1h'),
})
# CSV (row-based)
csv_buf = io.BytesIO(); df.to_csv(csv_buf); csv_size = csv_buf.tell()
# Parquet + Snappy (columnar + light compression)
p_snap = io.BytesIO(); df.to_parquet(p_snap, compression='snappy'); snap_size = p_snap.tell()
# Parquet + Gzip (columnar + heavy compression)
p_gz = io.BytesIO(); df.to_parquet(p_gz, compression='gzip'); gz_size = p_gz.tell()
print(f"CSV (row store): {csv_size:>10,} bytes (baseline)")
print(f"Parquet + Snappy: {snap_size:>10,} bytes ({csv_size/snap_size:.1f}x smaller, fast)")
print(f"Parquet + Gzip: {gz_size:>10,} bytes ({csv_size/gz_size:.1f}x smaller, slow)")
# Dictionary encoding: Region column as category saves ~20x
str_mem = df['Region'].memory_usage(deep=True)
cat_mem = df['Region'].astype('category').memory_usage(deep=True)
print(f"
Region as string: {str_mem:,} bytes")
print(f"Region as category: {cat_mem:,} bytes ({str_mem//cat_mem}x smaller)")Practice questions
- When to use Min-Max vs Z-score normalisation? (Answer: Min-Max: need fixed [0,1] range, no significant outliers (neural networks, image pixels). Z-score: data has outliers or algorithm assumes Gaussian distribution (linear regression, PCA, LDA). Robust scaler: many extreme outliers.)
- Equal-width vs equal-frequency discretisation — when does each fail? (Answer: Equal-width fails for skewed data — most values fall in one bin. Equal-frequency ensures uniform bin population but may group very different values if data has gaps.)
- Why is stratified sampling preferred for imbalanced datasets? (Answer: If 1% of data is fraud, SRS of 1000 records may have 0-20 fraud cases by chance. Stratified guarantees approximately 10 fraud cases (1%), preserving class distribution for reliable model training.)
- Run-length encoding gives best compression when: (Answer: Sorted, low-cardinality columns with many consecutive repeated values. Example: Region column sorted as AAAA...BBBB compresses dramatically. Unsorted or high-cardinality columns may expand.)
- What is the key advantage of columnar storage (Parquet/ORC) over row-based storage (CSV) for analytics? (Answer: Analytics queries read only needed columns — columnar stores skip irrelevant columns entirely. Also enables highly effective per-column compression (all values in a column have same type/range). Row stores must read full rows even if only 2 of 50 columns are needed.)
On LumiChats
LumiChats can write complete ETL pipeline code in Python (pandas, sklearn, PySpark) including all transformation types. Describe your data: 'CSV with age, skewed income, region, and 95% non-fraud labels' — LumiChats writes the full preprocessing pipeline.
Try it free