AI in Travel

Build ETL Pipelines for Data Science Workflows in About 30 Lines of Python

Published

2 weeks ago

July 8, 2025

Image by Author | Ideogram

You know that feeling when you have data scattered across different formats and sources, and you need to make sense of it all? That’s exactly what we’re solving today. Let’s build an ETL pipeline that takes messy data and turns it into something actually useful.

In this article, I’ll walk you through creating a pipeline that processes e-commerce transactions. Nothing fancy, just practical code that gets the job done.

We’ll grab data from a CSV file (like you’d download from an e-commerce platform), clean it up, and store it in a proper database for analysis.

🔗 Link to the code on GitHub

What Is an Extract, Transform, Load (ETL) Pipeline?

Every ETL pipeline follows the same pattern. You grab data from somewhere (Extract), clean it up and make it better (Transform), then put it somewhere useful (Load).

ETL Pipeline | Image by Author | diagrams.net (draw.io)

The process begins with the extract phase, where data is retrieved from various source systems such as databases, APIs, files, or streaming platforms. During this phase, the pipeline identifies and pulls relevant data while maintaining connections to disparate systems that may operate on different schedules and formats.

Next the transform phase represents the core processing stage, where extracted data undergoes cleaning, validation, and restructuring. This step addresses data quality issues, applies business rules, performs calculations, and converts data into the required format and structure. Common transformations include data type conversions, field mapping, aggregations, and the removal of duplicates or invalid records.

Finally, the load phase transfers the now transformed data into the target system. This step can occur through full loads, where entire datasets are replaced, or incremental loads, where only new or changed data is added. The loading strategy depends on factors such as data volume, system performance requirements, and business needs.

Step 1: Extract

The “extract” step is where we get our hands on data. In the real world, you might be downloading this CSV from your e-commerce platform’s reporting dashboard, pulling it from an FTP server, or getting it via API. Here, we’re reading from an available CSV file.

def extract_data_from_csv(csv_file_path):
    try:
        print(f"Extracting data from {csv_file_path}...")
        df = pd.read_csv(csv_file_path)
        print(f"Successfully extracted {len(df)} records")
        return df
    except FileNotFoundError:
        print(f"Error: {csv_file_path} not found. Creating sample data...")
        csv_file = create_sample_csv_data()
        return pd.read_csv(csv_file)

Now that we have the raw data from its source (raw_transactions.csv), we need to transform it into something usable.

Step 2: Transform

This is where we make the data actually useful.

def transform_data(df):
    print("Transforming data...")
    
    df_clean = df.copy()
    
    # Remove records with missing emails
    initial_count = len(df_clean)
    df_clean = df_clean.dropna(subset=['customer_email'])
    removed_count = initial_count - len(df_clean)
    print(f"Removed {removed_count} records with missing emails")
    
    # Calculate derived fields
    df_clean['total_amount'] = df_clean['price'] * df_clean['quantity']
    
    # Extract date components
    df_clean['transaction_date'] = pd.to_datetime(df_clean['transaction_date'])
    df_clean['year'] = df_clean['transaction_date'].dt.year
    df_clean['month'] = df_clean['transaction_date'].dt.month
    df_clean['day_of_week'] = df_clean['transaction_date'].dt.day_name()
    
    # Create customer segments
    df_clean['customer_segment'] = pd.cut(df_clean['total_amount'], 
                                        bins=[0, 50, 200, float('inf')], 
                                        labels=['Low', 'Medium', 'High'])
    
    return df_clean

First, we’re dropping rows with missing emails because incomplete customer data isn’t helpful for most analyses.

Then we calculate total_amount by multiplying price and quantity. This seems obvious, but you’d be surprised how often derived fields like this are missing from raw data.

The date extraction is really handy. Instead of just having a timestamp, now we have separate year, month, and day-of-week columns. This makes it easy to analyze patterns like “do we sell more on weekends?”

The customer segmentation using pd.cut() can be particularly useful. It automatically buckets customers into spending categories. Now instead of just having transaction amounts, we have meaningful business segments.

Step 3: Load

In a real project, you might be loading into a database, sending to an API, or pushing to cloud storage.

Here, we’re loading our clean data into a proper SQLite database.

def load_data_to_sqlite(df, db_name="ecommerce_data.db", table_name="transactions"):
    print(f"Loading data to SQLite database '{db_name}'...")
    
    conn = sqlite3.connect(db_name)
    
    try:
        df.to_sql(table_name, conn, if_exists="replace", index=False)
        
        cursor = conn.cursor()
        cursor.execute(f"SELECT COUNT(*) FROM {table_name}")
        record_count = cursor.fetchone()[0]
        
        print(f"Successfully loaded {record_count} records to '{table_name}' table")
        
        return f"Data successfully loaded to {db_name}"
        
    finally:
        conn.close()

Now analysts can run SQL queries, connect BI tools, and actually use this data for decision-making.

SQLite works well for this because it’s lightweight, requires no setup, and creates a single file you can easily share or backup. The if_exists="replace" parameter means you can run this pipeline multiple times without worrying about duplicate data.

We’ve added verification steps so you know the load was successful. There’s nothing worse than thinking your data is safely stored only to find an empty table later.

Running the ETL Pipeline

This orchestrates the entire extract, transform, load workflow.

def run_etl_pipeline():
    print("Starting ETL Pipeline...")
    
    # Extract
    raw_data = extract_data_from_csv('raw_transactions.csv')
    
    # Transform  
    transformed_data = transform_data(raw_data)
    
    # Load
    load_result = load_data_to_sqlite(transformed_data)
    
    print("ETL Pipeline completed successfully!")
    
    return transformed_data

Notice how this ties everything together. Extract, transform, load, done. You can run this and immediately see your processed data.

You can find the complete code on GitHub.

Wrapping Up

This pipeline takes raw transaction data and turns it into something an analyst or data scientist can actually work with. You’ve got clean records, calculated fields, and meaningful segments.

Each function does one thing well, and you can easily modify or extend any part without breaking the rest.

Now try running it yourself. Also try to modify it for another use case. Happy coding!

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.

Source link

AI in Travel

MP Govt Signs Deal with Submer to Build Eco-Friendly AI Data Centres

Published

8 hours ago

July 19, 2025

Shalini Mondal

The Madhya Pradesh State Electronics Development Corporation (MPSEDC) has signed a Memorandum of Understanding (MoU) with Spain-based Submer Technologies.

The agreement will pave the way for developing up to 1 gigawatt of next-generation, AI-ready data centre capacity in Madhya Pradesh, using Submer’s advanced cooling technologies. These technologies, like immersion cooling and direct-to-chip solutions, help save energy, reduce water usage, and lower the overall environmental impact of data centres.

“Following our visit to Submer’s facility, we are convinced of the potential for transformative collaboration. This partnership reflects our vision for sustainable technology, job creation, and positioning Madhya Pradesh as a preferred destination for global innovation,” Mohan Yadav CM, MP said.

The deal was finalised after a high-level visit to Submer’s innovation centre in Barcelona on July 17, 2025.

As part of the agreement, the MP government will support the project by helping with land allocation, approvals, and investment incentives.

On the other hand, Submer will offer expertise in design, training, and technical support to set up the facilities. The company’s solutions have already led to 600 GWh of electricity savings and saved over 3 billion liters of water worldwide.

“This MoU marks the beginning of a robust partnership that will catalyze local employment, skill development, and innovation while building scalable infrastructure for the AI era,” Sanjay Dubey,additional chief secretary of department of science and technology, GoMP said.

Submer’s leadership team is expected to visit MP later this month to explore potential sites and meet with local partners.

Source link

AI in Travel

DAZN Opens India’s First Sports-Tech GCC in Hyderabad, Plans to Hire 3,000 by 2026

Published

8 hours ago

July 19, 2025

Shalini Mondal

DAZN, the world’s leading sports streaming platform, opened India’s first sports technology global capability centre (GCC) in Hyderabad on July 18, 2025 . The new centre will serve as DAZN’s largest global hub and is expected to create around 3,000 jobs by the end of 2026.

DAZN plans to invest ₹500 crores over the next three years to expand operations in Hyderabad. The centre will focus on developing advanced sports technology, using AI and real-time analytics, while also working with academic institutions for training, research, and job creation in Telangana.

Speaking at the launch, Telangana IT and Industries minister Duddila Sridhar Babu said that the move reflects DAZN’s trust in Telangana’s skilled talent, strong infrastructure, and supportive government.

He also emphasised the rapid growth of Hyderabad as a top destination for GCCs. “Nearly one new GCC is being added every week in the city,” he said.

Babu also spoke about the state’s broader growth plans, including expanding development to tier-2 and tier-3 cities. “Telangana is investing over $15 billion in infrastructure projects like AI City, sports city, EV mobility, and the regional ring road,” he said.

DAZN has been growing rapidly in India since launching its centre of excellence in Hyderabad in 2023. In just two years, it has expanded to over 1,500 employees, including engineers, developers, and data scientists.
“Hyderabad has been a perfect destination for DAZN to grow its technology and product operations, thanks to the state government’s progressive policies, world-class infrastructure, and highly skilled talent pool,”Sandeep Tiku, DAZN’s CTO, said.

Source link

AI in Travel

Delta Air Lines to Expand Use of AI in Pricing This Year

Published

10 hours ago

July 19, 2025

Lacey Pfalz

by Lacey Pfalz
Last updated: 8:50 AM ET, Sat July 19, 2025

Delta Air Lines is set to expand the use of artificial intelligence to determine airfare after testing a pilot program which used AI to set 3 percent of the airline’s airfare, but privacy advocates and government officials are concerned that it could lead to price hikes and discriminatory pricing.

The airline was one of the first to consider using AI to determine airfare, a measure that was announced back in 2023 by the airline’s president, Glen Hauenstein. They’ve partnered with Israeli company Fetcherr to use AI to set prices.

According to Fortune, Hauenstein told investors during the latest financial call that Delta will expand the use of AI from setting 3 percent of ticket prices to 20 percent by the end of the year, with a goal of doing away with static pricing altogether.

“This is a full reengineering of how we price and how we will be pricing in the future,” he said on the call. Eventually, he told investors, “we will have a price that’s available on that flight, on that time, to you, the individual.”

Yet what does that mean, exactly?

While Delta maintains that their fares are public and based on trip-related factors, travel websites have a history of changing fares based on factors like web browser or ZIP code.

The expansion of AI into determining fares, some critics say, could end fair pricing because travelers will never see a universal rate, only the rate that the AI algorithm predicts a traveler will pay based on a variety of factors about that specific traveler.

“They are trying to see into people’s heads to see how much they’re willing to pay, Justin Kloczko, who analyzes so-called surveillance pricing for California nonprofit Consumer Watchdog told Fortune. “They are basically hacking our brains.”

There are laws protecting consumers from being charged different rates based on their sex or ethnicity, but Consumer Watchdog and others warn that pricing could become predatory for people of different classes.

Lawmakers are also taking note. Senator Ruben Gallego (D-AZ) tweeted this message about it on X on July 15: “Delta’s CEO just got caught bragging about using AI to find your pain point — meaning they’ll squeeze you for every penny. This isn’t fair pricing or competitive pricing. It’s predatory pricing. I won’t let them get away with this.”

The integration of AI into businesses and travel brands has been a conversation topic that repeatedly returns to the issue of ethical implementation as worries about it replacing people’s jobs and stealing protected information becomes top of mind.

For the latest travel news, updates and deals, subscribe to the daily TravelPulse newsletter.