Connect with us

AI in Travel

Building Modern Data Lakehouses on Google Cloud with Apache Iceberg and Apache Spark

Published

on


Sponsored Content

 

 

 

The landscape of big data analytics is constantly evolving, with organizations seeking more flexible, scalable, and cost-effective ways to manage and analyze vast amounts of data. This pursuit has led to the rise of the data lakehouse paradigm, which combines the low-cost storage and flexibility of data lakes with the data management capabilities and transactional consistency of data warehouses. At the heart of this revolution are open table formats like Apache Iceberg and powerful processing engines like Apache Spark, all empowered by the robust infrastructure of Google Cloud.

 

The Rise of Apache Iceberg: A Game-Changer for Data Lakes

 

For years, data lakes, typically built on cloud object storage like Google Cloud Storage (GCS), offered unparalleled scalability and cost efficiency. However, they often lacked the crucial features found in traditional data warehouses, such as transactional consistency, schema evolution, and performance optimizations for analytical queries. This is where Apache Iceberg shines.

Apache Iceberg is an open table format designed to address these limitations. It sits on top of your data files (like Parquet, ORC, or Avro) in cloud storage, providing a layer of metadata that transforms a collection of files into a high-performance, SQL-like table. Here’s what makes Iceberg so powerful:

  • ACID Compliance: Iceberg brings Atomicity, Consistency, Isolation, and Durability (ACID) properties to your data lake. This means that data writes are transactional, ensuring data integrity even with concurrent operations. No more partial writes or inconsistent reads.
  • Schema Evolution: One of the biggest pain points in traditional data lakes is managing schema changes. Iceberg handles schema evolution seamlessly, allowing you to add, drop, rename, or reorder columns without rewriting the underlying data. This is critical for agile data development.
  • Hidden Partitioning: Iceberg intelligently manages partitioning, abstracting away the physical layout of your data. Users no longer need to know the partitioning scheme to write efficient queries, and you can evolve your partitioning strategy over time without data migrations.
  • Time Travel and Rollback: Iceberg maintains a complete history of table snapshots. This enables “time travel” queries, allowing you to query data as it existed at any point in the past. It also provides rollback capabilities, letting you revert a table to a previous good state, invaluable for debugging and data recovery.
  • Performance Optimizations: Iceberg’s rich metadata allows query engines to prune irrelevant data files and partitions efficiently, significantly accelerating query execution. It avoids costly file listing operations, directly jumping to the relevant data based on its metadata.

By providing these data warehouse-like features on top of a data lake, Apache Iceberg enables the creation of a true “data lakehouse,” offering the best of both worlds: the flexibility and cost-effectiveness of cloud storage combined with the reliability and performance of structured tables.

Google Cloud’s BigLake tables for Apache Iceberg in BigQuery offers a fully-managed table experience similar to standard BigQuery tables, but all of the data is stored in customer-owned storage buckets. Support features include:

  • Table mutations via GoogleSQL data manipulation language (DML)
  • Unified batch and high throughput streaming using the Storage Write API through BigLake connectors such as Spark
  • Iceberg V2 snapshot export and automatic refresh on each table mutation
  • Schema evolution to update column metadata
  • Automatic storage optimization
  • Time travel for historical data access
  • Column-level security and data masking

Here’s an example of how to create an empty BigLake Iceberg table using GoogleSQL:


SQL

CREATE TABLE PROJECT_ID.DATASET_ID.my_iceberg_table (
  name STRING,
  id INT64
)
WITH CONNECTION PROJECT_ID.REGION.CONNECTION_ID
OPTIONS (
file_format="PARQUET"
table_format="ICEBERG"
storage_uri = 'gs://BUCKET/PATH');

 

You can then import data into the data using LOAD INTO to import data from a file or INSERT INTO from another table.


SQL

# Load from file
LOAD DATA INTO PROJECT_ID.DATASET_ID.my_iceberg_table
FROM FILES (
uris=['gs://bucket/path/to/data'],
format="PARQUET");

# Load from table
INSERT INTO PROJECT_ID.DATASET_ID.my_iceberg_table
SELECT name, id
FROM PROJECT_ID.DATASET_ID.source_table

 

In addition to a fully-managed offering, Apache Iceberg is also supported as a read-external table in BigQuery. Use this to point to an existing path with data files.


SQL

CREATE OR REPLACE EXTERNAL TABLE PROJECT_ID.DATASET_ID.my_external_iceberg_table
WITH CONNECTION PROJECT_ID.REGION.CONNECTION_ID
OPTIONS (
  format="ICEBERG",
  uris =
    ['gs://BUCKET/PATH/TO/DATA'],
  require_partition_filter = FALSE);

 

 

Apache Spark: The Engine for Data Lakehouse Analytics

 

While Apache Iceberg provides the structure and management for your data lakehouse, Apache Spark is the processing engine that brings it to life. Spark is a powerful open-source, distributed processing system renowned for its speed, versatility, and ability to handle diverse big data workloads. Spark’s in-memory processing, robust ecosystem of tools including ML and SQL-based processing, and deep Iceberg support make it an excellent choice.

Apache Spark is deeply integrated into the Google Cloud ecosystem. Benefits of using Apache Spark on Google Cloud include:

  • Access to a true serverless Spark experience without cluster management using Google Cloud Serverless for Apache Spark.
  • Fully managed Spark experience with flexible cluster configuration and management via Dataproc.
  • Accelerate Spark jobs using the new Lightning Engine for Apache Spark preview feature.
  • Configure your runtime with GPUs and drivers preinstalled.
  • Run AI/ML jobs using a robust set of libraries available by default in Spark runtimes, including XGBoost, PyTorch and Transformers.
  • Write PySpark code directly inside BigQuery Studio via Colab Enterprise notebooks along with Gemini-powered PySpark code generation.
  • Easily connect to your data in BigQuery native tables, BigLake Iceberg tables, external tables and GCS
  • Integration with Vertex AI for end-to-end MLOps

 

Iceberg + Spark: Better Together

 

Together, Iceberg and Spark form a potent combination for building performant and reliable data lakehouses. Spark can leverage Iceberg’s metadata to optimize query plans, perform efficient data pruning, and ensure transactional consistency across your data lake.

Your Iceberg tables and BigQuery native tables are accessible via BigLake metastore. This exposes your tables to open source engines with BigQuery compatibility, including Spark.


Python

from pyspark.sql import SparkSession

# Create a spark session
spark = SparkSession.builder \
.appName("BigLake Metastore Iceberg") \
.config("spark.sql.catalog.CATALOG_NAME", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.CATALOG_NAME.catalog-impl", "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog") \
.config("spark.sql.catalog.CATALOG_NAME.gcp_project", "PROJECT_ID") \
.config("spark.sql.catalog.CATALOG_NAME.gcp_location", "LOCATION") \
.config("spark.sql.catalog.CATALOG_NAME.warehouse", "WAREHOUSE_DIRECTORY") \
.getOrCreate()
spark.conf.set("viewsEnabled","true")

# Use the blms_catalog
spark.sql("USE `CATALOG_NAME`;")
spark.sql("USE NAMESPACE DATASET_NAME;")

# Configure spark for temp results
spark.sql("CREATE namespace if not exists MATERIALIZATION_NAMESPACE");
spark.conf.set("materializationDataset","MATERIALIZATION_NAMESPACE")

# List the tables in the dataset
df = spark.sql("SHOW TABLES;")
df.show();

# Query the tables
sql = """SELECT * FROM DATASET_NAME.TABLE_NAME"""
df = spark.read.format("bigquery").load(sql)
df.show()
sql = """SELECT * FROM DATASET_NAME.ICEBERG_TABLE_NAME"""
df = spark.read.format("bigquery").load(sql)
df.show()

sql = """SELECT * FROM DATASET_NAME.READONLY_ICEBERG_TABLE_NAME"""
df = spark.read.format("bigquery").load(sql)
df.show()

 

Extending the functionality of BigLake metastore is the Iceberg REST catalog (in preview) to access Iceberg data with any data processing engine. Here’s how to connect to it using Spark:


Python

import google.auth
from google.auth.transport.requests import Request
from google.oauth2 import service_account
import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SparkSession

catalog = ""
spark = SparkSession.builder.appName("") \
    .config("spark.sql.defaultCatalog", catalog) \
    .config(f"spark.sql.catalog.{catalog}", "org.apache.iceberg.spark.SparkCatalog") \
    .config(f"spark.sql.catalog.{catalog}.type", "rest") \
    .config(f"spark.sql.catalog.{catalog}.uri",
"https://biglake.googleapis.com/iceberg/v1beta/restcatalog") \
    .config(f"spark.sql.catalog.{catalog}.warehouse", "gs://") \
    .config(f"spark.sql.catalog.{catalog}.token", "") \
    .config(f"spark.sql.catalog.{catalog}.oauth2-server-uri", "https://oauth2.googleapis.com/token") \                   .config(f"spark.sql.catalog.{catalog}.header.x-goog-user-project", "") \     .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config(f"spark.sql.catalog.{catalog}.io-impl","org.apache.iceberg.hadoop.HadoopFileIO") \    .config(f"spark.sql.catalog.{catalog}.rest-metrics-reporting-enabled", "false") \
.getOrCreate()

 

 

Completing the lakehouse

 

Google Cloud provides a comprehensive suite of services that complement Apache Iceberg and Apache Spark, enabling you to build, manage, and scale your data lakehouse with ease while leveraging many of the open-source technologies you already use:

  • Dataplex Universal Catalog: Dataplex Universal Catalog provides a unified data fabric for managing, monitoring, and governing your data across data lakes, data warehouses, and data marts. It integrates with BigLake Metastore, ensuring that governance policies are consistently enforced across your Iceberg tables, and enabling capabilities like semantic search, data lineage, and data quality checks.
  • Google Cloud Managed Service for Apache Kafka: Run fully-managed Kafka clusters on Google Cloud, including Kafka Connect. Data streams can be read directly to BigQuery, including to managed Iceberg tables with low latency reads.
  • Cloud Composer: A fully managed workflow orchestration service built on Apache Airflow.
  • Vertex AI: Use Vertex AI to manage the full end-to-end ML Ops experience. You can also use Vertex AI Workbench for a managed JupyterLab experience to connect to your serverless Spark and Dataproc instances.

 

Conclusion

 

The combination of Apache Iceberg and Apache Spark on Google Cloud offers a compelling solution for building modern, high-performance data lakehouses. Iceberg provides the transactional consistency, schema evolution, and performance optimizations that were historically missing from data lakes, while Spark offers a versatile and scalable engine for processing these large datasets.

To learn more, check out our free webinar on July 8th at 11AM PST where we’ll dive deeper into using Apache Spark and supporting tools on Google Cloud.

Author: Brad Miro, Senior Developer Advocate – Google

 
 



Source link

Continue Reading
Click to comment

You must be logged in to post a comment Login

Leave a Reply

AI in Travel

7 Python Web Development Frameworks for Data Scientists

Published

on


7 Python Web Development Frameworks
Image by Author | Canva

 

Python is widely known for its popularity among engineers and data scientists, but it’s also a favorite choice for web developers. In fact, many developers prefer Python over JavaScript for building web applications because of its simple syntax, readability, and the vast ecosystem of powerful frameworks and tools available.

Whether you are a beginner or an experienced developer, Python offers frameworks that cater to every need, from lightweight micro-frameworks that require just a few lines of code, to robust full-stack solutions packed with built-in features. Some frameworks are designed for rapid prototyping, while others focus on security, scalability, or lightning-fast performance. 

In this article, we will review seven of the most popular Python web frameworks. You will discover which ones are best suited for building anything from simple websites to complex, high-traffic web applications. No matter your experience level, there is a Python framework that can help you bring your web project to life efficiently and effectively.

 

Python Web Development Frameworks

 

1. Django: The Full-Stack Powerhouse for Scalable Web Apps

Django is a robust, open-source Python framework designed for rapid development of secure and scalable web applications. With its built-in ORM, admin interface, authentication, and a vast ecosystem of reusable components, Django is ideal for building everything from simple websites to complex enterprise solutions. 

Learn more: https://www.djangoproject.com/

 

2. Flask: The Lightweight and Flexible Microframework

Flask is a minimalist Python web framework that gives you the essentials to get started, while letting you add only what you need. It’s perfect for small to medium-sized applications, APIs, and rapid prototyping. Flask’s simplicity, flexibility, and extensive documentation make it a top choice for developers who want full control over their project’s architecture.

Learn more: https://flask.palletsprojects.com/

 

3. FastAPI: Modern, High-Performance APIs with Ease

FastAPI is best known for building high-performance APIs, but with Jinja templates (v2), you can also create fully-featured websites that combine both backend and frontend functionality within the same framework. Built on top of Starlette and Pydantic, FastAPI offers asynchronous support, automatic interactive documentation, and exceptional speed, making it one of the fastest Python web frameworks available.

Learn more: https://fastapi.tiangolo.com/

 

4. Gradio: Effortless Web Interfaces for Machine Learning

Gradio is an open-source Python framework that allows you to rapidly build and share web-based interfaces for machine learning models. It is highly popular among the machine learning community, as you can build, test, and deploy your ML web demos on Hugging Face for free in just minutes. You don’t need front-end or back-end experience; just basic Python knowledge is enough to create high-performance web demos and APIs.

Learn more: https://www.gradio.app/

 

5. Streamlit: Instantly Build Data Web Apps

Streamlit is designed for data scientists and engineers who want to create beautiful, interactive web apps directly from Python scripts. With its intuitive API, you can build dashboards, data visualizations, and ML model demos in minutes.No need for HTML, CSS, or JavaScript. Streamlit is perfect for rapid prototyping and sharing insights with stakeholders.

Learn more: https://streamlit.io/

 

6. Tornado: Scalable, Non-Blocking Web Server and Framework

Tornado is a powerful Python web framework and asynchronous networking library, designed for building scalable and high-performance web applications. Unlike traditional frameworks, Tornado uses a non-blocking network I/O, which makes it ideal for handling thousands of simultaneous connections, perfect for real-time web services like chat applications, live updates, and long polling.

Learn more: https://www.tornadoweb.org/en/stable/guide.html 

 

7. Reflex: Pure Python Web Apps, Simplified

Reflex (formerly Pynecone) lets you build full-stack web applications using only Python, no JavaScript required. It compiles your Python code into modern web apps, handling both the frontend and backend seamlessly. Reflex is perfect for Python developers who want to create interactive, production-ready web apps without switching languages.

Learn more: https://reflex.dev/ 

 

Conclusion

 
FastAPI is my go-to framework for creating REST API endpoints for machine learning applications, thanks to its speed, simplicity, and production-ready features. 

For sharing machine learning demos with non-technical stakeholders, Gradio is incredibly useful, allowing you to build interactive web interfaces with minimal effort.

Django stands out as a robust, full-featured framework that lets you build any web-related application with complete control and scalability.

If you need something lightweight and quick to set up, Flask is an excellent choice for simple web apps and prototype.

Streamlit shines when it comes to building interactive user interfaces for data apps in just minutes, making it perfect for rapid prototyping and visualization.

For real-time web applications that require handling thousands of simultaneous connections, Tornado is a strong option due to its non-blocking, asynchronous architecture.

Finally, Reflex is a modern framework designed for building production-ready applications that are both simple to develop and easy to deploy.
 
 

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master’s degree in technology management and a bachelor’s degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.



Source link

Continue Reading

AI in Travel

What Does Python’s __slots__ Actually Do?

Published

on


What Does Python slots Do
Image by Author | Canva

 

What if there is a way to make your Python code faster? __slots__ in Python is easy to implement and can improve the performance of your code while reducing the memory usage.

In this article, we will walk through how it works using a data science project from the real world, where Allegro is using this as a challenge for their data science recruitment process. However, before we get into this project, let’s build a solid understanding of what __slots__ does.

 

What is __slots__ in Python?

 
In Python, every object keeps a dictionary of its attributes. This allows you to add, change, or delete them, but it also comes at a cost: extra memory and slower attribute access.
The __slots__ declaration tells Python that these are the only attributes this object will ever need. It is kind of a limitation, but it will save us time. Let’s see with an example.

class WithoutSlots:
    def __init__(self, name, age):
        self.name = name
        self.age = age

class WithSlots:
    __slots__ = ['name', 'age']

    def __init__(self, name, age):
        self.name = name
        self.age = age

 

In the second class, __slots__ tells Python not to create a dictionary for each object. Instead, it reserves a fixed spot in memory for the name and age values, making it faster and decreasing memory usage.

 

Why Use __slots__?

 
Now, before starting the data project, let’s name the reason why you should use __slots__.

  • Memory: Objects take up less space when Python skips creating a dictionary.
  • Speed: Accessing values is quicker because Python knows where each value is stored.
  • Bugs: This structure avoids silent bugs because only the defined ones are allowed.

 

Using Allegro’s Data Science Challenge as an Example

 
In this data project, Allegro asked data science candidates to predict laptop prices by building machine learning models.

 
A real data project to understand Python slotsA real data project to understand Python slots
 

Link to this data project: https://platform.stratascratch.com/data-projects/laptop-price-prediction

There are three different datasets:

  • train_dataset.json
  • val_dataset.json
  • test_dataset.json

Good. Let’s continue with the data exploration process.

 

Data Exploration

Now let’s load one of them to see the dataset’s structure.

with open('train_dataset.json', 'r') as f:
    train_data = json.load(f)
df = pd.DataFrame(train_data).dropna().reset_index(drop=True)
df.head()

 

Here is the output.

 
Python slots examplePython slots example
 

Good, let’s see the columns.

 

Here is the output.

 
Python slots examplePython slots example
 

Now, let’s check the numerical columns.

 

Here is the output.

 
Python slots examplePython slots example
 

Data Exploration with __slots__ vs Regular Classes

Let’s create a class called SlottedDataExploration, which will use the __slots__ attribute. It allows only one attribute called df. Let’s see the code.

class SlottedDataExploration:
    __slots__ = ['df']

    def __init__(self, df):
        self.df = df

    def info(self):
        return self.df.info()

    def head(self, n=5):
        return self.df.head(n)

    def tail(self, n=5):
        return self.df.tail(n)

    def describe(self):
        return self.df.describe(include="all")

 

Now let’s see the implementation, and instead of using __slots__ let’s use regular classes.

class DataExploration:
    def __init__(self, df):
        self.df = df

    def info(self):
        return self.df.info()

    def head(self, n=5):
        return self.df.head(n)

    def tail(self, n=5):
        return self.df.tail(n)

    def describe(self):
        return self.df.describe(include="all")

 

You can read more about how class methods work in this Python Class Methods guide.

 

Performance Comparison: Time Benchmark

Now let’s measure the performance by measuring the time and memory.

import time
from pympler import asizeof  # memory measurement

start_normal = time.time()
de = DataExploration(df)
_ = de.head()
_ = de.tail()
_ = de.describe()
_ = de.info()
end_normal = time.time()
normal_duration = end_normal - start_normal
normal_memory = asizeof.asizeof(de)

start_slotted = time.time()
sde = SlottedDataExploration(df)
_ = sde.head()
_ = sde.tail()
_ = sde.describe()
_ = sde.info()
end_slotted = time.time()
slotted_duration = end_slotted - start_slotted
slotted_memory = asizeof.asizeof(sde)

print(f"⏱️ Normal class duration: {normal_duration:.4f} seconds")
print(f"⏱️ Slotted class duration: {slotted_duration:.4f} seconds")

print(f"📦 Normal class memory usage: {normal_memory:.2f} bytes")
print(f"📦 Slotted class memory usage: {slotted_memory:.2f} bytes")

 

Now let’s see the result.
 
Python slots examplePython slots example
 

The slotted class duration is 46.45% faster, but the memory usage is the same for this example.

 

Machine Learning in Action

 
Now, in this section, let’s continue with the machine learning. But before doing so, let’s do a train and test split.

 

Train and Test Split

Now we have three different datasets, train, val, and test, so let’s first find their indices.

train_indeces = train_df.dropna().index
val_indeces = val_df.dropna().index
test_indeces = test_df.dropna().index

 

Now it’s time to assign those indices to select those datasets easily in the next step.

train_df = new_df.loc[train_indeces]
val_df = new_df.loc[val_indeces]
test_df = new_df.loc[test_indeces]

 

Great, now let’s format these data frames because numpy wants the flat (n,) format instead of
the (n,1). To do that, we need ot use .ravel() after to_numpy().

X_train, X_val, X_test = train_df[selected_features].to_numpy(), val_df[selected_features].to_numpy(), test_df[selected_features].to_numpy()
y_train, y_val, y_test = df.loc[train_indeces][label_col].to_numpy().ravel(), df.loc[val_indeces][label_col].to_numpy().ravel(), df.loc[test_indeces][label_col].to_numpy().ravel()

 

Applying Machine Learning Models

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error 
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import VotingRegressor
from sklearn import linear_model
from sklearn.neural_network import MLPRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, MaxAbsScaler
import matplotlib.pyplot as plt
from sklearn import tree
import seaborn as sns
def rmse(y_true, y_pred): 
    return mean_squared_error(y_true, y_pred, squared=False)
def regression(regressor_name, regressor):
    pipe = make_pipeline(MaxAbsScaler(), regressor)
    pipe.fit(X_train, y_train) 
    predicted = pipe.predict(X_test)
    rmse_val = rmse(y_test, predicted)
    print(regressor_name, ':', rmse_val)
    pred_df[regressor_name+'_Pred'] = predicted
    plt.figure(regressor_name)
    plt.title(regressor_name)
    plt.xlabel('predicted')
    plt.ylabel('actual')
    sns.regplot(y=y_test,x=predicted)

 

Next, we will define a dictionary of regressors and run each model.

regressors = {
    'Linear' : LinearRegression(),
    'MLP': MLPRegressor(random_state=42, max_iter=500, learning_rate="constant", learning_rate_init=0.6),
    'DecisionTree': DecisionTreeRegressor(max_depth=15, random_state=42),
    'RandomForest': RandomForestRegressor(random_state=42),
    'GradientBoosting': GradientBoostingRegressor(random_state=42, criterion='squared_error',
                                                  loss="squared_error",learning_rate=0.6, warm_start=True),
    'ExtraTrees': ExtraTreesRegressor(n_estimators=100, random_state=42),
}
pred_df = pd.DataFrame(columns =["Actual"])
pred_df["Actual"] = y_test
for key in regressors.keys():
    regression(key, regressors[key])

 

Here are the results.

 
Python slots examplePython slots example
 

Now, implement this with both slots and regular classes.

 

Machine Learning with __slots__ vs Regular Classes

Now let’s check the code with slots.

class SlottedMachineLearning:
    __slots__ = ['X_train', 'y_train', 'X_test', 'y_test', 'pred_df']

    def __init__(self, X_train, y_train, X_test, y_test):
        self.X_train = X_train
        self.y_train = y_train
        self.X_test = X_test
        self.y_test = y_test
        self.pred_df = pd.DataFrame({'Actual': y_test})

    def rmse(self, y_true, y_pred):
        return mean_squared_error(y_true, y_pred, squared=False)

    def regression(self, name, model):
        pipe = make_pipeline(MaxAbsScaler(), model)
        pipe.fit(self.X_train, self.y_train)
        predicted = pipe.predict(self.X_test)
        self.pred_df[name + '_Pred'] = predicted

        score = self.rmse(self.y_test, predicted)
        print(f"{name} RMSE:", score)

        plt.figure(figsize=(6, 4))
        sns.regplot(x=predicted, y=self.y_test, scatter_kws={"s": 10})
        plt.xlabel('Predicted')
        plt.ylabel('Actual')
        plt.title(f'{name} Predictions')
        plt.grid(True)
        plt.show()

    def run_all(self):
        models = {
            'Linear': LinearRegression(),
            'MLP': MLPRegressor(random_state=42, max_iter=500, learning_rate="constant", learning_rate_init=0.6),
            'DecisionTree': DecisionTreeRegressor(max_depth=15, random_state=42),
            'RandomForest': RandomForestRegressor(random_state=42),
            'GradientBoosting': GradientBoostingRegressor(random_state=42, learning_rate=0.6, warm_start=True),
            'ExtraTrees': ExtraTreesRegressor(n_estimators=100, random_state=42)
        }

        for name, model in models.items():
            self.regression(name, model)

 

Here is the regular class application.

class MachineLearning:
    def __init__(self, X_train, y_train, X_test, y_test):
        self.X_train = X_train
        self.y_train = y_train
        self.X_test = X_test
        self.y_test = y_test
        self.pred_df = pd.DataFrame({'Actual': y_test})

    def rmse(self, y_true, y_pred):
        return mean_squared_error(y_true, y_pred, squared=False)

    def regression(self, name, model):
        pipe = make_pipeline(MaxAbsScaler(), model)
        pipe.fit(self.X_train, self.y_train)
        predicted = pipe.predict(self.X_test)
        self.pred_df[name + '_Pred'] = predicted

        score = self.rmse(self.y_test, predicted)
        print(f"{name} RMSE:", score)

        plt.figure(figsize=(6, 4))
        sns.regplot(x=predicted, y=self.y_test, scatter_kws={"s": 10})
        plt.xlabel('Predicted')
        plt.ylabel('Actual')
        plt.title(f'{name} Predictions')
        plt.grid(True)
        plt.show()

    def run_all(self):
        models = {
            'Linear': LinearRegression(),
            'MLP': MLPRegressor(random_state=42, max_iter=500, learning_rate="constant", learning_rate_init=0.6),
            'DecisionTree': DecisionTreeRegressor(max_depth=15, random_state=42),
            'RandomForest': RandomForestRegressor(random_state=42),
            'GradientBoosting': GradientBoostingRegressor(random_state=42, learning_rate=0.6, warm_start=True),
            'ExtraTrees': ExtraTreesRegressor(n_estimators=100, random_state=42)
        }

        for name, model in models.items():
            self.regression(name, model)

 

Performance Comparison: Time Benchmark

Now let’s compare each code to the one we did in the previous section.

import time

start_normal = time.time()
ml = MachineLearning(X_train, y_train, X_test, y_test)
ml.run_all()
end_normal = time.time()
normal_duration = end_normal - start_normal
normal_memory = (
    ml.X_train.nbytes +
    ml.X_test.nbytes +
    ml.y_train.nbytes +
    ml.y_test.nbytes
)

start_slotted = time.time()
sml = SlottedMachineLearning(X_train, y_train, X_test, y_test)
sml.run_all()
end_slotted = time.time()
slotted_duration = end_slotted - start_slotted
slotted_memory = (
    sml.X_train.nbytes +
    sml.X_test.nbytes +
    sml.y_train.nbytes +
    sml.y_test.nbytes
)

print(f"⏱️ Normal ML class duration: {normal_duration:.4f} seconds")
print(f"⏱️ Slotted ML class duration: {slotted_duration:.4f} seconds")

print(f"📦 Normal ML class memory usage: {normal_memory:.2f} bytes")
print(f"📦 Slotted ML class memory usage: {slotted_memory:.2f} bytes")

time_diff = normal_duration - slotted_duration
percent_faster = (time_diff / normal_duration) * 100
if percent_faster > 0:
    print(f"✅ Slotted ML class is {percent_faster:.2f}% faster than the regular ML class.")
else:
    print(f"ℹ️ No speed improvement with slots in this run.")

memory_diff = normal_memory - slotted_memory
percent_smaller = (memory_diff / normal_memory) * 100
if percent_smaller > 0:
    print(f"✅ Slotted ML class uses {percent_smaller:.2f}% less memory than the regular ML class.")
else:
    print(f"ℹ️ No memory savings with slots in this run.")

 

Here is the output.

 
Python slots examplePython slots example
 

Conclusion

 
By preventing the creation of dynamic __dict__ for each instance, Python __slots__ are very good at reducing the memory usage and speeding up attribute access. You saw how it works in practice through both data exploration and machine learning tasks using Allegro’s real recruitment project.

In small datasets, the improvements might be minor. But as data scales, the benefits become more noticeable, especially in memory-bound or performance-critical applications.
 
 

Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.





Source link

Continue Reading

AI in Travel

Perplexity Leads ChatGPT on Apple App Store in India Post Airtel Offer

Published

on


Just one day after Airtel announced its offer of a complimentary Perplexity Pro subscription, the Perplexity app has seen an unprecedented surge in downloads, particularly on iOS. 

Thanks to the Airtel promotion, the AI search engine has climbed to the top spot on the App Store in India, overtaking OpenAI’s ChatGPT. This milestone was announced by Perplexity AI’s CEO, Aravind Srinivas, in a post on LinkedIn.

As reported by Mint News, Google’s Gemini ranks fifth among the top free apps on the App Store, following Perplexity and ChatGPT. 

On the other hand, ChatGPT remains at the top of the charts on the Google Play Store, while Perplexity has not yet tried to compete for a position among the leading free apps.

Perplexity Pro offers users access to advanced AI models, including GPT-4.1, Claude, Grok 4, and others, as well as image generation capabilities across compatible models. Subscribers also gain access to the company’s newly introduced Comet browser, which is currently available only to free users through an invitation.

Following Airtel’s announcement of its major partnership with an AI startup based in the US to offer a complimentary year-long subscription to Perplexity AI Pro, numerous Airtel subscribers took advantage of the promotion. 

The Perplexity AI Pro subscription is highly advantageous for students, researchers, professionals, and educators. In India, the annual subscription to the Pro tier costs Rs 17,000 when purchased directly from Perplexity.

With the Airtel promotion, a full year’s subscription is now available for free, encouraging more individuals to sign up and try it out. This initiative has led to a significant number of subscribers downloading the app via App Stores, as using the app is the simplest method to access the AI model.



Source link

Continue Reading

Trending

Copyright © 2025 AISTORIZ. For enquiries email at prompt@travelstoriz.com