Saturday, February 21, 2026

Why GROUP BY ALL Improves Analytical Workflows

As data scientists, we use GROUP BY constantly.

We aggregate. We summarize. We compute KPIs. We build feature tables.

And in many analytical queries, especially wide tables, we end up repeating column names twice:

  • Once in the SELECT clause
  • Again in the GROUP BY clause

This repetition is tedious — and sometimes error-prone.

In Snowflake, there is a convenient feature that simplifies this pattern:

GROUP BY ALL


The Traditional Pattern

Consider a typical aggregation:

SELECT
    region,
    product_category,
    sales_channel,
    SUM(revenue) AS total_revenue,
    COUNT(*) AS order_count
FROM sales
GROUP BY
    region,
    product_category,
    sales_channel;

Notice the duplication: every non-aggregated column must be explicitly listed twice.

When the number of grouping columns grows — 5, 10, sometimes more — the query becomes longer and easier to break during refactoring.

If you remove a column from SELECT but forget to remove it from GROUP BY, you introduce inconsistency.


Using GROUP BY ALL

With GROUP BY ALL, Snowflake automatically groups by all non-aggregated columns in the SELECT list.

The same query becomes:

SELECT
    region,
    product_category,
    sales_channel,
    SUM(revenue) AS total_revenue,
    COUNT(*) AS order_count
FROM sales
GROUP BY ALL;

That’s it. No duplication. The grouping columns are inferred from the SELECT clause.


Why This Matters in Practice

This is not just syntactic sugar. It improves:

  • Readability — the grouping logic is visually cleaner.
  • Maintainability — when modifying selected dimensions, you only change one place.
  • Error Reduction — you eliminate mismatches between SELECT and GROUP BY.

For data scientists who iterate quickly — especially during exploratory analysis — this reduces friction.


When Should You Use It?

GROUP BY ALL is particularly useful when:

  • Building summary tables
  • Creating feature engineering pipelines
  • Writing intermediate analytical transformations
  • Working with many dimension columns

However, if you are writing standardized queries meant to run across multiple database systems, portability matters — not all engines support this syntax.


A Broader Perspective

Small syntactic improvements can compound.

In modern data platforms like Snowflake, productivity features are not trivial — they reflect a design philosophy: reduce friction in analytical workflows.

As data scientists, we often spend more time transforming and summarizing data than training models. Anything that makes aggregation safer and cleaner is valuable.


Final Thoughts

Machine learning often receives the spotlight, but high-quality analytics begins with clean aggregation.

GROUP BY ALL is a small feature — but for daily analytical work, small improvements matter.

And sometimes, better engineering starts with better syntax.

Friday, October 20, 2023

Speeding Up Bulk Data Loads in Oracle with Hints

Bulk loading volumes of data from staging tables into production tables is a common task, but a slow and tedious process using basic INSERT statements in Oracle. But using query hints can dramatically improve load times, as I recently discovered how dramatical it could be.

In our ETL process, we needed to load order data from a staging table to the main orders table. Our initial INSERT statement looked like:
    INSERT INTO Orders (order_id, customer_id, order_date, total_amount)
       SELECT order_id, customer_id, order_date, total_amount
         FROM Staging_Orders;
This took many hours to millions of rows. But then we came across hint syntax that lets you fine-tune SQL behavior.
Adding a parallel and direct-path hint like:
INSERT /*+ APPEND PARALLEL(4) */ INTO Orders (order_id, customer_id, order_date, total_amount)
SELECT order_id, customer_id, order_date, total_amount
FROM Staging_Orders;
In the above SQL statement, the /*+ APPEND */ hint instructs Oracle to perform a Direct Path Insert for faster data loading. and the PARALLEL(4) hint instructs Oracle to use parallel processing with four parallel execution servers (you can adjust the number as needed based on your system's resources).

The results were mind-blowing - loading finished in under 10 minutes! For big data loads, enabling direct-path INSERT with parallelism cuts the time from hours to minutes.

I hope this example gives you some ideas on optimizing bulk data operations in Oracle through hints. Let me know if you have any other tips for supercharged ETL performance!

Thursday, March 16, 2023

Leveraging Oracle's Compress and Partition Features for Efficient Data Management

Recently, one of my clients loaded a substantial amount of data into an Oracle database table, which consumed an excessive amount of disk space and resulted in sluggish query performance. To address these issues, I suggested implementing Oracle's compress and partition table features. By applying these techniques, we successfully saved disk space and improved query performance.
In this example, we will create a table that combines both compression and partitioning with an interval. The table will store a sales transaction dataset, which includes columns for transaction ID, product ID, date, and revenue. We will use range partitioning with an interval of 1 month and enable Advanced Row Compression for the table. Creating the table with compression and interval partitioning:
CREATE TABLE sales_transactions
(
  transaction_id NUMBER,
  product_id NUMBER,
  transaction_date DATE,
  revenue NUMBER
)
PARTITION BY RANGE (transaction_date) INTERVAL (NUMTOYMINTERVAL(1, 'MONTH'))
(
  PARTITION sales_transactions_initial VALUES LESS THAN (TO_DATE('2023-01-01', 'YYYY-MM-DD'))
)
COMPRESS FOR OLTP;
In this example, we have created a range-partitioned table with an interval of 1 month. The PARTITION BY RANGE clause specifies that we are partitioning the table based on the transaction_date column. The INTERVAL clause defines a 1-month interval for partition creation using the NUMTOYMINTERVAL function. The table also has an initial partition, sales_transactions_initial, that holds data for dates before '2023-01-01'. Lastly, we have added the COMPRESS FOR OLTP clause to enable Advanced Row Compression for the table. Inserting data into the table:
INSERT INTO sales_transactions (transaction_id, product_id, transaction_date, revenue)
  VALUES (1, 101, TO_DATE('2022-12-15', 'YYYY-MM-DD'), 1000);
INSERT INTO sales_transactions (transaction_id, product_id, transaction_date, revenue)
  VALUES (2, 102, TO_DATE('2023-01-05', 'YYYY-MM-DD'), 1500);
INSERT INTO sales_transactions (transaction_id, product_id, transaction_date, revenue)
  VALUES (3, 103, TO_DATE('2023-01-20', 'YYYY-MM-DD'), 2000);
INSERT INTO sales_transactions (transaction_id, product_id, transaction_date, revenue)
  VALUES (4, 104, TO_DATE('2023-02-15', 'YYYY-MM-DD'), 2500);

Querying the table:
SELECT * FROM sales_transactions 
   WHERE transaction_date >= TO_DATE('2023-01-01', 'YYYY-MM-DD') 
   AND transaction_date < TO_DATE('2023-02-01', 'YYYY-MM-DD');
This query retrieves sales transactions for January 2023. Oracle will use partition pruning to read only the relevant partition(s) for the specified date range, improving query performance. By combining compression and interval partitioning, you can optimize storage space, manage large tables efficiently, and enhance query performance in your Oracle database.

Wednesday, June 15, 2022

Convincing an Intern to Perform Data Science Tasks Using SQL

An intern has joined my client's company to work on a data science task. I will be helping her along the way. She is familiar with Python and R programming languages and has accomplished a number of projects with these tools. However, I have successfully convinced her that it is a good idea to learn database query language SQL and use it to perform most data work. The following are the four reasons that I mentioned:
  1. Unlike in academic environment, business data are mostly stored in relational databases. It is economical and secure to perform data analytics/data science jobs within databases using SQL.
  2. SQL language is standard. SQL scripts, with minor or no change,  can run on any relational databases such as Snowflake, Oracle, SQL Server, MySQL, Postgresql, etc.
  3. Most of the data analytics/data science processes can be implemented in SQL and deployed in databases.
  4. SQL was developed over 50 years ago. It is still widely used and will continue to be so for many years to come. Not many programming language survived such a long period of time. It is a good investment to learn SQL.


Friday, September 24, 2021

An Antique Neural Network for Image Classification

Quite accidentally I found the picture below showing my research work in 1994 when I trained a neural network to identify the land use types from a satellite image. The neural network had a structure of 3 input neurons, corresponding to the blue, green, and red bands of a Landsat TM image, 14 hidden neurons, and 7 output neurons, representing seven land use types including the cornfields, wheat fields, water bodies, impervious surfaces, etc. The number of training data points is 65.

I coded everything from scratch including the algorithm and the user interface using C++. There were no prebuilt packages to use, no internet to search the information, and no such thing as asking questions on social media. The computer used had a 486 CPU, a 32-megabyte memory, and a hard drive with less than 500-megabyte storage.

How things have changed in 27 years! With a deep learning neural network, such as a convolutional neural network, many layers of neurons serving different purposes can be stacked together to form a complex structure and collectively perform recognitive tasks that were unthinkable in the past. And all these can be done with a few lines of Python scripts and the execution is often done in a cloud computing environment with virtually unlimited computation and storage resources. The progress is astonishing.

Wednesday, September 22, 2021

Online Course: Oracle SQL for Random Sampling

Since many companies store their critical business data in Oracle databases, it is advantageous to perform random sampling within the same environment using SQL without data movement. For example, it is time-consuming to pull a large data set out of a database and do random sampling using Python on a laptop computer. In addition, the data are prone to various security issues once they are not protected by the database.

A Competition-winning data scientist and long-time Oracle SQL practitioner Dr. Jay Zhou creates an online course and shares his expertise in performing random sampling using Oracle SQL. Students will learn practical skills that can be applied immediately in their work. There were hundreds of people from 85 countries who took the course.

The course begins with a description of scenarios where random sampling is necessary. A number of useful Oracle SQL random functions are introduced. The course uses examples and presents SQL scripts to perform the following common tasks.
  • How to quickly view random samples of the data. There are multiple ways to do this task.
  • How to select a precise number of samples randomly.
  • How to split data randomly. This is a necessary task when we build a machine learning model and need to produce three data sets, i.e., training, testing, and validation sets.
  • How to select random samples by groups. For example, we want to randomly select 100 students, 50 of them female and 50 male, from a school.

Are there any course requirements or prerequisites?

Very basic Oracle SQL knowledge

Who this course is for?

SQL developers, data analysts, data scientists, statisticians

Please take the course here.

Saturday, September 18, 2021

Taking Operational Efficiency to the Next Level: Leverage the 95-5 Rule of Automation

Through an odyssey of over two decades helping clients in various industries solve hard problems, I have gained a deep appreciation of a pattern that can be leveraged to dramatically improve the quality and efficiency of the work and, ultimately, the return on investment of businesses.

Whether it is inventory planning, or financial fraudulent transaction detection, or finding costly insurance claims, it generally holds true that 95% of the work can be resolved by automated algorithms. The remaining 5% needs to be done by domain experts using their expertise, intuition, and creativity. I call it the 95-5 rule of automation.

The 95-5 rule is not simply a division of the labor between machines and human experts flatly in that proportion. There is a structural and temporal implication in it. Algorithms are first applied to a raw problem, which involves a large number of cases and big data and is hard or inefficient to solve manually. This step produces as an output a simpler problem where the work is greatly reduced, by 95% generally. Human experts then work on this reduced problem and make their judgment calls to reach the final decision.

Take as an example our solution to a worker compensation insurance claim problem. A company receives about 200 worker injury claims daily. Our algorithm highlights 10 (5% of the total) of them as potentially costly using a machine learning model based on factors including age, cause of injury, and injury body parts. Using these 10 cases as a starting point, analysts review them carefully and take proper action. The solution has resulted in a 40% reduction in claim loss.

To recap, in the real world the 95-5 rule of automation works this way: applying algorithms to a raw problem to reduce the work by 95% and subsequently having human experts take on the reduced problem.

Here are the benefits as reported by our clients that have adopted solutions based on the 95-5 rule of automation.

  • Improved outcomes. For example, a bank sees its fraud loss reduced by 70%. Another bank finds the bad debt rate dropping by 50%.
  • Increased efficiency. In a K12 education company, content tagging is 100 times more efficient than a manual process.
  • More jobs. A group in a bank hires more analysts because the operation there drives a good return on investment.
  • Improved employees' morale. This is because they work on the reduced problem where the same amount of effort generates more fruitful outcomes. ( I did not realize this point until I saw a report produced by an independent department from a client company.)

When the rule is applied to inventory planning, our advanced optimization algorithm generates a set of recommended safety stocks for all items which serves as the foundation for planners to make further improvements.

One lesson that we have learned is that, unless it is an exceptionally simple circumstance, domain experts should not work with the raw problem directly. Unfortunately, the violation of this principle is happening every day resulting in ineffective, inefficient, and unscalable operations and a stressful workforce. The whole situation is avoidable.

The 95-5 rule of automation has worked for us remarkably. I hope you make the most of it in your organization and take operational efficiency to the next level.