✧    ✧                                        ✧ 
                                                 ✧
   ✦                 ✧   ▅   ✦                     ▅     
✧        ▁✧              █    ✧  ▃ ✦▇      ✧✦  ☽ ✦ █     
    ▇ ▁  █               █     ▇▅█  █▃    ✦✦       █     
    █▁█▅▁█ ▃▃▁✧▇▃✧ ▁▁  ▁▅█  ▁ ✦███▁ ██▅  ✦      ▇  █ ▇   
    ██████▅███▅██▅▁██▁▁███▁▅█▁▅████▅███▁▁▅▁▅▅▅▁▁█▁▁█▁█▁▁▁

Skyline computation is an essential database operation that has many applications in multi-criteria decision making scenarios such as recommender systems. With the growth of data, efficiently finding interesting data within large datasets has become essential. Skyline queries are a method to select a subset of data by returning tuples that are not dominated by any other tuple. A tuple t dominates another tuple s if t is not worse than s in any attribute and is strictly better in at least one. As such, Skyline queries have emerged as an increasingly popular tool for identifying a set of interesting objects that balance different user-specified criteria.

Given a multidimensional data set, where the dimensions correspond to the criteria that need to be balanced, a Skyline query returns a set of interesting data points, aka. skyline points, that are not dominated by any other point in all dimensions. A point m dominates another point n, if m is better than or equal to n in all dimensions and strictly better than n in at least one dimension.

An Illustrative Example

The best known example use case for a skyline query is a hotel booking scenario where users are looking for hotels. Assume many hotels are available and the user wants to find one based on three criteria: distance to the center, hotel rating and price per night. Further assume that the user is unable to say which of these criteria is more important. So, we need to look for hotels representing a good combination of all three criteria. The skyline consists of all hotels that represent a “good” combinations of the three criteria. For each of the other hotels, there is always at least one hotel in the skyline that is better with respect to the three criteria. So, being presented the skyline, the user gets an overview of the available hotels and can make the final decision with respect to her personal preferences for the three criteria.

Example Hotel Dataset

HOTEL_ID RATING PRICE DISTANCE_FROM_CENTER
a 4 200 3
b 5 300 4
c 1 600 5
d 2 700 1
e 4 100 2
f 5 290 3
g 1 190 5
h 1 290 5
i 5 190 1
j 2 490 5
k 5 490 1

Skyline of the Example Hotel Dataset

Screenshot 2024-05-22 at 21-31-38 Online Interactive 3D Scatter Plot
Figure 1. Here Hotels i and e are better or equal to all hotels in all dimensions and strictly better than other hotels in at least one dimension. Thus, i and e form the Skyline in this 3 Dimensional space.
Chart(2)
Figure 2. Hotels i and e compared hotels c and f. While hotel f, like hotel i, has a rating of 5, it is more expensinve than hotel i and and farther from the City Center than hotel i. So f is not on the Skyline. Hotel c has lower rating, farther from the City Center and pricey, so it is not on the Skyline at all.

Skyline using SQL

The following are few different options for calculating the Skyline or Pareto Frontier or Pareto Front using SQL

Option 1

SELECT hotel_id, price, rating, distance_from_center FROM hotels AS o WHERE
 not EXISTS(
SELECT 1 FROM hotels AS i WHERE
  i.price <= o.price AND i.rating >= o.rating and i.distance_from_center <= o.distance_from_center
  AND (i.price < o.price OR i.rating > o.rating OR i.distance_from_center < o.distance_from_center)
);

Query Output

HOTEL_ID RATING PRICE DISTANCE_FROM_CENTER
e 4 100 2
i 5 190 1

Option 2

SELECT p.*
FROM hotels p
LEFT JOIN hotels q
  ON (q.price <= p.price AND q.rating >= p.rating AND q.distance_from_center < p.distance_from_center)
  OR (q.price < p.price AND q.rating >= p.rating AND q.distance_from_center <= p.distance_from_center)
  OR (q.price <= p.price AND q.rating > p.rating AND q.distance_from_center <= p.distance_from_center)
WHERE q.price IS NULL AND q.rating IS NULL and q.distance_from_center is NULL;

Query Output

HOTEL_ID RATING PRICE DISTANCE_FROM_CENTER
e 4 100 2
i 5 190 1

Option 3

WITH ranked AS (
  SELECT
    *
    , dense_rank() OVER (ORDER BY price ASC, rating DESC, distance_from_center ASC ) AS rank1
    , dense_rank() OVER (ORDER BY rating DESC, price ASC, rating DESC, distance_from_center ASC ) AS rank2
    , dense_rank() OVER (ORDER BY distance_from_center ASC, rating DESC, price ASC, rating DESC) AS rank3

  FROM hotels
)
SELECT *
FROM ranked
WHERE rank1 = 1 OR rank2 = 1 OR rank3 = 1;

Query Output

HOTEL_ID RATING PRICE DISTANCE_FROM_CENTER RANK1 RANK2 RANK3
i 5 190 1 2 1 1
e 4 100 2 1 5 4

See also