Optimizing query performance using OpenSearch indexing - OpenSearch Documentation
Optimizing query performance using OpenSearch indexing | OpenSearch Documentation
OpenSearch
About
Releases
Roadmap
FAQ
Platform
Observability
Security Analytics
Vector Database
Playground Demo
Performance Benchmarks
Community
Forum
Slack
Events
Solutions Providers
Projects
Members
Documentation
OpenSearch and Dashboards
Data Prepper
Clients
Benchmark
Migration Assistant
Blog
Documentation
Optimizing query performance using OpenSearch indexing
Introduced 2.11
Query performance can be slow when using external data sources for reasons such as network latency, data transformation, and data volume. You can optimize your query performance by using OpenSearch indexes, such as a skipping index or a covering index.
skipping index
uses skip acceleration methods, such as partition, minimum and maximum values, and value sets, to ingest and create compact aggregate data structures. This makes them an economical option for direct querying scenarios. For more information, see
Skipping indexes
covering index
ingests all or some of the data from the source into OpenSearch and makes it possible to use all OpenSearch Dashboards and plugin functionality. For more information, see
Covering indexes
materialized view
enhances query performance by storing precomputed and aggregated data from the source data. For more information, see
Materialized views
For comprehensive guidance on each indexing process, see the
Flint Index Reference Manual
Data sources use case: Accelerate performance
To get started with accelerating query performance, perform the following steps:
Go to
OpenSearch Plugins
Query Workbench
and select your data source from the
Data sources
dropdown menu.
From the navigation menu, select a database.
View the results in the table and confirm that you have the correct data.
Create an OpenSearch index by following these steps:
Select
Accelerate data
. A pop-up window appears.
Enter your database and table details under
Select data fields
For
Acceleration type
, select the type of acceleration according to your use case. Then, enter the information for your acceleration type. For more information, see the following sections:
Skipping indexes
Covering indexes
Materialized views
Skipping indexes
skipping index
uses skip acceleration methods, such as partition, min/max, and value sets, to ingest data using compact aggregate data structures. This makes them an economical option for direct querying scenarios.
With a skipping index, you can index only the metadata of the data stored in Amazon S3. When you query a table with a skipping index, the query planner references the index and rewrites the query to efficiently locate the data, instead of scanning all partitions and files. This allows the skipping index to quickly narrow down the specific location of the stored data.
Define skipping index settings
Under
Skipping index definition
, select
Generate
to automatically generate a skipping index. Alternately, to manually choose the fields you want to add, select
Add fields
. Choose from the following types:
Partition
: Uses data partition details to locate data. This type is best for partitioning-based columns such as year, month, day, hour.
MinMax
: Uses lower and upper bound of the indexed column to locate data. This type is best for numeric columns.
ValueSet
: Uses a unique value set to locate data. This type is best for columns with low to moderate cardinality that require exact matching.
BloomFilter
: Uses the bloom filter algorithm to locate data. This type is best for columns with high cardinality that do not require exact matching.
Select
Create acceleration
to apply your skipping index settings.
View the skipping index query details and then click
Run
. OpenSearch adds your index to the left navigation pane.
Alternately, you can manually create a skipping index using Query Workbench. Select your data source from the dropdown and run a query like the following:
CREATE
SKIPPING
INDEX
ON
datasourcename
gluedatabasename
vpclogstable
`srcaddr`
BLOOM_FILTER
`dstaddr`
BLOOM_FILTER
`day`
PARTITION
`account_id`
BLOOM_FILTER
WITH
index_settings
'{"number_of_shards":5,"number_of_replicas":1}'
auto_refresh
true
checkpoint_location
's3://accountnum-vpcflow/AWSLogs/checkpoint'
Covering indexes
covering index
ingests all or some of the data from the source into OpenSearch and makes it possible to use all OpenSearch Dashboards and plugin functionality.
With a covering index, you can ingest data from a specified column in a table. This is the most performant of the three indexing types. Because OpenSearch ingests all data from your desired column, you get better performance and can perform advanced analytics.
OpenSearch creates a new index from the covering index data. You can use this new index to create visualizations, or for anomaly detection and geospatial capabilities. You can manage the covering view index with Index State Management. For more information, see
Index State Management
Define covering index settings
For
Index name
, enter a valid index name. Note that each table can have multiple covering indexes.
Choose a
Refresh type
. By default, OpenSearch automatically refreshes the index. Otherwise, you must manually trigger a refresh using a REFRESH statement.
Enter a
Checkpoint location
, which is a path for refresh job checkpoints. The location must be a path in a file system compatible with the Hadoop Distributed File System (HDFS). For more information, see
Starting streaming queries
Define the covering index fields by selecting
(add fields here)
under
Covering index definition
Select
Create acceleration
to apply your covering index settings.
View the covering index query details and then click
Run
. OpenSearch adds your index to the left navigation pane.
Alternately, you can manually create a covering index on your table using Query Workbench. Select your data source from the dropdown and run a query like the following:
CREATE
INDEX
vpc_covering_index
ON
datasourcename
gluedatabasename
vpclogstable
version
account_id
interface_id
srcaddr
dstaddr
srcport
dstport
protocol
packets
bytes
start
action
log_status
STRING
`aws-account-id`
`aws-service`
`aws-region`
year
month
day
hour
WITH
auto_refresh
true
refresh_interval
'15 minute'
checkpoint_location
's3://accountnum-vpcflow/AWSLogs/checkpoint'
Materialized views
With
materialized views
, you can use complex queries, such as aggregations, to power Dashboards visualizations. Materialized views ingest a small amount of your data, depending on the query, into OpenSearch. OpenSearch then forms an index from the ingested data that you can use for visualizations. You can manage the materialized view index with Index State Management. For more information, see
Index State Management
Define materialized view settings
For
Index name
, enter a valid index name. Note that each table can have multiple covering indexes.
Choose a
Refresh type
. By default, OpenSearch automatically refreshes the index. Otherwise, you must manually trigger a refresh using a
REFRESH
statement.
Enter a
Checkpoint location
, which is a path for refresh job checkpoints. The location must be a path in an HDFS compatible file system.
Enter a
Watermark delay
, which defines how late data can come and still be processed, such as 1 minute or 10 seconds.
Define the covering index fields under
Materialized view definition
Select
Create acceleration
to apply your materialized view index settings.
View the materialized view query details and then click
Run
. OpenSearch adds your index to the left navigation pane.
Alternately, you can manually create a materialized view index on your table using Query Workbench. Select your data source from the dropdown and run a query like the following:
CREATE
MATERIALIZED
VIEW
table_name
__week_live_mview
AS
SELECT
cloud
account_uid
AS
`aws.vpc.cloud_account_uid`
cloud
region
AS
`aws.vpc.cloud_region`
cloud
zone
AS
`aws.vpc.cloud_zone`
cloud
provider
AS
`aws.vpc.cloud_provider`
CAST
IFNULL
src_endpoint
port
AS
LONG
AS
`aws.vpc.srcport`
CAST
IFNULL
src_endpoint
svc_name
'Unknown'
AS
STRING
AS
`aws.vpc.pkt-src-aws-service`
CAST
IFNULL
src_endpoint
ip
'0.0.0.0'
AS
STRING
AS
`aws.vpc.srcaddr`
CAST
IFNULL
src_endpoint
interface_uid
'Unknown'
AS
STRING
AS
`aws.vpc.src-interface_uid`
CAST
IFNULL
src_endpoint
vpc_uid
'Unknown'
AS
STRING
AS
`aws.vpc.src-vpc_uid`
CAST
IFNULL
src_endpoint
instance_uid
'Unknown'
AS
STRING
AS
`aws.vpc.src-instance_uid`
CAST
IFNULL
src_endpoint
subnet_uid
'Unknown'
AS
STRING
AS
`aws.vpc.src-subnet_uid`
CAST
IFNULL
dst_endpoint
port
AS
LONG
AS
`aws.vpc.dstport`
CAST
IFNULL
dst_endpoint
svc_name
'Unknown'
AS
STRING
AS
`aws.vpc.pkt-dst-aws-service`
CAST
IFNULL
dst_endpoint
ip
'0.0.0.0'
AS
STRING
AS
`aws.vpc.dstaddr`
CAST
IFNULL
dst_endpoint
interface_uid
'Unknown'
AS
STRING
AS
`aws.vpc.dst-interface_uid`
CAST
IFNULL
dst_endpoint
vpc_uid
'Unknown'
AS
STRING
AS
`aws.vpc.dst-vpc_uid`
CAST
IFNULL
dst_endpoint
instance_uid
'Unknown'
AS
STRING
AS
`aws.vpc.dst-instance_uid`
CAST
IFNULL
dst_endpoint
subnet_uid
'Unknown'
AS
STRING
AS
`aws.vpc.dst-subnet_uid`
CASE
WHEN
regexp
dst_endpoint
ip
'(10
\\
..*)|(192
\\
.168
\\
..*)|(172
\\
.1[6-9]
\\
..*)|(172
\\
.2[0-9]
\\
..*)|(172
\\
.3[0-1]
\\
.*)'
THEN
'ingress'
ELSE
'egress'
END
AS
`aws.vpc.flow-direction`
CAST
IFNULL
connection_info
'protocol_num'
],
AS
INT
AS
`aws.vpc.connection.protocol_num`
CAST
IFNULL
connection_info
'tcp_flags'
],
'0'
AS
STRING
AS
`aws.vpc.connection.tcp_flags`
CAST
IFNULL
connection_info
'protocol_ver'
],
'0'
AS
STRING
AS
`aws.vpc.connection.protocol_ver`
CAST
IFNULL
connection_info
'boundary'
],
'Unknown'
AS
STRING
AS
`aws.vpc.connection.boundary`
CAST
IFNULL
connection_info
'direction'
],
'Unknown'
AS
STRING
AS
`aws.vpc.connection.direction`
CAST
IFNULL
traffic
packets
AS
LONG
AS
`aws.vpc.packets`
CAST
IFNULL
traffic
bytes
AS
LONG
AS
`aws.vpc.bytes`
CAST
FROM_UNIXTIME
time
1000
AS
TIMESTAMP
AS
`@timestamp`
CAST
FROM_UNIXTIME
start_time
1000
AS
TIMESTAMP
AS
`start_time`
CAST
FROM_UNIXTIME
start_time
1000
AS
TIMESTAMP
AS
`interval_start_time`
CAST
FROM_UNIXTIME
end_time
1000
AS
TIMESTAMP
AS
`end_time`
status_code
AS
`aws.vpc.status_code`
severity
AS
`aws.vpc.severity`
class_name
AS
`aws.vpc.class_name`
category_name
AS
`aws.vpc.category_name`
activity_name
AS
`aws.vpc.activity_name`
disposition
AS
`aws.vpc.disposition`
type_name
AS
`aws.vpc.type_name`
region
AS
`aws.vpc.region`
accountid
AS
`aws.vpc.account-id`
FROM
datasourcename
gluedatabasename
vpclogstable
WITH
auto_refresh
true
refresh_interval
'15 Minute'
checkpoint_location
's3://accountnum-vpcflow/AWSLogs/checkpoint'
watermark_delay
'1 Minute'
Limitations
This feature is still under development, so there are some limitations. For real-time updates, see the
developer documentation on GitHub
Data sources use case: Accelerate performance
Skipping indexes
Define skipping index settings
Covering indexes
Define covering index settings
Materialized views
Define materialized view settings
Limitations
WAS THIS PAGE HELPFUL?
✔ Yes
✖ No
Tell us why
350 characters left
Thank you for your feedback!
Have a question?
Ask us on the OpenSearch forum
Want to contribute?
Edit this page
or
create an issue
OpenSearch Links
Get Involved
Code of Conduct
Forum
GitHub
Slack
Resources
About
Release Schedule
Maintenance Policy
FAQ
Testimonials
Trademark and Brand Policy
Connect
Meetup
Copyright © OpenSearch Project a Series of LF Projects, LLC
For web site terms of use, trademark policy and other project policies please see