As mentioned at the beginning of this hands-on, there are two major use cases for Amazon ES: log analysis and full-text search. For the log analysis had already mentioned in Lab 2. In this Lab 4, let’s take a look for some of the more advanced usage, focusing on full-text search.
So far, you have been trying on Amazon ES features, mainly log analysis. However, as the name Elasticsearch indicates, Elasticsearch is originally developed as a product for full-text search. In this section, let’s try full-text search using Amazon ES.
First, you will create a new index to perform full-text search.
Click icon on the left of the screen to open the Dev tools menu.
Copy the following block of codes to the “Console” below, and click ▶ button on the right to execute the API. In this example, an index called mydocs which has only one field called “content” will be created. In Lab 2, the index that automatically recognizes field mappings on Amazon ES when inserting data has created, but here the index designated with mapping in advance to clearly analyze texts will be created.
PUT mydocs
{
"mappings" : {
"properties" : {
"content" : {
"type" : "text",
"analyzer": "standard"
}
}
}
}
POST mydocs/_bulk
{"index":{"_index":"mydocs","_type":"_doc"}}
{"content":"Amazon Redshift is a high speed enterprise grade data warehouse service."}
{"index":{"_index":"mydocs","_type":"_doc"}}
{"content":"Amazon Web Services offers various kinds of analytics services."}
When creating the index in step 2 above, "analyzer": "standard"
has been set. Elasticsearch automatically analyzes text fields by specifying analyzer to make them easier to search later. Standard analyzer is the default analyzer in Amazon ES and provides a variety of settings to help you search. For more information, see here.
Now, you will actually execute a search query against the index created above.
_search
API. As a query parameter, the “content” field is specified query conditions that matches such as “course”.GET mydocs/_search?q=content:"redshift"
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "mydocs",
"_type" : "_doc",
"_id" : "YKXvBXEBdQ_VtJWAeY85",
"_score" : 0.2876821,
"_source" : {
"content" : "Amazon Redshift is a high speed enterprise grade data warehouse service."
}
}
]
}
}
GET mydocs/_analyze
{
"analyzer": "standard",
"text": "Amazon Redshift is a high speed enterprise grade data warehouse service."
}
{
"tokens" : [
{
"token" : "amazon",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "redshift",
"start_offset" : 7,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "is",
"start_offset" : 16,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "a",
"start_offset" : 19,
"end_offset" : 20,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "high",
"start_offset" : 21,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "speed",
"start_offset" : 26,
"end_offset" : 31,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "enterprise",
"start_offset" : 32,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "grade",
"start_offset" : 43,
"end_offset" : 48,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "data",
"start_offset" : 49,
"end_offset" : 53,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "warehouse",
"start_offset" : 54,
"end_offset" : 63,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "service",
"start_offset" : 64,
"end_offset" : 71,
"type" : "<ALPHANUM>",
"position" : 10
}
]
}
Next, you will set up synonyms. When a user performs a search, it is not always possible to search for keywords that match the words in the text body. You might use keywords with similar meanings but different expressions. By setting synonyms, you will be able to receive search results properly in such cases.
DELETE mydocs
PUT mydocs
{
"mappings" : {
"properties" : {
"content" : {
"type" : "text",
"analyzer": "my_analyzer"
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"my_synonym",
"lowercase",
"stop"
]
}
},
"filter": {
"my_synonym" : {
"type": "synonym",
"synonyms": [
"amazon web services,aws,cloud",
"redshift,rs,dwh"
]
}
}
}
}
}
}
POST mydocs/_bulk
{"index":{"_index":"mydocs","_type":"_doc"}}
{"content":"Amazon Redshift is a high speed enterprise grade data warehouse service."}
{"index":{"_index":"mydocs","_type":"_doc"}}
{"content":"Amazon Web Services offers various kinds of analytics services."}
GET mydocs/_search?q=content:"aws"
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.8630463,
"hits" : [
{
"_index" : "mydocs",
"_type" : "_doc",
"_id" : "maUaBnEBdQ_VtJWA-48R",
"_score" : 0.8630463,
"_source" : {
"content" : "Amazon Web Services offers various kinds of analytics services."
}
}
]
}
}
GET mydocs/_analyze
{
"analyzer": "my_analyzer",
"text": "Amazon Web Services offers various kinds of analytics services."
}
{
"tokens" : [
{
"token" : "amazon",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "web",
"start_offset" : 7,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "services",
"start_offset" : 11,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "offers",
"start_offset" : 20,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "various",
"start_offset" : 27,
"end_offset" : 34,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "kind",
"start_offset" : 35,
"end_offset" : 39,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "analytics",
"start_offset" : 43,
"end_offset" : 52,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "services",
"start_offset" : 53,
"end_offset" : 61,
"type" : "<ALPHANUM>",
"position" : 8
}
]
}
In this section, you have tried the usage of synonyms and filters in full-text search. However, only a few part of full-text search features has been tried in this section. Other than the setting you have made in this section, Amazon ES allows you to configure a variety of full-text search settings. Please read theElasticsearch documentation for more details.
In the previous section, you learned how to apply synonyms with directly embedding synonyms into mapping templetes. You can use register and use it if you only use small number of synonyms. However, handling millions of synonyms with mapping template is heavy work and hard to manage. In this section, you’ll try to import a package of synonym file and associate it to an existing Amazon ES domain.
To use synonym file in Amazon ES domain, you will upload the file to S3 and import it as a package. Then you will associate the imported package with Amazon ES domain.
Now you can execute a query and confirm the effect of synonyms.
DELETE mydocs
PUT mydocs
{
"mappings" : {
"properties" : {
"content" : {
"type" : "text",
"analyzer": "my_analyzer"
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"my_synonym",
"lowercase",
"stop"
]
}
},
"filter": {
"my_synonym" : {
"type": "synonym",
"synonyms_path": "analyzers/F0123456789"
}
}
}
}
}
}
POST mydocs/_bulk
{"index":{"_index":"mydocs","_type":"_doc"}}
{"content":"Amazon Redshift is a high speed enterprise grade data warehouse service."}
{"index":{"_index":"mydocs","_type":"_doc"}}
{"content":"Amazon Web Services offers various kinds of analytics services."}
GET mydocs/_search?q=content:"aws"
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.8630463,
"hits" : [
{
"_index" : "mydocs",
"_type" : "_doc",
"_id" : "maUaBnEBdQ_VtJWA-48R",
"_score" : 0.8630463,
"_source" : {
"content" : "Amazon Web Services offers various kinds of analytics services."
}
}
]
}
}
In this section, you have tried the usage of packages in Amazon ES. You can easily manage your own synonym files separated from Amazon ES domain and its mapping templete. This features is especially useful for full-text search usecase. Please read the Amazon ES documentation for more details.
So far, you have learned how to write a search query that directly executes _search API. However, Elasticsearch query needs to be described by nesting JSON, which is troublesome to write. There are several ways to solve this problem. For example, Python has a high-level query description library calledElasticsearch DSL and a low-level client called Python Elasticsearch Client, those allows you to write and execute queries relatively easily.
In this section, you will try to write a query using SQL as a tertiary method not mentioned before. Using the SQL function of Open Distro, you can issue Elasticsearch queries using familiar SQL.
Again, you will execute the API using Dev tools as before.
Click icon on the left of the screen to open the Dev tools menu.
Copy the following block of codes to the “Console” below, and click ▶ button on the right to execute the API. This means that for all indexes that fit “workshop-log-*”, the number of records per status and the average temperature is aggregated. As described in Section 2 of Lab 2, when performing the grouping processing, the field of keyword type is required. Here we use status.keyword.
POST _opendistro/_sql
{
"query": """
select
status.keyword
, count(*) as cnt
, avg(currentTemperature) as avgTemperature
from
workshop-log*
group by
status.keyword
"""
}
{
"schema": [
{
"name": "status.keyword",
"type": "double"
},
{
"name": "cnt",
"alias": "cnt",
"type": "double"
},
{
"name": "avgTemperature",
"alias": "avgTemperature",
"type": "double"
}
],
"total": 3,
"datarows": [
[
"OK",
86425,
79.99312698871854
],
[
"WARN",
7673,
79.42864590121204
],
[
"FAIL",
1940,
80.93762886597938
]
],
"size": 3,
"status": 200
}
format=csv
as follows.POST _opendistro/_sql?format=csv
{
"query": """
select
status.keyword
, count(*) as cnt
, avg(currentTemperature) as avgTemperature
from
workshop-log*
group by
status.keyword
"""
}
csv
, other formats such as jdbc
and raw
are also supported. See the Open Distro document for more information. status.keyword,cnt,avgTemperature
OK,96930.0,79.99918497885072
WARN,8595.0,79.84409540430482
FAIL,2162.0,80.45189639222941
Now, you know that you can execute queries in SQL. Then let’s see what would be when this query is actually revised into an Elasticsearch query. Open Distro has _opendistro/_sql/_explain
API that returns an Elasticsearch query to be equal to a query written in SQL.
POST _opendistro/_sql/_explain
{
"query": """
select
status.keyword
, count(*) as cnt
, avg(currentTemperature) as avgTemperature
from
workshop-log*
group by
status.keyword
"""
}
_search
API. {
"from" : 0,
"size" : 0,
"_source" : {
"includes" : [
"status.keyword",
"COUNT",
"AVG"
],
"excludes" : [ ]
},
"stored_fields" : "status.keyword",
"aggregations" : {
"status#keyword" : {
"terms" : {
"field" : "status.keyword",
"size" : 200,
"min_doc_count" : 1,
"shard_min_doc_count" : 0,
"show_term_doc_count_error" : false,
"order" : [
{
"_count" : "desc"
},
{
"_key" : "asc"
}
]
},
"aggregations" : {
"cnt" : {
"value_count" : {
"field" : "_index"
}
},
"avgTemperature" : {
"avg" : {
"field" : "currentTemperature"
}
}
}
}
}
}
_search
API as shown below.POST _search
{
"from" : 0,
"size" : 0,
"_source" : {
"includes" : [
"status",
"COUNT",
"AVG"
],
"excludes" : [ ]
},
"stored_fields" : "status",
"aggregations" : {
"status.keyword" : {
"terms" : {
"field" : "status.keyword",
"size" : 200,
"min_doc_count" : 1,
"shard_min_doc_count" : 0,
"show_term_doc_count_error" : false,
"order" : [
{
"_count" : "desc"
},
{
"_key" : "asc"
}
]
},
"aggregations" : {
"cnt" : {
"value_count" : {
"field" : "_index"
}
},
"avgTemperature" : {
"avg" : {
"field" : "currentTemperature"
}
}
}
}
}
}
_opendistro/_sql
. Note that the normal _search
API does not support to display the results such as csv
and raw
. {
"took" : 109,
"timed_out" : false,
"_shards" : {
"total" : 134,
"successful" : 134,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"status.keyword" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "OK",
"doc_count" : 96932,
"cnt" : {
"value" : 96932
},
"avgTemperature" : {
"value" : 79.99918499566706
}
},
{
"key" : "WARN",
"doc_count" : 8595,
"cnt" : {
"value" : 8595
},
"avgTemperature" : {
"value" : 79.84409540430482
}
},
{
"key" : "FAIL",
"doc_count" : 2162,
"cnt" : {
"value" : 2162
},
"avgTemperature" : {
"value" : 80.45189639222941
}
}
]
}
}
}
This has completed the query execution by _sql
API. Note that this API does not support all standard SQL commands. For example, aggregate functions are currently supported only avg ()
, count ()
, max ()
, min ()
, andsum ()
.Please see the Open Distro document for more details.
In this lab focused on the advanced aspects of Elasticsearch search: full-text search, customization, and SQL API usage. All of the Elasticsearch Workshop is now complete. Please do not forget clean up by following these steps. If you keep the resource you used in this workshop, you will be continuously charged.