As mentioned at the beginning of this hands-on, there are two major use cases for Amazon ES: log analysis and full-text search. For the log analysis had already mentioned in Lab 2. In this Lab 4, let’s take a look for some of the more advanced usage, focusing on full-text search.
So far, you have been trying on Amazon ES features, mainly log analysis. However, as the name Elasticsearch indicates, Elasticsearch is originally developed as a product for full-text search. In this section, let’s try full-text search using Amazon ES.
First, you will create a new index to perform full-text search.
Click icon on the left of the screen to open the Dev tools menu.
Copy the following block of codes to the “Console” below, and click ▶ button on the right to execute the API. In this example, an index called mydocs which has only one field called “content” will be created. In Lab 2, the index that automatically recognizes field mappings on Amazon ES when inserting data has created, but here the index designated with mapping in advance to clearly analyze texts will be created.
PUT mydocs
{
"mappings" : {
"properties" : {
"content" : {
"type" : "text",
"analyzer": "standard"
}
}
}
}
POST mydocs/_bulk
{"index":{"_index":"mydocs","_type":"_doc"}}
{"content":"Amazon Redshift is a high speed enterprise grade data warehouse service."}
{"index":{"_index":"mydocs","_type":"_doc"}}
{"content":"Amazon Web Services offers various kinds of analytics services."}
When creating the index in step 2 above, "analyzer": "standard"
has been set. Elasticsearch automatically analyzes text fields by specifying analyzer to make them easier to search later. Standard analyzer is the default analyzer in Amazon ES and provides a variety of settings to help you search. For more information, see here.
Now, you will actually execute a search query against the index created above.
_search
API. As a query parameter, the “content” field is specified query conditions that matches such as “course”.GET mydocs/_search?q=content:"redshift"
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "mydocs",
"_type" : "_doc",
"_id" : "YKXvBXEBdQ_VtJWAeY85",
"_score" : 0.2876821,
"_source" : {
"content" : "Amazon Redshift is a high speed enterprise grade data warehouse service."
}
}
]
}
}
GET mydocs/_analyze
{
"analyzer": "standard",
"text": "Amazon Redshift is a high speed enterprise grade data warehouse service."
}
You will notice that all words are converted to lowercase as follows. This includes a filter called Lower Case Token Filter into the standard analyzer, which convert here all words to lowercase. In addition to the standard analyzer used here, there are a variety of built-in analyzer. For more information, see here for more information.
{
"tokens" : [
{
"token" : "amazon",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "redshift",
"start_offset" : 7,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "is",
"start_offset" : 16,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "a",
"start_offset" : 19,
"end_offset" : 20,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "high",
"start_offset" : 21,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "speed",
"start_offset" : 26,
"end_offset" : 31,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "enterprise",
"start_offset" : 32,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "grade",
"start_offset" : 43,
"end_offset" : 48,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "data",
"start_offset" : 49,
"end_offset" : 53,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "warehouse",
"start_offset" : 54,
"end_offset" : 63,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "service",
"start_offset" : 64,
"end_offset" : 71,
"type" : "<ALPHANUM>",
"position" : 10
}
]
}
Next, you will set up synonyms. When a user performs a search, it is not always possible to search for keywords that match the words in the text body. You might use keywords with similar meanings but different expressions. By setting synonyms, you will be able to receive search results properly in such cases.
DELETE mydocs
PUT mydocs
{
"mappings" : {
"properties" : {
"content" : {
"type" : "text",
"analyzer": "my_analyzer"
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"my_synonym",
"lowercase",
"stop"
]
}
},
"filter": {
"my_synonym" : {
"type": "synonym",
"synonyms": [
"amazon web services,aws,cloud",
"redshift,rs,dwh"
]
}
}
}
}
}
}
POST mydocs/_bulk
{"index":{"_index":"mydocs","_type":"_doc"}}
{"content":"Amazon Redshift is a high speed enterprise grade data warehouse service."}
{"index":{"_index":"mydocs","_type":"_doc"}}
{"content":"Amazon Web Services offers various kinds of analytics services."}
GET mydocs/_search?q=content:"aws"
Even though there are no words in the sentence, you will be able to make sure that the search results are obtained without any problems!
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.8630463,
"hits" : [
{
"_index" : "mydocs",
"_type" : "_doc",
"_id" : "maUaBnEBdQ_VtJWA-48R",
"_score" : 0.8630463,
"_source" : {
"content" : "Amazon Web Services offers various kinds of analytics services."
}
}
]
}
}
At last, execute the following command to see how the original document is analyzed.
GET mydocs/_analyze
{
"analyzer": "my_analyzer",
"text": "Amazon Web Services offers various kinds of analytics services."
}
This time, unlike the previous one, you can see that the words “is” and “of” are not included. Actually, this is because “filter” called “stop” has added when recreating the index earlier. This Stop Tken Filter allows you to exclude particles and prepositions that are difficult to search before analyzing them. This prevents the document from being match when searching for the word “of”.
{
"tokens" : [
{
"token" : "amazon",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "web",
"start_offset" : 7,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "services",
"start_offset" : 11,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "offers",
"start_offset" : 20,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "various",
"start_offset" : 27,
"end_offset" : 34,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "kind",
"start_offset" : 35,
"end_offset" : 39,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "analytics",
"start_offset" : 43,
"end_offset" : 52,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "services",
"start_offset" : 53,
"end_offset" : 61,
"type" : "<ALPHANUM>",
"position" : 8
}
]
}
In this section, you have tried the usage of synonyms and filters in full-text search. However, only a few part of full-text search features has been tried in this section. Other than the setting you have made in this section, Amazon ES allows you to configure a variety of full-text search settings. Please read theElasticsearch documentation for more details.
In the previous section, you learned how to apply synonyms with directly embedding synonyms into mapping templetes. You can use register and use it if you only use small number of synonyms. However, handling millions of synonyms with mapping template is heavy work and hard to manage. In this section, you’ll try to import a package of synonym file and associate it to an existing Amazon ES domain.
To use synonym file in Amazon ES domain, you will upload the file to S3 and import it as a package. Then you will associate the imported package with Amazon ES domain.
Now you can execute a query and confirm the effect of synonyms.
DELETE mydocs
PUT mydocs
{
"mappings" : {
"properties" : {
"content" : {
"type" : "text",
"analyzer": "my_analyzer"
}
}
},
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"my_synonym",
"lowercase",
"stop"
]
}
},
"filter": {
"my_synonym" : {
"type": "synonym",
"synonyms_path": "analyzers/F0123456789"
}
}
}
}
}
}
POST mydocs/_bulk
{"index":{"_index":"mydocs","_type":"_doc"}}
{"content":"Amazon Redshift is a high speed enterprise grade data warehouse service."}
{"index":{"_index":"mydocs","_type":"_doc"}}
{"content":"Amazon Web Services offers various kinds of analytics services."}
GET mydocs/_search?q=content:"aws"
You can get the same search results as the previous section
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.8630463,
"hits" : [
{
"_index" : "mydocs",
"_type" : "_doc",
"_id" : "maUaBnEBdQ_VtJWA-48R",
"_score" : 0.8630463,
"_source" : {
"content" : "Amazon Web Services offers various kinds of analytics services."
}
}
]
}
}
In this section, you have tried the usage of packages in Amazon ES. You can easily manage your own synonym files separated from Amazon ES domain and its mapping templete. This features is especially useful for full-text search usecase. Please read the Amazon ES documentation for more details.
So far, you have learned how to write a search query that directly executes _search API. However, Elasticsearch query needs to be described by nesting JSON, which is troublesome to write. There are several ways to solve this problem. For example, Python has a high-level query description library calledElasticsearch DSL and a low-level client called Python Elasticsearch Client, those allows you to write and execute queries relatively easily.
In this section, you will try to write a query using SQL as a tertiary method not mentioned before. Using the SQL function of Open Distro, you can issue Elasticsearch queries using familiar SQL.
Again, you will execute the API using Dev tools as before.
Click icon on the left of the screen to open the Dev tools menu.
Copy the following block of codes to the “Console” below, and click ▶ button on the right to execute the API. This means that for all indexes that fit “workshop-log-*", the number of records per status and the average temperature is aggregated. As described in Section 2 of Lab 2, when performing the grouping processing, the field of keyword type is required. Here we use status.keyword.
POST _opendistro/_sql
{
"query": """
select
status.keyword
, count(*) as cnt
, avg(currentTemperature) as avgTemperature
from
workshop-log*
group by
status.keyword
"""
}
Executing the code above, you will get the result as follows (Note that the actual aggregate results will be different for each execution environment). You can see the aggregate results displayed in “aggregations”.
{
"schema": [
{
"name": "status.keyword",
"type": "double"
},
{
"name": "cnt",
"alias": "cnt",
"type": "double"
},
{
"name": "avgTemperature",
"alias": "avgTemperature",
"type": "double"
}
],
"total": 3,
"datarows": [
[
"OK",
86425,
79.99312698871854
],
[
"WARN",
7673,
79.42864590121204
],
[
"FAIL",
1940,
80.93762886597938
]
],
"size": 3,
"status": 200
}
Now let’s display not only the query but also the aggregate result in the form of the SQL execution result. Execute the same query with the query parameter called format=csv
as follows.
POST _opendistro/_sql?format=csv
{
"query": """
select
status.keyword
, count(*) as cnt
, avg(currentTemperature) as avgTemperature
from
workshop-log*
group by
status.keyword
"""
}
As a result of execution, you can confirm that the following simple csv value is displayed. In addition to csv
, other formats such as jdbc
and raw
are also supported. See the Open Distro document for more information.
status.keyword,cnt,avgTemperature
OK,96930.0,79.99918497885072
WARN,8595.0,79.84409540430482
FAIL,2162.0,80.45189639222941
Now, you know that you can execute queries in SQL. Then let’s see what would be when this query is actually revised into an Elasticsearch query. Open Distro has _opendistro/_sql/_explain
API that returns an Elasticsearch query to be equal to a query written in SQL.
POST _opendistro/_sql/_explain
{
"query": """
select
status.keyword
, count(*) as cnt
, avg(currentTemperature) as avgTemperature
from
workshop-log*
group by
status.keyword
"""
}
The following result is returned. The result itself is the query to be sent to the _search
API.
{
"from" : 0,
"size" : 0,
"_source" : {
"includes" : [
"status.keyword",
"COUNT",
"AVG"
],
"excludes" : [ ]
},
"stored_fields" : "status.keyword",
"aggregations" : {
"status#keyword" : {
"terms" : {
"field" : "status.keyword",
"size" : 200,
"min_doc_count" : 1,
"shard_min_doc_count" : 0,
"show_term_doc_count_error" : false,
"order" : [
{
"_count" : "desc"
},
{
"_key" : "asc"
}
]
},
"aggregations" : {
"cnt" : {
"value_count" : {
"field" : "_index"
}
},
"avgTemperature" : {
"avg" : {
"field" : "currentTemperature"
}
}
}
}
}
}
Paste the above result, and execute _search
API as shown below.
POST _search
{
"from" : 0,
"size" : 0,
"_source" : {
"includes" : [
"status",
"COUNT",
"AVG"
],
"excludes" : [ ]
},
"stored_fields" : "status",
"aggregations" : {
"status.keyword" : {
"terms" : {
"field" : "status.keyword",
"size" : 200,
"min_doc_count" : 1,
"shard_min_doc_count" : 0,
"show_term_doc_count_error" : false,
"order" : [
{
"_count" : "desc"
},
{
"_key" : "asc"
}
]
},
"aggregations" : {
"cnt" : {
"value_count" : {
"field" : "_index"
}
},
"avgTemperature" : {
"avg" : {
"field" : "currentTemperature"
}
}
}
}
}
}
You can now receive the result as same as _opendistro/_sql
. Note that the normal _search
API does not support to display the results such as csv
and raw
.
{
"took" : 109,
"timed_out" : false,
"_shards" : {
"total" : 134,
"successful" : 134,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"status.keyword" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "OK",
"doc_count" : 96932,
"cnt" : {
"value" : 96932
},
"avgTemperature" : {
"value" : 79.99918499566706
}
},
{
"key" : "WARN",
"doc_count" : 8595,
"cnt" : {
"value" : 8595
},
"avgTemperature" : {
"value" : 79.84409540430482
}
},
{
"key" : "FAIL",
"doc_count" : 2162,
"cnt" : {
"value" : 2162
},
"avgTemperature" : {
"value" : 80.45189639222941
}
}
]
}
}
}
This has completed the query execution by _sql
API. Note that this API does not support all standard SQL commands. For example, aggregate functions are currently supported only avg ()
, count ()
, max ()
, min ()
, andsum ()
.Please see the Open Distro document for more details.
In this lab focused on the advanced aspects of Elasticsearch search: full-text search, customization, and SQL API usage. All of the Elasticsearch Workshop is now complete. Please do not forget clean up by following these steps. If you keep the resource you used in this workshop, you will be continuously charged.