Lab 4: Amazon ES Advanced Usage

As mentioned at the beginning of this hands-on, there are two major use cases for Amazon ES: log analysis and full-text search. For the log analysis had already mentioned in Lab 2. In this Lab 4, let’s take a look for some of the more advanced usage, focusing on full-text search.

Section 1: Full-Text Search Using Amazon ES

So far, you have been trying on Amazon ES features, mainly log analysis. However, as the name Elasticsearch indicates, Elasticsearch is originally developed as a product for full-text search. In this section, let’s try full-text search using Amazon ES.

First, you will create a new index to perform full-text search.

  1. Click kibana_devtools icon on the left of the screen to open the Dev tools menu.

  2. Copy the following block of codes to the “Console” below, and click ▶ button on the right to execute the API. In this example, an index called mydocs which has only one field called “content” will be created. In Lab 2, the index that automatically recognizes field mappings on Amazon ES when inserting data has created, but here the index designated with mapping in advance to clearly analyze texts will be created.

PUT mydocs
{
  "mappings" : {
    "properties" : {
      "content" : {
        "type" : "text",
        "analyzer": "standard"
      }
    }
  }
}
  1. Execute the following command to add two documents to the index you have created.
POST mydocs/_bulk
{"index":{"_index":"mydocs","_type":"_doc"}}
{"content":"Amazon Redshift is a high speed enterprise grade data warehouse service."}
{"index":{"_index":"mydocs","_type":"_doc"}}
{"content":"Amazon Web Services offers various kinds of analytics services."}

When creating the index in step 2 above, "analyzer": "standard" has been set. Elasticsearch automatically analyzes text fields by specifying analyzer to make them easier to search later. Standard analyzer is the default analyzer in Amazon ES and provides a variety of settings to help you search. For more information, see here.

Executing a full-text search query

Now, you will actually execute a search query against the index created above.

  1. Copy the following codes to the Dev Tools Console, and click ▶︎ button on the right to execute the API. You can execute the search query by calling _search API. As a query parameter, the “content” field is specified query conditions that matches such as “course”.
GET mydocs/_search?q=content:"redshift"
  1. As expected, the following results may be obtained.
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "mydocs",
        "_type" : "_doc",
        "_id" : "YKXvBXEBdQ_VtJWAeY85",
        "_score" : 0.2876821,
        "_source" : {
          "content" : "Amazon Redshift is a high speed enterprise grade data warehouse service."
        }
      }
    ]
  }
}

  1. Did you notice that the document matched correctly when searching for “redshift” instead of “Redshift”? To confirm this background, execute the following query to see how the sentence is analyzed.
GET mydocs/_analyze
{
  "analyzer": "standard",
  "text": "Amazon Redshift is a high speed enterprise grade data warehouse service."
}
  1. You will notice that all words are converted to lowercase as follows. This includes a filter called Lower Case Token Filter into the standard analyzer, which convert here all words to lowercase. In addition to the standard analyzer used here, there are a variety of built-in analyzer. For more information, see here for more information.

    {
      "tokens" : [
        {
          "token" : "amazon",
          "start_offset" : 0,
          "end_offset" : 6,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "redshift",
          "start_offset" : 7,
          "end_offset" : 15,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "is",
          "start_offset" : 16,
          "end_offset" : 18,
          "type" : "<ALPHANUM>",
          "position" : 2
        },
        {
          "token" : "a",
          "start_offset" : 19,
          "end_offset" : 20,
          "type" : "<ALPHANUM>",
          "position" : 3
        },
        {
          "token" : "high",
          "start_offset" : 21,
          "end_offset" : 25,
          "type" : "<ALPHANUM>",
          "position" : 4
        },
        {
          "token" : "speed",
          "start_offset" : 26,
          "end_offset" : 31,
          "type" : "<ALPHANUM>",
          "position" : 5
        },
        {
          "token" : "enterprise",
          "start_offset" : 32,
          "end_offset" : 42,
          "type" : "<ALPHANUM>",
          "position" : 6
        },
        {
          "token" : "grade",
          "start_offset" : 43,
          "end_offset" : 48,
          "type" : "<ALPHANUM>",
          "position" : 7
        },
        {
          "token" : "data",
          "start_offset" : 49,
          "end_offset" : 53,
          "type" : "<ALPHANUM>",
          "position" : 8
        },
        {
          "token" : "warehouse",
          "start_offset" : 54,
          "end_offset" : 63,
          "type" : "<ALPHANUM>",
          "position" : 9
        },
        {
          "token" : "service",
          "start_offset" : 64,
          "end_offset" : 71,
          "type" : "<ALPHANUM>",
          "position" : 10
        }
      ]
    }
    

Synonym Settings

Next, you will set up synonyms. When a user performs a search, it is not always possible to search for keywords that match the words in the text body. You might use keywords with similar meanings but different expressions. By setting synonyms, you will be able to receive search results properly in such cases.

  1. Copy the following codes to the Dev Tools Console, and click ▶︎ button on the right to execute the API. The index you have created is deleted.
DELETE mydocs
  1. Then, create a new index. Add a new synonym setting to the end of “my_synonym”. The three words “amazon web services”, “aws”, and “cloud” are considered the same here. The same applies to “redshift”, “rs”, and “dwh”.
PUT mydocs
{
  "mappings" : {
    "properties" : {
      "content" : {
        "type" : "text",
        "analyzer": "my_analyzer"
      }
    }
  },
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": [
              "my_synonym",
              "lowercase",
              "stop"
            ]
          }
        },
        "filter": {
          "my_synonym" : {
            "type": "synonym",
            "synonyms": [
              "amazon web services,aws,cloud",
              "redshift,rs,dwh"
            ]
          }

        }
      }
    }
  }
}
  1. Add the data as before
POST mydocs/_bulk
{"index":{"_index":"mydocs","_type":"_doc"}}
{"content":"Amazon Redshift is a high speed enterprise grade data warehouse service."}
{"index":{"_index":"mydocs","_type":"_doc"}}
{"content":"Amazon Web Services offers various kinds of analytics services."}
  1. Execute the search query in the same way. This time, however, you will search for words that are not included in the sentence.
GET mydocs/_search?q=content:"aws"
  1. Even though there are no words in the sentence, you will be able to make sure that the search results are obtained without any problems!

    {
      "took" : 4,
      "timed_out" : false,
      "_shards" : {
        "total" : 5,
        "successful" : 5,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1,
          "relation" : "eq"
        },
        "max_score" : 0.8630463,
        "hits" : [
          {
            "_index" : "mydocs",
            "_type" : "_doc",
            "_id" : "maUaBnEBdQ_VtJWA-48R",
            "_score" : 0.8630463,
            "_source" : {
              "content" : "Amazon Web Services offers various kinds of analytics services."
            }
          }
        ]
      }
    }
    
  2. At last, execute the following command to see how the original document is analyzed.

GET mydocs/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Amazon Web Services offers various kinds of analytics services."
}
  1. This time, unlike the previous one, you can see that the words “is” and “of” are not included. Actually, this is because “filter” called “stop” has added when recreating the index earlier. This Stop Tken Filter allows you to exclude particles and prepositions that are difficult to search before analyzing them. This prevents the document from being match when searching for the word “of”.

    {
      "tokens" : [
        {
          "token" : "amazon",
          "start_offset" : 0,
          "end_offset" : 6,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "web",
          "start_offset" : 7,
          "end_offset" : 10,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "services",
          "start_offset" : 11,
          "end_offset" : 19,
          "type" : "<ALPHANUM>",
          "position" : 2
        },
        {
          "token" : "offers",
          "start_offset" : 20,
          "end_offset" : 26,
          "type" : "<ALPHANUM>",
          "position" : 3
        },
        {
          "token" : "various",
          "start_offset" : 27,
          "end_offset" : 34,
          "type" : "<ALPHANUM>",
          "position" : 4
        },
        {
          "token" : "kind",
          "start_offset" : 35,
          "end_offset" : 39,
          "type" : "<ALPHANUM>",
          "position" : 5
        },
        {
          "token" : "analytics",
          "start_offset" : 43,
          "end_offset" : 52,
          "type" : "<ALPHANUM>",
          "position" : 7
        },
        {
          "token" : "services",
          "start_offset" : 53,
          "end_offset" : 61,
          "type" : "<ALPHANUM>",
          "position" : 8
        }
      ]
    }
    

In this section, you have tried the usage of synonyms and filters in full-text search. However, only a few part of full-text search features has been tried in this section. Other than the setting you have made in this section, Amazon ES allows you to configure a variety of full-text search settings. Please read theElasticsearch documentation for more details.

Section 2: Apply synonym with file

In the previous section, you learned how to apply synonyms with directly embedding synonyms into mapping templetes. You can use register and use it if you only use small number of synonyms. However, handling millions of synonyms with mapping template is heavy work and hard to manage. In this section, you’ll try to import a package of synonym file and associate it to an existing Amazon ES domain.

Importing package and associate with Amazon ES domain

To use synonym file in Amazon ES domain, you will upload the file to S3 and import it as a package. Then you will associate the imported package with Amazon ES domain.

  1. Download synonym file to your local environment
  2. Open [S3] management console
  3. Choose “workshop-YYYYMMDD-YOURNAME” created in Lab1 (replace YYYYMMDD and YOURNAME to appropriate one). Choose [+ Create foloder], type “package”, and click [Save]
  4. Choose “package”, click [Upload], select the synonym file downloaded before, and click [Upoad]
  5. Open [Elasticsearch Service] management console
  6. Choose “Packages”, then click [Import]
  7. Name “custom-synonym”, type “s3://workshop-YYYYMMDD-YOURNAME/package/custom_synonym.csv” as a package source, and click [Import]
  8. After page transition, click [Associate to a domain], choose “workshop-esdomain”, then click [Associate]
  9. Wait a few minutes till domain status will be available
  10. Make a note of ID like F123456789, as you will use it later

Executing a query in association with synonym package

Now you can execute a query and confirm the effect of synonyms.

  1. Copy the following codes to the Dev Tools Console, and click ▶︎ button on the right to execute the API. The index you have created is deleted.
DELETE mydocs
  1. Then, create a new index. Confirm that the configuration at filter is not the same with Section 1. Replace “analyzers/F0123456789” with the ID you took a note before, then execute the API
PUT mydocs
{
  "mappings" : {
    "properties" : {
      "content" : {
        "type" : "text",
        "analyzer": "my_analyzer"
      }
    }
  },
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": [
              "my_synonym",
              "lowercase",
              "stop"
            ]
          }
        },
        "filter": {
          "my_synonym" : {
            "type": "synonym",
            "synonyms_path": "analyzers/F0123456789"
          }
        }
      }
    }
  }
}
  1. Add the data as before
POST mydocs/_bulk
{"index":{"_index":"mydocs","_type":"_doc"}}
{"content":"Amazon Redshift is a high speed enterprise grade data warehouse service."}
{"index":{"_index":"mydocs","_type":"_doc"}}
{"content":"Amazon Web Services offers various kinds of analytics services."}
  1. Execute the search query in the same way
GET mydocs/_search?q=content:"aws"
  1. You can get the same search results as the previous section

    {
      "took" : 4,
      "timed_out" : false,
      "_shards" : {
        "total" : 5,
        "successful" : 5,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 1,
          "relation" : "eq"
        },
        "max_score" : 0.8630463,
        "hits" : [
          {
            "_index" : "mydocs",
            "_type" : "_doc",
            "_id" : "maUaBnEBdQ_VtJWA-48R",
            "_score" : 0.8630463,
            "_source" : {
              "content" : "Amazon Web Services offers various kinds of analytics services."
            }
          }
        ]
      }
    }
    

In this section, you have tried the usage of packages in Amazon ES. You can easily manage your own synonym files separated from Amazon ES domain and its mapping templete. This features is especially useful for full-text search usecase. Please read the Amazon ES documentation for more details.

Section 3: Log Analysis Using SQL

So far, you have learned how to write a search query that directly executes _search API. However, Elasticsearch query needs to be described by nesting JSON, which is troublesome to write. There are several ways to solve this problem. For example, Python has a high-level query description library calledElasticsearch DSL and a low-level client called Python Elasticsearch Client, those allows you to write and execute queries relatively easily.

In this section, you will try to write a query using SQL as a tertiary method not mentioned before. Using the SQL function of Open Distro, you can issue Elasticsearch queries using familiar SQL.

Executing queries with _opendistro/_sql API

Again, you will execute the API using Dev tools as before.

  1. Click kibana_devtools icon on the left of the screen to open the Dev tools menu.

  2. Copy the following block of codes to the “Console” below, and click ▶ button on the right to execute the API. This means that for all indexes that fit “workshop-log-*", the number of records per status and the average temperature is aggregated. As described in Section 2 of Lab 2, when performing the grouping processing, the field of keyword type is required. Here we use status.keyword.

POST _opendistro/_sql
{
  "query": """
select
  status.keyword
  , count(*) as cnt
  , avg(currentTemperature) as avgTemperature
from
  workshop-log*
group by
  status.keyword
"""
}
  1. Executing the code above, you will get the result as follows (Note that the actual aggregate results will be different for each execution environment). You can see the aggregate results displayed in “aggregations”.

    {
      "schema": [
        {
          "name": "status.keyword",
          "type": "double"
        },
        {
          "name": "cnt",
          "alias": "cnt",
          "type": "double"
        },
        {
          "name": "avgTemperature",
          "alias": "avgTemperature",
          "type": "double"
        }
      ],
      "total": 3,
      "datarows": [
        [
          "OK",
          86425,
          79.99312698871854
        ],
        [
          "WARN",
          7673,
          79.42864590121204
        ],
        [
          "FAIL",
          1940,
          80.93762886597938
        ]
      ],
      "size": 3,
      "status": 200
    }
    
  2. Now let’s display not only the query but also the aggregate result in the form of the SQL execution result. Execute the same query with the query parameter called format=csv as follows.

POST _opendistro/_sql?format=csv
{
  "query": """
select
  status.keyword
  , count(*) as cnt
  , avg(currentTemperature) as avgTemperature
from
  workshop-log*
group by
  status.keyword
"""
}
  1. As a result of execution, you can confirm that the following simple csv value is displayed. In addition to csv, other formats such as jdbc and raw are also supported. See the Open Distro document for more information.

    status.keyword,cnt,avgTemperature
    OK,96930.0,79.99918497885072
    WARN,8595.0,79.84409540430482
    FAIL,2162.0,80.45189639222941
    

Convert _sql API query to Elasticsearch query

Now, you know that you can execute queries in SQL. Then let’s see what would be when this query is actually revised into an Elasticsearch query. Open Distro has _opendistro/_sql/_explain API that returns an Elasticsearch query to be equal to a query written in SQL.

  1. Execute the following SQL.
POST _opendistro/_sql/_explain
{
  "query": """
select
  status.keyword
  , count(*) as cnt
  , avg(currentTemperature) as avgTemperature
from
  workshop-log*
group by
  status.keyword
"""
}
  1. The following result is returned. The result itself is the query to be sent to the _search API.

    {
      "from" : 0,
      "size" : 0,
      "_source" : {
        "includes" : [
          "status.keyword",
          "COUNT",
          "AVG"
        ],
        "excludes" : [ ]
      },
      "stored_fields" : "status.keyword",
      "aggregations" : {
        "status#keyword" : {
          "terms" : {
            "field" : "status.keyword",
            "size" : 200,
            "min_doc_count" : 1,
            "shard_min_doc_count" : 0,
            "show_term_doc_count_error" : false,
            "order" : [
              {
                "_count" : "desc"
              },
              {
                "_key" : "asc"
              }
            ]
          },
          "aggregations" : {
            "cnt" : {
              "value_count" : {
                "field" : "_index"
              }
            },
            "avgTemperature" : {
              "avg" : {
                "field" : "currentTemperature"
              }
            }
          }
        }
      }
    }
    
  2. Paste the above result, and execute _search API as shown below.

POST _search
{
  "from" : 0,
  "size" : 0,
  "_source" : {
    "includes" : [
      "status",
      "COUNT",
      "AVG"
    ],
    "excludes" : [ ]
  },
  "stored_fields" : "status",
  "aggregations" : {
    "status.keyword" : {
      "terms" : {
        "field" : "status.keyword",
        "size" : 200,
        "min_doc_count" : 1,
        "shard_min_doc_count" : 0,
        "show_term_doc_count_error" : false,
        "order" : [
          {
            "_count" : "desc"
          },
          {
            "_key" : "asc"
          }
        ]
      },
      "aggregations" : {
        "cnt" : {
          "value_count" : {
            "field" : "_index"
          }
        },
        "avgTemperature" : {
          "avg" : {
            "field" : "currentTemperature"
          }
        }
      }
    }
  }
}
  1. You can now receive the result as same as _opendistro/_sql. Note that the normal _search API does not support to display the results such as csv and raw.

    {
      "took" : 109,
      "timed_out" : false,
      "_shards" : {
        "total" : 134,
        "successful" : 134,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 10000,
          "relation" : "gte"
        },
        "max_score" : null,
        "hits" : [ ]
      },
      "aggregations" : {
        "status.keyword" : {
          "doc_count_error_upper_bound" : 0,
          "sum_other_doc_count" : 0,
          "buckets" : [
            {
              "key" : "OK",
              "doc_count" : 96932,
              "cnt" : {
                "value" : 96932
              },
              "avgTemperature" : {
                "value" : 79.99918499566706
              }
            },
            {
              "key" : "WARN",
              "doc_count" : 8595,
              "cnt" : {
                "value" : 8595
              },
              "avgTemperature" : {
                "value" : 79.84409540430482
              }
            },
            {
              "key" : "FAIL",
              "doc_count" : 2162,
              "cnt" : {
                "value" : 2162
              },
              "avgTemperature" : {
                "value" : 80.45189639222941
              }
            }
          ]
        }
      }
    }
    

This has completed the query execution by _sql API. Note that this API does not support all standard SQL commands. For example, aggregate functions are currently supported only avg (), count (), max (), min (), andsum ().Please see the Open Distro document for more details.

Summary

In this lab focused on the advanced aspects of Elasticsearch search: full-text search, customization, and SQL API usage. All of the Elasticsearch Workshop is now complete. Please do not forget clean up by following these steps. If you keep the resource you used in this workshop, you will be continuously charged.