How to fix “Could not push logs to Elasticsearch cluster” error – EFK

Last Updated:

Issue

I have an EFK stack (Elasticsearch, Fluentd, Kibana) and I noticed my logs are not reaching Elasticsearch. I could see in Kibana UI that the logs are not coming, and when I logged into the Fluentd node, I saw the following error:

error="could not push logs to Elasticsearch cluster ({:host=>\"esclient1\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}, {:host=>\"esclient2\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): read timeout reached"

This initially got me thinking that something is not right with my td-agent configuration. Some suggest that I have overloaded my Elasticsearch nodes and some encouraged tweaking Fluentd td-agent configuration. In my case, all my Elasticsearch nodes were OK in terms of load.

Fluentd could ship some of the logs but some other logs were getting buffered in the fluentd node and I could also see the buffer-related errors like running out of space and BufferOverFlow.

0 failed to write data into buffer by buffer overflow action=:throw_exception
2021-02-18 17:59:05 +0800 [warn]: #0 send an error event stream to @ERROR: error_class=Fluent::Plugin::Buffer::BufferOverflowError error="buffer space has too many data" location="/opt/td-agent/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.4.2/lib/fluent/plugin/buffer.rb:298:in `write'" tag="openshift.operations"

Solution

This could happens when Elasticsearch is not available, overloaded, or unhealthy and shards are not assigned properly. I suggest checking the cluster health first.

curl http://localhost/_cluster/health?pretty
{
  "cluster_name" : "elastic-iot",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 5,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 178,
  "active_shards" : 248,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 285,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 46.52908067542214
}

It is obvious that it’s the Elasticsearch cluster itself which is unhealthy. I now need to find out why the cluster is in “red” state and why some shards are “unassigned”. To do that, _cluster/allocation/explain can help

curl http://localhost/_cluster/allocation/explain
{
      "node_id" : "s2Yqo7SiTzakRpj0rMO-RA",
      "node_name" : "iot-esdatanode1",
      "transport_address" : "IP:9300",
      "node_decision" : "no",
      "weight_ranking" : 2,
      "deciders" : [
        {
          "decider" : "replica_after_primary_active",
          "decision" : "NO",
          "explanation" : "primary shard for this replica is not yet active"
        },
        {
          "decider" : "filter",
          "decision" : "NO",
          "explanation" : """node does not match index setting [index.routing.allocation.require] filters [box_type:"hot"]"""
        },

To see shard allocation across the nodes, you can run:

curl localhost/_cat/allocation
111   2.2gb 114.9gb 184.4gb 299.3gb 38 10.28.13.108 10.28.13.108 iot-esdatanode1
  7 391.4mb  51.5gb 247.8gb 299.3gb 17 10.28.13.109 10.28.13.109 iot-esdatanode2
114   2.5gb  86.7gb 212.5gb 299.3gb 28 10.28.13.110 10.28.13.110 iot-esdatanode3
295 

This should explain why those shards are not assigned. Yours might be a different reason but mine is because indexes that are created require a node with the type “hot” (referring to hot/warm/cold phases) and none of my Elasticsearch nodes had that attribute. It looked like the attribute was missing from my nodes somehow.
So to solve this, I have to set the attribute for my “hot” nodes and those shards should get assigned.
You can do this by logging into the node and running:

/bin/elasticsearch -Enode.attr.box_type=hot

You can also hardcode this on Elasticsearch configuration.

Edit /etc/elasticsearch/elasticsearch.yaml and add the following:

node.attr.box_type: "hot"

Restart the Elasticsearch and those unassigned shards should get assigned.

sudo systemctl restart elasticsearch

Conclusion

You see this error because Elasticsearch is unavailable or unhealthy. You can first check the ES health and see if it’s “green”. You might also check the load on your Elasticsearch node(s) as that can be another reason for this error.

RECENT POSTS

Get Ops Pro Tips in Your Inbox!