Whenever you create an amazon kubernetes cluster with terraform, It will create a cloudwatch log-group associated with it with the name /aws/eks/<cluster-name>/cluster1.
Using it, you are able to see every orders that were issued on the cluster.
It contains quite a lot of data. One can use the web ui to navigate the logs. Personally, I consider it very slow and unnecessarily distracting. I will use the command line instead.
gathering the data
I recommend getting its output in a temporary file before processing it in a post mortem analysis.
time aws --profile "${profile}" logs tail --since 20h "/aws/eks/${cluster_name}/cluster" > /tmp/out.txt
real 2m27.090s
user 0m24.328s
sys 0m4.346s
du -sh /tmp/out.txt
wc -l /tmp/out.txt
504M /tmp/out.txt
405856 /tmp/out.txt
processing the data
The audit data are inside all this mess, therefore grep may be useful to focus only on those.
Also, the data are json encoded, but because Amazons prefixes the line with something of its own, we have to pre process it. We still want to conserve the date for future analysis though.
time cat /tmp/out.txt \
| grep "kube-apiserver-audit-" \
| sed -r 's/^(.+) kube-apiserver-audit-[^ ]+ \{(.+)$/{"date": "\1", \2/' \
> /tmp/audit.txt
du -sh /tmp/audit.txt
wc -l /tmp/audit.txt
467M /tmp/audit.txt
374682 /tmp/audit.txt
analysing the data
From that point, being fluent in jq helps a lot, because most of what we likely want to do can be done with it.
In order to avoid wasting cpu cycles analysing data that we know wont be useful, we can create a pre filter using grep. It can be used for instance to filter in only lines that mention some information that are of interest for our current analysis and filter out the users that we know won’t give us any useful insights.
For instance, to have a hint of what kind of agents tried to “abuse” our APIs.
cat /tmp/audit.txt \
| grep 401 \
| grep -v '"username":"\(system:\|eks:\|.\+terraform\)' \
| jq -r .userAgent \
| sort \
| uniq -c \
| sort -n
1 Expanse, a Palo Alto Networks company, searches across the global IPv4 space multiple times per day to identify customers' presences on the Internet. If you would like to be excluded from our scans, please send IP addresses/domains to: scaninfo@paloaltonetworks.com
1 Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 10.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729)
1 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36
1 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36
1 Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit/603.3.8 (KHTML, like Gecko) Mobile/14G60 MicroMessenger/7.0.5(0x17000523) NetType/4G Language/zh_CN
2 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36
2 Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36
2 Mozilla/5.0 zgrab/0.x
6 Mozilla/5.0 (compatible; CensysInspect/1.1; +https://about.censys.io/)
6 Mozilla/5.0 (compatible; InternetMeasurement/1.0; +https://internet-measurement.com/)
6 TLS tester from https://testssl.sh/dev/
6 python-httpx/0.27.0
6 python-requests/2.32.3
23 null
30 Mozilla/5.0 (compatible; Nmap Scripting Engine; https://nmap.org/book/nse.html)
108 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/88.0.4324.192 Safari/537.36-AmazonAutoTester-Palisade-Riddler
113 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/51.0.2704.103 Safari/537.36 DynamicScanningFramework (prosecdev-dast-team@amazon.com) -AmazonAutoTester-DyePack [DyePack] [PROD] [804694375827]
Another example could be to trace the lifecycle of a particular resource.
time {
cat /tmp/audit.txt \
| jq -r -c 'select(
( .verb | IN("create", "delete"))
and ( .objectRef.resource | IN("deployments") )
and ( .objectRef.namespace | IN("default") )
and ( .objectRef.name | contains("-doc") )
)|"\(.date): \(.verb)"'
}
2025-05-05T08:27:13.718000+00:00: create
2025-05-05T10:41:28.610000+00:00: delete
2025-05-05T10:41:52.485000+00:00: create
real 0m9.971s
user 0m9.288s
sys 0m0.626s
We can see the initial creation at 08:27 and some redeployment occurring at 10:41.
Again, note that jq has a cost, and that using grep, while less precise, might speed things a little bit.
time {
cat /tmp/audit.txt \
| grep '"resource":"deployments"' \
| grep '"namespace":"default"' \
| grep '"name":"[^"]\+-doc"' \
| grep '"verb":"\(\create\|delete\)' \
| jq -r '"\(.date): \(.verb)"'
}
2025-05-05T08:27:13.718000+00:00: create
2025-05-05T10:41:28.610000+00:00: delete
2025-05-05T10:41:52.485000+00:00: create
real 0m0.414s
user 0m0.273s
sys 0m0.292s
note about aws logs start-query
After writing this article, I discovered aws logs start-query
to get similar
data. So far, I don’t see the gain. Maybe if I need to process much bigger data,
it might be of interest.
It requires to submit a query, wait for it to be ready and then output it.
query_id=$(aws --profile ${profile} logs start-query --log-group-name /aws/eks/${cluster_name}/cluster --start-time $(($(date +%s) - 36000)) --end-time $(date +%s) --query-string "fields @timestamp, @message | sort @timestamp desc"|jq -r .queryId)
sleep 10 # wait for some time for the result to be here.
aws --profile ${profile} logs get-query-results --query-id ${query_id}|head -300
{
"queryLanguage": "CWLI",
"results": [
[
{
"field": "@timestamp",
"value": "2025-05-05 15:59:56.148"
},
{
"field": "@message",
"value": "{\"kind\":\"Event\",\"apiVersion\":\"audit.k8s.io/v1\",\"level\":\"Request\",\"a..."
},
{
"field": "@ptr",
"value": "..."
}
],
[
{
"field": "@timestamp",
"value": "2025-05-05 15:59:56.148"
},
{
"field": "@message",
"value": "{\"kind\":\"Event\",\"apiVersion\":\"audit.k8s.io/v1\",\"level\":\"Metadata\",\"aud...."
},
{
"field": "@ptr",
"value": "..."
}
],
...
This results in a big feedback loop time, not practical when trying to extract
information from the data. Therefore, I would be tempted to still export the
data and process it with jq. Doing so would be harder than using aws logs tail
, because I would have to extract the @message and also inject the
@timestamp anyway. I may investigate this path another time, but for now, it
seems to me as unnecessarily complicated for the needs I have.
Permalink
-
resource “aws_cloudwatch_log_group” “this” { count = local.create && var.create_cloudwatch_log_group ? 1 : 0
name = “/aws/eks/${var.cluster_name}/cluster” retention_in_days = var.cloudwatch_log_group_retention_in_days kms_key_id = var.cloudwatch_log_group_kms_key_id log_group_class = var.cloudwatch_log_group_class
tags = merge( var.tags, var.cloudwatch_log_group_tags, { Name = “/aws/eks/${var.cluster_name}/cluster” } ) }