Application-level Purple Teaming: A case study
- 29 Sep 2020
Attack-aware applications have been discussed in AppSec for over a decade - the concept that an application can detect that it is being attacked and fight back. However, the majority of organisations don’t do this, or even the basics of detection and response for those apps, leaving a visibility gap on critical assets and a notable hole in detection and response strategies.
In our separate article on the ‘Our Thinking’ section of the F-Secure website, we discussed the history of attack-aware applications and our assessment of why this hasn’t taken off. Specifically, we believe all of the current approaches (native code-level; tools-based using Web Application Firewalls (WAFs) and RASPs; anti-fraud techniques) are either lacking in coverage or are too effort-intensive. As such, we have instead been experimenting with an approach to introduce attack-awareness to apps in a more modular, iterative fashion, with application-level purple teaming.
In this post, we will walk through a case study where we performed this exercise against an existing application and the positive effect this had. This will include:
This should provide useful insights for organisations similarly looking to review and improve the detection and response capabilities of their critical apps and enhance their resilience.
This approach is context-dependent and will involve collaboration from security, development and latterly detection and response/blue team members. However, the effort from each of these teams is generally short-term, and will variably involve each of them in specifically focused sprints. Throughout all of this, we have found that the general principles these activities should align to are:
Where AppSensor-style approaches to attack-aware applications have heavily focused at the code-level and on software security ways of working, our application-level purple teaming approach has sought to focus on iterative and collaborative approaches. This approach aligns further with detection and response team ways of working.
In general, purple teaming is the sharing of knowledge between offensive/red and defensive/blue teams; red teams emulating real-life attacker tactics, techniques and procedures (TTPs) and blue teams reviewing and enhancing their resilience. In that regard, the purple teaming we discuss in this article will be significantly different to expectations:
The extent to which the above statements hold true, or to which the approach closer resembles a traditional purple team, is dependent on the existing level of sophistication within the app and the organisation.
Throughout our exercise, we were seeking to identify whether the application did each of the following:
Our general observations of client applications lead us to believe that a majority of large organisations collect some form telemetry from their applications. However, this is usually focused on activities such as detecting fraud, and not on security events. In general, security-focused alerting or response can therefore not exist, outside of out-of-the-box capabilities provided by tools like WAFs. Therefore, for most organisations, the first significant focus would likely be security-focused refinements to that logging. The focus should then shift to the introduction of exemplar, high-fidelity alerting, for the categories of attack relevant to the app and the organisation.
The application in our case study had little existing security-focused logging, and no existing alerting. Applications with sufficient existing logging and an established detection and response/blue team elsewhere in the business would likely start at a later stage than our case study - with further collaboration with the blue team, and the use of more numerous and sophisticated payloads.
This post describes our approach to application-level purple teaming for a specific app; yours would likely be different. We’ve started discussing this approach with a few different organisations, who each have different requirements and levels of sophistication. For this case study, the general flow looked like this:
To explore the above, we worked with a client to identify a critical application as a proof-of-concept - a file sharing portal. The application allowed file uploads and downloads, with access controls limiting access to specific vendors and files. We considered attacks from two distinct user roles:
The app had been pentested before as part of its development lifecycle, with some low and medium risk vulnerabilities identified, which had since been resolved. It is important to note that trying to find vulnerabilities was not the focus of the application-level purple team exercise; we should assume this work has already been undertaken, and focus our test cases on factors beyond just vulnerabilities, such as abuse of legitimate functionality. This is particularly true for internal users.
The application had a front-end AngularJS UI, with front and back proxies in place and multiple API micro-services. Each of those APIs and proxies performed logging, meaning we had the following log sources:
Early in the process, the developers identified to us that while reasonably verbose logs were maintained, they were not security-focused. Through this, they realised that most of the attacks we discussed with them would not be easily visible from those logs. Specifically, the following key threats and attacks were discussed:
Using the insights from this threat modelling, we produced a list of areas to focus our work on, alongside specific test cases. Throughout this post, we will reference the following terminology:
Firstly, we agreed a range of key metrics we required visibility of with the client. Note, these are independent of our test case categories, as a broad underpinning to any application-level logging. To understand what malicious or anomalous behaviour looks like, it is critical to first understand general application load and user behaviour.
What? - How the application behaves and monitors users and general application health metrics, split into login metrics and general request metrics. In general, we should focus on authenticated traffic, to avoid unnecessarily large data sets from common, low-effort attacks, such as those using automated tooling. While we might consider those attacks, we will probably arrive at the decision that it isn’t cost-effective for us to want to be aware of them.
Metrics:
Login and user metrics:
Request metrics:
~
For our categories and test cases, we were not seeking to align to a specific framework, nor to OWASP’s AppSensor Detection Points or Top 10. At this point, that felt unnecessary; trying to solve a problem we weren’t yet sophisticated enough to solve. Rather, our categories and test cases should be relevant to our identified threat model. As such, the following broad categories were identified:
For each of those categories, we created between 5 and 10 test cases and sought to capture a yes/no/partially status across the logging > alerting > responding journey for each of them.
However, the MITRE ATT&CK framework is the industry-standard approach for categorising detection and response activities, so our work below will make reference to it. While there are not often direct mappings to ATT&CK attacker techniques, the high-level phases are analogous and can be beneficial in framing application-level efforts as part of a broader detection and response strategy. This is shown in the diagram below of F-Secure’s adapted kill-chain, with those phases mapped onto ATT&CK ones beneath that:
Where attacker activities would usually be detected in phases such as C2, persistence, internal reconnaissance and lateral movement, many of our actions here could be detected earlier.
Enumeration:
What? - Where users attempt to iterate through multiple files, objects, or users to violate access controls and perform privilege escalation.
Why? - Due to the app’s purpose as a file sharing application, access control issues and enumeration of files, vendors and users were considered the most likely attacker actions.
Relevant MITRE ATT&CK phases: Discovery, Initial Access
Test cases:
Injection:
What? - Where users attempt to manipulate application content and behaviour to compromise systems and users. This includes common vulnerabilities such as XSS and SQL injection.
Why? - This category encompasses the most commonly exploited web app vulnerabilities, such as those within the OWASP Top 10 and those attempted by automated tooling.
Relevant MITRE ATT&CK phases: Initial Access, Execution, Persistence
Test cases:
Whether these test cases were initially logged or alerted is described in iteration #1; improvement activities were then carried out by the development team and validated in iteration #2.
The 3 application log sources identified in our threat modelling contained indicators which could help with logging and alerting malicious actions across our test cases, but often missed useful metrics such as timestamps or source IP addresses. The combination of that and a lack of correlation and request IDs would make it difficult to correlate user actions and trace events - whether malicious or benign.
All of these logs (both at an application-level and a proxy-level) were also only visible if viewing the log files on their respective hosts - they were not aggregated to a centralised source to allow using it as one app-wide dataset.
Several of our key login and request metrics were partially visible from the logs, but often only implicitly. For example, login events were visible via the relevant “/login” HTTP requests, but counts of those login events were not maintained. Timestamps and source IP addresses were frequently not visible, meaning it would be difficult to trace those events or identify when or from where a user had logged in.
A number of logical issues were identified at the code-level which prevented thorough logging of failed logins or OTP attempts. For example, a failed login counter was incremented, with accounts locked when it reached 5. However, that counter included both username/password and OTP failures, meaning it was not possible to differentiate between the two, nor identify whether a user had eventually gained account access. The value was also reset to 0 once an account was unlocked, meaning the total number of historical account lockouts or failed logins could not be counted.
The application was not capable of maintaining a count of overall requests to it, either on a per-user level or across the application as a whole. To properly identify peaks in traffic, potential denial of service attacks or many other suspicious indicators, it is critical to first understand what healthy application traffic looks like.
Due to the aggregation and visualisation issues described above, there was no central source of this log data, nor a way of visualising it and any resultant alerts.
As a file transfer application with multiple distinct user roles, a majority of our enumeration test cases focused on enumerating valid files and user profiles. This encompassed both of:
The majority of our enumeration test cases were partially visible in the logs, but without sufficient detail (such as timestamps and source IPs). No alerting was present. The only response was account lockout when password brute-force attempts were performed - this is arguably the best understood example of an attack-aware application and the one most organisations likely have in-place.
Interestingly, while username enumeration and password spraying attempts were handled by the application through preventative controls, no detective controls were present. Unsuccessful login attempts may be logged, but due to the lack of timestamps and source IP addresses, it was not possible to identify if multiple usernames had been attempted from the same source in a given time period.
Access control issues such as file enumeration and user profile enumeration were partially visible by the presence of key indicators such as HTTP 403 and 404 responses. However, due to the variation in those indicators, this could not be reliably used to identify any access control violations. Alerting based solely on a 403 or 404 would lack granularity of the cause of the malicious activity and would present little benefit to anyone triaging the logs; other factors such as the API controller responding could help clarify the cause of the response code. Multiple variances of enumeration were attempted to map out the resultant HTTP responses and indicators for this app.
As the internal user role was intended to access all files across all vendors, no file enumeration access control violations would occur. In this case, suspicious behaviour may include a user downloading a high quantity of files in a given time period (e.g. 100 files in an hour), rather than 403 responses. This would therefore not be flagged explicitly in the logs, but could be inferred from maintaining a count of such events per-user.
Directory brute-force attempts are traditionally indicated by HTTP 404 responses. However, with JavaScript front-ends supported by back-end APIs, this is non-trivial to identify. Some traditional “server-side 404s” may occur if API paths (e.g. https://www.example.com/api/v1/user1) were enumerated. However, many “client-side 404s” visually displayed a 404 to the user due to the lack of an API route (e.g. https://www.example.com/users), but neither generated a 404 response nor appeared in the server logs; rather 200 responses with “404” in the response body were visible in the proxy logs.
The test cases we used for this category encompassed basic examples of these types of attacks. As expected, all were partially logged (i.e. technically visible in logs, but not explicitly flagged, and lacking timestamps and detailed information), with none alerted or responded to. There was also no in-built rulesets to help detect a broad range of payloads.
Notably, the detection and prevention of these specific attack types (e.g. Cross-Site Scripting, SQL injection) has been the area most frequently outsourced to tools such as WAFs in the past.
Following iteration #1, the organisation decided on use of the ELK stack (Elasticsearch, Logstash, Kibana), to be consistent with use throughout the team and business. Therefore, a majority of logs were now sent to Logstash, to then be aggregated into various Elasticsearch indexes. That log data was sorted and queried in Elasticsearch, which could be used to look for key indicators, aided greatly by the increased presence of timestamps and source IP addresses.
At the suggestion of the dev team, the Yelp tool ElastAlert was used to help perform alerting. ElastAlert builds on top of Elasticsearch queries, which allowed the team to query the log data with Elastic, and then perform a match based on those query results. If those query results weren’t sufficient, this also supported “match enhancements” to perform a second query based on the results of the first - for example:
ElastAlert rules themselves are defined via YAML files, some of which are shown below. These YAML definitions may or may not reference match enhancements or built-in ElastAlert rules. In our example, we created or used rules for enumeration attempts, user lockouts and failed and successful logins. ElastAlert was then used to send all of the logs and any rule hits to the Kibana visualisation tool. Additionally, the Sigma tool supports compilation to ElastAlert, which would enable tool-agnostic creation of rules for the app.
A key benefit of this approach was that all of our log data was visible in Kibana, not just known-bad alerts. This would later allow detection and response teams to hunt through legitimate user activity and events surrounding alerts, similar to how they function in infrastructure detection and response.
In Kibana, the dev team created the following “Dashboards” to monitor, as well as the raw log data in the “Discover” tab:
Examples of powerful Kibana views for some of our test cases and attack types are visible below.
No approach for response to any detected test cases was yet defined. At this point, it was deemed premature to discuss this with the team.
Incremental changes to the data the application logged and proper aggregation to Elasticsearch and Kibana meant that our baseline metrics were much more visible. We could see key requests and responses for relevant user behaviour in Kibana, which could be used as the basis for high-fidelity alerting.
Account lockout actions were visible in our “Audit logs” Kibana dashboard, with key identifiers highlighted:
Teams using Kibana could identify the request ID or user ID in question and view all logs related to that. This would allow detection and response teams to identify any other linked attacks or anomalous activity. For example, all successful login events, OTPs and other conditions were visible in the “Audit logs” dashboard:
Further login-related metrics were summarised in the “Log ons” dashboard:
This provided detailed insight into our login-related metrics, such as who logged in when, and counts of both successful and failed logins. Request-related metrics were also visible, in the “Metrics” dashboard. This could be used to identify expected traffic over a given time period, as well as actual traffic, filtered by individual users and HTTP response codes:
This analysis was performed offline - i.e. in Elasticsearch, ElastAlert and Kibana - meaning the application was unaware of login alerts or request metrics. As such, response could not yet be achieved. In iteration #1, we had discussed a long-term plan for this with the developers: to introduce AppSensor-style code-level logic, so the application was natively aware of attacks to respond to. However, this was not our preferred solution, due to the perceived high amount of coding effort involved there, which would was seen as at odds with functional development. During iteration #2, the dev team suggested that they could instead use ElastAlert and webhooks to send rule hits to the app itself, rather than just to Kibana. Those webhooks could then reuse existing application functionality to appropriately respond.
Further incremental recommendations included new test cases we could assess, new alert categories, refinements to the Kibana dashboards and displaying source IP and login time to the end user.
All of the test cases identified for this category now generated log data, which was stored in Elasticsearch and visible in Kibana. A small number of exemplar, high-fidelity alerts were also created as a result of that logging. No improvement in response had been attempted, as this was not yet our focus.
The log data’s presence in Elasticsearch allowed us to query aggregated data over specific time periods. The dev team established set thresholds for certain attack indicators to generate alerts from, such as more than 5 files downloaded by an internal user in an hour, or more than 5 403s occurring in a 5-minute period. This allowed the creation of a small number of enumeration alerts:
Here we can see the following mapping on our logging > alerting > responding journey:
name: Access control violation downloading a file
type: any
index: rails-app*
realert:
minutes: 5
timeframe:
hours: 1
filter:
- query:
query_string:
query: 'status:403 AND controller:"Api::V1::DownloadsController"'
alert:
- "debug"
match_enhancements:
- "elastalert_modules.app_enhancements.SessionDataEnhancement"
Here we can see the following mapping on our logging > alerting > responding journey:
name: Download frequency for user exceeds 5 per hour
type: frequency
index: audit_logs*
realert:
minutes: 0
num_events: 5
attach_related: true
query_key: user.id
timeframe:
hours: 1
filter:
- term:
message: "successfully downloaded an upload"
alert:
- "debug"
Many of these alerts were very coarse-grained, with a lack of insight into specific access control violation causes. Additionally, the defined thresholds were too low, resulting in a number of false positives. However, the foundations for that alerting were now in place. Incremental recommendations were therefore made to increase the value and accuracy of these alerts.
The strong foundations established by the use of Elasticsearch, Kibana and ElastAlert meant that it was now trivial to introduce new alerts. Indeed, while we were working with the dev team during iteration #2, numerous new alerts (such as access control and account lockout ones) were able to be introduced. Long-term, these alerts should be defined and maintained by the detection and response team, who have greater insight into attacker actions. At this stage, our focus was introducing exemplar, high-fidelity alerts and identifying very application-specific indicators.
As with our metric capturing and visualisation, the alerting performed here was done offline, but could be used to enable responsiveness using the ElastAlert tool and webhooks, as described above.
Throughout this process, the importance of the dev team was evident - understanding what they were resolving and what useful log data to alert on looked like, rather than simply following prescribed steps in a report issued to them. For example, a list of resultant 403 and 404 responses may have been identified by the security team, but those may have changed by the time the dev team implemented the change. When the dev team understood what to log and alert on, they were far more effective at quickly and accurately implementing alerts - putting security in the minds of developers throughout the process is always a key goal. This will also make it easier for detection and response teams to understand application context and define their own alerts when they become involved at later stages.
Our recommendations for the devs to implement here had included a few different short-, medium- and long-term approaches:
However, it became apparent that these actions wouldn’t be a long-term or hugely beneficial use of time. Any hard-coded, high-fidelity detections would effectively be a denylist - our own WAF, in a sense. So integrating a commonplace WAF and using the logs it generated seemed the most pragmatic solution - though one that would be a reasonable integration undertaking in itself. So for now, it was decided not to introduce any logging or alerting improvements for this category, instead deciding to defer that to a future iteration.
Our initial threat modelling exercise helped us understand the application and generate appropriate test cases. This would also allow us to demonstrate where we were focusing and where we had improved, and evidence the value of the exercise to senior stakeholders. In iteration #1, we baselined the application’s current detection and response capability across those test cases, noting particular weaknesses in its logging and making suggestions for what alerting to begin including. In iteration #2, we reviewed the efficacy of those logging changes and made incremental suggestions for new alerting to include.
The main changes introduced so far in our case study can be summarised as:
The below tables show the metrics we now captured, alongside the improved logging and marginally improved alerting across our 2 test case categories:
As we described in the article on the ‘Our Thinking’ section of the F-Secure website, our motivation for this research was as a response to engagements where we were able to achieve end-to-end compromise of a client’s estate, using a web application as the entry point, while remaining undetected. While this example application would not yet be in a state to detect or respond to those attacks, the above demonstrates that the organisation is now closer on that journey. Specifically, attacks enumerating user profiles or files, in an attempt to elevate privileges or access sensitive data, could now broadly be detected. Detection of attacks relevant to an end-to-end compromise would be more reliant on “Injection” detection, and improvement in that area should be evidenced after iteration #3.
The above allows us to easily identify some residual gaps; the next 3 likely focuses for this application will probably be:
To accomplish the second and third focus, we plan to reuse existing components again. Specifically, integrating the ModSecurity WAF with the OWASP core ruleset should serve as a starting point for many common web security attack types, particularly helping to address the “Injection” gap. This approach was previously considered by AppSensor, but our goal here is to send the WAF logs as simply another log source to our ELK stack, rather than to a bespoke component like AppSensor; we can then hunt through those logs and define appropriate alerts similar to the above.
Note that we’re only interested in using the WAF for telemetry - to generate logs - not to try and prevent exploitation; as discussed in our separate “Why?” post, prevention is not our focus here. What that log data looks like and how easy it is to get it in the necessary format and sent to Elasticsearch remains to be seen, but it is considered an easier solution than “rolling our own” with hard-coded injection detection, and would provide broad coverage.
A variant of WAFs is RASPs (Runtime Application Self-Protection tools). RASPs integrate into the application via a binary that can be executed on the web server, and have many benefits, including automatic IP-based response, increased telemetry, and support for common detection and response toolsets. However, RASPs alone are not sufficient to make an app attack-aware, due to:
However, if your organisation is already using a commodity RASP, it is likely beneficial to integrate this into your app-level detection and response solution, rather than an open source WAF which may require more tuning.
For responsiveness, our intention is to avoid significant code changes if at all possible, and to avoid duplication of effort. Ae’ve created a source of all of our log data and the logic to generate key alerts, - in an offline manner - we don’t want to redo all of that at the code-level. The dev team believe the ElastAlert tool currently used to send notification of the rules being triggered to Kibana can help us do this. This could involve creating a webhook-style endpoint on the app which ElastAlert posts alerts to, or directly exposing app endpoints such as “lock account”, “introduce CAPTCHA” or “kill session” to ElastAlert (though the latter obviously introduces other architectural risks).
Broadly, the design we’re aiming for - its aggregation of logs, its flow along logging > alerting > responding and its reuse of existing components - is shown in the diagram below:
The above are our next steps for this case study. There also exists multiple areas to expand and improve upon for app-level purple teaming as a whole. Our current and planned focuses include:
This is a specific case study of our use of application-level purple teaming for one organisation. Organisations seeking to adopt this approach as part of their AppSec and detection and response efforts should consider the relative strengths of their existing app-level logging, alerting and response and consider where that starting point should be. Wherever that starting point is, alignment to the 6 high-level principles introduced at the start of this article will help to focus efforts and demonstrate value. For more on that demonstration of value and the history of this area, please refer to the article on the ‘Our Thinking’ section of the F-Secure website.
Verizon DBIR: https://enterprise.verizon.com/en-gb/resources/reports/dbir/
F-Secure - Building cyber resilience by changing your approach to testing: https://www.f-secure.com/en/consulting/our-thinking/cyber-security-readiness
OWASP AppSensor documentation: http://www.appsensor.org/overview.html
OWASP AppSensor detection points: https://wiki.owasp.org/index.php/OWASP_AppSensor_Project#tab=Detection_Points
MITRE ATT&CK framework: https://attack.mitre.org/techniques/enterprise/
ElastAlert docs: https://elastalert.readthedocs.io/en/latest/elastalert.html#overview
ElastAlert GitHub: https://github.com/Yelp/elastalert
ModSecurity: https://www.modsecurity.org/about.html
OWASP ModSecurity Core Rule Set (CRS): https://owasp.org/www-project-modsecurity-core-rule-set/