38 results found
-
Add .NET Counter publication to App Insights on Startup
There are a whole suite of .NET counters available to use to publish to App Insights. This is a low-lift modification that enables us to gain many insights, directly inside of App Insights Metrics, we can utilize to diagnose issues on customers.
This enables us to do performance investigations that otherwise require manual intervention (such as downloading .ETL files and opening them in PerfView or capturing dump files and analyzing them).
We can skip these manual steps and jump write to "close to root causes" by publishing this information. There is little cost and no significant performance degradation associated with…
1 vote -
Dynamic Scaling in DXP - improve performance and reduce costs
The default hardware SKU for Optimizely is P1V3. P1V3 only has 2 cores and P2V3 has 4 cores. Because the number of cores available to the system dictates the number of default threads the system attempt to regulate, it would be best to deviate away from our current default hardware SKU, P1V3 to P2V3, instead. However, doing so would increase costs which we do not want to do necessarily. This is a proposal to decrease costs while simultaneously increasing hardware resources during peak hours.
The idea is to increase a customer's lower environment hardware to P2V3 during "working hours" (to…
1 vote -
25 Production Service Buses Broken
We need to define what "working" means for a service bus so that reliability engineering can maintain reliability based on these metrics.
When a service bus no longer functions, reliability engineering should be equipped with the capacity to upgrade service buses in order to meet a component-specific SLA.
We need product management to define what "working" means for a service bus so reliability engineering can respond appropriately when a service bus is "broken" rather than having to go back to PM as though we need a exception for every broken service bus.
We also need monitoring in place to ensure…
1 vote -
Assess binding settings to resolve resets
We need to evaluate whether setting the suggested Microsoft setting is worthwhile to resolve resets. As part of that effort, we would have to evaluate the risks of doing so.
1 vote -
Align IQI Health Check with Dashboard Actionable Reports
Currently, the IQI report offers information such as a breakdown of HTTP status codes. Unfortunately, that does not offer the customer or partner information they can actively take action on because it requires further investigation on their part to make it actionable.
We want to bridge the gap between the data and being able to take action, on our side. The information we offer within the Dashboard is much closer to directly actionable, in many cases it is directly actionable.
For example, on the dashboard we offer a EpiCms database deadlock report. The customer or partner needs to identify which…
1 vote -
IQI Report - App Insights as the Source of Truth
There are multiple similar functions happening that are similar within Optimizely. They all need to be utilizing the same data a the singular source of truth. They are:
Incident Management, primary data is App Insights.
Deep diagnostics done by Problem Management, primary data is App Insights
App Insights Dashboard
IQI Health Check reporting, primary information (aside from pingdom's availability info which for legal reasons) is Cloud Flare for health information.
The inconsistency of source data causes issues because we cannot make a consistent presentation of the information when the source being utilized is entirely different.
It also adds possibility of…
1 vote -
Identify and Aid Customers with Production Live-Locks
There's the concept of a "dead" lock and a "live" lock. A live-lock is essentially a race condition within a production environment. It causes stair stepping of CPU usually until the server crashes.
This often happens when a developer accidentally uses a non-thread safe object in a multi-threaded manner.
The object (for example a HashSet) being used needs to be identified and a thread safe type needs to be replaced so that the live-lock goes away and the CPU goes back to normal.
1 vote -
Down-Sampling Service Bus App Insights
Much of the log analytics costs come from voluminous amounts of service bus activity that is largely useless for analytics purposes. We could generally use a fraction of the analytics and we would be just fine.
For diagnostic purposes, we generally need to inspect the contents of the service bus itself to identify problems.
1 vote -
Consider shifting from Adaptive Sampling to Fixed Sampling
There's a known bug in Adaptive Sampling that prevents us from getting accurate analytics from App Insights. It largely calls the values of App Insights largely into question because we cannot tell when the metrics within App Insights are accurate or not.
Moving from Adaptive Sampling to Fixed Rate Sampling resolves this issue, but it also can cause an increase in log analytics and the possibility of exceeding the log quota.
If we can come up with a weekly way to determine the appropriate sampling rate for a given type of log data for a customer and tweak the analytics…
1 vote -
Identify and Communicate Crashing Instance Causes by Exception/Log Inspection
Due to my elevated level of access nobody on the team gets these errors emails from Microsoft except Erik.
We need an easy way to address these with customers through the ticketing system.
These are causing instances crashes for the sites. Here's an example of what's happening on moco.
Here are the email threads I'm receiving from Microsoft that shows they're having outages.
1 vote -
IP Address restriction in Cloudflare
It would be great if we could configure an IP Address whitelist in Cloudflare so that we only allow a specific set of source IP Addresses to be allowed to access our DXP instances. This will allow us to block public access to non-production environments
2 votes -
When there are errors on the servicebus we don't log it. It would help if we logged errors for troubleshooting purposes.
When there are issues with the servicebus it is not possible to see what went wrong many times. You can go to the servicebus and look at the graph and see that there are "server errors" and "user errors" and "throttled messages"but there is no way to see the details of the server errors or the user errors.
Some times we can see a log entry in the CMS logs but not always and I guess it depends on the type of error.
If we had more information it would help when troubleshooting customers having problems with the servicebus.14 votes -
Allow disabling warmup for integration and preproduction
Integration and preproduction environments are sometimes protected by IP-whitelists and in those cases the warmup step always fails with status code 401. The warmup system waits for about 15 minutes until these requests time out. This increases delivery times to these environments specially when CI/CD is set up.
Ex:
2026-03-10 12:40:24 Information Starting to warm up the targets slots...
2026-03-10 12:40:25 Information Preparing target slot for Go Live (<masked>/slot) (warming up the slot)
2026-03-10 12:52:21 Warning Timed out waiting for all instances for webapp <masked> and slot "slot" to become ready!
2026-03-10 12:52:21 Information Validating deployment ID uniqueness between slots…7 votes -
rolling restart flag
Please make i possible to be able to flag for a rolling restart (one instance at a time) when executing deployments -through DevOps / GitHub release pipelines.
-Note:
old feature request disappeared it seems:
https://world.optimizely.com/forum/developer-forum/Developer-to-developer/Thread-Container/2023/5/is-it-possible-to-do-a-rolling-restart-from-an-api-call-powershell-cmdlet-or-something-similar/1 vote -
DXP Management Portal
Restart Site - Have "Restart one instance at a time" be checked by default since not checking this can cause serious issues upon restart.
Also adding the checkbox state to the confirmation message so users are aware.
1 vote -
Failover alerts
We need to receive alerts when a site goes into failover. The failover CMS should show a warning that its disabled during failover (instead of error message).
9 votes -
Google Tag Gateway
Support Google Tag Gateway with Cloudflare.
1 vote -
Regenerate Content Graph keys and secrets through Paas portal
Clients should have the ability to regenerate Content Graph keys and secrets in the self-service Paas portal.
6 votesGood news - this idea is now being explored by our product and design teams. We’re researching potential solutions and scoping out what an implementation might look like. We’ll share updates here as our thinking evolves.
-
Request for Wildcard Hostname Support on DXP
Wildcard hostnames are not supported on DXP, and all configured hostnames must be explicitly mapped in the CMS admin.
We will have thousands of customers for whom we will be setting up individual sites. Is it possible to configure a wildcard entry on DXP—such as *.procase.riogrande.com—to support this model? Managing tens of thousands of individual URLs in Optimizely DXP would be difficult for both Opti and us, and it would significantly increase our customer onboarding time.
1 vote -
Upload very large single files using cms
People are needing to upload very large files, such as video files and are running into Cloudflare file size limits.
Being able to upload a very large file into the storage blobs using the Deployment API and then being able to refer to it in the CMS may be a way to work around the size limits in Cloudflare.
The problem trying to do this now is that uploading via the deployment API won't make it into the same container that the CMS is configured to read from. Also, usually when people add media, a reference in the DB is…
4 votes
- Don't see your idea?