Hello friends,

I have not written a post in quite some time (this is the understatement of the year). This post is a calculator for the Power BI Premium Dataflows workload and a brief explanation of those parameters. The parameters are grouped in few categories:
  1. Parameters that can't be changed by the user - left section in the calculator (you can change these parameter in the calculator app at the bottom of the post for educational purposes only - the value of these parameters is not in the control of the user):
    • Container TTL [min] - this represents the time the M container survives (in minutes). The container might survive more, in case it's in the middle of a query. Once a query is finished, the "time bomb" is evaluated and, if exceeded, the container is disposed. The container is also disposed if it's idle and the TTL has passed (the assessment is not done only at the end of a query). The value for this parameter is 20 minutes.
    • Memory Usage Factor [%] - this represents the memory usage factor for M. Out of all the memory granted by the infrastructure, the dataflows engine needs to retain a bit for its own operation (orchestration of the ETL process, transaction management etc.). Everything else (pretty much everything) is used for the actual ETL engine (M engine). The value for this parameter is 95% (i.e. only 5% is used by the dataflows engine for internal operations, 95% goes to ETL).
    • CPU Usage Factor [multiplier] - the M engine has the ability to execute multiple ETL core data processes (all the transforms that lead to a dataflow entity to be created) in parallel. Moreover, each query is executed in a separate process - named mashup container (it's literally a process whose name is MashupContainer.exe - it also has nothing to do with any containerization technology, it's simply a name that uses the same term). The dataflows engine limits the number of containers of the mashup engine based on a number of criteria. Each container can run only a single query at a given moment. If the number of concurrent queries reaches the number of containers that the dataflow engine is willing to spend, queries are serialized - the number of containers represents, therefore, the degree of parallelism for the dataflow engine. One of the criteria used to limit, is the CPU available - ETL processes being a CPU "consumer", we do not want to create too many context switches as it has performance implications. To avoid unnecessary context switching, one should not have more tasks than the number of cores available. However, threads that are blocked on IO operations (e.g. reading data from the data source) provide good parallelism - as the CPU is not used during that time. To complicate matter further, it is very hard to know apriori if a query is going to use a lot of CPU (e.g. expensive in memory join or sort algorithm) or not at all (e.g. it's going to simply read the data, without any transforms, from a remote data source). Further, given that, in parallel to the dataflows engine, other workloads (e.g. Analysis Services) exist on the same dedicated capacity, this clouds the decision further. Based on experience, we set the value of the multiplier to 2. That means that, from the point of view of CPU utilization alone, we could use 2x the number of cores (e.g. the P1 sku has 4 [backend, i.e. capacity] cores - then we could have 8 containers, from the point of view of CPU restrictions). Other factors weigh in and the system will use the smallest of the computed values based on these different criteria.
    • WSS 2nd Container [MB] - [wss stands for working set size] for the case when the Enhanced (Dataflows) Engine is enabled, the ETL core pipeline (per entity) is split into two parts:
      • read the data and make the necessary transforms (the "ET" part of ETL) - this uses the "main" pool of containers mentioned earlier.
      • write the data into SQL and underlying storage (blob or ADLS Gen2) - given that this is, simply, streaming some data, albeit using the same M engine, it does not need a lot of memory for the container, therefore, we enabled a secondary pool of containers (same number of secondary containers as are main containers) with different, much smaller, working set size. This parameter specifies the working set size for each of the secondary containers. The value is 64 MB.
    • Max Cache Container [MB] - represents the maximum cache size per container in MB. Each data source can define a caching strategy such that it doesn't go over and over, across evaluations, to the data from the original location (e.g. cache an Excel file). The value of this parameter is 2048 MB.
    • Max Temp Size Container [MB] - represents the maximum temporary size per container in MB. This space is used by the data firewall of the M engine (the privacy rules). When the mashup contains a combination of data sources of different privacy boundaries and the query plan cannot guarantee that the data can be flown only in the secure direction (e.g. public => private), the M engine will first cache the necessary data in this temporary space (e.g. the public data in the example just considered in this statement). The temporary space needs to be large enough to cache the entire data needed for the query that cannot, otherwise, be safely processed. The value of this parameter is currently 2048 MB.
    • Sql Memory Usage Factor [%] - If the enhanced compute engine is enabled (see below), some memory has to be accounted to the SQL engine that is hosted on the dedicated capacity within the dataflow workload. The value of this parameter is 30%.
    • Sql Data Streaming Enabled - inside the engine, the flow can be decomposed into reading and transforming the data and, then, stream the data to both storage (blob or ADLS Gen2) and SQL database (using the secondary containers - described above). The other alternative is to read and transform the data while writing it at the same time (in the same M query) and, then, upload the data from the blob into the SQL database. The value of this parameter is true (we chose the first strategy described here, the one that uses the secondary containers).
  2. Parameters that can be changed in the UX by the administrator of the capacity in the admin portal/capacities section - middle section in the calculator. The UX looks like this:

    • WSS Container [MB] - (labeled "Container Size (Mb)" in the UX) represents the working set size of the mashup container. The minimum value is 700 MB, and so is its default value. It is recommended to increase it to 1.5 GB-2 GB if there are complex transforms being involved. You would be able to see, using the calculator, the impact of such a choice (less parallelism - but each operation might finish earlier because more memory is assigned to each respective operation). The result of increasing the container size is less parallelism as dividing the same amount of memory to larger quantity (to get the maximum number of containers based on the memory criteria) would result in a smaller number of containers being able to be allocated while ensuring more memory for each of them. Upgrading to higher skus can allow you to increase the memory per operation while adding more parallelism (as the higher end skus have more memory). Experimentation can also help to establish which of the parameters is critical for the project (increase sku, add more memory, add parallelism etc).
    • Max Memory [%] - represents the quota (out of the physical memory of the dedicated capacity) given to the dataflows workloads. As a note, the quotas of all of the workloads can be overcommited (can add up to more than 100%) - as a matter of fact, the Analysis Services (datasets) workload has always allocated 100% of the memory so, as a whole, the workloads are, indeed, overcommited. However, no workload can have by itself more than 100% of allocation. There is also an absolute minimum limit of about 1200 MB for the dataflows workload. This is not visible per se in the UX but you will get an error if you put a percentage that would end up with an absolute memory smaller than these 1200 MB. For the P1 sku, this translate to the fact that the smallest percentage you can assign is 5%.
    • Enhanced Engine Enabled - (labeled "Enhanced Dataflows Compute Engine" in the UX) represents the choice of the customer to direct the dataflow workload to use and manage an internal SQL engine that caches the data in order to deliver better performance (by rewriting queries that, otherwise, would use the Azure storage and redirect them to use that cached data in SQL - but more about that in another post).
  3. Parameters that depend on the hardware choice (dedicated capacity instantiated) - middle section in the calculator:
    • Sku - represents a designated moniker for the capacity that was instantiated, from P1 to P5. P1-P3 are mainstream capacities that can be instantiated in every region without customer needing to contact us. P4 and P5 are super-large capacities that can be instantiated only by specific deals with Microsoft.
    • Hardware Cores - not a parameter per se, it is driven by the selection of the sku. One note: the hardware cores are different than the v-cores. For example, P1 has 8 v-cores: 4 frontend, 4 backend (dedicated). These 4 backend cores are the hardware cores the calculator considers for its internal projections - this is why P1 shows 4 cores and not 8 cores (which include the 4 shared frontend cores).
    • Hardware Memory - again, not a parameter per se, but a projection of the selction of the sku - it represents the size of memory of the dedicated capacity.
The results are computed and presented in the right section of the calculator:
  1.  Max Containers [#] - represents the max number of M containers (and, therefore, ETL dop). This is calculated once the memory and CPU constraints are taken into account. This number is for the main container pool (not related to the secondary container pool described above).
  2. Container Max WSS [MB] - represents the actual maximum working set size for each of the main containers.
  3. Has Secondary Containers - represents an indication if the engine will use a secondary pool of M container. This pool (if used) will have less memory per container (equal to WSS 2nd Container [MB] parameter) and the number of containers will be equal to the number of containers in the main pool.

The calculator is embedded in the page below for your convenience (direct link is here):


5

View comments

  1. Love the UI in the actual Power BI "calculator". Can we get a copy of the underlying .pbix, Cristian?

    ReplyDelete

  2. Enjoyed reading the article above, really explains everything in detail, the article is very interesting and effective. Thank you and good luck for the upcoming articles Power Bi course

    ReplyDelete
  3. Thank you so much for providing information about Power BI and some complex tools which always help a lot in learning more about how one can categorically break down complex IT problems.

    Powerbi Read Soap

    ReplyDelete
  4. JSM - Casino Review - JtmHub
    JSM is 충청북도 출장마사지 a brand new casino offering new players 논산 출장안마 a secure environment, a fantastic variety of games 의왕 출장샵 and 광양 출장안마 great support 춘천 출장샵 from industry veterans. We've

    ReplyDelete

Hi guys,

This week I have delivered together with Mohammad Ali, from the Power BI dataflows team, the session named "Microsoft Power BI: Enterprise-grade BI with Power BI dataflows" (aka "What's new in Power BI Dataflows"). The session video is available here.

5

Hi all,

The enhanced engine of the (Premium) Power BI dataflows, mentioned in the calculator at http://aka.ms/dfparams, uses M query rewrites to target the source of the entities to be the cached entities in the hosted SQL engine.

2

Hello friends,

I have not written a post in quite some time (this is the understatement of the year). This post is a calculator for the Power BI Premium Dataflows workload and a brief explanation of those parameters.

5

Today we have launched the CTP2 for Reporting Services! Please download it from here and have fun! Make sure you read the release notes!

The release contains MANY bug fixes (many customers already use this in production), performance improvements and features! The most important ones are outlined i

14

Monday, the 22nd of November 2010 we will launch "PivotViewer Extension for Reporting Services" CTP2. Here is a preview taken out of the release notes:

PivotViewer Extension for Reporting Services CTP2 brings bug fixes as well as new features. This is a list of the changes:

1.

3

I've produced a new video that dives down into the anatomy of a PivotViewer application for Reporting Services and shows how to build a cool app (original link: http://bit.ly/a0Rho3).

4

Please see the video where Donald Farmer demonstrates the capabilities of PivotViewer Extension for Reporting Services.

4

I've produced a video that demonstrates how to get the PivotViewer Extension for RS installed (including the sample app) and help you get started.

9

The CTP1 of PivotViewer for Reporting Services has just shipped! Get your free download from here. As I was saying in an earlier post, it is a concept project. It's not supported, and not a feature.

9

Microsoft has recently released a new Silverlight control called PivotViewer. This new control helps us to make better use of the growing amounts of information around us by visualizing thousands of things at once in a way that reveals the relationships which connect them.

1

I have recently started to work on Business Intelligence “futures”, or how we sometimes call it, the “BI Labs”. The BI Labs effort has started out of the desire to present our customers with new ideas, in the form of samples/prototypes that gives them a glimpse into the future.

About Me
About Me
My Photo
Seattle, WA, United States
I am a Distinguished Engineer at Microsoft, and, in particular, the Chief Architect of Microsoft Azure Data Division.
Azure Data subsumes:

1. The Intelligence Platform: Power BI, Azure Analysis Services, Azure Synapse (DataWarehouse, Spark, Azure Data Factory, Azure Data Explorer aka Kusto), Cosmos, Azure Messaging.
2. The operational databases group: SQL Server, Azure SQL DB, Cosmos DB, Azure MySQL, Azure Database for MariaDB.
3. Azure Databricks

I have been working inside Microsoft since 1997, after I have received the MSc degree in computer science from 'Politehnica' University of Bucharest. My responsibilities include technical strategy for the products of the Azure Data, technical tactical guidance for the teams of the division, and ... well ... a lot of coding. I am an author of 49 granted patents, and many more pending. I have received my PhD degree in Computer Science from 'Politehnica' University of Bucharest in 2006.
Blog Archive
My Blogroll
My Blogroll
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.