I have not written a post in quite some time (this is the understatement of the year). This post is a calculator for the Power BI Premium Dataflows workload and a brief explanation of those parameters. The parameters are grouped in few categories:
- Parameters that can't be changed by the user - left section in the calculator (you can change these parameter in the calculator app at the bottom of the post for educational purposes only - the value of these parameters is not in the control of the user):
- Container TTL [min] - this represents the time the M container survives (in minutes). The container might survive more, in case it's in the middle of a query. Once a query is finished, the "time bomb" is evaluated and, if exceeded, the container is disposed. The container is also disposed if it's idle and the TTL has passed (the assessment is not done only at the end of a query). The value for this parameter is 20 minutes.
- Memory Usage Factor [%] - this represents the memory usage factor for M. Out of all the memory granted by the infrastructure, the dataflows engine needs to retain a bit for its own operation (orchestration of the ETL process, transaction management etc.). Everything else (pretty much everything) is used for the actual ETL engine (M engine). The value for this parameter is 95% (i.e. only 5% is used by the dataflows engine for internal operations, 95% goes to ETL).
- CPU Usage Factor [multiplier] - the M engine has the ability to execute multiple ETL core data processes (all the transforms that lead to a dataflow entity to be created) in parallel. Moreover, each query is executed in a separate process - named mashup container (it's literally a process whose name is MashupContainer.exe - it also has nothing to do with any containerization technology, it's simply a name that uses the same term). The dataflows engine limits the number of containers of the mashup engine based on a number of criteria. Each container can run only a single query at a given moment. If the number of concurrent queries reaches the number of containers that the dataflow engine is willing to spend, queries are serialized - the number of containers represents, therefore, the degree of parallelism for the dataflow engine. One of the criteria used to limit, is the CPU available - ETL processes being a CPU "consumer", we do not want to create too many context switches as it has performance implications. To avoid unnecessary context switching, one should not have more tasks than the number of cores available. However, threads that are blocked on IO operations (e.g. reading data from the data source) provide good parallelism - as the CPU is not used during that time. To complicate matter further, it is very hard to know apriori if a query is going to use a lot of CPU (e.g. expensive in memory join or sort algorithm) or not at all (e.g. it's going to simply read the data, without any transforms, from a remote data source). Further, given that, in parallel to the dataflows engine, other workloads (e.g. Analysis Services) exist on the same dedicated capacity, this clouds the decision further. Based on experience, we set the value of the multiplier to 2. That means that, from the point of view of CPU utilization alone, we could use 2x the number of cores (e.g. the P1 sku has 4 [backend, i.e. capacity] cores - then we could have 8 containers, from the point of view of CPU restrictions). Other factors weigh in and the system will use the smallest of the computed values based on these different criteria.
- WSS 2nd Container [MB] - [wss stands for working set size] for the case when the Enhanced (Dataflows) Engine is enabled, the ETL core pipeline (per entity) is split into two parts:
- read the data and make the necessary transforms (the "ET" part of ETL) - this uses the "main" pool of containers mentioned earlier.
- write the data into SQL and underlying storage (blob or ADLS Gen2) - given that this is, simply, streaming some data, albeit using the same M engine, it does not need a lot of memory for the container, therefore, we enabled a secondary pool of containers (same number of secondary containers as are main containers) with different, much smaller, working set size. This parameter specifies the working set size for each of the secondary containers. The value is 64 MB.
- Max Cache Container [MB] - represents the maximum cache size per container in MB. Each data source can define a caching strategy such that it doesn't go over and over, across evaluations, to the data from the original location (e.g. cache an Excel file). The value of this parameter is 2048 MB.
- Max Temp Size Container [MB] - represents the maximum temporary size per container in MB. This space is used by the data firewall of the M engine (the privacy rules). When the mashup contains a combination of data sources of different privacy boundaries and the query plan cannot guarantee that the data can be flown only in the secure direction (e.g. public => private), the M engine will first cache the necessary data in this temporary space (e.g. the public data in the example just considered in this statement). The temporary space needs to be large enough to cache the entire data needed for the query that cannot, otherwise, be safely processed. The value of this parameter is currently 2048 MB.
- Sql Memory Usage Factor [%] - If the enhanced compute engine is enabled (see below), some memory has to be accounted to the SQL engine that is hosted on the dedicated capacity within the dataflow workload. The value of this parameter is 30%.
- Sql Data Streaming Enabled - inside the engine, the flow can be decomposed into reading and transforming the data and, then, stream the data to both storage (blob or ADLS Gen2) and SQL database (using the secondary containers - described above). The other alternative is to read and transform the data while writing it at the same time (in the same M query) and, then, upload the data from the blob into the SQL database. The value of this parameter is true (we chose the first strategy described here, the one that uses the secondary containers).
- Parameters that can be changed in the UX by the administrator of the capacity in the admin portal/capacities section - middle section in the calculator. The UX looks like this:
- WSS Container [MB] - (labeled "Container Size (Mb)" in the UX) represents the working set size of the mashup container. The minimum value is 700 MB, and so is its default value. It is recommended to increase it to 1.5 GB-2 GB if there are complex transforms being involved. You would be able to see, using the calculator, the impact of such a choice (less parallelism - but each operation might finish earlier because more memory is assigned to each respective operation). The result of increasing the container size is less parallelism as dividing the same amount of memory to larger quantity (to get the maximum number of containers based on the memory criteria) would result in a smaller number of containers being able to be allocated while ensuring more memory for each of them. Upgrading to higher skus can allow you to increase the memory per operation while adding more parallelism (as the higher end skus have more memory). Experimentation can also help to establish which of the parameters is critical for the project (increase sku, add more memory, add parallelism etc).
- Max Memory [%] - represents the quota (out of the physical memory of the dedicated capacity) given to the dataflows workloads. As a note, the quotas of all of the workloads can be overcommited (can add up to more than 100%) - as a matter of fact, the Analysis Services (datasets) workload has always allocated 100% of the memory so, as a whole, the workloads are, indeed, overcommited. However, no workload can have by itself more than 100% of allocation. There is also an absolute minimum limit of about 1200 MB for the dataflows workload. This is not visible per se in the UX but you will get an error if you put a percentage that would end up with an absolute memory smaller than these 1200 MB. For the P1 sku, this translate to the fact that the smallest percentage you can assign is 5%.
- Enhanced Engine Enabled - (labeled "Enhanced Dataflows Compute Engine" in the UX) represents the choice of the customer to direct the dataflow workload to use and manage an internal SQL engine that caches the data in order to deliver better performance (by rewriting queries that, otherwise, would use the Azure storage and redirect them to use that cached data in SQL - but more about that in another post).
- Parameters that depend on the hardware choice (dedicated capacity instantiated) - middle section in the calculator:
- Sku - represents a designated moniker for the capacity that was instantiated, from P1 to P5. P1-P3 are mainstream capacities that can be instantiated in every region without customer needing to contact us. P4 and P5 are super-large capacities that can be instantiated only by specific deals with Microsoft.
- Hardware Cores - not a parameter per se, it is driven by the selection of the sku. One note: the hardware cores are different than the v-cores. For example, P1 has 8 v-cores: 4 frontend, 4 backend (dedicated). These 4 backend cores are the hardware cores the calculator considers for its internal projections - this is why P1 shows 4 cores and not 8 cores (which include the 4 shared frontend cores).
- Hardware Memory - again, not a parameter per se, but a projection of the selction of the sku - it represents the size of memory of the dedicated capacity.
The results are computed and presented in the right section of the calculator:
The calculator is embedded in the page below for your convenience (direct link is here):
- Max Containers [#] - represents the max number of M containers (and, therefore, ETL dop). This is calculated once the memory and CPU constraints are taken into account. This number is for the main container pool (not related to the secondary container pool described above).
- Container Max WSS [MB] - represents the actual maximum working set size for each of the main containers.
- Has Secondary Containers - represents an indication if the engine will use a secondary pool of M container. This pool (if used) will have less memory per container (equal to WSS 2nd Container [MB] parameter) and the number of containers will be equal to the number of containers in the main pool.
The calculator is embedded in the page below for your convenience (direct link is here):
Love the UI in the actual Power BI "calculator". Can we get a copy of the underlying .pbix, Cristian?
ReplyDelete
ReplyDeleteEnjoyed reading the article above, really explains everything in detail, the article is very interesting and effective. Thank you and good luck for the upcoming articles Power Bi course
Thank you so much for providing information about Power BI and some complex tools which always help a lot in learning more about how one can categorically break down complex IT problems.
ReplyDeletePowerbi Read Soap
Thanks for your information. very good article.
ReplyDeletePower BI Course
Power BI Online Training
JSM - Casino Review - JtmHub
ReplyDeleteJSM is 충청북도 출장마사지 a brand new casino offering new players 논산 출장안마 a secure environment, a fantastic variety of games 의왕 출장샵 and 광양 출장안마 great support 춘천 출장샵 from industry veterans. We've