MyPage is a personalized page based on your interests.The page is customized to help you to find content that matters you the most.


I'm not curious

Group Cursor in esProc

Published on 29 August 14
98
0
0
In the big data computing, besides the grouping and aggregate operations, sometimes you also need to retrieve a group of data each time to analyze. For example, analyze the sales by date, collect statistics on sales curve for each product, and the purchase habit of each client.
In esProc, you can use function cs.fetch(;x) or cs.skip(;x) to get or skip records till the value of expression x is changed. By doing so, a group of consecutive data can be obtained. For example, retrieve a product each time and prepare to examine the sales data of each product:
Group Cursor in esProc - Image 1
From B7, the records of the 20th goods can be retrieved like this:
Group Cursor in esProc - Image 2
The data retrieval in esProc cursor is a one-way street. Thus the data in cursor must be in order when retrieving a group of records each time as necessary.

As we know, that the @z option can be used to retrieve file by block or data from cursor. However, when retrieving by block, esProc will determine how the data is divided, and sometimes you may encounter troubles.

First, letâs prepare a data text: For the above-used data which are already sorted by the sequence number, store them into a new binary file Order_Products:
Group Cursor in esProc - Image 3
In the later computation, if retrieving data by segment, we will get the situation given below:
Group Cursor in esProc - Image 4
After all data are divided into 100 segments, retrieve the data from the 1st segment in A3, and retrieve the data from 2nd segment in A5, as shown below:
Group Cursor in esProc - Image 5
At this point, you may encounter such problems: For the product number B1445, its sales record appears in both groups. If aggregating after data retrieval each time, then duplicate product numbers may appear in the result returned, and the re-aggregation will be necessary to get the final result. Such piecewise computation is quite common for the parallel computation over big data. The above conditions will make the computation ever more complicated. In this case, we should perform the segmenting by group when storing the data.

When storing the binary data with the cursor, simply use the @g option. In this case, the data written into the cursor will be segmented by group. By doing so, the data from a same group is sure to be fully retrieved all at once when retrieving the data by block. For example:
Group Cursor in esProc - Image 6
For the data sorted by the sequence number of products, save them as a binary file Order_Products_G, segment by group according to the PID. This is slightly different to the method we adopted previously to write the data to a file of Order_Products. Please note that piecewise storage is only valid for the binary file.

To this point, the circumstances are different to retrieve by section:
Group Cursor in esProc - Image 7
In this step, the data retrieved in A3 and A5 are as follows:
Group Cursor in esProc - Image 8
At this point, for the data of the segment 1, all product records whose number is B1445 will be read out. As for the data of segment 2, the record will be retrieved from the next product. As can be seen, if the segmenting by group is set to perform during writing a binary file, the data of a whole group will be put in a segment for retrieval from the cursor. With segmenting by group, the integrity of the data in each group can be guaranteed, and the piecewise computation over big data can be simpler and easier.




In the big data computing, besides the grouping and aggregate operations, sometimes you also need to retrieve a group of data each time to analyze. For example, analyze the sales by date, collect statistics on sales curve for each product, and the purchase habit of each client.

In esProc, you can use function cs.fetch(;x) or cs.skip(;x) to get or skip records till the value of expression x is changed. By doing so, a group of consecutive data can be obtained. For example, retrieve a product each time and prepare to examine the sales data of each product:

Group Cursor in esProc - Image 1

From B7, the records of the 20th goods can be retrieved like this:

Group Cursor in esProc - Image 2

The data retrieval in esProc cursor is a one-way street. Thus the data in cursor must be in order when retrieving a group of records each time as necessary.

As we know, that the @z option can be used to retrieve file by block or data from cursor. However, when retrieving by block, esProc will determine how the data is divided, and sometimes you may encounter troubles.

First, letâs prepare a data text: For the above-used data which are already sorted by the sequence number, store them into a new binary file Order_Products:

Group Cursor in esProc - Image 3

In the later computation, if retrieving data by segment, we will get the situation given below:

Group Cursor in esProc - Image 4

After all data are divided into 100 segments, retrieve the data from the 1st segment in A3, and retrieve the data from 2nd segment in A5, as shown below:

Group Cursor in esProc - Image 5

At this point, you may encounter such problems: For the product number B1445, its sales record appears in both groups. If aggregating after data retrieval each time, then duplicate product numbers may appear in the result returned, and the re-aggregation will be necessary to get the final result. Such piecewise computation is quite common for the parallel computation over big data. The above conditions will make the computation ever more complicated. In this case, we should perform the segmenting by group when storing the data.

When storing the binary data with the cursor, simply use the @g option. In this case, the data written into the cursor will be segmented by group. By doing so, the data from a same group is sure to be fully retrieved all at once when retrieving the data by block. For example:

Group Cursor in esProc - Image 6

For the data sorted by the sequence number of products, save them as a binary file Order_Products_G, segment by group according to the PID. This is slightly different to the method we adopted previously to write the data to a file of Order_Products. Please note that piecewise storage is only valid for the binary file.

To this point, the circumstances are different to retrieve by section:

Group Cursor in esProc - Image 7

In this step, the data retrieved in A3 and A5 are as follows:

Group Cursor in esProc - Image 8

At this point, for the data of the segment 1, all product records whose number is B1445 will be read out. As for the data of segment 2, the record will be retrieved from the next product. As can be seen, if the segmenting by group is set to perform during writing a binary file, the data of a whole group will be put in a segment for retrieval from the cursor. With segmenting by group, the integrity of the data in each group can be guaranteed, and the piecewise computation over big data can be simpler and easier.

This blog is listed under Development & Implementations Community

Related Posts:

Cursor

 
Post a Comment

Please notify me the replies via email.

Important:
  • We hope the conversations that take place on MyTechLogy.com will be constructive and thought-provoking.
  • To ensure the quality of the discussion, our moderators may review/edit the comments for clarity and relevance.
  • Comments that are promotional, mean-spirited, or off-topic may be deleted per the moderators' judgment.
You may also be interested in
Awards & Accolades for MyTechLogy
Winner of
REDHERRING
Top 100 Asia
Finalist at SiTF Awards 2014 under the category Best Social & Community Product
Finalist at HR Vendor of the Year 2015 Awards under the category Best Learning Management System
Finalist at HR Vendor of the Year 2015 Awards under the category Best Talent Management Software
Hidden Image Url

Back to Top