MyPage is a personalized page based on your interests.The page is customized to help you to find content that matters you the most.


I'm not curious

An Illustration of Processing Big Text Files with esProc Cursor

Published on 03 September 14
257
0
0
esProc can process big text files conveniently by providing cursor data object. The following example is to illustrate this.
Letâs assume that there is a text file, sales.txt, with ten million sales records. Its main fields include SellerID, OrderDate and Amount. Now compute each salesmanâs total Amount of big orders in the past four years. The big orders refer to those whose amount is above 2000.

esProc code:
An Illustration of Processing Big Text Files with esProc Cursor - Image 1
Code interpretation:
A1: If all the ten million records are read into memory simultaneously, memory will overflow. So the job will be done in batches.
A2: Read by looping, 100,000 rows at a time.
B3: Filter each batch of data, select those records whose amount is above 2000 after the year of 2011.
B4: Group and summarize the filtered data, seek the sales of each salesperson in this batch of data.
B5: Add the computed result of this batch of data to a certain variable (B1), and move on to the computation of next batch of data.
B6: After all the computations, sales of each salesperson in every batch of data will be found in B1. Last, group and summarize these sales data and seek each salespersonâs total sales amount.

Analysis:
In cell A1, esProc cursor is created with function cursor. The cell name is the cursorâs variable name. When the cursor is created, data will not be read in the memory directly. Read-in will only be executed while fetch operation or other equal operations are going on, e.g., the code for A1,100000 in cell A2 represents reading data from cursor by looping with 100,000 rows at a time. We can see that the data size in memory is always kept in a relatively small level and no overflows will occur.

select and groups are computation functions specially used with structured data. After the data is read in the memory with esProc cursor, they can be processed and analyzed by employing functions of professional structured data computation library. This is more convenient than writing underlying code by hand.

Equipped with functions and grammar of semi-structured data processing, e.g., function for data split and merging, looping and traversal statement and branch judgment statement, esProc cursor can do complex task of data cleansing and arrangement and form easily computed structured data.

Splitting and analyzing
For instance, the format of weblog is too complex to be computed and analyzed directly. A typical web blog text need to be transformed into a two-dimensional table of standard format in order to be used in structured data computation or be stored in a database.
A record in the original weblog:
10.10.10.145 - - [01/May/2013:03:24:56 -0400] "GET /product/p0040001/review.jsp?page=5 HTTP/1.1" 200 8100 "http://www.xyz.com/xyz.html""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36" 0 tankLee in 9fc0792eb272b78916c3872e9ad â

The records in a two-dimensional table are:
An Illustration of Processing Big Text Files with esProc Cursor - Image 2
The following case omits the process of file accesses and the final merging of multiple batches of data(refer to the previous example), and lists the code for splitting and analyzing directly.
An Illustration of Processing Big Text Files with esProc Cursor - Image 3
Data cleansing
Letâs see a typical example of data cleansing. Since the employee table read in from a file is not a standard format, it need to be reorganized into standard structural data in batches. Data of the current batch will be stored in cell D3 temporarily. The rule for reorganizing is:
1. The record is invalid if UserID and firstName is null or blank string.
2. UserID can only contain digits; the record is invalid if letters appear in it.
3. For repeated UserID, only the last entered record is kept.
4. Delete possible spacing before and after the data.
5. Capitalize all the first letters of firstName.
6. Full name is combined in the form of firstName+â.â+âlastNameâ. But, if lastName is null or blank string, fullname equals to firstName.

The following table also omits the process of file accesses and the merging of multiple batches of data, and only lists the code for data cleansing:
An Illustration of Processing Big Text Files with esProc Cursor - Image 4







esProc can process big text files conveniently by providing cursor data object. The following example is to illustrate this.

Letâs assume that there is a text file, sales.txt, with ten million sales records. Its main fields include SellerID, OrderDate and Amount. Now compute each salesmanâs total Amount of big orders in the past four years. The big orders refer to those whose amount is above 2000.

esProc code:

An Illustration of Processing Big Text Files with esProc Cursor - Image 1

Code interpretation:

A1: If all the ten million records are read into memory simultaneously, memory will overflow. So the job will be done in batches.

A2: Read by looping, 100,000 rows at a time.

B3: Filter each batch of data, select those records whose amount is above 2000 after the year of 2011.

B4: Group and summarize the filtered data, seek the sales of each salesperson in this batch of data.

B5: Add the computed result of this batch of data to a certain variable (B1), and move on to the computation of next batch of data.

B6: After all the computations, sales of each salesperson in every batch of data will be found in B1. Last, group and summarize these sales data and seek each salespersonâs total sales amount.

Analysis:

In cell A1, esProc cursor is created with function cursor. The cell name is the cursorâs variable name. When the cursor is created, data will not be read in the memory directly. Read-in will only be executed while fetch operation or other equal operations are going on, e.g., the code for A1,100000 in cell A2 represents reading data from cursor by looping with 100,000 rows at a time. We can see that the data size in memory is always kept in a relatively small level and no overflows will occur.

select and groups are computation functions specially used with structured data. After the data is read in the memory with esProc cursor, they can be processed and analyzed by employing functions of professional structured data computation library. This is more convenient than writing underlying code by hand.

Equipped with functions and grammar of semi-structured data processing, e.g., function for data split and merging, looping and traversal statement and branch judgment statement, esProc cursor can do complex task of data cleansing and arrangement and form easily computed structured data.

Splitting and analyzing

For instance, the format of weblog is too complex to be computed and analyzed directly. A typical web blog text need to be transformed into a two-dimensional table of standard format in order to be used in structured data computation or be stored in a database.

A record in the original weblog:

10.10.10.145 - - [01/May/2013:03:24:56 -0400] "GET /product/p0040001/review.jsp?page=5 HTTP/1.1" 200 8100 "http://www.xyz.com/xyz.html""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36" 0 tankLee in 9fc0792eb272b78916c3872e9ad â

The records in a two-dimensional table are:

An Illustration of Processing Big Text Files with esProc Cursor - Image 2

The following case omits the process of file accesses and the final merging of multiple batches of data(refer to the previous example), and lists the code for splitting and analyzing directly.

An Illustration of Processing Big Text Files with esProc Cursor - Image 3

Data cleansing

Letâs see a typical example of data cleansing. Since the employee table read in from a file is not a standard format, it need to be reorganized into standard structural data in batches. Data of the current batch will be stored in cell D3 temporarily. The rule for reorganizing is:

1. The record is invalid if UserID and firstName is null or blank string.

2. UserID can only contain digits; the record is invalid if letters appear in it.

3. For repeated UserID, only the last entered record is kept.

4. Delete possible spacing before and after the data.

5. Capitalize all the first letters of firstName.

6. Full name is combined in the form of firstName+â.â+âlastNameâ. But, if lastName is null or blank string, fullname equals to firstName.

The following table also omits the process of file accesses and the merging of multiple batches of data, and only lists the code for data cleansing:

An Illustration of Processing Big Text Files with esProc Cursor - Image 4

This blog is listed under Development & Implementations and Data & Information Management Community

Related Posts:
Post a Comment

Please notify me the replies via email.

Important:
  • We hope the conversations that take place on MyTechLogy.com will be constructive and thought-provoking.
  • To ensure the quality of the discussion, our moderators may review/edit the comments for clarity and relevance.
  • Comments that are promotional, mean-spirited, or off-topic may be deleted per the moderators' judgment.
You may also be interested in
Awards & Accolades for MyTechLogy
Winner of
REDHERRING
Top 100 Asia
Finalist at SiTF Awards 2014 under the category Best Social & Community Product
Finalist at HR Vendor of the Year 2015 Awards under the category Best Learning Management System
Finalist at HR Vendor of the Year 2015 Awards under the category Best Talent Management Software
Hidden Image Url

Back to Top