QUICK START
1- Select the Source Tab then select the Tab for which type of input source data you want to parse, either Files, Webpages, or Pasted Text. Then add the source(s) using the control buttons for adding files or webpage URLs or paste text into the text view box.
2- Select the Rules Tab and configure your rule set. We suggest you start off with using a simple set of data (even if you have to make it) and one set of rules and play around to get to know how the rules affect the parsing of your data and the format of the output. Then once you have a good understanding of how things work you can start configuring more complex sets of rules and really benefit from its power! We believe this is the best way to get the most out of the application.
3- Select the Output Tab then Source Radio Button for where the data is coming from to parse (File, Web, or Text) and the Parse Style Radio Button (Series, Parallel, or Nested), then press run. Once processing is completed, you can copy the results to your clipboard or save it to a file.
Also, a report is generated which you can copy or save.
SOURCE
Datamate™ allows for 3 sources of data input, Files, Webpages, and Copy & Paste Text. Any format of text file is parseable with Datamate™ (txt, rtf, xml, html, etc.). You can also save the lists of file paths and webpage URLs and load them later for ease of use.
RULES
Here is where the power of Datamate™ comes from, our Rules configuration tool.
Rule Name: A name to help remember what the rule does (optional).
Input Data Starting Point: The place in the input data where Datamate™ will start parsing. This is helpful to control erroneous output by bypassing a certain amount of data at the beginning of the input before it starts looking for the first input tokens (i.e., elements). Also, this will help speed up processing if the input data is large and the elements you want to output are toward the end of the data (optional).
Input Elements Beginning Token: The set of characters used to identify the beginning point of the piece of data (element) you want to output (required).
Input Elements Ending Token: The set of characters used to identify the ending point of the piece of data (element) you want to output, you can also choose a standard delimiter from the drop-down box, located to the right of the text-field. If you choose a standard delimiter, then whatever is entered in the text-field will be disregarded (one or the other is required).
Input Data Finishing Point: The place in the input data where Datamate™ will stop parsing. This is helpful to control erroneous output by bypassing a certain amount of data at the ending of the input where it stops looking for any more input tokens (i.e., elements). Also, this will help speed up processing if the input data is large and the elements you want to output are toward the beginning of the data (optional).
Output Elements Prefix: The set of characters that you want to add to the beginning of the element you parsed out from the input data (optional).
Output Elements Suffix: The set of characters that you want to add to the end of the element you parsed out from the input data (optional).
Output Delimiter: A standard delimiter to add to the end of the element which is parsed out from the data, applied after the Output Element Suffix is added if one is used (optional).
EXAMPLE: There's a bunch Webpages in which we want to get just the body section and discard all the rest. We want the output of each page to be separated by a new line. Here's how we setup the rules.
Rule Name= HTML Body
Input Data Starting Point Token= We'll leave this blank, because we probably won't know what a good place to start at and webpages are usually small, so no real performance hit to be concerned with.
Input Elements Beginning Point Token= <body We'll set it to Case Insensitive, because some page builders use upper case while other use lower case and we can't be sure which we will encounter and we won't use a '>' after 'body' 'becuase the body tag may have some additional attributes and if we use the greater-than symbol after 'body' our begining token would not match up with anything in the webpage.
Input Elements Ending Point Token= </body> We'll set it to Case Insensitive again for the same reasons as above and since we can only use a token OR a standard delimiter, we'll leave the drop-down box to the right set to None.
Input Data Finishing Point Token= We'll leave this blank for the same reasons as the Input Data Starting Point Token.
Output Elements Prefix= <html><body For this example we want to display the output as simple webpages, so we add back some HTML tags, note we are not adding the '>' symbol after 'body'.'
Output Elements Suffix= </body></html> Same reason as above.
Output Delimiter= New Line This will help us see where the output of one webpage ends and the next begins.
Now we'll add the rule and save it if we want.
OUTPUT
You can extracted and output the elements of your data in three distinct fashions, Series, Parallel, or Nested, under some circumstances changing the Parse Style wont make a difference, like when using only one rule with a simple set of data.
Parsing the data in Series means that Datamate™ will go through all of the input data one source file at a time parsing the data applying the first rule and outputting all the elements it finds, then the second rule (if applicable, only 1 rule is required), then the third rule and so on. The output will contain all of the parsed elements of the first rule of the first file, then the elements of the second rule for the first file, and so on, then the first rule of the second file, then second rule of the second file and so on in that order.
Parsing the data in Parallel will result in Datamate™ outputting the first element of the first rule, then the first element of the second rule and so on of the first file, then the first element of the first rule and the first element of the second rule and so on of the second file.
Parsing the data in Nested format is very similar to the Parallel format, except it processes the rules in a nested fashion to create an output data set that will appear like the Parallel output, but will help ensure a more reliable result for input data that may have missing values or other issues.
We suggest you try each Parse Style and see which works best for your data.
REPORT
A Processing Report will be generated and can be copied or saved to a file.
TERMINOLOGY
Element: The set of text found between the Input Element Beginning Point Token and Output Elements Ending Point Token.
Token: A set of characters, including special characters and spaces, tabs, new lines.
NO-ELEMENT: This will be returned in the output if during parsing the application found the Input Elements Beginning Point Token and Output Elements Ending Point Token, but no value (element) was found within them.
NO-DATA: This will be returned in the output if during parsing the application did NOT find either the Input Element Beginning Point Token or Output Elements Ending Point Token where it was expected to find them.
Copyright © 2015 Procypher Corporation. All Rights Reserved.
[email protected] www.procypher.com
|