Sites with large amounts of data, they usually have a complex structure, protection against parsing and other features that complicate the process of collecting information. However, this type of project usually has one significant advantage over the rest of the standard sites – support inside the code of the JSON format, which allows for a few seconds to get huge lists by parsing. Of course, not all sites with JSON support have the source code open, sometimes you have to try hard to extract it from the document. Unlike regular HTML layout, the page provides access to the data array directly, bypassing the separate selection of style classes and elements in the document, so the lists can be obtained quickly and painlessly.
Millions of goods – how to take?
First, I had to think about how best to collect all the data, not to make a mistake and make the parsing process easier and faster. For this purpose it was required to make the following:
Write a script to collect a list of addresses categories and products themselves by parsing the site map;
Check for the presence of JSON code inside the page and for validity, so that no errors occur during parsing;
Actually the collection of information, necessarily separate, so that there was always an opportunity to return to the desired list, item, counter, etc.
Parsing from the right moment
Another key point is the presence in the script of those data that allow you to continue parsing, without delving into its essence and remember where you left off: when you need to go to the folder with the action and take the counter number or other element from which the action should begin. This is very convenient when you do not need to perform unnecessary operations to manage your scripts, especially if their number is growing, or you are working on several projects at once. In this case, there is a high probability that you can simply confuse the folders themselves with the actions and therefore continue from the wrong moment.
Of course, The zennoposter program has a separate tool for saving errors, and you can determine exactly where they are in the script, as well as continue the action from this element. But we need not only to determine where the process stopped, but also to automatically start the action from the start of the action. This solution is the key to success in parsing a huge amount of data, as in this particular case.
So, we have come to the point where you need to prioritize the collection of information from the site.
The first and main elements of parsing are: title, description, links to the source, photos, etc. Then follow the additional data in the form of tags, categories and other information. It all depends on the specific task.
You must save this data to a CSV or XLSX table. Usually, if you take the JSON format code, then most likely all this information is already present in it, and you just have to save it and distribute it to cells. As I mentioned in this article, it can be done conveniently and quickly with the help of The zennoposter program.
After the basic steps, you need to structure all the available information together, especially if you collect data separately by category. For example, I do not put a lot of data in 1 table at once, because in this case there is a chance to make a mistake when the PC crashes, the action stops or simply there are errors in the script code. Instead, I usually make selections from lists: by categories, their numbers or titles for quick search, necessarily with a counter number to sort by. If there is a counter in the file name, you have no option to make a mistake when you take the next parsing number.
The last action we save together all the data in the desired format and send to the customer. Here you need to decide immediately with the filtering of cells, which is discussed with the client in advance, as well as the number of rows, table cells. In this case, you have no option to redo all the work for free! If the customer himself # profukal# something, then this is his problem, and you can request an additional fee for the completion of the action and the parsing process.
Appreciate your own and others ‘ time, and time is the money for which you, in fact, work!