I am engaged in data parsing for a long time, including on quark.ru (popular freelancing exchange) from various sites-collecting the necessary elements from the pages of the project specified by the client and saving them to a table or MySQL database. For example, the standard task for me is to collect data such as title, product or page description, keywords, tags, etc.
I can say with confidence that the more popular the site becomes, the more difficult it is to take the necessary data from this project, because people understand that information is money, and they are not ready to share it for free! Just, and this applies to the following website – agroserver.
What is the complexity of parsing?
First of all, this site has implemented protection against data collection at the software level, so you have to try hard to get the right information in a short time.
This type of data can include meta tags, product characteristics, and individual page elements. In a particular case, it was necessary to collect ads from selected categories and information on them such as title, description, category, links, photos, etc. the Usual automated script for parsing documents I write through software called ZennoPoster, which not only facilitates the task of collecting almost any data on the website, but also supports a huge number of additional functions.
Proxy support, including free ones;
Convenient formatting of tags and other text values through the built-in regular expression tool;
Randomization and direct requests get, post to the server and many other useful chips.
In order to take on each link data from this project I had to take into account the following nuances:
After several consecutive requests from the same ip address, an error 503 or captcha occurred;
It was necessary to use and change the proxy very often, which, of course, does not make the speed of the script and you as the executor of the job;
For some reason, it was important for the client to take a category for each ad, the data about which were not included in the General issue of the list, and therefore complicated the already difficult task.
A prerequisite was the presence of Excel format for the table instead of CSV, which also took a lot of time and nerves.
Note: usually I do parsing in this format (CSV), because it is convenient for subsequent editing and consists of plain text with the necessary delimiters. With the format for Microsoft Office, and Excel in particular, there are sometimes difficulties when importing data, as the format supports tabs, and therefore is not always correctly recognized by applications.
Problem solving and parsing speed
So, we came to the moment when it was necessary to solve a huge number of different tasks and speed up the process of collecting information from the site at times.
The first thing I did was collect lists with ads to simplify the solution and then separately take from this list the necessary addresses (especially since these directories already contained the data I needed).
This was followed by the process of installing a proxy in case of errors and blocking the site, as well as fixing minor bugs and testing the script.
It should be noted that ZennoPoster has a great solution for fast parsing, which allows you to run the same script in multithreaded mode – this approach makes the task much faster and easier.
After collecting lists and testing the script for minor errors should be finding categories in the products themselves – this process took most of the time working on the entire order, because I had to go into each product and take the name and link to the category.
A little more complexity from the client
Further there is an interesting moment: the client needs not to put as it becomes standard means each announcement in one line, and it is necessary to introduce in 1 line all announcements of one producer. I can not say what is done in this way, but this “hemorrhoids” it was necessary to somehow implement a working script.
I did the following: I made a selection from the ads of one company and just added them to the list in the usual way, one per line, and then separately put together all the data in 1 line. Thus, I killed two birds with one stone – did not suffer with sorting a huge amount of data, and quickly solved the problem.
Merge all files into one in the desired format
Now you need to combine all the ads in 1 file, where all the manufacturers will be presented on 1 line each. This task was also non-trivial, because initially parsing was carried out in CSV format, and the client needed Excel.
The fact is that CSV format is simple delimited text, and XLSX is a set of data from text, numbers, tabs and everything else. As a result, I needed to completely reformat each file to solve this problem, which, by the way, is several thousand separate documents.
So, to cope with it helped me again the same product-ZennoPoster! It allows you to save in either format, but there is no standard solution for translation from one to another, because it is not provided by default.
I had to write another additional script, where I have completely changed the format of the issue, and also learned in parallel that you can use .bat file with a single click to connect all the existing CSV documents.
The results of parsing a site
We’ve come to the conclusions drawn during data parsing, and here’s what I see:
First, I gained a new experience of gathering information and putting together one type of file format;
I learned to work in multithreaded mode and solve problems in a nontrivial way;
The speed of solutions has increased many times due to the logical structure and formulation of problems.
The next step will be even faster parsing using JSON data format processing, which I will talk about later…