python method to extract content (excluding navigation) from an HTML page -
Of course, an HTML page can be parsed using any number of python parser, but I'm afraid Not to seem to have a public parsing script to extract meaningful content (excluding sidebar, navigation, etc.) from the given HTML document
I'm guessing it like collecting the DIV and P elements And then for the minimum content of them I'll check it, but I'm convinced that a solid implementation will include a lot of things I have not thought that
Try the library for Python. There are very easy ways to extract information from an HTML file.
Trying to extract data normally from the webpages will require people to type their pages in the same way ... but in this almost unlimited number of methods express a page that gives the same information All you can have to express is equally identical.
Were you trying to extract a specific kind of information or some other end goal?
You can try to remove any material in the 'div' and 'p' markers and compare the relative size of all the information in the page. The problem occurs when people send group information to the collection of 'div' and 'p' (or at least they do this, they are writing the well-formed HTML!). Maybe << p>
Information is related (nodes' P 'or' div or whatever and all the text contained in the node ') You can do some kind of analysis, which can be identified by the smallest' P 'or' Div 'that most encapses to be included in it ..?
[edit] Perhaps if you can get it in the tree structure, then I suggested that you can use the same number system for the spam killer. . Define some rules that try to classify the information. Some examples:
+1 points for every 100 words + 1 point for each child element & gt; 100 words -1 points if the section name contains the word 'NAV' -2, if the word 'advert' is in the section name then If you have very few scoring rules that add When you find a more popular looking category, then I think it can grow into a very powerful and strong technique.
[EDIT2] Looking at the readability, it feels great that it is okay what I just suggested! Probably can improve tables and improve understanding?
Comments
Post a Comment