parsing - How might one go about implementing a forward index in PHP? -


I'm looking to implement a simple forward indexer in PHP. Yes, I think PHP is probably the best tool for the job, but I want to do it anyway. The logic behind this is simple: I need another PHP.

Let us make some basic assumptions:

  1. Approximately five thousand HTML and / or plain-text documents in the entire interwave each document is a special domain (UID)

  2. The results of our awesome PHP-based forwarding indexing algorithm should be accompanied by the following lines:

  3. < P>
  4. P> UID 1 -> index -> Helen, with that, Champion, Freqs

    UID 1 -> Foo GM -> Chicken, Uhad, go, home, eat, sheep

    UID 2 -> blaho - html -> next, week, current, badgearwawa

    uid2 -> gah.txt -> one, Ideally, I would love to see the solutions that keep in mind, even the most of them, the one, and, one, is, no, numberwise

In the initial form, concepts of tokening / word boundary dispute / part-of-speech-tagging. Of course, I realize that this is a wishful thinking, and therefore will be humble to any worthy effort of parsing: Fictional documents have said:

  1. Extract the contents of the actual text content within the documents As a list of words in
  2. any garbage like and & lt; Html & gt; To compute the tag, the list of the UID (which can be a domain, for example), ignoring any garbage, followed by the name of the document (the resource within the domain) and finally List of words for the document. I realize that the HTML tags play an important role in the terminology of the text within a document, but at this level I does not care .
  3. Keeping a solution in mind, make a list of words that the document is cooler to read, which needs to be read in the first document.

    At this level, I do not care about shore or storage. Even an original group of 'print' statements will be sufficient.

    Thanks in advance, hope it was quite clear.

    $ P-> Load ("www.page.com"); $ P- & gt; ("Body") - & gt; Plane;

    And he will give you all the lessons. Just want to iterate on the link

      foreach ($ p-> find ("a") $ link) {echo $ link-> InnerText; }  

    See it as it is very useful and powerful.


Comments

Popular posts from this blog

python - Overriding the save method in Django ModelForm -

html - CSS autoheight, but fit content to height of div -

qt - How to prevent QAudioInput from automatically boosting the master volume to 100%? -