vectorization - Vectorizing LIst of Unique Words into 0 or 1 using Python -


I am quite new in Python, and recently on a few text processing to have a cozy parallel between two text To do.

I have been able to present the text at present, such as lowercase them, removing text tokenizing stopwords and using NLTK libraries on basic pre-processing on the creation of those words. And now, I've been able to create a list of unique words from all text files.

Then, now I have made a list of unique words, there are only a few words that I have to vector to 1 (and the rest of 0) according to a text file to me.

< P> For example, after vectoring the list of unique words, it should look like the following:

  terrible | Best | Move Elephant | Fly | Home | Irresponsible Vested 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0  

I tried googling and here's to see through the stack overflow, but it seems to be using one of the common solutions scikit is knowing - clearance in the list change Has the facility. However, I only want 0 or 1 ... and 1 should be specified by a text file. For example, there is a textfile (after doing all the vectors in 1) which I would like to calculate the similarity with this dictionary ... so it should look something like the following:

Text_to_Compare.txt

  terrible | Fly | Vested 1 | 1 | 1  

And then, I will compare "Text_to_Compare.txt" to the list of unique words and calculate the similarity result.

Does anyone please tell me how can I shrink the list of unique words only in 0 or 1, and alert "Text_to_Compare.txt" for all 1?

Thank you!

Do you want to do this?

  text_file = ['hello', 'world', 'test'] term_dict = {'something': 0, 'word': 0, 'world': 0} in the text_file of the word For: If the word is in term_dict: term_dict [word] = 1  

you have been tokenized your file ( .split () method in dragon), then they A list will be available. Assuming that you have generalized each word (reduced, hard work, stripped of punctuation marks) in your dictionary and your text_file, then the above code should work. Just set your values ​​to 0, and loop your file, to see if the word is in in the dict . If so, set that value to 1.

Here is how you can create a word with the values ​​set to 0:

  new_dict = {word: 0} in text_file} for word  

This is the one. Again, note that my code assumes that you are normalizing all the conditions - comparison of apples to apples - and that is always important when working with text.

Last edit if you have two lists of unique posts (after token and normalization)

  def general (word): #do stuff - i.e., lower; Stem; Strip punctuation; Etc. passed word_list_one = [text_doc.split for the term ()] word_list_two = [(word) to other_text_doc.split () in general] # If you know the longest list of your list, then you One can create two lists, and the dictionary of zero is word_dict = dict ([(word, 1) if word_list_one words in word (word, 0) for word_list_t 2] # that it is in the above code, word_list_two your two lists There should be more intensity (to handle that I understand your code properly) #n A person with more dragon experience can definitely improve my code. I just wanted to show you another option  

Please tell me what does this work for you Hope this helps a bit!


Comments

Popular posts from this blog

python - Overriding the save method in Django ModelForm -

html - CSS autoheight, but fit content to height of div -

qt - How to prevent QAudioInput from automatically boosting the master volume to 100%? -