vectorization - Vectorizing LIst of Unique Words into 0 or 1 using Python -
I am quite new in Python, and recently on a few text processing to have a cozy parallel between two text To do.
I have been able to present the text at present, such as lowercase them, removing text tokenizing stopwords and using NLTK libraries on basic pre-processing on the creation of those words. And now, I've been able to create a list of unique words from all text files.
Then, now I have made a list of unique words, there are only a few words that I have to vector to 1 (and the rest of 0) according to a text file to me.
< P> For example, after vectoring the list of unique words, it should look like the following: terrible | Best | Move Elephant | Fly | Home | Irresponsible Vested 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 I tried googling and here's to see through the stack overflow, but it seems to be using one of the common solutions scikit is knowing - clearance in the list change Has the facility. However, I only want 0 or 1 ... and 1 should be specified by a text file. For example, there is a textfile (after doing all the vectors in 1) which I would like to calculate the similarity with this dictionary ... so it should look something like the following:
Text_to_Compare.txt
terrible | Fly | Vested 1 | 1 | 1 And then, I will compare "Text_to_Compare.txt" to the list of unique words and calculate the similarity result.
Does anyone please tell me how can I shrink the list of unique words only in 0 or 1, and alert "Text_to_Compare.txt" for all 1?
Thank you!
Do you want to do this?
text_file = ['hello', 'world', 'test'] term_dict = {'something': 0, 'word': 0, 'world': 0} in the text_file of the word For: If the word is in term_dict: term_dict [word] = 1 you have been tokenized your file ( .split () method in dragon), then they A list will be available. Assuming that you have generalized each word (reduced, hard work, stripped of punctuation marks) in your dictionary and your text_file, then the above code should work. Just set your values to 0, and loop your file, to see if the word is in in the dict . If so, set that value to 1.
Here is how you can create a word with the values set to 0:
new_dict = {word: 0} in text_file} for word This is the one. Again, note that my code assumes that you are normalizing all the conditions - comparison of apples to apples - and that is always important when working with text.
Last edit if you have two lists of unique posts (after token and normalization)
def general (word): #do stuff - i.e., lower; Stem; Strip punctuation; Etc. passed word_list_one = [text_doc.split for the term ()] word_list_two = [(word) to other_text_doc.split () in general] # If you know the longest list of your list, then you One can create two lists, and the dictionary of zero is word_dict = dict ([(word, 1) if word_list_one words in word (word, 0) for word_list_t 2] # that it is in the above code, word_list_two your two lists There should be more intensity (to handle that I understand your code properly) #n A person with more dragon experience can definitely improve my code. I just wanted to show you another option Please tell me what does this work for you Hope this helps a bit!
Comments
Post a Comment