python - Counting number of documents -


I have a corpus, and I need to calculate the number of documents and tokens in the corpus as a whole , But its also Subparts

The code I have developed so far looks like this:

  def combined_data (path): root = dir = files = 0 o.walk (path): For files in f: if not f.endswith ('_metadata.txt') and f.endswith ('.txt'): Article + = 1p = os.path.join (root, f) as two_file With open (P): for row in duo_file.readlines (): word + = take (line.split ()) write_to_data (word, article, current_path)  

counting too much crude I know, it needs to be developed even more. However, I can not know how to count the total (whole corpus), and if all the parts are counted on the corpus, then the structure, the entire corpus - (part 1, part 2, part 3) - and after each part There are also sub-parts, so we have Part 1 - (Part 1, Part 2, Part 3).

Then a list of its lists in summary:

  [corpus, [part 1 [part 1, part 2]], [part 3 [...]] ...]]  

So I think that matters (from above example)

  corpus -> Counting Part 1 - & gt; Counting Part 1.part 1 - & gt; Counting Part 1.part 2 - & gt; Calculation  

Someone asked if there are parts. Because they are folders, the main folder is called the corpus, that folder has many folders, each is a part of the corpus, and those folders contain more folders or files, a folder in its folder - folders - (folders Or files)

So basically I want to count all the files which are under each folder so I want to count one for the root folder, It means that everything is counted, then a count for each folder under the root, then counted for those folders (if there are more folders)

and I want to print it like this Corpus: X article, x word faculty of natural science: x article, x word physics institute: x article, x word

then a subcrop of the Faculty of Natural Sciences Carpus , And Physics Institute Pvt. Ritik a sub-circus of the Faculty of Science. Hope this will make it clear.

The keyword is given the argument topdown = False , to generate the directory Before, oswalk will generate a directory subdirectory. In other words, this is like a post order tree search, we can use it to recalculate the number of entries in each part of the corpus.

Let's assume that our directory structure is like this:

  ./ corpus ├ ── Part 1 │ ├── sub_1 │ │ ├── 1 │ │ ├─ ─ 2 │ │ └── 3 │ └── sub_2 │ ├── 1 │ └── 2 └── Part 2 └── Part 1 ├ ── 1 ├── 2 └── 3  < / Pre> 

We can find the count of entries in each subdirectory by moving down-down and adding the subdirectory sizes:

  counts = {}: key = Os.path.join (dirpath, d) calculation [dirpath] + = calculation Capital]  

A test:

  & gt; & Gt; & Gt; Calculation {'./corpus': 8,' ./corpus/part_1 ': 5,' ./corpus/part_1/sub_1 ': 3,' ./corpus/part_1/sub_2 ': 2,' ./corpus/part_2 ': 3,' ./corpus/part_2/part_1 ': 3}  

Comments

Popular posts from this blog

python - Overriding the save method in Django ModelForm -

html - CSS autoheight, but fit content to height of div -

qt - How to prevent QAudioInput from automatically boosting the master volume to 100%? -