Question on input data format

`example.sh` describes the input format as:

```
This file provides information about running the Dynamic Topic Model
or the Document Influence Model.  It gives two command-line examples
for running the software and several example commands in R for reading
output files.

Dynamic topic models and the influence model have been implemented
here in c / c++.  This implementation takes two input files:

 (a) foo-mult.dat, which is one-doc-per-line, each line of the form

   unique_word_count index1:count1 index2:count2 ... indexn:counnt

   where each index is an integer corresponding to a unique word.

 (b) foo-seq.dat, which is of the form

   Number_Timestamps
   number_docs_time_1
   ...
   number_docs_time_i
   ...
   number_docs_time_NumberTimestamps

   - The docs in foo-mult.dat should be ordered by date, with the first
     docs from time1, the next from time2, ..., and the last docs from
     timen.
```

`test-mult.dat` looks like this (1000 lines):

```
28 12:1 44:1 75:10 76:1 77:1 78:1 79:1 80:1 81:2 82:1 83:1 84:1 85:1 86:1 87:2 88:4 89:1 90:1 91:1 92:1 93:1 94:1 95:1 96:1 97:1 98:1 99:2 100:1
60 771:1 388:1 98:1 134:1 8:1 908:1 1037:1 600:1 405:1 1046:1 516:1 27:2 773:1 37:1 1137:1 1138:1 302:1 433:2 51:1 59:1 999:1 1119:1 224:1 67:1 69:1 71:1 584:1 330:1 77:1 269:1 337:1 83:1 1112:1 349:2 1118:1 1125:1 1120:1 1121:1 1122:1 1123:1 1124:1 101:2 1126:1 1127:1 488:1 1129:1 618:3 1131:1 1132:1 1133:1 1134:1 1135:1 1136:2 1128:1 114:4 1139:1 1140:1 1141:1 631:1 1130:1
17 257:1 546:2 547:1 548:1 549:1 6:1 551:1 552:1 553:1 554:1 550:2 418:1 174:1 433:1 315:1 92:1 415:1
11 288:1 1248:1 5:1 1063:2 269:1 654:1 656:2 532:1 373:1 1247:1 543:1
25 909:1 407:1 797:1 543:1 555:1 693:1 823:4 569:1 1226:1 1227:1 1228:2 1229:1 1230:1 1231:4 1232:1 1233:4 1234:1 1235:1 1236:1 1237:3 1238:1 1239:1 1106:1 113:1 243:1
```

`test-seq.dat` looks like this (10 lines):

```
10
25
50
75
100
100
100
100
```

I don't understand how the time correspondence is defined between `test-mult.dat` (which has 1 document per line) and  `test-seq.dat` which has the number of docs per time-period (in this case 10 time periods).  Can someone clarify for me how the input data should be formatted? Are we assuming the first 10 documents in `test-mult.dat` correspond to time period 1, the next 25 documents correspond to time period 2, etc?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question on input data format #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question on input data format #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions