Python XML Parsing bei Handgeschriebener oder Handeditierter XML-Datei

agredo · 29. Januar 2016

Guten Tag liebe Community.

Ich wage mich momentan an Python. Ich hab ein kleines Programm erstellt, welches ein Nomen als Plural, das Wort und mit Artikel in einer XML speichert.

Code:

<?xml version="1.0"?>

<lexicon-segment>

<group name="Pretty in Pink">Name1, Name2, Name3</group>

<word plural="Hunde" gender="masc">Hund</word>

<word plural="Frauen" gender="fem">Frau</word>

<word plural="Häuser" gender="neut">Haus</word>

<word plural="Stühle" gender="masc">Stuhl</word>

<word plural="Bären" gender="masc">Bär</word>

<word plural="Bäume" gender="masc">Baum</word>

</lexicon-segment>

Nun lese ich diese XML ein. Das Funktioniert auch ohne Probleme.

Nun hatte ich ein Mal den Fall, dass eine solche XML per Hand editiert oder komplett geschrieben wurde. Das kann ich leider nicht Nachvollziehen.

Und die kann leider nicht mehr eingelesen werden.

Code:

<?xml version="1.0"?>

<lexicon-segment>

<group name="Samantha">Name1, Name2, Name3, Name4</group>

<word plural="Filme" gender="masc">Film</word>

<word plural="Autoren" gender="masc">Autor</word>

<word plural="Schnitte" gender="masc">Schnitt</word>

<word plural="Damen" gender="fem">Dame</word>

</lexicon-segment>

In der Zeile myStuff = ET.parse(myFile); ensteht immer der Fehler "not well-formed (invalid token): line 1, column 1"
nun verstehe ich leider nicht, wie ich diesen Fehler behandeln kann.

Speichern der XML

Code:

            ET.SubElement(root, "word", gender=words[2], plural=words[1]).text = words[0];
            tree = ET.ElementTree(root);
            tree.write("lexicon-segment-Pretty-in-Pink-From-" + name + ".xml");

Lesen aller XML-Dateien im Ordner

Code:

        #Get Files
        allFiles = glob.glob("./*.xml");

        for i in range(len(allFiles)):
            allFiles[i] = allFiles[i][2:-1];
            allFiles[i] += "l";

        allFilesList = [];

        #Store all xmls
        for i in range(len(allFiles)):
            myFile = open(allFiles[i]);
            myStuff = ET.parse(myFile);
            root = myStuff.getroot();
            allFilesList.append(root);           

        list = root.attrib;
        print(list);
        
        attList =  [];

vielleicht kann mir noch jemand kurz erklären wie ich einen String von Index n bis zum ende abspeichere.
Momentan ja so gelöst

for i in range(len(allFiles)):
allFiles = allFiles[2:-1];
allFiles += "l";

Ich bekomme es nur hin ohne das "l" sorry aber ich habe da echt nichts gefunden wie ich das in der Klammer ausdrücken kann.

Vielen Dank für die Antworten der vermutlich dämlichen Fragen

Agredo

0x8100 · 29. Januar 2016

String von Index n bis zum Ende ist einfach:

Code:

langer_string = 'abcdef'
kurzer_string = langer_string[2:] # cdef

Ich nehme an, du möchtest das "./" von deinem glob entfernen? Das ist aber nicht notwendig.

Code:

from lxml import etree
import glob

allFilesList = []

for file in glob.glob("./*.xml"):
  allFilesList.append(etree.parse(file).getroot())

Ist der Bindestrich im XML beabsichtigt? Dadurch wird das XML ungültig.

Code:

-<lexicon-segment>

agredo · 29. Januar 2016

0x8100 schrieb:
String von Index n bis zum Ende ist einfach:

Code:

langer_string = 'abcdef' kurzer_string = langer_string[2:] # cdef

Ich nehme an, du möchtest das "./" von deinem glob entfernen? Das ist aber nicht notwendig.

Code:

from lxml import etree import glob allFilesList = [] for file in glob.glob("./*.xml"): allFilesList.append(etree.parse(file).getroot())

Ist der Bindestrich im XML beabsichtigt? Dadurch wird das XML ungültig.

Code:

-<lexicon-segment>

Danke ich dachte ich hätte str[2:] ausprobiert

ja klar jetzt funktioniert es

Danke

Wussste nicht, dass ich ich "." weglassen kann werde das gleich mal ausprobieren danke

Ja und beim Letzten das eigentliche Proble,m habe ich ein Fehler gemacht. ich habe die XML im Browser geöffnet und der Browser hat da ein "-" rangehangen. Habe das soweit Editiert.
Noch mal das Problem:
Die Datei, welche mit diesem Programm erstellt wurde, kann ich einlesen. Eine Andere Datei, welche eventuell per Hand Editiert oder komplett geschrieben wurde kann ich nicht einlesen. Dort Taucht dann der Fehler "not well-formed (invalid token): line 1, column 1" auf.

0x8100 · 29. Januar 2016

unter Linux benutze ich "xmllint" in der Console, um zu überprüfen ob XML Dateien valide sind. Das gibts aber auch online (nach "xml validator" suchen z.b. http://www.xmlvalidation.com), da kannst du deine Datei ja mal überprüfen lassen.

agredo · 29. Januar 2016

Dort wurde nichts gefunden.
wenn ich Validate against external XML Schema wähle ist gefühlt jede Zeile Falsch.

217 errors have been found!

Click on to jump to the error. In the document, you can point at with your mouse to see the error message.
Errors in the XML document:

1: 18 cvc-elt.1: Cannot find the declaration of element 'lexicon-segment'.

Errors in file xml-schema:

1: 18 s4s-elt-invalid: Element 'lexicon-segment' is not a valid element in a schema document.
1: 18 s4s-elt-schema-ns: The namespace of element 'lexicon-segment' must be from the schema namespace, 'http://www.w3.org/2001/XMLSchema'.
1: 18 schema_reference.4: Failed to read schema document 'file:///xml-schema', because 1) could not find the document; 2) the document could not be read; 3) the root element of the document is not <xsd:schema>.
1: 65 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Cassandra, Michael'.
1: 96 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'a, Christina, Christian, Leonie'.
1: 143 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Hund'.
1: 189 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Frau'.
1: 241 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Haus'.
1: 294 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Stuhl'.
1: 342 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'B'.
1: 348 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Ã¤'.
1: 349 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'r'.

XML document:

1 <lexicon-segment><group name="Pretty in Pink">Cassandra, Michaela, Christina, Christian, Leonie</group><word gender="masc" plural="Hunde">Hund</word><word gender="fem" plural="Frauen">Frau</word><word gender="neut" plural="Häuser">Haus</word><word gender="masc" plural="Stühle">Stuhl</word><word gender="masc" plural="Bären">Bär</word></lexicon-segment>

xml-schema

1 <lexicon-segment><group name="Pretty in Pink">Name1, Name2, Name3, Name4</group><word gender="masc" plural="Hunde">Hund</word><word gender="fem" plural="Frauen">Frau</word><word gender="neut" plural="Häuser">Haus</word><word gender="masc" plural="Stühle">Stuhl</word><word gender="masc" plural="Bären">Bär</word></lexicon-segment>

Und bei der funktioniert das Einlesen.

Und hier die nicht Funktioniert hat:

8416 errors have been found!

Click on to jump to the error. In the document, you can point at with your mouse to see the error message.
Errors in the XML document:

1: 18 cvc-elt.1: Cannot find the declaration of element 'lexicon-segment'.

Errors in file xml-schema:

1: 18 s4s-elt-invalid: Element 'lexicon-segment' is not a valid element in a schema document.
1: 18 s4s-elt-schema-ns: The namespace of element 'lexicon-segment' must be from the schema namespace, 'http://www.w3.org/2001/XMLSchema'.
1: 18 schema_reference.4: Failed to read schema document 'file:///xml-schema', because 1) could not find the document; 2) the document could not be read; 3) the root element of the document is not <xsd:schema>.
3: 20 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Imke Grothenn, Is'.
3: 77 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'abelle Schulz, Lea Jakubigk, Saskia Langrock Nicole Hober'.
5: 41 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Film'.
6: 44 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Autor'.
7: 47 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Schnitt'.
8: 40 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Dame'.
9: 51 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Drehbuch'.
10: 44 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Licht'.
11: 40 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Ton'.
12: 43 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Preis'.
13: 50 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Abenteuer'.
14: 62 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Schauspielerin'.
15: 46 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Sprache'.
16: 43 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Wort'.
17: 41 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Laut'.
18: 49 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Ausdruck'.
19: 59 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Kommunikation'.
20: 51 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'GesprÃƒÂ¤ch'.
21: 42 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Satz'.
22: 69 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Sprachwissenschaft'.
23: 52 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Wortgruppe'.
24: 56 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Sprachwandel'.
25: 45 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Bilanz'.
26: 45 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Gewinn'.
27: 47 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Verlust'.
28: 44 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Steuer'.
29: 47 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Konzern'.
30: 48 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Vertrag'.
31: 48 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Abkommen'.
32: 50 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Strategie'.
33: 41 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Bank'.
34: 50 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Vorstand'.
35: 43 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Baum'.
36: 40 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Ast'.
37: 43 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Wald'.
38: 42 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Blume'.
39: 44 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'BlÃƒÂ¼te'.
40: 43 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Gras'.
41: 44 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'KÃƒÂ¤fer'.
42: 41 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Tier'.
43: 44 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Busch'.
44: 41 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Pilz'.
45: 43 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Kunst'.
46: 50 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'KÃƒÂ¼nstler'.
47: 42 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Bild'.
48: 49 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'PortrÃƒÂ¤t'.
49: 48 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'GemÃƒÂ¤lde'.
50: 44 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Pinsel'.
51: 51 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Zeichnung'.
52: 42 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Farbe'.
53: 43 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Stift'.
54: 51 s4s-elt-character: Non-whitespace characters are not allowed in schema elements other than 'xs:appinfo' and 'xs:documentation'. Saw 'Staffelei'.

XML document:

1 <lexicon-segment>
2 <group name="Samantha">
3 Imke Grothenn, Isabelle Schulz, Lea Jakubigk, Saskia Langrock Nicole Hober
4 </group>
5 <word gender="masc" plural="Filme">Film</word>
6 <word gender="masc" plural="Autoren">Autor</word>
7 <word gender="masc" plural="Schnitte">Schnitt</word>
8 <word gender="fem" plural="Damen">Dame</word>
9 <word gender="neut" plural="DrehbÃƒÂ¼cher">Drehbuch</word>
10 <word gender="neut" plural="Lichter">Licht</word>
11 <word gender="masc" plural="TÃƒÂ¶ne">Ton</word>
12 <word gender="masc" plural="Preise">Preis</word>
13 <word gender="neut" plural="Abenteuer">Abenteuer</word>
14 <word gender="fem" plural="Schauspielerinnen">Schauspielerin</word>
15 <word gender="fem" plural="Sprachen">Sprache</word>
16 <word gender="neut" plural="WÃƒÂ¶rter">Wort</word>
17 <word gender="masc" plural="Laute">Laut</word>
18 <word gender="fem" plural="AusdrÃƒÂ¼cke">Ausdruck</word>
19 <word gender="fem" plural="Kommunikationen">Kommunikation</word>
20 <word gender="neut" plural="GesprÃƒÂ¤che">GesprÃƒÂ¤ch</word>
21 <word gender="masc" plural="SÃƒÂ¤tze">Satz</word>
22 <word gender="fem" plural="Sprachwissenschaften">Sprachwissenschaft</word>
23 <word gender="fem" plural="Wortgruppen">Wortgruppe</word>
24 <word gender="masc" plural="Sprachwandel">Sprachwandel</word>
25 <word gender="fem" plural="Bilanzen">Bilanz</word>
26 <word gender="masc" plural="Gewinne">Gewinn</word>
27 <word gender="masc" plural="Verluste">Verlust</word>
28 <word gender="fem" plural="Steuern">Steuer</word>
29 <word gender="masc" plural="Konzerne">Konzern</word>
30 <word gender="masc" plural="VertrÃƒÂ¤ge">Vertrag</word>
31 <word gender="neut" plural="Abkommen">Abkommen</word>
32 <word gender="fem" plural="Strategien">Strategie</word>
33 <word gender="fem" plural="Banken">Bank</word>
34 <word gender="masc" plural="VorstÃƒÂ¤nde">Vorstand</word>
35 <word gender="masc" plural="BÃƒÂ¤ume">Baum</word>
36 <word gender="masc" plural="Ãƒ?ste">Ast</word>
37 <word gender="masc" plural="WÃƒÂ¤lder">Wald</word>
38 <word gender="fem" plural="Blumen">Blume</word>
39 <word gender="fem" plural="BlÃƒÂ¼ten">BlÃƒÂ¼te</word>
40 <word gender="neut" plural="GrÃƒÂ¤ser">Gras</word>
41 <word gender="masc" plural="KÃƒÂ¤fer">KÃƒÂ¤fer</word>
42 <word gender="neut" plural="Tiere">Tier</word>
43 <word gender="masc" plural="BÃƒÂ¼sche">Busch</word>
44 <word gender="masc" plural="Pilze">Pilz</word>
45 <word gender="fem" plural="KÃƒÂ¼nste">Kunst</word>
46 <word gender="masc" plural="KÃƒÂ¼nstler">KÃƒÂ¼nstler</word>
47 <word gender="neut" plural="Bilder">Bild</word>
48 <word gender="neut" plural="PortrÃƒÂ¤ts">PortrÃƒÂ¤t</word>
49 <word gender="neut" plural="GemÃƒÂ¤lde">GemÃƒÂ¤lde</word>
50 <word gender="masc" plural="Pinsel">Pinsel</word>
51 <word gender="fem" plural="Zeichnungen">Zeichnung</word>
52 <word gender="fem" plural="Farben">Farbe</word>
53 <word gender="masc" plural="Stifte">Stift</word>
54 <word gender="fem" plural="Staffeleien">Staffelei</word>
55 </lexicon-segment>

xml-schema

1 <lexicon-segment>
2 <group name="Samantha">
3 Imke Grothenn, Isabelle Schulz, Lea Jakubigk, Saskia Langrock Nicole Hober
4 </group>
5 <word gender="masc" plural="Filme">Film</word>
6 <word gender="masc" plural="Autoren">Autor</word>
7 <word gender="masc" plural="Schnitte">Schnitt</word>
8 <word gender="fem" plural="Damen">Dame</word>
9 <word gender="neut" plural="DrehbÃƒÂ¼cher">Drehbuch</word>
10 <word gender="neut" plural="Lichter">Licht</word>
11 <word gender="masc" plural="TÃƒÂ¶ne">Ton</word>
12 <word gender="masc" plural="Preise">Preis</word>
13 <word gender="neut" plural="Abenteuer">Abenteuer</word>
14 <word gender="fem" plural="Schauspielerinnen">Schauspielerin</word>
15 <word gender="fem" plural="Sprachen">Sprache</word>
16 <word gender="neut" plural="WÃƒÂ¶rter">Wort</word>
17 <word gender="masc" plural="Laute">Laut</word>
18 <word gender="fem" plural="AusdrÃƒÂ¼cke">Ausdruck</word>
19 <word gender="fem" plural="Kommunikationen">Kommunikation</word>
20 <word gender="neut" plural="GesprÃƒÂ¤che">GesprÃƒÂ¤ch</word>
21 <word gender="masc" plural="SÃƒÂ¤tze">Satz</word>
22 <word gender="fem" plural="Sprachwissenschaften">Sprachwissenschaft</word>
23 <word gender="fem" plural="Wortgruppen">Wortgruppe</word>
24 <word gender="masc" plural="Sprachwandel">Sprachwandel</word>
25 <word gender="fem" plural="Bilanzen">Bilanz</word>
26 <word gender="masc" plural="Gewinne">Gewinn</word>
27 <word gender="masc" plural="Verluste">Verlust</word>
28 <word gender="fem" plural="Steuern">Steuer</word>
29 <word gender="masc" plural="Konzerne">Konzern</word>
30 <word gender="masc" plural="VertrÃƒÂ¤ge">Vertrag</word>
31 <word gender="neut" plural="Abkommen">Abkommen</word>
32 <word gender="fem" plural="Strategien">Strategie</word>
33 <word gender="fem" plural="Banken">Bank</word>
34 <word gender="masc" plural="VorstÃƒÂ¤nde">Vorstand</word>
35 <word gender="masc" plural="BÃƒÂ¤ume">Baum</word>
36 <word gender="masc" plural="Ãƒ?ste">Ast</word>
37 <word gender="masc" plural="WÃƒÂ¤lder">Wald</word>
38 <word gender="fem" plural="Blumen">Blume</word>
39 <word gender="fem" plural="BlÃƒÂ¼ten">BlÃƒÂ¼te</word>
40 <word gender="neut" plural="GrÃƒÂ¤ser">Gras</word>
41 <word gender="masc" plural="KÃƒÂ¤fer">KÃƒÂ¤fer</word>
42 <word gender="neut" plural="Tiere">Tier</word>
43 <word gender="masc" plural="BÃƒÂ¼sche">Busch</word>
44 <word gender="masc" plural="Pilze">Pilz</word>
45 <word gender="fem" plural="KÃƒÂ¼nste">Kunst</word>
46 <word gender="masc" plural="KÃƒÂ¼nstler">KÃƒÂ¼nstler</word>
47 <word gender="neut" plural="Bilder">Bild</word>
48 <word gender="neut" plural="PortrÃƒÂ¤ts">PortrÃƒÂ¤t</word>
49 <word gender="neut" plural="GemÃƒÂ¤lde">GemÃƒÂ¤lde</word>
50 <word gender="masc" plural="Pinsel">Pinsel</word>
51 <word gender="fem" plural="Zeichnungen">Zeichnung</word>
52 <word gender="fem" plural="Farben">Farbe</word>
53 <word gender="masc" plural="Stifte">Stift</word>
54 <word gender="fem" plural="Staffeleien">Staffelei</word>
55 </lexicon-segment>

0x8100 · 29. Januar 2016

Das Prüfen gegen ein Schema bringt nichts, wenn du keins hast. Daher die Fehlermeldungen

Mit so einem Schema kann man den Aufbau einer XML Datei definieren (z.B. wie müssen Elemente aufgebaut sein, Wertebereiche etc.). Siehe https://de.wikipedia.org/wiki/XML_Schema

Welche XML Bibliothek benutzt du denn? Die von Python oder z.B. lxml? Kannst du mal eine vermeintlich invalide Datei anhängen?

Elfire · 30. Januar 2016

Sieht für mich danach aus, dass dein Texteditor ein Byte Order Mark hinzufügt.
Öffne die Datei mal in Notepad++ und geh auf Encoding -> Convert to UTF-8 without BOM. Dann das File abspeichern und nochmals versuchen

agredo · 30. Januar 2016

Habe es mit der Python Bibliothek gemacht import xml.etree.ElementTree as ET;
Natürlich:

Ergänzung (30. Januar 2016)

Elfire schrieb:
Sieht für mich danach aus, dass dein Texteditor ein Byte Order Mark hinzufügt.
Öffne die Datei mal in Notepad++ und geh auf Encoding -> Convert to UTF-8 without BOM. Dann das File abspeichern und nochmals versuchen

Ja das ist ja cool. Wenn ich das so mache, dann kommt er immerhin bis Bäume (siehe Datei: lexicon-segment-Samantha.zip) dann sagt er Fehler: 'charmap' codec can't encode character '\u201e' in position 9: character maps to <undefined>.

Code:

print("gender " + child.attrib["gender"]);

print("plural: " + child.attrib["plural"]);

print("Wort: " + allFilesList[i][k].text  + "\n\n");

in print("plural: " + child.attrib["plural"]); schmeißt er den Fehler.
Es stimmt wohl irgendetwas mit "Bäume" nicht. Bei "Verträge" welches davor kommt und auch ein Umlaut enthält hat er kein Problem...

Nach dem auskommentieren, dieser Zeilen, welch mir den Inhalt ausgeben, speichert das Programm aber die Datei korrekt.

Wie kann ich denn eine XML Datei in UTF ohne DOM konvertieren?

TWry · 7. März 2016

Möglicherweise ist das schon gelöst... trotzdem der Tipp, im XML Prolog das Encoding anzugeben. Zum Beispiel so:

Code:

<?xml version="1.0" encoding="UTF-8"?>

agredo · 25. April 2016

Danke dir

Was geholfen hat war:

myFile = open(xmlFileNameList, "r+", encoding="utf-8");

Vielen Dank euch allen

Kann man diesen Thread als gelöst markieren?

Suche

Python XML Parsing bei Handgeschriebener oder Handeditierter XML-Datei

agredo

Cadet 2nd Year

0x8100

Admiral

agredo

Cadet 2nd Year

0x8100

Admiral

agredo

Cadet 2nd Year

0x8100

Admiral

Elfire

Cadet 1st Year

agredo

Cadet 2nd Year

Anhänge

TWry

Ensign

agredo

Cadet 2nd Year